mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 11:55:45 +00:00
60 lines
1.7 KiB
Markdown
60 lines
1.7 KiB
Markdown
# Usage
|
|
|
|
This page documents how to run the pipeline locally to complete the task for the [dataset for 2020](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#section-1).
|
|
|
|
The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed [here](../dataflow/index.md). We also discuss how to adapt the pipeline so it can run against [the full dataset](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#single-file).
|
|
|
|
## Download dataset
|
|
|
|
The input data by default should go in `./data/input`.
|
|
|
|
For convenience the data is available publicly in a GCP Cloud Storage bucket.
|
|
|
|
Run:
|
|
|
|
```bash
|
|
wget https://storage.googleapis.com/street-group-technical-test-dmot-euw1/input/pp-2020.csv -P data/input
|
|
```
|
|
|
|
to download the data for 2020 and place in the input directory above.
|
|
|
|
## Entrypoint
|
|
|
|
The entrypoint to the pipeline is `analyse_properties.main`.
|
|
|
|
## Available options
|
|
|
|
Running
|
|
|
|
```bash
|
|
python -m analyse_properties.main --help
|
|
```
|
|
|
|
gives the following output:
|
|
|
|
```bash
|
|
usage: analyse_properties.main [-h] [--input INPUT] [--output OUTPUT]
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
--input INPUT Full path to the input file.
|
|
--output OUTPUT Full path to the output file without extension.
|
|
```
|
|
|
|
The default value for input is `./data/input/pp-2020.csv` and the default value for output is `./data/output/pp-2020`.
|
|
|
|
## Run the pipeline
|
|
|
|
To run the pipeline and complete the task run:
|
|
|
|
```bash
|
|
python -m analyse_properties.main \
|
|
--runner DirectRunner \
|
|
--input ./data/input/pp-2020.csv \
|
|
--output ./data/output/pp-2020
|
|
```
|
|
|
|
from the root of the repo.
|
|
|
|
The pipeline will use the 2020 dataset located in `./data/input` and output the resulting `.json` to `./data/output`.
|