Merge final release (#1)

* adding initial skeleton

* updating .gitignore

* updating dev dependencies

* adding report.py

* updating notes

* adding prospector.yaml

* updating beam to install gcp extras

* adding documentation

* adding data exploration report + code

* adding latest beam pipeline code

* adding latest beam pipeline code

* adding debug.py

* adding latesty beam pipeline code

* adding latest beam pipeline code

* adding latest beam pipeline code

* updating .gitignore

* updating folder structure for data input/output

* updating prospector.yaml

* adding latest beam pipeline code

* updating prospector.yaml

* migrate beam pipeline to main.py

* updating .gitignore

* updating .gitignore

* adding download script for data set

* adding initial docs

* moving inputs/outputs to use pathlib

* removing shard_name_template from output file

* adding pyenv 3.7.9

* removing requirements.txt for documentation

* updating README.md

* updating download data script for new location in GCS

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* moving dataflow notes

* updating prospector.yaml

* adding latest beam pipeline code for dataflow

* updating beam pipeline to use GroupByKey

* updating download_data script with new bucket

* update prospector.yaml

* update dataflow documentation with new commands for vpc

* adding latest beam pipeline code for dataflow with group optimisation

* updating dataflow documentation

* adding latest beam pipeline code for dataflow with group optimisation

* updating download_data script with pp-2020 dataset

* adding temporary notes

* updating dataflow notes

* adding latest beam pipeline code

* updating dataflow notes

* adding latest beam pipeline code for dataflow

* adding debug print

* moving panda-profiling report into docs

* updating report.py

* adding entrypoint command

* adding initial docs

* adding commands.md to notes

* commenting out debug imports

* updating documentation

* updating latest beam pipeline with default inputs

* updating poetry

* adding requirements.txt

* updating documentation
This commit is contained in:
dtomlinson91
2021-09-28 00:31:09 +01:00
committed by GitHub
parent 8a22bfebe1
commit 80376a662e
34 changed files with 5667 additions and 1 deletions

View File

@@ -0,0 +1,31 @@
# Installation
The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested.
The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested.
For Beam 2.32.0 the supported versions of the Python SDK can be found [here](https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#sdk-for-python).
## Pip
In a virtual environment run from the root of the repo:
```bash
pip install -r requirements.txt
```
## Poetry (Alternative)
Install [Poetry](https://python-poetry.org) *globally*
From the root of the repo install the dependencies with:
```bash
poetry install --no-dev
```
Activate the shell with:
```bash
poetry shell
```

View File

@@ -0,0 +1,59 @@
# Usage
This page documents how to run the pipeline locally to complete the task for the [dataset for 2020](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#section-1).
The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed [here](../dataflow/index.md). We also discuss how to adapt the pipeline so it can run against [the full dataset](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#single-file).
## Download dataset
The input data by default should go in `./data/input`.
For convenience the data is available publicly in a GCP Cloud Storage bucket.
Run:
```bash
wget https://storage.googleapis.com/street-group-technical-test-dmot-euw1/input/pp-2020.csv -P data/input
```
to download the data for 2020 and place in the input directory above.
## Entrypoint
The entrypoint to the pipeline is `analyse_properties.main`.
## Available options
Running
```bash
python -m analyse_properties.main --help
```
gives the following output:
```bash
usage: analyse_properties.main [-h] [--input INPUT] [--output OUTPUT]
optional arguments:
-h, --help show this help message and exit
--input INPUT Full path to the input file.
--output OUTPUT Full path to the output file without extension.
```
The default value for input is `./data/input/pp-2020.csv` and the default value for output is `./data/output/pp-2020`.
## Run the pipeline
To run the pipeline and complete the task run:
```bash
python -m analyse_properties.main \
--runner DirectRunner \
--input ./data/input/pp-2020.csv \
--output ./data/output/pp-2020
```
from the root of the repo.
The pipeline will use the 2020 dataset located in `./data/input` and output the resulting `.json` to `./data/output`.