mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 03:55:43 +00:00
Merge final release (#1)
* adding initial skeleton * updating .gitignore * updating dev dependencies * adding report.py * updating notes * adding prospector.yaml * updating beam to install gcp extras * adding documentation * adding data exploration report + code * adding latest beam pipeline code * adding latest beam pipeline code * adding debug.py * adding latesty beam pipeline code * adding latest beam pipeline code * adding latest beam pipeline code * updating .gitignore * updating folder structure for data input/output * updating prospector.yaml * adding latest beam pipeline code * updating prospector.yaml * migrate beam pipeline to main.py * updating .gitignore * updating .gitignore * adding download script for data set * adding initial docs * moving inputs/outputs to use pathlib * removing shard_name_template from output file * adding pyenv 3.7.9 * removing requirements.txt for documentation * updating README.md * updating download data script for new location in GCS * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * moving dataflow notes * updating prospector.yaml * adding latest beam pipeline code for dataflow * updating beam pipeline to use GroupByKey * updating download_data script with new bucket * update prospector.yaml * update dataflow documentation with new commands for vpc * adding latest beam pipeline code for dataflow with group optimisation * updating dataflow documentation * adding latest beam pipeline code for dataflow with group optimisation * updating download_data script with pp-2020 dataset * adding temporary notes * updating dataflow notes * adding latest beam pipeline code * updating dataflow notes * adding latest beam pipeline code for dataflow * adding debug print * moving panda-profiling report into docs * updating report.py * adding entrypoint command * adding initial docs * adding commands.md to notes * commenting out debug imports * updating documentation * updating latest beam pipeline with default inputs * updating poetry * adding requirements.txt * updating documentation
This commit is contained in:
31
docs/documentation/installation.md
Normal file
31
docs/documentation/installation.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Installation
|
||||
|
||||
The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested.
|
||||
|
||||
The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested.
|
||||
|
||||
For Beam 2.32.0 the supported versions of the Python SDK can be found [here](https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#sdk-for-python).
|
||||
|
||||
## Pip
|
||||
|
||||
In a virtual environment run from the root of the repo:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Poetry (Alternative)
|
||||
|
||||
Install [Poetry](https://python-poetry.org) *globally*
|
||||
|
||||
From the root of the repo install the dependencies with:
|
||||
|
||||
```bash
|
||||
poetry install --no-dev
|
||||
```
|
||||
|
||||
Activate the shell with:
|
||||
|
||||
```bash
|
||||
poetry shell
|
||||
```
|
||||
59
docs/documentation/usage.md
Normal file
59
docs/documentation/usage.md
Normal file
@@ -0,0 +1,59 @@
|
||||
# Usage
|
||||
|
||||
This page documents how to run the pipeline locally to complete the task for the [dataset for 2020](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#section-1).
|
||||
|
||||
The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed [here](../dataflow/index.md). We also discuss how to adapt the pipeline so it can run against [the full dataset](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#single-file).
|
||||
|
||||
## Download dataset
|
||||
|
||||
The input data by default should go in `./data/input`.
|
||||
|
||||
For convenience the data is available publicly in a GCP Cloud Storage bucket.
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
wget https://storage.googleapis.com/street-group-technical-test-dmot-euw1/input/pp-2020.csv -P data/input
|
||||
```
|
||||
|
||||
to download the data for 2020 and place in the input directory above.
|
||||
|
||||
## Entrypoint
|
||||
|
||||
The entrypoint to the pipeline is `analyse_properties.main`.
|
||||
|
||||
## Available options
|
||||
|
||||
Running
|
||||
|
||||
```bash
|
||||
python -m analyse_properties.main --help
|
||||
```
|
||||
|
||||
gives the following output:
|
||||
|
||||
```bash
|
||||
usage: analyse_properties.main [-h] [--input INPUT] [--output OUTPUT]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--input INPUT Full path to the input file.
|
||||
--output OUTPUT Full path to the output file without extension.
|
||||
```
|
||||
|
||||
The default value for input is `./data/input/pp-2020.csv` and the default value for output is `./data/output/pp-2020`.
|
||||
|
||||
## Run the pipeline
|
||||
|
||||
To run the pipeline and complete the task run:
|
||||
|
||||
```bash
|
||||
python -m analyse_properties.main \
|
||||
--runner DirectRunner \
|
||||
--input ./data/input/pp-2020.csv \
|
||||
--output ./data/output/pp-2020
|
||||
```
|
||||
|
||||
from the root of the repo.
|
||||
|
||||
The pipeline will use the 2020 dataset located in `./data/input` and output the resulting `.json` to `./data/output`.
|
||||
Reference in New Issue
Block a user