mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 03:55:43 +00:00
* adding initial skeleton * updating .gitignore * updating dev dependencies * adding report.py * updating notes * adding prospector.yaml * updating beam to install gcp extras * adding documentation * adding data exploration report + code * adding latest beam pipeline code * adding latest beam pipeline code * adding debug.py * adding latesty beam pipeline code * adding latest beam pipeline code * adding latest beam pipeline code * updating .gitignore * updating folder structure for data input/output * updating prospector.yaml * adding latest beam pipeline code * updating prospector.yaml * migrate beam pipeline to main.py * updating .gitignore * updating .gitignore * adding download script for data set * adding initial docs * moving inputs/outputs to use pathlib * removing shard_name_template from output file * adding pyenv 3.7.9 * removing requirements.txt for documentation * updating README.md * updating download data script for new location in GCS * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * moving dataflow notes * updating prospector.yaml * adding latest beam pipeline code for dataflow * updating beam pipeline to use GroupByKey * updating download_data script with new bucket * update prospector.yaml * update dataflow documentation with new commands for vpc * adding latest beam pipeline code for dataflow with group optimisation * updating dataflow documentation * adding latest beam pipeline code for dataflow with group optimisation * updating download_data script with pp-2020 dataset * adding temporary notes * updating dataflow notes * adding latest beam pipeline code * updating dataflow notes * adding latest beam pipeline code for dataflow * adding debug print * moving panda-profiling report into docs * updating report.py * adding entrypoint command * adding initial docs * adding commands.md to notes * commenting out debug imports * updating documentation * updating latest beam pipeline with default inputs * updating poetry * adding requirements.txt * updating documentation
48 lines
1.9 KiB
Markdown
48 lines
1.9 KiB
Markdown
# Running on DataFlow
|
|
|
|
The pipeline runs as is on GCP DataFlow. The following documents how I deployed to my personal GCP account but the approach may vary depending on project/account in GCP.
|
|
|
|
## Prerequisites
|
|
|
|
### Cloud Storage
|
|
|
|
- A Cloud Storage bucket with the following structure:
|
|
|
|
```
|
|
./input
|
|
./output
|
|
./tmp
|
|
```
|
|
|
|
- Place the input files into the `./input` directory in the bucket.
|
|
|
|
### VPC
|
|
|
|
To get around public IP quotas I created a VPC in the `europe-west1` region that has `Private Google Access` turned to `ON`.
|
|
|
|
## Command
|
|
|
|
!!! tip
|
|
We need to choose a `worker_machine_type` with sufficient memory to run the pipeline. As the pipeline uses a mapping table, and DataFlow autoscales on CPU and not memory usage, we need a machine with more ram than usual to ensure sufficient memory when running on one worker. For `pp-2020.csv` the type `n1-highmem-2` with 2vCPU and 13GB of ram was chosen and completed successfully in ~10 minutes using only 1 worker.
|
|
|
|
Assuming the `pp-2020.csv` file has been placed in the `./input` directory in the bucket you can run a command similar to:
|
|
|
|
```bash
|
|
python -m analyse_properties.main \
|
|
--runner DataflowRunner \
|
|
--project street-group \
|
|
--region europe-west1 \
|
|
--input gs://street-group-technical-test-dmot-euw1/input/pp-2020.csv \
|
|
--output gs://street-group-technical-test-dmot-euw1/output/pp-2020 \
|
|
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \
|
|
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
|
|
--no_use_public_ips \
|
|
--worker_machine_type=n1-highmem-2
|
|
```
|
|
|
|
The output file from this pipeline is publically available and can be downloaded [here](https://storage.googleapis.com/street-group-technical-test-dmot-euw1/output/pp-2020-00000-of-00001.json).
|
|
|
|
The job graph for this pipeline is displayed below:
|
|
|
|

|