* adding initial skeleton * updating .gitignore * updating dev dependencies * adding report.py * updating notes * adding prospector.yaml * updating beam to install gcp extras * adding documentation * adding data exploration report + code * adding latest beam pipeline code * adding latest beam pipeline code * adding debug.py * adding latesty beam pipeline code * adding latest beam pipeline code * adding latest beam pipeline code * updating .gitignore * updating folder structure for data input/output * updating prospector.yaml * adding latest beam pipeline code * updating prospector.yaml * migrate beam pipeline to main.py * updating .gitignore * updating .gitignore * adding download script for data set * adding initial docs * moving inputs/outputs to use pathlib * removing shard_name_template from output file * adding pyenv 3.7.9 * removing requirements.txt for documentation * updating README.md * updating download data script for new location in GCS * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * moving dataflow notes * updating prospector.yaml * adding latest beam pipeline code for dataflow * updating beam pipeline to use GroupByKey * updating download_data script with new bucket * update prospector.yaml * update dataflow documentation with new commands for vpc * adding latest beam pipeline code for dataflow with group optimisation * updating dataflow documentation * adding latest beam pipeline code for dataflow with group optimisation * updating download_data script with pp-2020 dataset * adding temporary notes * updating dataflow notes * adding latest beam pipeline code * updating dataflow notes * adding latest beam pipeline code for dataflow * adding debug print * moving panda-profiling report into docs * updating report.py * adding entrypoint command * adding initial docs * adding commands.md to notes * commenting out debug imports * updating documentation * updating latest beam pipeline with default inputs * updating poetry * adding requirements.txt * updating documentation
3.2 KiB
DataFlow
https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python
Examples
Full example of beam pipeline on dataflow:
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset
Setup
Export env variable:
export GOOGLE_APPLICATION_CREDENTIALS="/home/dtomlinson/git-repos/work/street_group/street_group_tech_test/street-group-0c490d23a9d0.json"
Run pipeline
Dataflow
Yearly dataset
python -m analyse_properties.main \
--runner DataflowRunner \
--project street-group \
--region europe-west1 \
--input gs://street-group-technical-test-dmot-euw1/input/pp-2020.csv \
--output gs://street-group-technical-test-dmot-euw1/output/pp-2020 \
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
--no_use_public_ips \
--worker_machine_type=n1-highmem-2
Full dataset
python -m analyse_properties.main \
--region europe-west1 \
--input gs://street-group-technical-test-dmot-euw1/input/pp-complete.csv \
--output gs://street-group-technical-test-dmot-euw1/output/pp-complete \
--runner DataflowRunner \
--project street-group \
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
--no_use_public_ips \
--worker_machine_type=n1-highmem-8 \
--num_workers=3 \
--autoscaling_algorithm=NONE
Locally
Run the pipeline locally:
python -m analyse_properties.main --runner DirectRunner
Errors
Unsubscriptable error on window:
Documentation
Running in its own private VPC without public IPs
- https://stackoverflow.com/questions/58893082/which-compute-engine-quotas-need-to-be-updated-to-run-dataflow-with-50-workers
- https://cloud.google.com/dataflow/docs/guides/specifying-networks#subnetwork_parameter
Error help
- https://cloud.google.com/dataflow/docs/guides/common-errors
- https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline
Scaling
Using DataFlowPrime: https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime#enable-prime
Use --experiments=enable_prime
Deploying a pipeline (with scaling options): https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline
Available VM types (with pricing): https://cloud.google.com/compute/vm-instance-pricing#n1_predefined
Performance
Sideinput performance: https://stackoverflow.com/questions/48242320/google-dataflow-apache-beam-python-side-input-from-pcollection-kills-perform
Common use cases:
- Part 1 https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1
- Part 2 https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-2
Side inputs: https://cloud.google.com/architecture/e-commerce/patterns/slow-updating-side-inputs