Files
street_group_tech_test/notes/documentation/dataflow.md
dtomlinson91 80376a662e Merge final release (#1)
* adding initial skeleton

* updating .gitignore

* updating dev dependencies

* adding report.py

* updating notes

* adding prospector.yaml

* updating beam to install gcp extras

* adding documentation

* adding data exploration report + code

* adding latest beam pipeline code

* adding latest beam pipeline code

* adding debug.py

* adding latesty beam pipeline code

* adding latest beam pipeline code

* adding latest beam pipeline code

* updating .gitignore

* updating folder structure for data input/output

* updating prospector.yaml

* adding latest beam pipeline code

* updating prospector.yaml

* migrate beam pipeline to main.py

* updating .gitignore

* updating .gitignore

* adding download script for data set

* adding initial docs

* moving inputs/outputs to use pathlib

* removing shard_name_template from output file

* adding pyenv 3.7.9

* removing requirements.txt for documentation

* updating README.md

* updating download data script for new location in GCS

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* moving dataflow notes

* updating prospector.yaml

* adding latest beam pipeline code for dataflow

* updating beam pipeline to use GroupByKey

* updating download_data script with new bucket

* update prospector.yaml

* update dataflow documentation with new commands for vpc

* adding latest beam pipeline code for dataflow with group optimisation

* updating dataflow documentation

* adding latest beam pipeline code for dataflow with group optimisation

* updating download_data script with pp-2020 dataset

* adding temporary notes

* updating dataflow notes

* adding latest beam pipeline code

* updating dataflow notes

* adding latest beam pipeline code for dataflow

* adding debug print

* moving panda-profiling report into docs

* updating report.py

* adding entrypoint command

* adding initial docs

* adding commands.md to notes

* commenting out debug imports

* updating documentation

* updating latest beam pipeline with default inputs

* updating poetry

* adding requirements.txt

* updating documentation
2021-09-28 00:31:09 +01:00

3.2 KiB

DataFlow

https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python

Examples

Full example of beam pipeline on dataflow:

https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset

Setup

Export env variable:

export GOOGLE_APPLICATION_CREDENTIALS="/home/dtomlinson/git-repos/work/street_group/street_group_tech_test/street-group-0c490d23a9d0.json"

Run pipeline

Dataflow

Yearly dataset

python -m analyse_properties.main \
    --runner DataflowRunner \
    --project street-group \
    --region europe-west1 \
    --input gs://street-group-technical-test-dmot-euw1/input/pp-2020.csv \
    --output gs://street-group-technical-test-dmot-euw1/output/pp-2020 \
    --temp_location gs://street-group-technical-test-dmot-euw1/tmp \
    --subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
    --no_use_public_ips \
    --worker_machine_type=n1-highmem-2

Full dataset

python -m analyse_properties.main \
    --region europe-west1 \
    --input gs://street-group-technical-test-dmot-euw1/input/pp-complete.csv \
    --output gs://street-group-technical-test-dmot-euw1/output/pp-complete \
    --runner DataflowRunner \
    --project street-group \
    --temp_location gs://street-group-technical-test-dmot-euw1/tmp \
    --subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
    --no_use_public_ips \
    --worker_machine_type=n1-highmem-8 \
    --num_workers=3 \
    --autoscaling_algorithm=NONE

Locally

Run the pipeline locally:

python -m analyse_properties.main --runner DirectRunner

Errors

Unsubscriptable error on window:

https://stackoverflow.com/questions/42276520/what-does-object-of-type-unwindowedvalues-has-no-len-mean

Documentation

Running in its own private VPC without public IPs

Error help

Scaling

Using DataFlowPrime: https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime#enable-prime Use --experiments=enable_prime

Deploying a pipeline (with scaling options): https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline

Available VM types (with pricing): https://cloud.google.com/compute/vm-instance-pricing#n1_predefined

Performance

Sideinput performance: https://stackoverflow.com/questions/48242320/google-dataflow-apache-beam-python-side-input-from-pcollection-kills-perform

Common use cases:

Side inputs: https://cloud.google.com/architecture/e-commerce/patterns/slow-updating-side-inputs