updating dataflow notes

This commit is contained in:
2021-09-27 03:18:49 +01:00
parent f2ed60426d
commit f60beb4565

View File

@@ -18,18 +18,19 @@ Export env variable:
### Dataflow ### Dataflow
#### Monthly dataset #### Yearly dataset
```bash ```bash
python -m analyse_properties.main \ python -m analyse_properties.main \
--region europe-west1 \ --region europe-west1 \
--input gs://street-group-technical-test-dmot-euw1/input/pp-monthly-update-new-version.csv \ --input gs://street-group-technical-test-dmot-euw1/input/pp-2020.csv \
--output gs://street-group-technical-test-dmot-euw1/output/pp-monthly-update-new-version \ --output gs://street-group-technical-test-dmot-euw1/output/pp-2020 \
--runner DataflowRunner \ --runner DataflowRunner \
--project street-group \ --project street-group \
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \ --temp_location gs://street-group-technical-test-dmot-euw1/tmp \
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \ --subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
--no_use_public_ips --no_use_public_ips \
--worker_machine_type=n1-highmem-2
``` ```
#### Full dataset #### Full dataset
@@ -44,13 +45,11 @@ python -m analyse_properties.main \
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \ --temp_location gs://street-group-technical-test-dmot-euw1/tmp \
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \ --subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
--no_use_public_ips \ --no_use_public_ips \
--worker_machine_type=n1-highmem-8 --worker_machine_type=n1-highmem-8 \
--num_workers=3 \
--autoscaling_algorithm=NONE
``` ```
—-disk_size_gb=50 \
### Locally ### Locally
Run the pipeline locally: Run the pipeline locally:
@@ -83,3 +82,7 @@ Use `--experiments=enable_prime`
Deploying a pipeline (with scaling options): <https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline> Deploying a pipeline (with scaling options): <https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline>
Available VM types (with pricing): <https://cloud.google.com/compute/vm-instance-pricing#n1_predefined> Available VM types (with pricing): <https://cloud.google.com/compute/vm-instance-pricing#n1_predefined>
Performance
Sideinput performance: <https://stackoverflow.com/questions/48242320/google-dataflow-apache-beam-python-side-input-from-pcollection-kills-perform>