Merge final release (#1)

* adding initial skeleton

* updating .gitignore

* updating dev dependencies

* adding report.py

* updating notes

* adding prospector.yaml

* updating beam to install gcp extras

* adding documentation

* adding data exploration report + code

* adding latest beam pipeline code

* adding latest beam pipeline code

* adding debug.py

* adding latesty beam pipeline code

* adding latest beam pipeline code

* adding latest beam pipeline code

* updating .gitignore

* updating folder structure for data input/output

* updating prospector.yaml

* adding latest beam pipeline code

* updating prospector.yaml

* migrate beam pipeline to main.py

* updating .gitignore

* updating .gitignore

* adding download script for data set

* adding initial docs

* moving inputs/outputs to use pathlib

* removing shard_name_template from output file

* adding pyenv 3.7.9

* removing requirements.txt for documentation

* updating README.md

* updating download data script for new location in GCS

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* moving dataflow notes

* updating prospector.yaml

* adding latest beam pipeline code for dataflow

* updating beam pipeline to use GroupByKey

* updating download_data script with new bucket

* update prospector.yaml

* update dataflow documentation with new commands for vpc

* adding latest beam pipeline code for dataflow with group optimisation

* updating dataflow documentation

* adding latest beam pipeline code for dataflow with group optimisation

* updating download_data script with pp-2020 dataset

* adding temporary notes

* updating dataflow notes

* adding latest beam pipeline code

* updating dataflow notes

* adding latest beam pipeline code for dataflow

* adding debug print

* moving panda-profiling report into docs

* updating report.py

* adding entrypoint command

* adding initial docs

* adding commands.md to notes

* commenting out debug imports

* updating documentation

* updating latest beam pipeline with default inputs

* updating poetry

* adding requirements.txt

* updating documentation
This commit is contained in:
dtomlinson91
2021-09-28 00:31:09 +01:00
committed by GitHub
parent 8a22bfebe1
commit 80376a662e
34 changed files with 5667 additions and 1 deletions

6
notes/commands.md Normal file
View File

@@ -0,0 +1,6 @@
# Commands
## mkdocs
`mkdocs serve`
`mkdocs gh-deploy`

View File

@@ -0,0 +1,7 @@
# Answers
## CSV
Read a csv file into Beam:
<https://stackoverflow.com/a/41171867>

View File

@@ -0,0 +1,12 @@
# Beam Documentation
## Transforms
FlatMap:
<https://beam.apache.org/documentation/transforms/python/elementwise/flatmap/>
I/O Transforms:
<https://beam.apache.org/documentation/io/built-in/>

View File

@@ -0,0 +1,95 @@
# DataFlow
<https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python>
## Examples
Full example of beam pipeline on dataflow:
<https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset>
## Setup
Export env variable:
`export GOOGLE_APPLICATION_CREDENTIALS="/home/dtomlinson/git-repos/work/street_group/street_group_tech_test/street-group-0c490d23a9d0.json"`
## Run pipeline
### Dataflow
#### Yearly dataset
```bash
python -m analyse_properties.main \
--runner DataflowRunner \
--project street-group \
--region europe-west1 \
--input gs://street-group-technical-test-dmot-euw1/input/pp-2020.csv \
--output gs://street-group-technical-test-dmot-euw1/output/pp-2020 \
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
--no_use_public_ips \
--worker_machine_type=n1-highmem-2
```
#### Full dataset
```bash
python -m analyse_properties.main \
--region europe-west1 \
--input gs://street-group-technical-test-dmot-euw1/input/pp-complete.csv \
--output gs://street-group-technical-test-dmot-euw1/output/pp-complete \
--runner DataflowRunner \
--project street-group \
--temp_location gs://street-group-technical-test-dmot-euw1/tmp \
--subnetwork=https://www.googleapis.com/compute/v1/projects/street-group/regions/europe-west1/subnetworks/europe-west-1-dataflow \
--no_use_public_ips \
--worker_machine_type=n1-highmem-8 \
--num_workers=3 \
--autoscaling_algorithm=NONE
```
### Locally
Run the pipeline locally:
`python -m analyse_properties.main --runner DirectRunner`
## Errors
Unsubscriptable error on window:
<https://stackoverflow.com/questions/42276520/what-does-object-of-type-unwindowedvalues-has-no-len-mean>
## Documentation
Running in its own private VPC without public IPs
- <https://stackoverflow.com/questions/58893082/which-compute-engine-quotas-need-to-be-updated-to-run-dataflow-with-50-workers>
- <https://cloud.google.com/dataflow/docs/guides/specifying-networks#subnetwork_parameter>
Error help
- <https://cloud.google.com/dataflow/docs/guides/common-errors>
- <https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline>
Scaling
Using DataFlowPrime: <https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime#enable-prime>
Use `--experiments=enable_prime`
Deploying a pipeline (with scaling options): <https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline>
Available VM types (with pricing): <https://cloud.google.com/compute/vm-instance-pricing#n1_predefined>
Performance
Sideinput performance: <https://stackoverflow.com/questions/48242320/google-dataflow-apache-beam-python-side-input-from-pcollection-kills-perform>
Common use cases:
- Part 1 <https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1>
- Part 2 <https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-2>
Side inputs: <https://cloud.google.com/architecture/e-commerce/patterns/slow-updating-side-inputs>

5
notes/links.md Normal file
View File

@@ -0,0 +1,5 @@
# Links
## Data
https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

27
notes/tmp/errordata Normal file
View File

@@ -0,0 +1,27 @@
"Error message from worker: Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 651, in do_work
work_executor.execute()
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 181, in execute
op.finish()
File "dataflow_worker/native_operations.py", line 93, in dataflow_worker.native_operations.NativeWriteOperation.finish
File "dataflow_worker/native_operations.py", line 94, in dataflow_worker.native_operations.NativeWriteOperation.finish
File "dataflow_worker/native_operations.py", line 95, in dataflow_worker.native_operations.NativeWriteOperation.finish
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/nativeavroio.py", line 308, in __exit__
self._data_file_writer.flush()
File "fastavro/_write.pyx", line 664, in fastavro._write.Writer.flush
File "fastavro/_write.pyx", line 639, in fastavro._write.Writer.dump
File "fastavro/_write.pyx", line 451, in fastavro._write.snappy_write_block
File "fastavro/_write.pyx", line 458, in fastavro._write.snappy_write_block
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filesystemio.py", line 200, in write
self._uploader.put(b)
File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/gcsio.py", line 720, in put
self._conn.send_bytes(data.tobytes())
File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"
"Out of memory: Killed process 2042 (python) total-vm:28616496kB, anon-rss:25684136kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:51284kB oom_score_adj:900"

44
notes/tmp/exampledata Normal file
View File

@@ -0,0 +1,44 @@
[{
"property_id": "3cf3c06632c46754696f2017933702f3",
"flat_appartment": "",
"builing": "",
"number": "63",
"street": "ROTTON PARK STREET",
"locality": "",
"town": "BIRMINGHAM",
"district": "BIRMINGHAM",
"county": "WEST MIDLANDS",
"postcode": "B16 0AE",
"property_transactions": [
{ "price": "385000", "transaction_date": "2021-01-08", "year": "2021" },
{ "price": "701985", "transaction_date": "2019-03-28", "year": "2019" },
{ "price": "1748761", "transaction_date": "2020-05-27", "year": "2020" }
],
"latest_transaction_year": "2021"
},
{
"property_id": "c650d5d7bb0daf0a19bb2cacabbee74e",
"readable_address": "16 STATION ROAD\nPARKGATE\nNESTON\nCHESHIRE WEST AND CHESTER\nCH64 6QJ",
"flat_appartment": "",
"builing": "",
"number": "16",
"street": "STATION ROAD",
"locality": "PARKGATE",
"town": "NESTON",
"district": "CHESHIRE WEST AND CHESTER",
"county": "CHESHIRE WEST AND CHESTER",
"postcode": "CH64 6QJ",
"property_transactions": [
{
"price": "280000",
"transaction_date": "2020-11-30",
"year": "2020"
},
{
"price": "265000",
"transaction_date": "2020-05-29",
"year": "2020"
}
],
"latest_transaction_year": "2020"
}]

16
notes/tmp/runningdata Normal file
View File

@@ -0,0 +1,16 @@
Create Mapping table
('fd4634faec47c29de40bbf7840723b41', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
('fd4634faec47c29de40bbf7840723b41', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
Condensing
{'fd4634faec47c29de40bbf7840723b41': ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', '']}
Prepared
GroupByKey
('fe205bfe66bc7f18c50c8f3d77ec3e30', ['fd4634faec47c29de40bbf7840723b41', 'fd4634faec47c29de40bbf7840723b41'])
deduplicated
('fe205bfe66bc7f18c50c8f3d77ec3e30', ['fd4634faec47c29de40bbf7840723b41'])