Merge final release (#1)

* adding initial skeleton

* updating .gitignore

* updating dev dependencies

* adding report.py

* updating notes

* adding prospector.yaml

* updating beam to install gcp extras

* adding documentation

* adding data exploration report + code

* adding latest beam pipeline code

* adding latest beam pipeline code

* adding debug.py

* adding latesty beam pipeline code

* adding latest beam pipeline code

* adding latest beam pipeline code

* updating .gitignore

* updating folder structure for data input/output

* updating prospector.yaml

* adding latest beam pipeline code

* updating prospector.yaml

* migrate beam pipeline to main.py

* updating .gitignore

* updating .gitignore

* adding download script for data set

* adding initial docs

* moving inputs/outputs to use pathlib

* removing shard_name_template from output file

* adding pyenv 3.7.9

* removing requirements.txt for documentation

* updating README.md

* updating download data script for new location in GCS

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* adding latest beam pipeline code for dataflow

* moving dataflow notes

* updating prospector.yaml

* adding latest beam pipeline code for dataflow

* updating beam pipeline to use GroupByKey

* updating download_data script with new bucket

* update prospector.yaml

* update dataflow documentation with new commands for vpc

* adding latest beam pipeline code for dataflow with group optimisation

* updating dataflow documentation

* adding latest beam pipeline code for dataflow with group optimisation

* updating download_data script with pp-2020 dataset

* adding temporary notes

* updating dataflow notes

* adding latest beam pipeline code

* updating dataflow notes

* adding latest beam pipeline code for dataflow

* adding debug print

* moving panda-profiling report into docs

* updating report.py

* adding entrypoint command

* adding initial docs

* adding commands.md to notes

* commenting out debug imports

* updating documentation

* updating latest beam pipeline with default inputs

* updating poetry

* adding requirements.txt

* updating documentation
This commit is contained in:
dtomlinson91
2021-09-28 00:31:09 +01:00
committed by GitHub
parent 8a22bfebe1
commit 80376a662e
34 changed files with 5667 additions and 1 deletions

View File

@@ -0,0 +1,30 @@
# Data Exploration Report
A brief exploration was done on the **full** dataset using the module `pandas-profiling`. The module uses `pandas` to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. and produces a report `.html` file that can be viewed interatively in your browser.
The script used to generate this report is located in `./exploration/report.py` and can be viewed below.
<details>
<summary>report.py</summary>
```python
--8<-- "exploration/report.py"
```
</details>
The report can be viewed by clicking the Data Exploration Report tab at the top of the page.
## Interesting observations
When looking at the report we are looking for data quality and missing observations. The statistics are interesting to see but are largely irrelevant for this task.
The data overall looks very good for a dataset of its size (~27 million records). For important fields there are no missing values:
- Every row has a price.
- Every row has a unique transaction ID.
- Every row has a transaction date.
Some fields that we will need are missing data:
- ~42,000 (0.2%) are missing a Postcode.
- ~4,000 (<0.1%) are missing a PAON (primary addressable object name).
- ~412,000 (1.6%) are missing a Street Name.