mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 11:55:45 +00:00
Merge final release (#1)
* adding initial skeleton * updating .gitignore * updating dev dependencies * adding report.py * updating notes * adding prospector.yaml * updating beam to install gcp extras * adding documentation * adding data exploration report + code * adding latest beam pipeline code * adding latest beam pipeline code * adding debug.py * adding latesty beam pipeline code * adding latest beam pipeline code * adding latest beam pipeline code * updating .gitignore * updating folder structure for data input/output * updating prospector.yaml * adding latest beam pipeline code * updating prospector.yaml * migrate beam pipeline to main.py * updating .gitignore * updating .gitignore * adding download script for data set * adding initial docs * moving inputs/outputs to use pathlib * removing shard_name_template from output file * adding pyenv 3.7.9 * removing requirements.txt for documentation * updating README.md * updating download data script for new location in GCS * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * moving dataflow notes * updating prospector.yaml * adding latest beam pipeline code for dataflow * updating beam pipeline to use GroupByKey * updating download_data script with new bucket * update prospector.yaml * update dataflow documentation with new commands for vpc * adding latest beam pipeline code for dataflow with group optimisation * updating dataflow documentation * adding latest beam pipeline code for dataflow with group optimisation * updating download_data script with pp-2020 dataset * adding temporary notes * updating dataflow notes * adding latest beam pipeline code * updating dataflow notes * adding latest beam pipeline code for dataflow * adding debug print * moving panda-profiling report into docs * updating report.py * adding entrypoint command * adding initial docs * adding commands.md to notes * commenting out debug imports * updating documentation * updating latest beam pipeline with default inputs * updating poetry * adding requirements.txt * updating documentation
This commit is contained in:
30
docs/discussion/exploration.md
Normal file
30
docs/discussion/exploration.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Data Exploration Report
|
||||
|
||||
A brief exploration was done on the **full** dataset using the module `pandas-profiling`. The module uses `pandas` to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. and produces a report `.html` file that can be viewed interatively in your browser.
|
||||
|
||||
The script used to generate this report is located in `./exploration/report.py` and can be viewed below.
|
||||
|
||||
<details>
|
||||
<summary>report.py</summary>
|
||||
```python
|
||||
--8<-- "exploration/report.py"
|
||||
```
|
||||
</details>
|
||||
|
||||
The report can be viewed by clicking the Data Exploration Report tab at the top of the page.
|
||||
|
||||
## Interesting observations
|
||||
|
||||
When looking at the report we are looking for data quality and missing observations. The statistics are interesting to see but are largely irrelevant for this task.
|
||||
|
||||
The data overall looks very good for a dataset of its size (~27 million records). For important fields there are no missing values:
|
||||
|
||||
- Every row has a price.
|
||||
- Every row has a unique transaction ID.
|
||||
- Every row has a transaction date.
|
||||
|
||||
Some fields that we will need are missing data:
|
||||
|
||||
- ~42,000 (0.2%) are missing a Postcode.
|
||||
- ~4,000 (<0.1%) are missing a PAON (primary addressable object name).
|
||||
- ~412,000 (1.6%) are missing a Street Name.
|
||||
Reference in New Issue
Block a user