adding initial docs

This commit is contained in:
2021-09-27 21:19:28 +01:00
parent a73d7b74a4
commit cbb8a7e237
9 changed files with 261 additions and 18 deletions

View File

@@ -0,0 +1 @@
# Approach

115
docs/discussion/cleaning.md Normal file
View File

@@ -0,0 +1,115 @@
# Cleaning
In this page we discuss the cleaning stages and how best to prepare the data.
## Uniquely identify a property.
To uniquely identify a property with the data we have it is enough to have a Postcode and the PAON (or SAON or combination of both).
### Postcode
Because so few properties are missing a postcode (0.2% of all records) we will drop all rows that do not have one. We will drop some properties that could be identified uniquely with some more work, but the properties that are missing a postcode tend to be unusual/commercial/industrial (e.g a powerplant).
### PAON/SAON
The PAON has 3 possible formats:
- The street number.
- The building name.
- The building name and street number (comma delimited).
The SAON:
- Identifies the appartment/flat number for the building.
- If the SAON is present (only 11.7% of values) then the PAON will either be
- The building name.
- The building name and street number.
Because of the way the PAON and SOAN are defined, if any row is missing **both** of these columns we will drop it. As only having the postcode is not enough (generally speaking) to uniquely identify a property.
!!! tip
In a production environment we could send these rows to a sink table (in BigQuery for example), rather than drop them outright. Collecting these rows over time might show some patterns on how we can uniquely identify properties that are missing these fields.
We split the PAON as part of the cleaning stage. If the PAON contains a comma then it contains the building name and street number. We keep the street number in the same position as the PAON and insert the building name as a new column at the end of the row. If the PAON does not contain a comma we insert a blank column at the end to keep the number of columns in the PCollection consistent.
### Unneeded columns
To try keep computation costs/time down, I decided to drop the categorical columns provided. These include:
- Property Type.
- Old/New.
- Duration.
- PPD Category Type.
- Record Status - monthly file only.
Initially I was attempting to work against the full dataset so dropping these columns would make a difference in the amount of data that needs processing.
These columns do provide some relevant information (old/new, duration, property type) and these could be included back into the pipeline fairly easily. Due to time constraints I was unable to make this change.
In addition, I also dropped the transaction unique identifier column. I wanted the IDs calculated in the pipeline to be consistent in format, and hashing a string (md5) isn't that expensive to calculate with complexity $\mathcal{O}(n)$.
### General cleaning
#### Upper case
As all strings in the dataset are upper case, we convert everything in the row to upper case to enforce consistency across the dataset.
#### Strip leading/trailing whitespace
We strip all leading/trailing whitespace from each column to enforce consistency.
#### Repeated rows
Some of the data is repeated:
- Some rows repeated, with the same date + price + address information but with a unique transaction id.
<details>
<summary>Example (PCollection)</summary>
```json
[
{
"fd4634faec47c29de40bbf7840723b41": [
"317500",
"2020-11-13 00:00",
"B90 3LA",
"1",
"",
"VERSTONE ROAD",
"SHIRLEY",
"SOLIHULL",
"SOLIHULL",
"WEST MIDLANDS",
""
]
},
{
"fd4634faec47c29de40bbf7840723b41": [
"317500",
"2020-11-13 00:00",
"B90 3LA",
"1",
"",
"VERSTONE ROAD",
"SHIRLEY",
"SOLIHULL",
"SOLIHULL",
"WEST MIDLANDS",
""
]
}
]
```
</details>
These rows will be deduplicated as part of the pipeline.
- Some rows have the same date + address information, but different prices.
It would be very unusual to see multiple transactions on the same date for the same property. One reason could be that there was a data entry error, resulting in two different transactions with only one being the real price. As the date column does not contain the time (it is fixed at `00:00`) it is impossible to tell.
Another reason could be missing building/flat/appartment information in this entry.
We **keep** these in the data, resulting in some properties having multiple transactions with different prices on the same date. Without a time or more information to go on, it is difficult to see how these could be filtered out.

View File

@@ -0,0 +1,30 @@
# Data Exploration Report
A brief exploration was done on the **full** dataset using the module `pandas-profiling`. The module uses `pandas` to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. and produces a report `.html` file that can be viewed interatively in your browser.
The script used to generate this report is located in `./exploration/report.py` and can be viewed below.
<details>
<summary>report.py</summary>
```python
--8<-- "exploration/report.py"
```
</details>
The report can be viewed by clicking the Data Exploration Report tab at the top of the page.
## Interesting observations
When looking at the report we are looking for data quality and missing observations. The statistics are interesting to see but are largely irrelevant for this task.
The data overall looks very good for a dataset of its size (~27 million records). For important fields there are no missing values:
- Every row has a price.
- Every row has a unique transaction ID.
- Every row has a transaction date.
Some fields that we will need are missing data:
- ~42,000 (0.2%) are missing a Postcode.
- ~4,000 (<0.1%) are missing a PAON (primary addressable object name).
- ~412,000 (1.6%) are missing a Street Name.

View File

@@ -0,0 +1,9 @@
# Introduction
This section will go through some discussion of the test including:
- Data exploration
- Cleaning the data
- Interpreting the results
- Deploying on GCP DataFlow
- Improvements

View File

@@ -0,0 +1,26 @@
# Installation
The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested.
The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested.
For Beam 2.32.0 the supported versions of the Python SDK can be found [here](https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#sdk-for-python).
## Poetry
The test uses [Poetry](https://python-poetry.org) for dependency management.
!!! info inline end
If you already have Poetry installed globally you can go straight to the `poetry install` step.
In a virtual environment install poetry:
```bash
pip install poetry
```
From the root of the repo install the dependencies with:
```bash
poetry install --no-dev
```

View File

@@ -0,0 +1,56 @@
# Usage
This page documents how to run the pipeline locally to complete the task for the [dataset for 2020](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#section-1).
The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed here. We also discuss how to adapt the pipeline so it can run against [the full dataset](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#single-file).
## Download dataset
The input data by default should go in `./data/input`.
For convenience the data is available publicly in a GCP Cloud Storage bucket.
Run:
```bash
wget https://storage.googleapis.com/street-group-technical-test-dmot-euw1/input/pp-2020.csv -P data/input
```
to download the data for 2020 and place in the input directory above.
## Entrypoint
The entrypoint to the pipeline is `analyse-properties`.
## Available options
Running
```bash
analyse-properties --help
```
gives the following output:
```bash
usage: analyse-properties [-h] [--input INPUT] [--output OUTPUT]
optional arguments:
-h, --help show this help message and exit
--input INPUT Full path to the input file.
--output OUTPUT Full path to the output file without extension.
```
The default value for input is `./data/input/pp-2020.csv` and the default value for output is `./data/output/pp-2020`.
If passing in values for `input`/`output` these should be **full** paths to the files. The test will parse these inputs as a `str()` and pass this to `#!python beam.io.ReadFromText()`.
## Run the pipeline
To run the pipeline and complete the task run:
```bash
analyse-properties --runner DirectRunner
```
The pipeline will use the 2020 dataset located in `./data/input` and output the resulting `.json` to `./data/output`.

View File

@@ -3,3 +3,10 @@
## Introduction
This documentation accompanies the technical test for the Street Group.
The following pages will guide the user through installing the requirements, and running the task to complete the test. In addition, there is some discussion around the approach, and any improvements that could be made.
Navigate sections using the tabs at the top of the page. Pages in this section can be viewed in order by using the section links in the left menu, or by using bar at the bottom of the page. The table of contents in the right menu can be used to navigate sections on each page.
!!! note
All paths in this documentation, e.g `./analyse_properties/data/output` refer to the location of the directory/file from the root of the repo.