Merge final release (#1)

* adding initial skeleton * updating .gitignore * updating dev dependencies * adding report.py * updating notes * adding prospector.yaml * updating beam to install gcp extras * adding documentation * adding data exploration report + code * adding latest beam pipeline code * adding latest beam pipeline code * adding debug.py * adding latesty beam pipeline code * adding latest beam pipeline code * adding latest beam pipeline code * updating .gitignore * updating folder structure for data input/output * updating prospector.yaml * adding latest beam pipeline code * updating prospector.yaml * migrate beam pipeline to main.py * updating .gitignore * updating .gitignore * adding download script for data set * adding initial docs * moving inputs/outputs to use pathlib * removing shard_name_template from output file * adding pyenv 3.7.9 * removing requirements.txt for documentation * updating README.md * updating download data script for new location in GCS * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * adding latest beam pipeline code for dataflow * moving dataflow notes * updating prospector.yaml * adding latest beam pipeline code for dataflow * updating beam pipeline to use GroupByKey * updating download_data script with new bucket * update prospector.yaml * update dataflow documentation with new commands for vpc * adding latest beam pipeline code for dataflow with group optimisation * updating dataflow documentation * adding latest beam pipeline code for dataflow with group optimisation * updating download_data script with pp-2020 dataset * adding temporary notes * updating dataflow notes * adding latest beam pipeline code * updating dataflow notes * adding latest beam pipeline code for dataflow * adding debug print * moving panda-profiling report into docs * updating report.py * adding entrypoint command * adding initial docs * adding commands.md to notes * commenting out debug imports * updating documentation * updating latest beam pipeline with default inputs * updating poetry * adding requirements.txt * updating documentation
2026-02-06 07:55:45 +00:00 · 2021-09-28 00:31:09 +01:00
parent 8a22bfebe1
commit 80376a662e
34 changed files with 5667 additions and 1 deletions
--- a/docs/discussion/approach.md
+++ b/docs/discussion/approach.md
@@ -0,0 +1,95 @@
+# Approach
+
+The general approach to the pipeline is:
+
+## Loading stage
+
+- Load using `#!python beam.io.ReadFromText()`
+- Split the string loaded by `,` as it's a comma delimited `.csv`.
+- Strip the leading/trailing `"` marks.
+
+The result is an array with each element representing a single column in that row.
+
+## Cleaning stage
+
+Already discussed.
+
+## Create a mapping table
+
+The mapping table takes each row and creates a `(key,value)` pair with:
+
+- The key being the id across all columns (`id_all_columns`).
+- The value being the raw data as an array.
+
+The mapping table is then condensed to a single dictionary with these key, value pairs (automatically deduplicating repeated rows) and is used as a side input further down the pipeline.
+
+This mapping table is created to ensure the `GroupByKey` operation is as quick as possible. The more data you have to process in a `GroupByKey`, the longer the operation takes. By doing the `GroupByKey` using just the ids, the pipeline can process the files much quicker than if we included the raw data in this operation.
+
+## Prepare stage
+
+- Take the mapping table data (before it is condensed) and create a unique id ignoring the price and date (`id_without_price_date`).
+
+This id will not be unique: for properties with more than one transaction they will share this id.
+
+- Create a `(key, value)` pair with:
+    - The key being `id_without_price_date`.
+    - The value being `id_all_columns`.
+- Group by `id_without_price_date`.
+
+This results in a PCollection that looks like: `(id_without_price_date, [id_all_columns,...])`
+
+- Deduplicate the `id_all_columns` inside this array to eliminate repeated rows that are exactly the same.
+- Use the mapping table as a side input to reinsert the raw data using the `id_all_columns`.
+
+<details>
+    <summary>Example for No.1 B90 3LA</summary>
+
+Mapping table (pre condensed):
+
+```json
+('fd4634faec47c29de40bbf7840723b41', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
+('fd4634faec47c29de40bbf7840723b41', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
+```
+
+Mapping table (condensed):
+
+```json
+{'fd4634faec47c29de40bbf7840723b41': ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', '']}
+```
+
+Prepared (key, value):
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', 'fd4634faec47c29de40bbf7840723b41')
+('fe205bfe66bc7f18c50c8f3d77ec3e30', 'fd4634faec47c29de40bbf7840723b41')
+```
+
+Prepared (GroupByKey):
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', ['fd4634faec47c29de40bbf7840723b41', 'fd4634faec47c29de40bbf7840723b41'])
+```
+
+Prepared (Deduplicated):
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', ['fd4634faec47c29de40bbf7840723b41'])
+```
+
+Use mapping table as side input:
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
+```
+
+</details>
+
+## Format stage
+
+This stage takes the result and constructs a `json` object out of the grouped data. The schema for this output is discussed in the following page.
+
+## Save stage
+
+- The PCollection is combined with `#!python beam.combiners.ToList()`
+- Apply `json.dumps()` for proper quotation marks for strings.
+- Write to text with `#!python beam.io.WriteToText`.
--- a/docs/discussion/cleaning.md
+++ b/docs/discussion/cleaning.md
@@ -0,0 +1,154 @@
+# Cleaning
+
+In this page we discuss the cleaning stages and how best to prepare the data.
+
+## Uniquely identify a property.
+
+To uniquely identify a property with the data we have it is enough to have a Postcode and the PAON (or SAON or combination of both).
+
+### Postcode
+
+Because so few properties are missing a postcode (0.2% of all records) we will drop all rows that do not have one. We will drop some properties that could be identified uniquely with some more work, but the properties that are missing a postcode tend to be unusual/commercial/industrial (e.g a powerplant).
+
+### PAON/SAON
+
+The PAON has 3 possible formats:
+
+- The street number.
+- The building name.
+- The building name and street number (comma delimited).
+
+The SAON:
+
+- Identifies the appartment/flat number for the building.
+- If the SAON is present (only 11.7% of values) then the PAON will either be
+    - The building name.
+    - The building name and street number.
+
+Because of the way the PAON and SOAN are defined, if any row is missing **both** of these columns we will drop it. As only having the postcode is not enough (generally speaking) to uniquely identify a property.
+
+!!! tip
+    In a production environment we could send these rows to a sink table (in BigQuery for example), rather than drop them outright. Collecting these rows over time might show some patterns on how we can uniquely identify properties that are missing these fields.
+
+We split the PAON as part of the cleaning stage. If the PAON contains a comma then it contains the building name and street number. We keep the street number in the same position as the PAON and insert the building name as a new column at the end of the row. If the PAON does not contain a comma we insert a blank column at the end to keep the number of columns in the PCollection consistent.
+
+### Unneeded columns
+
+To try keep computation costs/time down, I decided to drop the categorical columns provided. These include:
+
+- Property Type.
+- Old/New.
+- Duration.
+- PPD Category Type.
+- Record Status - monthly file only.
+
+Initially I was attempting to work against the full dataset so dropping these columns would make a difference in the amount of data that needs processing.
+
+These columns are also not consistent. E.g the property `63` `B16, 0AE` has three transactions. Two of these transactions have a property type of `Other` and one transaction has a property type of `Terraced`.
+
+These columns do provide some relevant information (old/new, duration, property type) and these could be included back into the pipeline fairly easily. Due to time constraints I was unable to make this change.
+
+In addition, I also dropped the transaction unique identifier column. I wanted the IDs calculated in the pipeline to be consistent in format, and hashing a string (md5) isn't that expensive to calculate with complexity $\mathcal{O}(n)$.
+
+### General cleaning
+
+#### Upper case
+
+As all strings in the dataset are upper case, we convert everything in the row to upper case to enforce consistency across the dataset.
+
+#### Strip leading/trailing whitespace
+
+We strip all leading/trailing whitespace from each column to enforce consistency.
+
+#### Repeated rows
+
+Some of the data is repeated:
+
+- Some rows are repeated, with the same date + price + address information but with a unique transaction id.
+
+<details>
+    <summary>Example (PCollection)</summary>
+
+```json
+[
+  {
+    "fd4634faec47c29de40bbf7840723b41": [
+      "317500",
+      "2020-11-13 00:00",
+      "B90 3LA",
+      "1",
+      "",
+      "VERSTONE ROAD",
+      "SHIRLEY",
+      "SOLIHULL",
+      "SOLIHULL",
+      "WEST MIDLANDS",
+      ""
+    ]
+  },
+  {
+    "gd4634faec47c29de40bbf7840723b42": [
+      "317500",
+      "2020-11-13 00:00",
+      "B90 3LA",
+      "1",
+      "",
+      "VERSTONE ROAD",
+      "SHIRLEY",
+      "SOLIHULL",
+      "SOLIHULL",
+      "WEST MIDLANDS",
+      ""
+    ]
+  }
+]
+```
+
+</details>
+
+These rows will be deduplicated as part of the pipeline.
+
+- Some rows have the same date + address information, but different prices.
+
+It would be very unusual to see multiple transactions on the same date for the same property. One reason could be that there was a data entry error, resulting in two different transactions with only one being the real price. As the date column does not contain the time (it is fixed at `00:00`) it is impossible to tell.
+
+Another reason could be missing building/flat/appartment information in this entry.
+
+We **keep** these in the data, resulting in some properties having multiple transactions with different prices on the same date. Without a time or more information to go on, it is difficult to see how these could be filtered out.
+
+<details>
+    <summary>Example (Output)</summary>
+
+```json
+[
+  {
+    "property_id": "20d5c335c8d822a40baab0ecd57e92a4",
+    "readable_address": "53 PAVENHAM DRIVE\nBIRMINGHAM\nWEST MIDLANDS\nB5 7TN",
+    "flat_appartment": "",
+    "builing": "",
+    "number": "53",
+    "street": "PAVENHAM DRIVE",
+    "locality": "",
+    "town": "BIRMINGHAM",
+    "district": "BIRMINGHAM",
+    "county": "WEST MIDLANDS",
+    "postcode": "B5 7TN",
+    "property_transactions": [
+      {
+        "price": 270000,
+        "transaction_date": "2020-04-23",
+        "year": 2020
+      },
+      {
+        "price": 364000,
+        "transaction_date": "2020-04-23",
+        "year": 2020
+      }
+    ],
+    "latest_transaction_year": 2020
+  }
+]
+
+```
+
+</details>
--- a/docs/discussion/exploration.md
+++ b/docs/discussion/exploration.md
@@ -0,0 +1,30 @@
+# Data Exploration Report
+
+A brief exploration was done on the **full** dataset using the module `pandas-profiling`. The module uses `pandas` to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. and produces a report `.html` file that can be viewed interatively in your browser.
+
+The script used to generate this report is located in `./exploration/report.py` and can be viewed below.
+
+<details>
+	<summary>report.py</summary>
+```python
+--8<-- "exploration/report.py"
+```
+</details>
+
+The report can be viewed by clicking the Data Exploration Report tab at the top of the page.
+
+## Interesting observations
+
+When looking at the report we are looking for data quality and missing observations. The statistics are interesting to see but are largely irrelevant for this task.
+
+The data overall looks very good for a dataset of its size (~27 million records). For important fields there are no missing values:
+
+- Every row has a price.
+- Every row has a unique transaction ID.
+- Every row has a transaction date.
+
+Some fields that we will need are missing data:
+
+- ~42,000 (0.2%) are missing a Postcode.
+- ~4,000 (<0.1%) are missing a PAON (primary addressable object name).
+- ~412,000 (1.6%) are missing a Street Name.
--- a/docs/discussion/introduction.md
+++ b/docs/discussion/introduction.md
@@ -0,0 +1,7 @@
+# Introduction
+
+This section will go through some discussion of the test including:
+
+- Data exploration
+- Cleaning the data
+- Interpreting the results
--- a/docs/discussion/results.md
+++ b/docs/discussion/results.md
@@ -0,0 +1,51 @@
+# Results
+
+The resulting output `.json` looks like (for the previous example using No. 1 `B90 3LA`):
+
+```json
+[
+  {
+    "property_id": "fe205bfe66bc7f18c50c8f3d77ec3e30",
+    "readable_address": "1 VERSTONE ROAD\nSHIRLEY\nSOLIHULL\nWEST MIDLANDS\nB90 3LA",
+    "flat_appartment": "",
+    "builing": "",
+    "number": "1",
+    "street": "VERSTONE ROAD",
+    "locality": "SHIRLEY",
+    "town": "SOLIHULL",
+    "district": "SOLIHULL",
+    "county": "WEST MIDLANDS",
+    "postcode": "B90 3LA",
+    "property_transactions": [
+      {
+        "price": 317500,
+        "transaction_date": "2020-11-13",
+        "year": 2020
+      }
+    ],
+    "latest_transaction_year": 2020
+  }
+]
+```
+
+The standard property information is included, we will briefly discuss the additional fields included in this output file.
+
+## readable_address
+
+The components that make up the address in the dataset are often repetitive, with the locality, town/city, district and county often sharing the same result. This can result in hard to read addresses if we just stacked all the components sequentially.
+
+The `readable_address` provides an easy to read address that strips this repetiveness out, by doing pairwise comparisons to each of the four components and applying a mask. The result is an address that could be served to the end user, or easily displayed on a page.
+
+This saves any user having to apply the same logic to simply display the address somewhere, the full address of a property should be easy to read and easily accessible.
+
+## property_transactions
+
+This array contains an object for each transaction for that property that has the price and year as an `int`, with the date having the `00:00` time stripped out.
+
+## latest_transaction_year
+
+The date of the latest transaction is extracted from the array of `property_transactions` and placed in the top level of the `json` object. This allows any end user to easily search for properties that haven't been sold in a period of time, without having to write this logic themselves.
+
+A consumer should be able to use this data to answer questions like:
+
+- Give me all properties in the town of Solihull that haven't been sold in the past 10 years.