updating documentation

2026-02-06 07:55:45 +00:00 · 2021-09-27 23:16:30 +01:00
parent 4056ca1f32
commit 4561f1a356
10 changed files with 313 additions and 2 deletions
--- a/docs/discussion/approach.md
+++ b/docs/discussion/approach.md
@@ -1 +1,95 @@
 # Approach
+
+The general approach to the pipeline is:
+
+## Loading stage
+
+- Load using `#!python beam.io.ReadFromText()`
+- Split the string loaded by `,` as it's a comma delimited `.csv`.
+- Strip the leading/trailing `"` marks.
+
+The result is an array with each element representing a single column in that row.
+
+## Cleaning stage
+
+Already discussed.
+
+## Create a mapping table
+
+The mapping table takes each row and creates a `(key,value)` pair with:
+
+- The key being the id across all columns (`id_all_columns`).
+- The value being the raw data as an array.
+
+The mapping table is then condensed to a single dictionary with these key, value pairs and is used as a side input further down the pipeline.
+
+This mapping table is created to ensure the `GroupByKey` operation is as quick as possible. The more data you have to process in a `GroupByKey`, the longer the operation takes. By doing the `GroupByKey` using just the ids, the pipeline can process the files much quicker than if we included the raw data in this operation.
+
+## Prepare stage
+
+- Take the mapping table data (before it is condensed) and create a unique id ignoring the price and date (`id_without_price_date`).
+
+This id will not be unique: for properties with more than one transaction they will share this id.
+
+- Create a `(key, value)` pair with:
+    - The key being `id_without_price_date`.
+    - The value being `id_all_columns`.
+- Group by `id_without_price_date`.
+
+This results in a PCollection that looks like: `(id_without_price_date, [id_all_columns,...])`
+
+- Deduplicate the `id_all_columns` inside this array to eliminate repeated rows that are exactly the same.
+- Use the mapping table as a side input to reinsert the raw data using the `id_all_columns`.
+
+<details>
+    <summary>Example for No.1 B90 3LA</summary>
+
+Mapping table (pre condensed):
+
+```json
+('fd4634faec47c29de40bbf7840723b41', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
+('fd4634faec47c29de40bbf7840723b41', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
+```
+
+Mapping table (condensed):
+
+```json
+{'fd4634faec47c29de40bbf7840723b41': ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', '']}
+```
+
+Prepared (key, value):
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', 'fd4634faec47c29de40bbf7840723b41')
+('fe205bfe66bc7f18c50c8f3d77ec3e30', 'fd4634faec47c29de40bbf7840723b41')
+```
+
+Prepared (GroupByKey):
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', ['fd4634faec47c29de40bbf7840723b41', 'fd4634faec47c29de40bbf7840723b41'])
+```
+
+Prepared (Deduplicated):
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', ['fd4634faec47c29de40bbf7840723b41'])
+```
+
+Use mapping table as side input:
+
+```json
+('fe205bfe66bc7f18c50c8f3d77ec3e30', ['317500', '2020-11-13 00:00', 'B90 3LA', '1', '', 'VERSTONE ROAD', 'SHIRLEY', 'SOLIHULL', 'SOLIHULL', 'WEST MIDLANDS', ''])
+```
+
+</details>
+
+## Format stage
+
+This stage takes the result and constructs a `json` object out of the grouped data. The schema for this output is discussed in the following page.
+
+## Save stage
+
+- The PCollection is combined with `#!python beam.combiners.ToList()`
+- Apply `json.dumps()` for proper quotation marks for strings.
+- Write to text with `#!python beam.io.WriteToText`.
--- a/docs/discussion/cleaning.md
+++ b/docs/discussion/cleaning.md
@@ -44,6 +44,8 @@ To try keep computation costs/time down, I decided to drop the categorical colum

 Initially I was attempting to work against the full dataset so dropping these columns would make a difference in the amount of data that needs processing.

+These columns are also not consistent. E.g the property `63` `B16, 0AE` has three transactions. Two of these transactions have a property type of `Other` and one transaction has a property type of `Terraced`.
+
 These columns do provide some relevant information (old/new, duration, property type) and these could be included back into the pipeline fairly easily. Due to time constraints I was unable to make this change.

 In addition, I also dropped the transaction unique identifier column. I wanted the IDs calculated in the pipeline to be consistent in format, and hashing a string (md5) isn't that expensive to calculate with complexity $\mathcal{O}(n)$.
@@ -113,3 +115,40 @@ It would be very unusual to see multiple transactions on the same date for the s
 Another reason could be missing building/flat/appartment information in this entry.

 We **keep** these in the data, resulting in some properties having multiple transactions with different prices on the same date. Without a time or more information to go on, it is difficult to see how these could be filtered out.
+
+<details>
+    <summary>Example (Output)</summary>
+
+```json
+[
+  {
+    "property_id": "20d5c335c8d822a40baab0ecd57e92a4",
+    "readable_address": "53 PAVENHAM DRIVE\nBIRMINGHAM\nWEST MIDLANDS\nB5 7TN",
+    "flat_appartment": "",
+    "builing": "",
+    "number": "53",
+    "street": "PAVENHAM DRIVE",
+    "locality": "",
+    "town": "BIRMINGHAM",
+    "district": "BIRMINGHAM",
+    "county": "WEST MIDLANDS",
+    "postcode": "B5 7TN",
+    "property_transactions": [
+      {
+        "price": 270000,
+        "transaction_date": "2020-04-23",
+        "year": 2020
+      },
+      {
+        "price": 364000,
+        "transaction_date": "2020-04-23",
+        "year": 2020
+      }
+    ],
+    "latest_transaction_year": 2020
+  }
+]
+
+```
+
+</details>
--- a/docs/discussion/introduction.md
+++ b/docs/discussion/introduction.md
@@ -5,5 +5,3 @@ This section will go through some discussion of the test including:
 - Data exploration
 - Cleaning the data
 - Interpreting the results
- Deploying on GCP DataFlow
- Improvements
--- a/docs/discussion/results.md
+++ b/docs/discussion/results.md
@@ -0,0 +1,51 @@
+# Results
+
+The resulting output `.json` looks like (for the previous example using No. 1 `B90 3LA`):
+
+```json
+[
+  {
+    "property_id": "fe205bfe66bc7f18c50c8f3d77ec3e30",
+    "readable_address": "1 VERSTONE ROAD\nSHIRLEY\nSOLIHULL\nWEST MIDLANDS\nB90 3LA",
+    "flat_appartment": "",
+    "builing": "",
+    "number": "1",
+    "street": "VERSTONE ROAD",
+    "locality": "SHIRLEY",
+    "town": "SOLIHULL",
+    "district": "SOLIHULL",
+    "county": "WEST MIDLANDS",
+    "postcode": "B90 3LA",
+    "property_transactions": [
+      {
+        "price": 317500,
+        "transaction_date": "2020-11-13",
+        "year": 2020
+      }
+    ],
+    "latest_transaction_year": 2020
+  }
+]
+```
+
+The standard property information is included, we will briefly discuss the additional fields included in this output file.
+
+## readable_address
+
+The components that make up the address in the dataset are often repetitive, with the locality, town/city, district and county often sharing the same result. This can result in hard to read addresses if we just stacked all the components sequentially.
+
+The `readable_address` provides an easy to read address that strips this repetiveness out, by doing pairwise comparisons to each of the four components and applying a mask. The result is an address that could be served to the end user, or easily displayed on a page.
+
+This saves any user having to apply the same logic to simply display the address somewhere, the full address of a property should be easy to read and easily accessible.
+
+## property_transactions
+
+This array contains an object for each transaction for that property that has the price and year as an `int`, with the date having the `00:00` time stripped out.
+
+## latest_transaction_year
+
+The date of the latest transaction is extracted from the array of `property_transactions` and placed in the top level of the `json` object. This allows any end user to easily search for properties that haven't been sold in a period of time, without having to write this logic themselves.
+
+A consumer should be able to use this data to answer questions like:
+
+- Give me all properties in the town of Solihull that haven't been sold in the past 10 years.