updating documentation

2026-02-06 16:05:43 +00:00 · 2021-09-28 00:29:41 +01:00
parent c481c1a976
commit 8a0d8085a2
7 changed files with 28 additions and 23 deletions
--- a/docs/discussion/approach.md
+++ b/docs/discussion/approach.md
@@ -21,7 +21,7 @@ The mapping table takes each row and creates a `(key,value)` pair with:
 - The key being the id across all columns (`id_all_columns`).
 - The value being the raw data as an array.

-The mapping table is then condensed to a single dictionary with these key, value pairs and is used as a side input further down the pipeline.
+The mapping table is then condensed to a single dictionary with these key, value pairs (automatically deduplicating repeated rows) and is used as a side input further down the pipeline.

 This mapping table is created to ensure the `GroupByKey` operation is as quick as possible. The more data you have to process in a `GroupByKey`, the longer the operation takes. By doing the `GroupByKey` using just the ids, the pipeline can process the files much quicker than if we included the raw data in this operation.

--- a/docs/discussion/cleaning.md
+++ b/docs/discussion/cleaning.md
@@ -64,7 +64,7 @@ We strip all leading/trailing whitespace from each column to enforce consistency

 Some of the data is repeated:

- Some rows repeated, with the same date + price + address information but with a unique transaction id.
+- Some rows are repeated, with the same date + price + address information but with a unique transaction id.

 <details>
    <summary>Example (PCollection)</summary>
@@ -87,7 +87,7 @@ Some of the data is repeated:
    ]
  },
  {
-    "fd4634faec47c29de40bbf7840723b41": [
+    "gd4634faec47c29de40bbf7840723b42": [
      "317500",
      "2020-11-13 00:00",
      "B90 3LA",