mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 11:55:45 +00:00
updating documentation
This commit is contained in:
@@ -21,7 +21,7 @@ The mapping table takes each row and creates a `(key,value)` pair with:
|
||||
- The key being the id across all columns (`id_all_columns`).
|
||||
- The value being the raw data as an array.
|
||||
|
||||
The mapping table is then condensed to a single dictionary with these key, value pairs and is used as a side input further down the pipeline.
|
||||
The mapping table is then condensed to a single dictionary with these key, value pairs (automatically deduplicating repeated rows) and is used as a side input further down the pipeline.
|
||||
|
||||
This mapping table is created to ensure the `GroupByKey` operation is as quick as possible. The more data you have to process in a `GroupByKey`, the longer the operation takes. By doing the `GroupByKey` using just the ids, the pipeline can process the files much quicker than if we included the raw data in this operation.
|
||||
|
||||
|
||||
@@ -64,7 +64,7 @@ We strip all leading/trailing whitespace from each column to enforce consistency
|
||||
|
||||
Some of the data is repeated:
|
||||
|
||||
- Some rows repeated, with the same date + price + address information but with a unique transaction id.
|
||||
- Some rows are repeated, with the same date + price + address information but with a unique transaction id.
|
||||
|
||||
<details>
|
||||
<summary>Example (PCollection)</summary>
|
||||
@@ -87,7 +87,7 @@ Some of the data is repeated:
|
||||
]
|
||||
},
|
||||
{
|
||||
"fd4634faec47c29de40bbf7840723b41": [
|
||||
"gd4634faec47c29de40bbf7840723b42": [
|
||||
"317500",
|
||||
"2020-11-13 00:00",
|
||||
"B90 3LA",
|
||||
|
||||
Reference in New Issue
Block a user