Deployed 8a0d808 with MkDocs version: 1.2.2

This commit is contained in:
2021-09-28 00:31:12 +01:00
parent c76e3c542a
commit 0e17f26631
9 changed files with 41 additions and 28 deletions

View File

@@ -699,10 +699,6 @@
<p>We need to choose a <code>worker_machine_type</code> with sufficient memory to run the pipeline. As the pipeline uses a mapping table, and DataFlow autoscales on CPU and not memory usage, we need a machine with more ram than usual to ensure sufficient memory when running on one worker. For <code>pp-2020.csv</code> the type <code>n1-highmem-2</code> with 2vCPU and 13GB of ram was chosen and completed successfully in ~10 minutes using only 1 worker.</p> <p>We need to choose a <code>worker_machine_type</code> with sufficient memory to run the pipeline. As the pipeline uses a mapping table, and DataFlow autoscales on CPU and not memory usage, we need a machine with more ram than usual to ensure sufficient memory when running on one worker. For <code>pp-2020.csv</code> the type <code>n1-highmem-2</code> with 2vCPU and 13GB of ram was chosen and completed successfully in ~10 minutes using only 1 worker.</p>
</div> </div>
<p>Assuming the <code>pp-2020.csv</code> file has been placed in the <code>./input</code> directory in the bucket you can run a command similar to:</p> <p>Assuming the <code>pp-2020.csv</code> file has been placed in the <code>./input</code> directory in the bucket you can run a command similar to:</p>
<div class="admonition caution">
<p class="admonition-title">Caution</p>
<p>Use the command <code>python -m analyse_properties.main</code> as the entrypoint to the pipeline and not <code>analyse-properties</code> as the module isn't installed with poetry on the workers with the configuration below.</p>
</div>
<div class="highlight"><pre><span></span><code>python -m analyse_properties.main <span class="se">\</span> <div class="highlight"><pre><span></span><code>python -m analyse_properties.main <span class="se">\</span>
--runner DataflowRunner <span class="se">\</span> --runner DataflowRunner <span class="se">\</span>
--project street-group <span class="se">\</span> --project street-group <span class="se">\</span>

View File

@@ -686,7 +686,7 @@ struct.error: &#39;i&#39; format requires -2147483648 &lt;= number &lt;= 2147483
<h2 id="solution">Solution<a class="headerlink" href="#solution" title="Permanent link">&para;</a></h2> <h2 id="solution">Solution<a class="headerlink" href="#solution" title="Permanent link">&para;</a></h2>
<p>A possible solution would be to leverage BigQuery to store the results of the mapping table in as the pipeline progresses. We can make use of BigQueries array type to literally store the raw array as we process each row.</p> <p>A possible solution would be to leverage BigQuery to store the results of the mapping table in as the pipeline progresses. We can make use of BigQueries array type to literally store the raw array as we process each row.</p>
<p>In addition to creating the mapping table <code>(key, value)</code> pairs, we also save these pairs to BigQuery at this stage. We then yield the element as it is currently written to allow the subsequent stages to make use of this data.</p> <p>In addition to creating the mapping table <code>(key, value)</code> pairs, we also save these pairs to BigQuery at this stage. We then yield the element as it is currently written to allow the subsequent stages to make use of this data.</p>
<p>Remove the condense mapping table stage as it is no longer needed.</p> <p>Remove the condense mapping table stage as it is no longer needed (which also saves a bit of time).</p>
<p>Instead of using:</p> <p>Instead of using:</p>
<div class="highlight"><pre><span></span><code><span class="n">beam</span><span class="o">.</span><span class="n">FlatMap</span><span class="p">(</span> <div class="highlight"><pre><span></span><code><span class="n">beam</span><span class="o">.</span><span class="n">FlatMap</span><span class="p">(</span>
<span class="n">insert_data_for_id</span><span class="p">,</span> <span class="n">beam</span><span class="o">.</span><span class="n">pvalue</span><span class="o">.</span><span class="n">AsSingleton</span><span class="p">(</span><span class="n">mapping_table_condensed</span><span class="p">)</span> <span class="n">insert_data_for_id</span><span class="p">,</span> <span class="n">beam</span><span class="o">.</span><span class="n">pvalue</span><span class="o">.</span><span class="n">AsSingleton</span><span class="p">(</span><span class="n">mapping_table_condensed</span><span class="p">)</span>

View File

@@ -710,7 +710,7 @@
<li>The key being the id across all columns (<code>id_all_columns</code>).</li> <li>The key being the id across all columns (<code>id_all_columns</code>).</li>
<li>The value being the raw data as an array.</li> <li>The value being the raw data as an array.</li>
</ul> </ul>
<p>The mapping table is then condensed to a single dictionary with these key, value pairs and is used as a side input further down the pipeline.</p> <p>The mapping table is then condensed to a single dictionary with these key, value pairs (automatically deduplicating repeated rows) and is used as a side input further down the pipeline.</p>
<p>This mapping table is created to ensure the <code>GroupByKey</code> operation is as quick as possible. The more data you have to process in a <code>GroupByKey</code>, the longer the operation takes. By doing the <code>GroupByKey</code> using just the ids, the pipeline can process the files much quicker than if we included the raw data in this operation.</p> <p>This mapping table is created to ensure the <code>GroupByKey</code> operation is as quick as possible. The more data you have to process in a <code>GroupByKey</code>, the longer the operation takes. By doing the <code>GroupByKey</code> using just the ids, the pipeline can process the files much quicker than if we included the raw data in this operation.</p>
<h2 id="prepare-stage">Prepare stage<a class="headerlink" href="#prepare-stage" title="Permanent link">&para;</a></h2> <h2 id="prepare-stage">Prepare stage<a class="headerlink" href="#prepare-stage" title="Permanent link">&para;</a></h2>
<ul> <ul>

View File

@@ -794,7 +794,7 @@
<h4 id="repeated-rows">Repeated rows<a class="headerlink" href="#repeated-rows" title="Permanent link">&para;</a></h4> <h4 id="repeated-rows">Repeated rows<a class="headerlink" href="#repeated-rows" title="Permanent link">&para;</a></h4>
<p>Some of the data is repeated:</p> <p>Some of the data is repeated:</p>
<ul> <ul>
<li>Some rows repeated, with the same date + price + address information but with a unique transaction id.</li> <li>Some rows are repeated, with the same date + price + address information but with a unique transaction id.</li>
</ul> </ul>
<details> <details>
<summary>Example (PCollection)</summary> <summary>Example (PCollection)</summary>
@@ -816,7 +816,7 @@
<span class="p">]</span> <span class="p">]</span>
<span class="p">},</span> <span class="p">},</span>
<span class="p">{</span> <span class="p">{</span>
<span class="nt">&quot;fd4634faec47c29de40bbf7840723b41&quot;</span><span class="p">:</span> <span class="p">[</span> <span class="nt">&quot;gd4634faec47c29de40bbf7840723b42&quot;</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">&quot;317500&quot;</span><span class="p">,</span> <span class="s2">&quot;317500&quot;</span><span class="p">,</span>
<span class="s2">&quot;2020-11-13 00:00&quot;</span><span class="p">,</span> <span class="s2">&quot;2020-11-13 00:00&quot;</span><span class="p">,</span>
<span class="s2">&quot;B90 3LA&quot;</span><span class="p">,</span> <span class="s2">&quot;B90 3LA&quot;</span><span class="p">,</span>

View File

@@ -358,8 +358,15 @@
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#poetry" class="md-nav__link"> <a href="#pip" class="md-nav__link">
Poetry Pip
</a>
</li>
<li class="md-nav__item">
<a href="#poetry-alternative" class="md-nav__link">
Poetry (Alternative)
</a> </a>
</li> </li>
@@ -600,8 +607,15 @@
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix> <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item"> <li class="md-nav__item">
<a href="#poetry" class="md-nav__link"> <a href="#pip" class="md-nav__link">
Poetry Pip
</a>
</li>
<li class="md-nav__item">
<a href="#poetry-alternative" class="md-nav__link">
Poetry (Alternative)
</a> </a>
</li> </li>
@@ -627,18 +641,18 @@
<p>The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested.</p> <p>The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested.</p>
<p>The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested.</p> <p>The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested.</p>
<p>For Beam 2.32.0 the supported versions of the Python SDK can be found <a href="https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#sdk-for-python">here</a>.</p> <p>For Beam 2.32.0 the supported versions of the Python SDK can be found <a href="https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies#sdk-for-python">here</a>.</p>
<h2 id="poetry">Poetry<a class="headerlink" href="#poetry" title="Permanent link">&para;</a></h2> <h2 id="pip">Pip<a class="headerlink" href="#pip" title="Permanent link">&para;</a></h2>
<p>The test uses <a href="https://python-poetry.org">Poetry</a> for dependency management.</p> <p>In a virtual environment run from the root of the repo:</p>
<div class="admonition info inline end"> <div class="highlight"><pre><span></span><code>pip install -r requirements.txt
<p class="admonition-title">Info</p>
<p>If you already have Poetry installed globally you can go straight to the <code>poetry install</code> step.</p>
</div>
<p>In a virtual environment install poetry:</p>
<div class="highlight"><pre><span></span><code>pip install poetry
</code></pre></div> </code></pre></div>
<h2 id="poetry-alternative">Poetry (Alternative)<a class="headerlink" href="#poetry-alternative" title="Permanent link">&para;</a></h2>
<p>Install <a href="https://python-poetry.org">Poetry</a> <em>globally</em></p>
<p>From the root of the repo install the dependencies with:</p> <p>From the root of the repo install the dependencies with:</p>
<div class="highlight"><pre><span></span><code>poetry install --no-dev <div class="highlight"><pre><span></span><code>poetry install --no-dev
</code></pre></div> </code></pre></div>
<p>Activate the shell with:</p>
<div class="highlight"><pre><span></span><code>poetry shell
</code></pre></div>

View File

@@ -667,7 +667,7 @@
<h1 id="usage">Usage<a class="headerlink" href="#usage" title="Permanent link">&para;</a></h1> <h1 id="usage">Usage<a class="headerlink" href="#usage" title="Permanent link">&para;</a></h1>
<p>This page documents how to run the pipeline locally to complete the task for the <a href="https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#section-1">dataset for 2020</a>.</p> <p>This page documents how to run the pipeline locally to complete the task for the <a href="https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#section-1">dataset for 2020</a>.</p>
<p>The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed here. We also discuss how to adapt the pipeline so it can run against <a href="https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#single-file">the full dataset</a>.</p> <p>The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed <a href="../dataflow/index.html">here</a>. We also discuss how to adapt the pipeline so it can run against <a href="https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads#single-file">the full dataset</a>.</p>
<h2 id="download-dataset">Download dataset<a class="headerlink" href="#download-dataset" title="Permanent link">&para;</a></h2> <h2 id="download-dataset">Download dataset<a class="headerlink" href="#download-dataset" title="Permanent link">&para;</a></h2>
<p>The input data by default should go in <code>./data/input</code>.</p> <p>The input data by default should go in <code>./data/input</code>.</p>
<p>For convenience the data is available publicly in a GCP Cloud Storage bucket.</p> <p>For convenience the data is available publicly in a GCP Cloud Storage bucket.</p>
@@ -676,13 +676,13 @@
</code></pre></div> </code></pre></div>
<p>to download the data for 2020 and place in the input directory above.</p> <p>to download the data for 2020 and place in the input directory above.</p>
<h2 id="entrypoint">Entrypoint<a class="headerlink" href="#entrypoint" title="Permanent link">&para;</a></h2> <h2 id="entrypoint">Entrypoint<a class="headerlink" href="#entrypoint" title="Permanent link">&para;</a></h2>
<p>The entrypoint to the pipeline is <code>analyse-properties</code>.</p> <p>The entrypoint to the pipeline is <code>analyse_properties.main</code>.</p>
<h2 id="available-options">Available options<a class="headerlink" href="#available-options" title="Permanent link">&para;</a></h2> <h2 id="available-options">Available options<a class="headerlink" href="#available-options" title="Permanent link">&para;</a></h2>
<p>Running</p> <p>Running</p>
<div class="highlight"><pre><span></span><code>analyse-properties --help <div class="highlight"><pre><span></span><code>python -m analyse_properties.main --help
</code></pre></div> </code></pre></div>
<p>gives the following output:</p> <p>gives the following output:</p>
<div class="highlight"><pre><span></span><code>usage: analyse-properties <span class="o">[</span>-h<span class="o">]</span> <span class="o">[</span>--input INPUT<span class="o">]</span> <span class="o">[</span>--output OUTPUT<span class="o">]</span> <div class="highlight"><pre><span></span><code>usage: analyse_properties.main <span class="o">[</span>-h<span class="o">]</span> <span class="o">[</span>--input INPUT<span class="o">]</span> <span class="o">[</span>--output OUTPUT<span class="o">]</span>
optional arguments: optional arguments:
-h, --help show this <span class="nb">help</span> message and <span class="nb">exit</span> -h, --help show this <span class="nb">help</span> message and <span class="nb">exit</span>
@@ -690,11 +690,14 @@ optional arguments:
--output OUTPUT Full path to the output file without extension. --output OUTPUT Full path to the output file without extension.
</code></pre></div> </code></pre></div>
<p>The default value for input is <code>./data/input/pp-2020.csv</code> and the default value for output is <code>./data/output/pp-2020</code>.</p> <p>The default value for input is <code>./data/input/pp-2020.csv</code> and the default value for output is <code>./data/output/pp-2020</code>.</p>
<p>If passing in values for <code>input</code>/<code>output</code> these should be <strong>full</strong> paths to the files. The test will parse these inputs as a <code>str()</code> and pass this to <code class="highlight"><span class="n">beam</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">ReadFromText</span><span class="p">()</span></code>.</p>
<h2 id="run-the-pipeline">Run the pipeline<a class="headerlink" href="#run-the-pipeline" title="Permanent link">&para;</a></h2> <h2 id="run-the-pipeline">Run the pipeline<a class="headerlink" href="#run-the-pipeline" title="Permanent link">&para;</a></h2>
<p>To run the pipeline and complete the task run:</p> <p>To run the pipeline and complete the task run:</p>
<div class="highlight"><pre><span></span><code>analyse-properties --runner DirectRunner <div class="highlight"><pre><span></span><code>python -m analyse_properties.main <span class="se">\</span>
--runner DirectRunner <span class="se">\</span>
--input ./data/input/pp-2020.csv <span class="se">\</span>
--output ./data/output/pp-2020
</code></pre></div> </code></pre></div>
<p>from the root of the repo.</p>
<p>The pipeline will use the 2020 dataset located in <code>./data/input</code> and output the resulting <code>.json</code> to <code>./data/output</code>.</p> <p>The pipeline will use the 2020 dataset located in <code>./data/input</code> and output the resulting <code>.json</code> to <code>./data/output</code>.</p>

View File

@@ -626,7 +626,7 @@
<h1 id="welcome">Welcome<a class="headerlink" href="#welcome" title="Permanent link">&para;</a></h1> <h1 id="welcome">Welcome<a class="headerlink" href="#welcome" title="Permanent link">&para;</a></h1>
<h2 id="introduction">Introduction<a class="headerlink" href="#introduction" title="Permanent link">&para;</a></h2> <h2 id="introduction">Introduction<a class="headerlink" href="#introduction" title="Permanent link">&para;</a></h2>
<p>This documentation accompanies the technical test for the Street Group.</p> <p>This documentation accompanies the technical test for the Street Group.</p>
<p>The following pages will guide the user through installing the requirements, and running the task to complete the test. In addition, there is some discussion around the approach, and any improvements that could be made.</p> <p>The following pages will guide the user through installing the requirements, and running the task to complete the test. In addition, there is some discussion around the approach, and scaling the pipeline.</p>
<p>Navigate sections using the tabs at the top of the page. Pages in this section can be viewed in order by using the section links in the left menu, or by using bar at the bottom of the page. The table of contents in the right menu can be used to navigate sections on each page.</p> <p>Navigate sections using the tabs at the top of the page. Pages in this section can be viewed in order by using the section links in the left menu, or by using bar at the bottom of the page. The table of contents in the right menu can be used to navigate sections on each page.</p>
<div class="admonition note"> <div class="admonition note">
<p class="admonition-title">Note</p> <p class="admonition-title">Note</p>

File diff suppressed because one or more lines are too long

Binary file not shown.