Deployed 8a0d808 with MkDocs version: 1.2.2

2026-02-06 07:55:45 +00:00 · 2021-09-28 00:31:12 +01:00
parent c76e3c542a
commit 0e17f26631
9 changed files with 41 additions and 28 deletions
--- a/dataflow/index.html
+++ b/dataflow/index.html
@@ -699,10 +699,6 @@
 <p>We need to choose a <code>worker_machine_type</code> with sufficient memory to run the pipeline. As the pipeline uses a mapping table, and DataFlow autoscales on CPU and not memory usage, we need a machine with more ram than usual to ensure sufficient memory when running on one worker. For <code>pp-2020.csv</code> the type <code>n1-highmem-2</code> with 2vCPU and 13GB of ram was chosen and completed successfully in ~10 minutes using only 1 worker.</p>
 </div>
 <p>Assuming the <code>pp-2020.csv</code> file has been placed in the <code>./input</code> directory in the bucket you can run a command similar to:</p>
-<div class="admonition caution">
-<p class="admonition-title">Caution</p>
-<p>Use the command <code>python -m analyse_properties.main</code> as the entrypoint to the pipeline and not <code>analyse-properties</code> as the module isn't installed with poetry on the workers with the configuration below.</p>
-</div>
 <div class="highlight"><pre><span></span><code>python -m analyse_properties.main <span class="se">\</span>
    --runner DataflowRunner <span class="se">\</span>
    --project street-group <span class="se">\</span>
--- a/dataflow/scaling.html
+++ b/dataflow/scaling.html
@@ -686,7 +686,7 @@ struct.error: &#39;i&#39; format requires -2147483648 &lt;= number &lt;= 2147483
 <h2 id="solution">Solution<a class="headerlink" href="#solution" title="Permanent link">&para;</a></h2>
 <p>A possible solution would be to leverage BigQuery to store the results of the mapping table in as the pipeline progresses. We can make use of BigQueries array type to literally store the raw array as we process each row.</p>
 <p>In addition to creating the mapping table <code>(key, value)</code> pairs, we also save these pairs to BigQuery at this stage. We then yield the element as it is currently written to allow the subsequent stages to make use of this data.</p>
-<p>Remove the condense mapping table stage as it is no longer needed.</p>
+<p>Remove the condense mapping table stage as it is no longer needed (which also saves a bit of time).</p>
 <p>Instead of using:</p>
 <div class="highlight"><pre><span></span><code><span class="n">beam</span><span class="o">.</span><span class="n">FlatMap</span><span class="p">(</span>
    <span class="n">insert_data_for_id</span><span class="p">,</span> <span class="n">beam</span><span class="o">.</span><span class="n">pvalue</span><span class="o">.</span><span class="n">AsSingleton</span><span class="p">(</span><span class="n">mapping_table_condensed</span><span class="p">)</span>