Deployed a53d791 with MkDocs version: 1.2.2

This commit is contained in:
2021-09-27 23:17:07 +01:00
parent 4f331eff92
commit c76e3c542a
18 changed files with 5037 additions and 7 deletions

View File

@@ -207,6 +207,22 @@
<li class="md-tabs__item">
<a href="../dataflow/index.html" class="md-tabs__link">
DataFlow
</a>
</li>
<li class="md-tabs__item">
<a href="../pandas-profiling/report.html" class="md-tabs__link">
Data Exploration Report
@@ -394,10 +410,144 @@
<label class="md-nav__link md-nav__link--active" for="__toc">
Data Exploration Report
<span class="md-nav__icon md-icon"></span>
</label>
<a href="exploration.html" class="md-nav__link md-nav__link--active">
Data Exploration Report
</a>
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#interesting-observations" class="md-nav__link">
Interesting observations
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item">
<a href="cleaning.html" class="md-nav__link">
Cleaning
</a>
</li>
<li class="md-nav__item">
<a href="approach.html" class="md-nav__link">
Approach
</a>
</li>
<li class="md-nav__item">
<a href="results.html" class="md-nav__link">
Results
</a>
</li>
</ul>
</nav>
</li>
<li class="md-nav__item md-nav__item--nested">
<input class="md-nav__toggle md-toggle" data-md-toggle="__nav_3" type="checkbox" id="__nav_3" >
<label class="md-nav__link" for="__nav_3">
DataFlow
<span class="md-nav__icon md-icon"></span>
</label>
<nav class="md-nav" aria-label="DataFlow" data-md-level="1">
<label class="md-nav__title" for="__nav_3">
<span class="md-nav__icon md-icon"></span>
DataFlow
</label>
<ul class="md-nav__list" data-md-scrollfix>
<li class="md-nav__item">
<a href="../dataflow/index.html" class="md-nav__link">
Running on DataFlow
</a>
</li>
<li class="md-nav__item">
<a href="../dataflow/scaling.html" class="md-nav__link">
Scaling to the Full DataSet
</a>
</li>
@@ -443,6 +593,21 @@
<label class="md-nav__title" for="__toc">
<span class="md-nav__icon md-icon"></span>
Table of contents
</label>
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
<li class="md-nav__item">
<a href="#interesting-observations" class="md-nav__link">
Interesting observations
</a>
</li>
</ul>
</nav>
</div>
</div>
@@ -459,9 +624,66 @@
<h1 id="data-exploration-report">Data Exploration Report<a class="headerlink" href="#data-exploration-report" title="Permanent link">&para;</a></h1>
<p>A brief exploration was done on the <strong>full</strong> dataset using the module <code>pandas-profiling</code>. The module uses <code>pandas</code> to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc.</p>
<p>The script used to generate this report is located in <code>./exploration/report.py</code>.</p>
<p>A brief exploration was done on the <strong>full</strong> dataset using the module <code>pandas-profiling</code>. The module uses <code>pandas</code> to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. and produces a report <code>.html</code> file that can be viewed interatively in your browser.</p>
<p>The script used to generate this report is located in <code>./exploration/report.py</code> and can be viewed below.</p>
<details>
<summary>report.py</summary>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pathlib</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pandas_profiling</span> <span class="kn">import</span> <span class="n">ProfileReport</span>
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="n">input_file</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">pathlib</span><span class="o">.</span><span class="n">Path</span><span class="p">(</span><span class="vm">__file__</span><span class="p">)</span><span class="o">.</span><span class="n">parents</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">/</span> <span class="s2">&quot;data&quot;</span> <span class="o">/</span> <span class="s2">&quot;input&quot;</span> <span class="o">/</span> <span class="s2">&quot;pp-complete.csv&quot;</span>
<span class="p">)</span>
<span class="k">with</span> <span class="n">input_file</span><span class="o">.</span><span class="n">open</span><span class="p">()</span> <span class="k">as</span> <span class="n">csv</span><span class="p">:</span>
<span class="n">df_report</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span>
<span class="n">csv</span><span class="p">,</span>
<span class="n">names</span><span class="o">=</span><span class="p">[</span>
<span class="s2">&quot;transaction_id&quot;</span><span class="p">,</span>
<span class="s2">&quot;price&quot;</span><span class="p">,</span>
<span class="s2">&quot;date_of_transfer&quot;</span><span class="p">,</span>
<span class="s2">&quot;postcode&quot;</span><span class="p">,</span>
<span class="s2">&quot;property_type&quot;</span><span class="p">,</span>
<span class="s2">&quot;old_new&quot;</span><span class="p">,</span>
<span class="s2">&quot;duration&quot;</span><span class="p">,</span>
<span class="s2">&quot;paon&quot;</span><span class="p">,</span>
<span class="s2">&quot;saon&quot;</span><span class="p">,</span>
<span class="s2">&quot;street&quot;</span><span class="p">,</span>
<span class="s2">&quot;locality&quot;</span><span class="p">,</span>
<span class="s2">&quot;town_city&quot;</span><span class="p">,</span>
<span class="s2">&quot;district&quot;</span><span class="p">,</span>
<span class="s2">&quot;county&quot;</span><span class="p">,</span>
<span class="s2">&quot;ppd_category&quot;</span><span class="p">,</span>
<span class="s2">&quot;record_status&quot;</span><span class="p">,</span>
<span class="p">],</span>
<span class="p">)</span>
<span class="n">profile</span> <span class="o">=</span> <span class="n">ProfileReport</span><span class="p">(</span><span class="n">df_report</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s2">&quot;Price Paid Data&quot;</span><span class="p">,</span> <span class="n">minimal</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">profile</span><span class="o">.</span><span class="n">to_file</span><span class="p">(</span><span class="s2">&quot;price_paid_data_report.html&quot;</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&quot;__main__&quot;</span><span class="p">:</span>
<span class="n">main</span><span class="p">()</span>
</code></pre></div>
</details>
<p>The report can be viewed by clicking the Data Exploration Report tab at the top of the page.</p>
<h2 id="interesting-observations">Interesting observations<a class="headerlink" href="#interesting-observations" title="Permanent link">&para;</a></h2>
<p>When looking at the report we are looking for data quality and missing observations. The statistics are interesting to see but are largely irrelevant for this task.</p>
<p>The data overall looks very good for a dataset of its size (~27 million records). For important fields there are no missing values:</p>
<ul>
<li>Every row has a price.</li>
<li>Every row has a unique transaction ID.</li>
<li>Every row has a transaction date.</li>
</ul>
<p>Some fields that we will need are missing data:</p>
<ul>
<li>~42,000 (0.2%) are missing a Postcode.</li>
<li>~4,000 (&lt;0.1%) are missing a PAON (primary addressable object name).</li>
<li>~412,000 (1.6%) are missing a Street Name.</li>
</ul>
@@ -497,13 +719,13 @@
<a href="../pandas-profiling/report.html" class="md-footer__link md-footer__link--next" aria-label="Next: Data Exploration Report" rel="next">
<a href="cleaning.html" class="md-footer__link md-footer__link--next" aria-label="Next: Cleaning" rel="next">
<div class="md-footer__title">
<div class="md-ellipsis">
<span class="md-footer__direction">
Next
</span>
Data Exploration Report
Cleaning
</div>
</div>
<div class="md-footer__button md-icon">