mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 11:55:45 +00:00
Deployed a53d791 with MkDocs version: 1.2.2
This commit is contained in:
@@ -207,6 +207,22 @@
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-tabs__item">
|
||||
<a href="../dataflow/index.html" class="md-tabs__link">
|
||||
DataFlow
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-tabs__item">
|
||||
<a href="../pandas-profiling/report.html" class="md-tabs__link">
|
||||
Data Exploration Report
|
||||
@@ -394,10 +410,144 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link md-nav__link--active" for="__toc">
|
||||
Data Exploration Report
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<a href="exploration.html" class="md-nav__link md-nav__link--active">
|
||||
Data Exploration Report
|
||||
</a>
|
||||
|
||||
|
||||
<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#interesting-observations" class="md-nav__link">
|
||||
Interesting observations
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="cleaning.html" class="md-nav__link">
|
||||
Cleaning
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="approach.html" class="md-nav__link">
|
||||
Approach
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="results.html" class="md-nav__link">
|
||||
Results
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
</ul>
|
||||
</nav>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item md-nav__item--nested">
|
||||
|
||||
|
||||
<input class="md-nav__toggle md-toggle" data-md-toggle="__nav_3" type="checkbox" id="__nav_3" >
|
||||
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__link" for="__nav_3">
|
||||
DataFlow
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
</label>
|
||||
|
||||
<nav class="md-nav" aria-label="DataFlow" data-md-level="1">
|
||||
<label class="md-nav__title" for="__nav_3">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
DataFlow
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-scrollfix>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../dataflow/index.html" class="md-nav__link">
|
||||
Running on DataFlow
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="../dataflow/scaling.html" class="md-nav__link">
|
||||
Scaling to the Full DataSet
|
||||
</a>
|
||||
</li>
|
||||
|
||||
|
||||
@@ -443,6 +593,21 @@
|
||||
|
||||
|
||||
|
||||
<label class="md-nav__title" for="__toc">
|
||||
<span class="md-nav__icon md-icon"></span>
|
||||
Table of contents
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#interesting-observations" class="md-nav__link">
|
||||
Interesting observations
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
|
||||
</nav>
|
||||
</div>
|
||||
</div>
|
||||
@@ -459,9 +624,66 @@
|
||||
|
||||
|
||||
<h1 id="data-exploration-report">Data Exploration Report<a class="headerlink" href="#data-exploration-report" title="Permanent link">¶</a></h1>
|
||||
<p>A brief exploration was done on the <strong>full</strong> dataset using the module <code>pandas-profiling</code>. The module uses <code>pandas</code> to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc.</p>
|
||||
<p>The script used to generate this report is located in <code>./exploration/report.py</code>.</p>
|
||||
<p>A brief exploration was done on the <strong>full</strong> dataset using the module <code>pandas-profiling</code>. The module uses <code>pandas</code> to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. and produces a report <code>.html</code> file that can be viewed interatively in your browser.</p>
|
||||
<p>The script used to generate this report is located in <code>./exploration/report.py</code> and can be viewed below.</p>
|
||||
<details>
|
||||
<summary>report.py</summary>
|
||||
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pathlib</span>
|
||||
|
||||
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
|
||||
<span class="kn">from</span> <span class="nn">pandas_profiling</span> <span class="kn">import</span> <span class="n">ProfileReport</span>
|
||||
|
||||
|
||||
<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
|
||||
<span class="n">input_file</span> <span class="o">=</span> <span class="p">(</span>
|
||||
<span class="n">pathlib</span><span class="o">.</span><span class="n">Path</span><span class="p">(</span><span class="vm">__file__</span><span class="p">)</span><span class="o">.</span><span class="n">parents</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">/</span> <span class="s2">"data"</span> <span class="o">/</span> <span class="s2">"input"</span> <span class="o">/</span> <span class="s2">"pp-complete.csv"</span>
|
||||
<span class="p">)</span>
|
||||
<span class="k">with</span> <span class="n">input_file</span><span class="o">.</span><span class="n">open</span><span class="p">()</span> <span class="k">as</span> <span class="n">csv</span><span class="p">:</span>
|
||||
<span class="n">df_report</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span>
|
||||
<span class="n">csv</span><span class="p">,</span>
|
||||
<span class="n">names</span><span class="o">=</span><span class="p">[</span>
|
||||
<span class="s2">"transaction_id"</span><span class="p">,</span>
|
||||
<span class="s2">"price"</span><span class="p">,</span>
|
||||
<span class="s2">"date_of_transfer"</span><span class="p">,</span>
|
||||
<span class="s2">"postcode"</span><span class="p">,</span>
|
||||
<span class="s2">"property_type"</span><span class="p">,</span>
|
||||
<span class="s2">"old_new"</span><span class="p">,</span>
|
||||
<span class="s2">"duration"</span><span class="p">,</span>
|
||||
<span class="s2">"paon"</span><span class="p">,</span>
|
||||
<span class="s2">"saon"</span><span class="p">,</span>
|
||||
<span class="s2">"street"</span><span class="p">,</span>
|
||||
<span class="s2">"locality"</span><span class="p">,</span>
|
||||
<span class="s2">"town_city"</span><span class="p">,</span>
|
||||
<span class="s2">"district"</span><span class="p">,</span>
|
||||
<span class="s2">"county"</span><span class="p">,</span>
|
||||
<span class="s2">"ppd_category"</span><span class="p">,</span>
|
||||
<span class="s2">"record_status"</span><span class="p">,</span>
|
||||
<span class="p">],</span>
|
||||
<span class="p">)</span>
|
||||
<span class="n">profile</span> <span class="o">=</span> <span class="n">ProfileReport</span><span class="p">(</span><span class="n">df_report</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="s2">"Price Paid Data"</span><span class="p">,</span> <span class="n">minimal</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
|
||||
<span class="n">profile</span><span class="o">.</span><span class="n">to_file</span><span class="p">(</span><span class="s2">"price_paid_data_report.html"</span><span class="p">)</span>
|
||||
|
||||
|
||||
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
|
||||
<span class="n">main</span><span class="p">()</span>
|
||||
</code></pre></div>
|
||||
</details>
|
||||
|
||||
<p>The report can be viewed by clicking the Data Exploration Report tab at the top of the page.</p>
|
||||
<h2 id="interesting-observations">Interesting observations<a class="headerlink" href="#interesting-observations" title="Permanent link">¶</a></h2>
|
||||
<p>When looking at the report we are looking for data quality and missing observations. The statistics are interesting to see but are largely irrelevant for this task.</p>
|
||||
<p>The data overall looks very good for a dataset of its size (~27 million records). For important fields there are no missing values:</p>
|
||||
<ul>
|
||||
<li>Every row has a price.</li>
|
||||
<li>Every row has a unique transaction ID.</li>
|
||||
<li>Every row has a transaction date.</li>
|
||||
</ul>
|
||||
<p>Some fields that we will need are missing data:</p>
|
||||
<ul>
|
||||
<li>~42,000 (0.2%) are missing a Postcode.</li>
|
||||
<li>~4,000 (<0.1%) are missing a PAON (primary addressable object name).</li>
|
||||
<li>~412,000 (1.6%) are missing a Street Name.</li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
||||
@@ -497,13 +719,13 @@
|
||||
|
||||
|
||||
|
||||
<a href="../pandas-profiling/report.html" class="md-footer__link md-footer__link--next" aria-label="Next: Data Exploration Report" rel="next">
|
||||
<a href="cleaning.html" class="md-footer__link md-footer__link--next" aria-label="Next: Cleaning" rel="next">
|
||||
<div class="md-footer__title">
|
||||
<div class="md-ellipsis">
|
||||
<span class="md-footer__direction">
|
||||
Next
|
||||
</span>
|
||||
Data Exploration Report
|
||||
Cleaning
|
||||
</div>
|
||||
</div>
|
||||
<div class="md-footer__button md-icon">
|
||||
|
||||
Reference in New Issue
Block a user