From 94cc22a385319dcfef292374f08ad2814763fdb5 Mon Sep 17 00:00:00 2001 From: Daniel Tomlinson Date: Sat, 25 Sep 2021 00:48:34 +0100 Subject: [PATCH] adding data exploration report + code --- exploration/price_paid_data_report.html | 1439 +++++++++++++++++++++++ exploration/report.py | 22 +- 2 files changed, 1460 insertions(+), 1 deletion(-) create mode 100644 exploration/price_paid_data_report.html diff --git a/exploration/price_paid_data_report.html b/exploration/price_paid_data_report.html new file mode 100644 index 0000000..4a641eb --- /dev/null +++ b/exploration/price_paid_data_report.html @@ -0,0 +1,1439 @@ +Price Paid Data

Overview

Dataset statistics

Number of variables16
Number of observations26321785
Missing cells32579197
Missing cells (%)7.7%
Total size in memory3.1 GiB
Average record size in memory128.0 B

Variable types

Categorical15
Numeric1

Warnings

record_status has constant value "A" Constant
transaction_id has a high cardinality: 26321785 distinct values High cardinality
date_of_transfer has a high cardinality: 9698 distinct values High cardinality
postcode has a high cardinality: 1274429 distinct values High cardinality
paon has a high cardinality: 508466 distinct values High cardinality
saon has a high cardinality: 58202 distinct values High cardinality
street has a high cardinality: 320310 distinct values High cardinality
locality has a high cardinality: 23716 distinct values High cardinality
town_city has a high cardinality: 1171 distinct values High cardinality
district has a high cardinality: 463 distinct values High cardinality
county has a high cardinality: 130 distinct values High cardinality
saon has 23252525 (88.3%) missing values Missing
street has 411893 (1.6%) missing values Missing
locality has 8868564 (33.7%) missing values Missing
price is highly skewed (γ1 = 212.5121542) Skewed
transaction_id has unique values Unique

Reproduction

Analysis started2021-09-24 16:09:53.212210
Analysis finished2021-09-24 16:15:35.611974
Duration5 minutes and 42.4 seconds
Software versionpandas-profiling v3.0.0
Download configurationconfig.json

Variables

transaction_id
Categorical

HIGH CARDINALITY
UNIQUE

Distinct26321785
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
{A2EF370F-229F-405B-A6EC-3D501915AD67}
 
1
{5F9A7C5C-E476-4986-8722-CCD108EB8B6E}
 
1
{E31EE92A-128A-4AA3-81B2-E7A8B3A4C90D}
 
1
{919FEC05-D788-9A90-E053-6C04A8C0A300}
 
1
{6FFBE023-6AE3-451E-ADA4-8D66F231D8F8}
 
1
Other values (26321780)
26321780 

Unique

Unique26321785 ?
Unique (%)100.0%

Sample

1st row{F887F88E-7D15-4415-804E-52EAC2F10958}
2nd row{40FD4DF2-5362-407C-92BC-566E2CCE89E9}
3rd row{7A99F89E-7D81-4E45-ABD5-566E49A045EA}
4th row{28225260-E61C-4E57-8B56-566E5285B1C1}
5th row{444D34D7-9BA6-43A7-B695-4F48980E0176}

Common Values

ValueCountFrequency (%)
{A2EF370F-229F-405B-A6EC-3D501915AD67}1
 
< 0.1%
{5F9A7C5C-E476-4986-8722-CCD108EB8B6E}1
 
< 0.1%
{E31EE92A-128A-4AA3-81B2-E7A8B3A4C90D}1
 
< 0.1%
{919FEC05-D788-9A90-E053-6C04A8C0A300}1
 
< 0.1%
{6FFBE023-6AE3-451E-ADA4-8D66F231D8F8}1
 
< 0.1%
{174380A1-DCA2-4DB5-A7C3-661C6C5D9798}1
 
< 0.1%
{1D30B696-41C4-4D78-A474-EE409899F476}1
 
< 0.1%
{C18420AC-7FE9-4E93-89AC-53697390DE27}1
 
< 0.1%
{62E23376-BBCE-435C-AE3B-A2CA076580F4}1
 
< 0.1%
{BC6C63E2-BDA1-498E-9E99-3B4A64C67381}1
 
< 0.1%
Other values (26321775)26321775
> 99.9%

price
Real number (ℝ≥0)

SKEWED

Distinct218688
Distinct (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean205226.2064
Minimum1
Maximum630000000
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size200.8 MiB
2021-09-24T17:15:38.084341image/svg+xmlMatplotlib v3.4.3, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile36000
Q180500
median142500
Q3235000
95-th percentile500000
Maximum630000000
Range629999999
Interquartile range (IQR)154500

Descriptive statistics

Standard deviation813350.2288
Coefficient of variation (CV)3.963188927
Kurtosis86409.10552
Mean205226.2064
Median Absolute Deviation (MAD)71290
Skewness212.5121542
Sum5.401920082 × 1012
Variance6.615385948 × 1011
MonotonicityNot monotonic
2021-09-24T17:15:38.218791image/svg+xmlMatplotlib v3.4.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
250000283789
 
1.1%
125000246148
 
0.9%
120000216854
 
0.8%
150000205955
 
0.8%
110000192318
 
0.7%
175000185637
 
0.7%
115000182009
 
0.7%
135000181500
 
0.7%
60000180568
 
0.7%
130000178803
 
0.7%
Other values (218678)24268204
92.2%
ValueCountFrequency (%)
1101
< 0.1%
52
 
< 0.1%
108
 
< 0.1%
113
 
< 0.1%
154
 
< 0.1%
ValueCountFrequency (%)
6300000001
< 0.1%
5943000001
< 0.1%
5692000001
< 0.1%
4485000001
< 0.1%
4483009791
< 0.1%

date_of_transfer
Categorical

HIGH CARDINALITY

Distinct9698
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
2016-03-31 00:00
 
32378
2001-06-29 00:00
 
26583
2002-05-31 00:00
 
26338
2002-06-28 00:00
 
26320
2007-06-29 00:00
 
24970
Other values (9693)
26185196 

Unique

Unique10 ?
Unique (%)< 0.1%

Sample

1st row1995-07-07 00:00
2nd row1995-02-03 00:00
3rd row1995-01-13 00:00
4th row1995-07-28 00:00
5th row1995-06-28 00:00

Common Values

ValueCountFrequency (%)
2016-03-31 00:0032378
 
0.1%
2001-06-29 00:0026583
 
0.1%
2002-05-31 00:0026338
 
0.1%
2002-06-28 00:0026320
 
0.1%
2007-06-29 00:0024970
 
0.1%
2000-06-30 00:0024927
 
0.1%
2003-11-28 00:0024802
 
0.1%
1999-05-28 00:0024335
 
0.1%
2006-06-30 00:0024308
 
0.1%
2000-03-31 00:0023428
 
0.1%
Other values (9688)26063396
99.0%

postcode
Categorical

HIGH CARDINALITY

Distinct1274429
Distinct (%)4.8%
Missing42019
Missing (%)0.2%
Memory size200.8 MiB
YO10 3FT
 
534
LU1 5FT
 
523
RH10 3HZ
 
387
L7 3AA
 
372
TR8 4LX
 
355
Other values (1274424)
26277595 

Unique

Unique88384 ?
Unique (%)0.3%

Sample

1st rowMK15 9HP
2nd rowSR6 0AQ
3rd rowCO6 1SQ
4th rowB90 4TG
5th rowDY5 1SA

Common Values

ValueCountFrequency (%)
YO10 3FT534
 
< 0.1%
LU1 5FT523
 
< 0.1%
RH10 3HZ387
 
< 0.1%
L7 3AA372
 
< 0.1%
TR8 4LX355
 
< 0.1%
M1 5GB348
 
< 0.1%
BS3 3NG322
 
< 0.1%
L5 3AA315
 
< 0.1%
L3 8HA312
 
< 0.1%
CM21 9PF305
 
< 0.1%
Other values (1274419)26275993
99.8%
(Missing)42019
 
0.2%

property_type
Categorical

Distinct5
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
T
7940720 
S
7224820 
D
6077417 
F
4730013 
O
 
348815

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowD
2nd rowT
3rd rowT
4th rowT
5th rowS

Common Values

ValueCountFrequency (%)
T7940720
30.2%
S7224820
27.4%
D6077417
23.1%
F4730013
18.0%
O348815
 
1.3%

old_new
Categorical

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
N
23590329 
Y
2731456 

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowN
2nd rowN
3rd rowN
4th rowN
5th rowN

Common Values

ValueCountFrequency (%)
N23590329
89.6%
Y2731456
 
10.4%

duration
Categorical

Distinct3
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
F
20127890 
L
6193361 
U
 
534

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowF
2nd rowF
3rd rowF
4th rowF
5th rowF

Common Values

ValueCountFrequency (%)
F20127890
76.5%
L6193361
 
23.5%
U534
 
< 0.1%

paon
Categorical

HIGH CARDINALITY

Distinct508466
Distinct (%)1.9%
Missing4196
Missing (%)< 0.1%
Memory size200.8 MiB
1
 
656135
2
 
655728
3
 
649054
4
 
628033
5
 
605766
Other values (508461)
23122873 

Unique

Unique236792 ?
Unique (%)0.9%

Sample

1st row31
2nd row50
3rd row19
4th row37
5th row59

Common Values

ValueCountFrequency (%)
1656135
 
2.5%
2655728
 
2.5%
3649054
 
2.5%
4628033
 
2.4%
5605766
 
2.3%
6584416
 
2.2%
7559000
 
2.1%
8547016
 
2.1%
9517669
 
2.0%
10505557
 
1.9%
Other values (508456)20409215
77.5%

saon
Categorical

HIGH CARDINALITY
MISSING

Distinct58202
Distinct (%)1.9%
Missing23252525
Missing (%)88.3%
Memory size200.8 MiB
FLAT 2
 
165778
FLAT 1
 
164781
FLAT 3
 
148061
FLAT 4
 
122764
FLAT 5
 
97065
Other values (58197)
2370811 

Unique

Unique33682 ?
Unique (%)1.1%

Sample

1st row28
2nd rowFLAT 21
3rd rowFLAT 7A
4th rowFLAT 1
5th rowFLAT 8

Common Values

ValueCountFrequency (%)
FLAT 2165778
 
0.6%
FLAT 1164781
 
0.6%
FLAT 3148061
 
0.6%
FLAT 4122764
 
0.5%
FLAT 597065
 
0.4%
FLAT 682135
 
0.3%
279685
 
0.3%
178655
 
0.3%
FLAT 766209
 
0.3%
FLAT 859802
 
0.2%
Other values (58192)2004325
 
7.6%
(Missing)23252525
88.3%

street
Categorical

HIGH CARDINALITY
MISSING

Distinct320310
Distinct (%)1.2%
Missing411893
Missing (%)1.6%
Memory size200.8 MiB
HIGH STREET
 
169595
STATION ROAD
 
87514
LONDON ROAD
 
60153
CHURCH ROAD
 
49684
CHURCH STREET
 
49125
Other values (320305)
25493821 

Unique

Unique16555 ?
Unique (%)0.1%

Sample

1st rowALDRICH DRIVE
2nd rowHOWICK PARK
3rd rowBRICK KILN CLOSE
4th rowRAINSBROOK DRIVE
5th rowMERRY HILL

Common Values

ValueCountFrequency (%)
HIGH STREET169595
 
0.6%
STATION ROAD87514
 
0.3%
LONDON ROAD60153
 
0.2%
CHURCH ROAD49684
 
0.2%
CHURCH STREET49125
 
0.2%
MAIN STREET48089
 
0.2%
PARK ROAD40175
 
0.2%
VICTORIA ROAD34925
 
0.1%
CHURCH LANE32130
 
0.1%
MAIN ROAD29955
 
0.1%
Other values (320300)25308547
96.2%
(Missing)411893
 
1.6%

locality
Categorical

HIGH CARDINALITY
MISSING

Distinct23716
Distinct (%)0.1%
Missing8868564
Missing (%)33.7%
Memory size200.8 MiB
LONDON
 
899924
BIRMINGHAM
 
111310
MANCHESTER
 
100845
LIVERPOOL
 
99506
LEEDS
 
88968
Other values (23711)
16152668 

Unique

Unique836 ?
Unique (%)< 0.1%

Sample

1st rowWILLEN
2nd rowSUNDERLAND
3rd rowCOGGESHALL
4th rowSHIRLEY
5th rowBRIERLEY HILL

Common Values

ValueCountFrequency (%)
LONDON899924
 
3.4%
BIRMINGHAM111310
 
0.4%
MANCHESTER100845
 
0.4%
LIVERPOOL99506
 
0.4%
LEEDS88968
 
0.3%
BRISTOL88659
 
0.3%
SHEFFIELD76269
 
0.3%
BOURNEMOUTH60354
 
0.2%
SOUTHAMPTON56612
 
0.2%
PLYMOUTH56250
 
0.2%
Other values (23706)15814524
60.1%
(Missing)8868564
33.7%

town_city
Categorical

HIGH CARDINALITY

Distinct1171
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
LONDON
 
2031341
MANCHESTER
 
431016
BRISTOL
 
404300
BIRMINGHAM
 
386956
NOTTINGHAM
 
344685
Other values (1166)
22723487 

Unique

Unique2 ?
Unique (%)< 0.1%

Sample

1st rowMILTON KEYNES
2nd rowSUNDERLAND
3rd rowCOLCHESTER
4th rowSOLIHULL
5th rowBRIERLEY HILL

Common Values

ValueCountFrequency (%)
LONDON2031341
 
7.7%
MANCHESTER431016
 
1.6%
BRISTOL404300
 
1.5%
BIRMINGHAM386956
 
1.5%
NOTTINGHAM344685
 
1.3%
LEEDS296653
 
1.1%
LIVERPOOL272491
 
1.0%
SHEFFIELD250601
 
1.0%
LEICESTER231450
 
0.9%
SOUTHAMPTON213759
 
0.8%
Other values (1161)21458533
81.5%

district
Categorical

HIGH CARDINALITY

Distinct463
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
BIRMINGHAM
 
387632
LEEDS
 
351319
BRADFORD
 
230830
SHEFFIELD
 
213316
MANCHESTER
 
210509
Other values (458)
24928179 

Unique

Unique1 ?
Unique (%)< 0.1%

Sample

1st rowMILTON KEYNES
2nd rowSUNDERLAND
3rd rowBRAINTREE
4th rowSOLIHULL
5th rowDUDLEY

Common Values

ValueCountFrequency (%)
BIRMINGHAM387632
 
1.5%
LEEDS351319
 
1.3%
BRADFORD230830
 
0.9%
SHEFFIELD213316
 
0.8%
MANCHESTER210509
 
0.8%
CITY OF BRISTOL205163
 
0.8%
LIVERPOOL183632
 
0.7%
KIRKLEES178730
 
0.7%
WANDSWORTH177641
 
0.7%
EAST RIDING OF YORKSHIRE174522
 
0.7%
Other values (453)24008491
91.2%

county
Categorical

HIGH CARDINALITY

Distinct130
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
GREATER LONDON
3406054 
GREATER MANCHESTER
 
1165840
WEST MIDLANDS
 
1005283
WEST YORKSHIRE
 
1000437
KENT
 
747469
Other values (125)
18996702 

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowMILTON KEYNES
2nd rowTYNE AND WEAR
3rd rowESSEX
4th rowWEST MIDLANDS
5th rowWEST MIDLANDS

Common Values

ValueCountFrequency (%)
GREATER LONDON3406054
 
12.9%
GREATER MANCHESTER1165840
 
4.4%
WEST MIDLANDS1005283
 
3.8%
WEST YORKSHIRE1000437
 
3.8%
KENT747469
 
2.8%
ESSEX732466
 
2.8%
HAMPSHIRE691504
 
2.6%
SURREY593744
 
2.3%
LANCASHIRE593351
 
2.3%
HERTFORDSHIRE559718
 
2.1%
Other values (120)15825919
60.1%

ppd_category
Categorical

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
A
25365945 
B
 
955840

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowA
2nd rowA
3rd rowA
4th rowA
5th rowA

Common Values

ValueCountFrequency (%)
A25365945
96.4%
B955840
 
3.6%

record_status
Categorical

CONSTANT
REJECTED

Distinct1
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size200.8 MiB
A
26321785 

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowA
2nd rowA
3rd rowA
4th rowA
5th rowA

Common Values

ValueCountFrequency (%)
A26321785
100.0%
\ No newline at end of file diff --git a/exploration/report.py b/exploration/report.py index ff6fac8..ed81b47 100644 --- a/exploration/report.py +++ b/exploration/report.py @@ -6,7 +6,27 @@ from pandas_profiling import ProfileReport def main(): with resources.path("analyse_properties.data", "pp-complete.csv") as csv_file: - df_report = pd.read_csv(csv_file) + df_report = pd.read_csv( + csv_file, + names=[ + "transaction_id", + "price", + "date_of_transfer", + "postcode", + "property_type", + "old_new", + "duration", + "paon", + "saon", + "street", + "locality", + "town_city", + "district", + "county", + "ppd_category", + "record_status", + ], + ) profile = ProfileReport(df_report, title="Price Paid Data", minimal=True) profile.to_file("price_paid_data_report.html")