mirror of
https://github.com/dtomlinson91/street_group_tech_test
synced 2025-12-22 20:05:45 +00:00
1 line
8.1 KiB
JSON
1 line
8.1 KiB
JSON
{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"index.html","text":"Welcome \u00b6 Introduction \u00b6 This documentation accompanies the technical test for the Street Group. The following pages will guide the user through installing the requirements, and running the task to complete the test. In addition, there is some discussion around the approach, and any improvements that could be made. Navigate sections using the tabs at the top of the page. Pages in this section can be viewed in order by using the section links in the left menu, or by using bar at the bottom of the page. The table of contents in the right menu can be used to navigate sections on each page. Note All paths in this documentation, e.g ./analyse_properties/data/output refer to the location of the directory/file from the root of the repo.","title":"Welcome"},{"location":"index.html#welcome","text":"","title":"Welcome"},{"location":"index.html#introduction","text":"This documentation accompanies the technical test for the Street Group. The following pages will guide the user through installing the requirements, and running the task to complete the test. In addition, there is some discussion around the approach, and any improvements that could be made. Navigate sections using the tabs at the top of the page. Pages in this section can be viewed in order by using the section links in the left menu, or by using bar at the bottom of the page. The table of contents in the right menu can be used to navigate sections on each page. Note All paths in this documentation, e.g ./analyse_properties/data/output refer to the location of the directory/file from the root of the repo.","title":"Introduction"},{"location":"discussion/exploration.html","text":"Data Exploration Report \u00b6 A brief exploration was done on the full dataset using the module pandas-profiling . The module uses pandas to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. The script used to generate this report is located in ./exploration/report.py . The report can be viewed by clicking the Data Exploration Report tab at the top of the page.","title":"Data Exploration Report"},{"location":"discussion/exploration.html#data-exploration-report","text":"A brief exploration was done on the full dataset using the module pandas-profiling . The module uses pandas to load a dataset and automatically produce quantile/descriptive statistics, common values, extreme values, skew, kurtosis etc. The script used to generate this report is located in ./exploration/report.py . The report can be viewed by clicking the Data Exploration Report tab at the top of the page.","title":"Data Exploration Report"},{"location":"discussion/introduction.html","text":"Introduction \u00b6 This section will go through some discussion of the test including: Data exploration Cleaning the data Interpreting the results Deploying on GCP DataFlow Improvements","title":"Introduction"},{"location":"discussion/introduction.html#introduction","text":"This section will go through some discussion of the test including: Data exploration Cleaning the data Interpreting the results Deploying on GCP DataFlow Improvements","title":"Introduction"},{"location":"documentation/installation.html","text":"Installation \u00b6 The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested. The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested. For Beam 2.32.0 the supported versions of the Python SDK can be found here . Poetry \u00b6 The test uses Poetry for dependency management. Info If you already have Poetry installed globally you can go straight to the poetry install step. In a virtual environment install poetry: pip install poetry From the root of the repo install the dependencies with: poetry install --no-dev","title":"Installation"},{"location":"documentation/installation.html#installation","text":"The task is written in Python 3.7.9 using Apache Beam 2.32.0. Python versions 3.6.14 and 3.8.11 should also be compatible but have not been tested. The task has been tested on MacOS Big Sur and WSL2. The task should run on Windows but this wasn't tested. For Beam 2.32.0 the supported versions of the Python SDK can be found here .","title":"Installation"},{"location":"documentation/installation.html#poetry","text":"The test uses Poetry for dependency management. Info If you already have Poetry installed globally you can go straight to the poetry install step. In a virtual environment install poetry: pip install poetry From the root of the repo install the dependencies with: poetry install --no-dev","title":"Poetry"},{"location":"documentation/usage.html","text":"Usage \u00b6 This page documents how to run the pipeline locally to complete the task for the dataset for 2020 . The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed here. We also discuss how to adapt the pipeline so it can run against the full dataset . Download dataset \u00b6 The input data by default should go in ./data/input . For convenience the data is available publicly in a GCP Cloud Storage bucket. Run: wget https://storage.googleapis.com/street-group-technical-test-dmot-euw1/input/pp-2020.csv -P data/input to download the data for 2020 and place in the input directory above. Entrypoint \u00b6 The entrypoint to the pipeline is analyse-properties . Available options \u00b6 Running analyse-properties --help gives the following output: usage: analyse-properties [ -h ] [ --input INPUT ] [ --output OUTPUT ] optional arguments: -h, --help show this help message and exit --input INPUT Full path to the input file. --output OUTPUT Full path to the output file without extension. The default value for input is ./data/input/pp-2020.csv and the default value for output is ./data/output/pp-2020 . If passing in values for input / output these should be full paths to the files. The test will parse these inputs as a str() and pass this to beam . io . ReadFromText () . Run the pipeline \u00b6 To run the pipeline and complete the task run: analyse-properties --runner DirectRunner The pipeline will use the 2020 dataset located in ./data/input and output the resulting .json to ./data/output .","title":"Usage"},{"location":"documentation/usage.html#usage","text":"This page documents how to run the pipeline locally to complete the task for the dataset for 2020 . The pipeline also runs in GCP using DataFlow and is discussed further on but can be viewed here. We also discuss how to adapt the pipeline so it can run against the full dataset .","title":"Usage"},{"location":"documentation/usage.html#download-dataset","text":"The input data by default should go in ./data/input . For convenience the data is available publicly in a GCP Cloud Storage bucket. Run: wget https://storage.googleapis.com/street-group-technical-test-dmot-euw1/input/pp-2020.csv -P data/input to download the data for 2020 and place in the input directory above.","title":"Download dataset"},{"location":"documentation/usage.html#entrypoint","text":"The entrypoint to the pipeline is analyse-properties .","title":"Entrypoint"},{"location":"documentation/usage.html#available-options","text":"Running analyse-properties --help gives the following output: usage: analyse-properties [ -h ] [ --input INPUT ] [ --output OUTPUT ] optional arguments: -h, --help show this help message and exit --input INPUT Full path to the input file. --output OUTPUT Full path to the output file without extension. The default value for input is ./data/input/pp-2020.csv and the default value for output is ./data/output/pp-2020 . If passing in values for input / output these should be full paths to the files. The test will parse these inputs as a str() and pass this to beam . io . ReadFromText () .","title":"Available options"},{"location":"documentation/usage.html#run-the-pipeline","text":"To run the pipeline and complete the task run: analyse-properties --runner DirectRunner The pipeline will use the 2020 dataset located in ./data/input and output the resulting .json to ./data/output .","title":"Run the pipeline"}]} |