chore: add dev docs

2022-06-22 20:38:28 +01:00
parent 74c3145c23
commit 0886223f17
1 changed files with 81 additions and 0 deletions
--- a/docs/scraping.md
+++ b/docs/scraping.md
@@ -0,0 +1,81 @@
 <https://en.wikipedia.org/wiki/List_of_sovereign_states>
 ## get links of countries from table of sovereign states
 Xpath to select data from table
 Country:
 ```
 //table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
 ```
 ## scrapy
 Pagination guide:
 <https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
 Response:
 <https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
 Using selectors:
 <https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
 Download files/images:
 <https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
 ### new project
 ```
 scrapy startproject wikipedia_country_scraper
 ```
 ### create spider
 ```
 scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
 ```
 ### using scrapy shell
 - Install `ipython`:
 ```
 poetry add ipython
 ```
 - Add to `scrapy.cfg` under `[settings]`:
 ```
 shell = ipython
 ```
 - Run scrapy shell:
 ```
 scrapy shell
 ```
 - Fetch an URL:
 ```
 fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
 ```
 - Print the response:
 ```
 response
 ```
 - Extract data using xpath:
 ```
 countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
 countries[0]
 ```
 - Extract the data:
 ```
 countries[0].get()
 ```
 ### Spider
 [start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
 By default if no callback is specified in a response, `parse()` is called.