chore: add dev docs
This commit is contained in:
81
docs/scraping.md
Normal file
81
docs/scraping.md
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
<https://en.wikipedia.org/wiki/List_of_sovereign_states>
|
||||||
|
|
||||||
|
## get links of countries from table of sovereign states
|
||||||
|
|
||||||
|
Xpath to select data from table
|
||||||
|
|
||||||
|
Country:
|
||||||
|
|
||||||
|
```
|
||||||
|
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
|
||||||
|
```
|
||||||
|
|
||||||
|
## scrapy
|
||||||
|
|
||||||
|
Pagination guide:
|
||||||
|
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
|
||||||
|
|
||||||
|
Response:
|
||||||
|
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
|
||||||
|
|
||||||
|
Using selectors:
|
||||||
|
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
|
||||||
|
|
||||||
|
Download files/images:
|
||||||
|
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
|
||||||
|
|
||||||
|
### new project
|
||||||
|
|
||||||
|
```
|
||||||
|
scrapy startproject wikipedia_country_scraper
|
||||||
|
```
|
||||||
|
|
||||||
|
### create spider
|
||||||
|
|
||||||
|
```
|
||||||
|
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
|
||||||
|
```
|
||||||
|
|
||||||
|
### using scrapy shell
|
||||||
|
|
||||||
|
- Install `ipython`:
|
||||||
|
```
|
||||||
|
poetry add ipython
|
||||||
|
```
|
||||||
|
|
||||||
|
- Add to `scrapy.cfg` under `[settings]`:
|
||||||
|
```
|
||||||
|
shell = ipython
|
||||||
|
```
|
||||||
|
|
||||||
|
- Run scrapy shell:
|
||||||
|
```
|
||||||
|
scrapy shell
|
||||||
|
```
|
||||||
|
|
||||||
|
- Fetch an URL:
|
||||||
|
```
|
||||||
|
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
|
||||||
|
```
|
||||||
|
|
||||||
|
- Print the response:
|
||||||
|
```
|
||||||
|
response
|
||||||
|
```
|
||||||
|
|
||||||
|
- Extract data using xpath:
|
||||||
|
```
|
||||||
|
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
|
||||||
|
countries[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
- Extract the data:
|
||||||
|
```
|
||||||
|
countries[0].get()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Spider
|
||||||
|
|
||||||
|
[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
|
||||||
|
|
||||||
|
By default if no callback is specified in a response, `parse()` is called.
|
||||||
Reference in New Issue
Block a user