chore: add dev docs
This commit is contained in:
81
docs/scraping.md
Normal file
81
docs/scraping.md
Normal file
@@ -0,0 +1,81 @@
|
||||
<https://en.wikipedia.org/wiki/List_of_sovereign_states>
|
||||
|
||||
## get links of countries from table of sovereign states
|
||||
|
||||
Xpath to select data from table
|
||||
|
||||
Country:
|
||||
|
||||
```
|
||||
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
|
||||
```
|
||||
|
||||
## scrapy
|
||||
|
||||
Pagination guide:
|
||||
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
|
||||
|
||||
Response:
|
||||
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
|
||||
|
||||
Using selectors:
|
||||
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
|
||||
|
||||
Download files/images:
|
||||
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
|
||||
|
||||
### new project
|
||||
|
||||
```
|
||||
scrapy startproject wikipedia_country_scraper
|
||||
```
|
||||
|
||||
### create spider
|
||||
|
||||
```
|
||||
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
|
||||
```
|
||||
|
||||
### using scrapy shell
|
||||
|
||||
- Install `ipython`:
|
||||
```
|
||||
poetry add ipython
|
||||
```
|
||||
|
||||
- Add to `scrapy.cfg` under `[settings]`:
|
||||
```
|
||||
shell = ipython
|
||||
```
|
||||
|
||||
- Run scrapy shell:
|
||||
```
|
||||
scrapy shell
|
||||
```
|
||||
|
||||
- Fetch an URL:
|
||||
```
|
||||
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
|
||||
```
|
||||
|
||||
- Print the response:
|
||||
```
|
||||
response
|
||||
```
|
||||
|
||||
- Extract data using xpath:
|
||||
```
|
||||
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
|
||||
countries[0]
|
||||
```
|
||||
|
||||
- Extract the data:
|
||||
```
|
||||
countries[0].get()
|
||||
```
|
||||
|
||||
### Spider
|
||||
|
||||
[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
|
||||
|
||||
By default if no callback is specified in a response, `parse()` is called.
|
||||
Reference in New Issue
Block a user