Files
geography-anki/docs/scraping.md
2022-06-22 20:38:28 +01:00

82 lines
1.8 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<https://en.wikipedia.org/wiki/List_of_sovereign_states>
## get links of countries from table of sovereign states
Xpath to select data from table
Country:
```
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
```
## scrapy
Pagination guide:
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
Response:
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
Using selectors:
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
Download files/images:
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
### new project
```
scrapy startproject wikipedia_country_scraper
```
### create spider
```
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
```
### using scrapy shell
- Install `ipython`:
```
poetry add ipython
```
- Add to `scrapy.cfg` under `[settings]`:
```
shell = ipython
```
- Run scrapy shell:
```
scrapy shell
```
- Fetch an URL:
```
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
```
- Print the response:
```
response
```
- Extract data using xpath:
```
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
countries[0]
```
- Extract the data:
```
countries[0].get()
```
### Spider
[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
By default if no callback is specified in a response, `parse()` is called.