Files
geography-anki/docs/scraping.md

93 lines
2.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<https://en.wikipedia.org/wiki/List_of_sovereign_states>
## get links of countries from table of sovereign states
Xpath to select data from table
Country:
```
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
```
## scrapy
Pagination guide:
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
Response:
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
Using selectors:
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
Download files/images:
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
Setting pipelines per spider:
<https://stackoverflow.com/a/34647090>
Exporting JSON:
<https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS>
Setting exports per spider:
<https://stackoverflow.com/a/53322959>
Processing using item loaders + pipelines:
<https://thepythonscrapyplaybook.com/scrapy-beginners-guide-cleaning-data/#pre-processing-data-with-scrapy-item-loaders>
<https://docs.scrapy.org/en/latest/topics/loaders.html>
### new project
```
scrapy startproject wikipedia_country_scraper
```
### create spider
```
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
```
### using scrapy shell
- Install `ipython`:
```
poetry add ipython
```
- Add to `scrapy.cfg` under `[settings]`:
```
shell = ipython
```
- Run scrapy shell:
```
scrapy shell
```
- Fetch an URL:
```
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
```
- Print the response:
```
response
```
- Extract data using xpath:
```
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href").getall()
countries[0]
```
- Extract the data:
```
countries[0].get()
```
### Spider
[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
By default if no callback is specified in a response, `parse()` is called.