Files
geography-anki/docs/scraping.md

93 lines
2.3 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<https://en.wikipedia.org/wiki/List_of_sovereign_states>
## get links of countries from table of sovereign states
Xpath to select data from table
Country:
```
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
```
## scrapy
Pagination guide:
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
Response:
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
Using selectors:
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
Download files/images:
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
Setting pipelines per spider:
<https://stackoverflow.com/a/34647090>
Exporting JSON:
<https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS>
Setting exports per spider:
<https://stackoverflow.com/a/53322959>
Processing using item loaders + pipelines:
<https://thepythonscrapyplaybook.com/scrapy-beginners-guide-cleaning-data/#pre-processing-data-with-scrapy-item-loaders>
<https://docs.scrapy.org/en/latest/topics/loaders.html>
### new project
```
scrapy startproject wikipedia_country_scraper
```
### create spider
```
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
```
### using scrapy shell
- Install `ipython`:
```
poetry add ipython
```
- Add to `scrapy.cfg` under `[settings]`:
```
shell = ipython
```
- Run scrapy shell:
```
scrapy shell
```
- Fetch an URL:
```
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
```
- Print the response:
```
response
```
- Extract data using xpath:
```
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href").getall()
countries[0]
```
- Extract the data:
```
countries[0].get()
```
### Spider
[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
By default if no callback is specified in a response, `parse()` is called.