geography-anki/docs/scraping.md

<https://en.wikipedia.org/wiki/List_of_sovereign_states>

## get links of countries from table of sovereign states

Xpath to select data from table

Country:

```
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
```

## scrapy

Pagination guide:
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>

Response:
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>

Using selectors:
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>

Download files/images:
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
Setting pipelines per spider:
<https://stackoverflow.com/a/34647090>

Exporting JSON:
<https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS>
Setting exports per spider:
<https://stackoverflow.com/a/53322959>

Processing using item loaders + pipelines:
<https://thepythonscrapyplaybook.com/scrapy-beginners-guide-cleaning-data/#pre-processing-data-with-scrapy-item-loaders>
<https://docs.scrapy.org/en/latest/topics/loaders.html>

### new project

```
scrapy startproject wikipedia_country_scraper
```

### create spider

```
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
```

### using scrapy shell

- Install `ipython`:
```
poetry add ipython
```

- Add to `scrapy.cfg` under `[settings]`:
```
shell = ipython
```

- Run scrapy shell:
```
scrapy shell
```

- Fetch an URL:
```
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
```

- Print the response:
```
response
```

- Extract data using xpath:
```
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href").getall()
countries[0]
```

- Extract the data:
```
countries[0].get()
```

### Spider

[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.

By default if no callback is specified in a response, `parse()` is called.