85 lines
1.9 KiB
Markdown
85 lines
1.9 KiB
Markdown
<https://en.wikipedia.org/wiki/List_of_sovereign_states>
|
||
|
||
## get links of countries from table of sovereign states
|
||
|
||
Xpath to select data from table
|
||
|
||
Country:
|
||
|
||
```
|
||
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
|
||
```
|
||
|
||
## scrapy
|
||
|
||
Pagination guide:
|
||
<https://thepythonscrapyplaybook.com/scrapy-pagination-guide/>
|
||
|
||
Response:
|
||
<https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response>
|
||
|
||
Using selectors:
|
||
<https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors>
|
||
|
||
Download files/images:
|
||
<https://docs.scrapy.org/en/latest/topics/media-pipeline.html>
|
||
|
||
Exporting JSON:
|
||
<https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS>
|
||
|
||
### new project
|
||
|
||
```
|
||
scrapy startproject wikipedia_country_scraper
|
||
```
|
||
|
||
### create spider
|
||
|
||
```
|
||
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
|
||
```
|
||
|
||
### using scrapy shell
|
||
|
||
- Install `ipython`:
|
||
```
|
||
poetry add ipython
|
||
```
|
||
|
||
- Add to `scrapy.cfg` under `[settings]`:
|
||
```
|
||
shell = ipython
|
||
```
|
||
|
||
- Run scrapy shell:
|
||
```
|
||
scrapy shell
|
||
```
|
||
|
||
- Fetch an URL:
|
||
```
|
||
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
|
||
```
|
||
|
||
- Print the response:
|
||
```
|
||
response
|
||
```
|
||
|
||
- Extract data using xpath:
|
||
```
|
||
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href").getall()
|
||
countries[0]
|
||
```
|
||
|
||
- Extract the data:
|
||
```
|
||
countries[0].get()
|
||
```
|
||
|
||
### Spider
|
||
|
||
[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`.
|
||
|
||
By default if no callback is specified in a response, `parse()` is called.
|