1.9 KiB
1.9 KiB
https://en.wikipedia.org/wiki/List_of_sovereign_states
get links of countries from table of sovereign states
Xpath to select data from table
Country:
//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a
scrapy
Pagination guide: https://thepythonscrapyplaybook.com/scrapy-pagination-guide/
Response: https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response
Using selectors: https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors
Download files/images: https://docs.scrapy.org/en/latest/topics/media-pipeline.html
Exporting JSON: https://docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEEDS
### new project
scrapy startproject wikipedia_country_scraper
create spider
scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states
### using scrapy shell
- Install
ipython:
poetry add ipython
- Add to
scrapy.cfgunder[settings]:
shell = ipython
- Run scrapy shell:
scrapy shell
- Fetch an URL:
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
- Print the response:
response
- Extract data using xpath:
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
countries[0]
- Extract the data:
countries[0].get()
Spider
start_requests generates a Request for each url in start_urls.
By default if no callback is specified in a response, parse() is called.