Files
geography-anki/docs/scraping.md
2022-06-22 20:38:28 +01:00

1.8 KiB

https://en.wikipedia.org/wiki/List_of_sovereign_states

Xpath to select data from table

Country:

//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a

scrapy

Pagination guide: https://thepythonscrapyplaybook.com/scrapy-pagination-guide/

Response: https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response

Using selectors: https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors

Download files/images: https://docs.scrapy.org/en/latest/topics/media-pipeline.html

### new project

scrapy startproject wikipedia_country_scraper

create spider

scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states

### using scrapy shell

  • Install ipython:
poetry add ipython
  • Add to scrapy.cfg under [settings]:
shell = ipython
  • Run scrapy shell:
scrapy shell
  • Fetch an URL:
fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")
  • Print the response:
response
  • Extract data using xpath:
countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
countries[0]
  • Extract the data:
countries[0].get()

Spider

start_requests generates a Request for each url in start_urls.

By default if no callback is specified in a response, parse() is called.