geography-anki/docs/scraping.md at 3cb4b4ba463bbdf62d542fe575c72916b7225db0

dtomlinson/geography-anki

Fork 0

Files

Daniel Tomlinson 0886223f17 chore: add dev docs

2022-06-22 20:38:28 +01:00

1.8 KiB

Raw Blame History

https://en.wikipedia.org/wiki/List_of_sovereign_states

get links of countries from table of sovereign states

Xpath to select data from table

Country:

//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a

scrapy

Pagination guide: https://thepythonscrapyplaybook.com/scrapy-pagination-guide/

Response: https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response

Using selectors: https://docs.scrapy.org/en/latest/topics/selectors.html?highlight=xpath#using-selectors

Download files/images: https://docs.scrapy.org/en/latest/topics/media-pipeline.html

### new project

scrapy startproject wikipedia_country_scraper

create spider

scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states

### using scrapy shell

Install ipython:

poetry add ipython

Add to scrapy.cfg under [settings]:

shell = ipython

Run scrapy shell:

scrapy shell

Fetch an URL:

fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states")

Print the response:

response

Extract data using xpath:

countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href")
countries[0]

Extract the data:

countries[0].get()

Spider

start_requests generates a Request for each url in start_urls.

By default if no callback is specified in a response, parse() is called.

1.8 KiB Raw Blame History

get links of countries from table of sovereign states

scrapy

create spider

Spider

1.8 KiB

Raw Blame History