## get links of countries from table of sovereign states Xpath to select data from table Country: ``` //table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a ``` ## scrapy Pagination guide: Response: Using selectors: Download files/images: ### new project ``` scrapy startproject wikipedia_country_scraper ``` ### create spider ``` scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states ``` ### using scrapy shell - Install `ipython`: ``` poetry add ipython ``` - Add to `scrapy.cfg` under `[settings]`: ``` shell = ipython ``` - Run scrapy shell: ``` scrapy shell ``` - Fetch an URL: ``` fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states") ``` - Print the response: ``` response ``` - Extract data using xpath: ``` countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href") countries[0] ``` - Extract the data: ``` countries[0].get() ``` ### Spider [start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`. By default if no callback is specified in a response, `parse()` is called.