diff --git a/docs/scraping.md b/docs/scraping.md new file mode 100644 index 0000000..0a5320f --- /dev/null +++ b/docs/scraping.md @@ -0,0 +1,81 @@ + + +## get links of countries from table of sovereign states + +Xpath to select data from table + +Country: + +``` +//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a +``` + +## scrapy + +Pagination guide: + + +Response: + + +Using selectors: + + +Download files/images: + + +### new project + +``` +scrapy startproject wikipedia_country_scraper +``` + +### create spider + +``` +scrapy genspider countrydownloader https://en.wikipedia.org/wiki/List_of_sovereign_states +``` + +### using scrapy shell + +- Install `ipython`: +``` +poetry add ipython +``` + +- Add to `scrapy.cfg` under `[settings]`: +``` +shell = ipython +``` + +- Run scrapy shell: +``` +scrapy shell +``` + +- Fetch an URL: +``` +fetch("https://en.wikipedia.org/wiki/List_of_sovereign_states") +``` + +- Print the response: +``` +response +``` + +- Extract data using xpath: +``` +countries = response.xpath("//table[contains(@class, 'sortable') and contains(@class, 'wikitable')]/tbody/tr[not(contains(@style, 'background'))]/td[1 and contains(@style, 'vertical-align:top;')]/b/a/@href") +countries[0] +``` + +- Extract the data: +``` +countries[0].get() +``` + +### Spider + +[start_requests](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.Spider.start_requests) generates a [Request](https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html#Request) for each url in `start_urls`. + +By default if no callback is specified in a response, `parse()` is called.