Manipulate Scrapy Start_Urls before a Request is Madeby Saahyl
In this blog I am going to discuss two very useful methods provided by the Scrapy framework.
1. start_requests( )
Later we will see how we can harmoniously use these two methods to manipulate start_urls to extract some useful information from them and again reformat those urls to valid format and generate requests to be parsed using the default parse method.
- start_requests(): This method must return an iterable with the first Request to crawl for this spider. This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.
- make_requests_from_url(url): A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. This method is used to construct initial requests in the start_requests() method, and is typically used to convert URLs to requests. Unless overridden, this method returns Requests with the parse() method as their callback function.
The default workflow of Scrapy framework makes requests from start_urls which is a list containing URLs for webpages from where the data needs to be scraped. Consider a use case where the start_urls list is empty initially and is populated later from a redis database or any text file dynamically. Here, urls that are populated to start_urls list are not of correct format, so you would need to format these urls before making a request from them.
How are you going to do this now when the spider has started ?
Here the above mentioned methods come to your rescue. Let’s see how,
class MySpider(Spider): name = "spidey" start_urls =  allowed_domains = ["example.com"] '''Urls dynamically appended to a text file from where we add them to start urls''' for url in open("/path_to/urls.txt"): start_urls.append(url) def start_requests(self): for url in self.start_urls: ''' call function to manipulate url''' new_url = reformat_url(url) yield self.make_requests_from_url(new_url) def reformat_url(url): ''' Manipulate url to structure to proper format or extract some information from it''' return url def parse(self, response): '''Parse url response page.''' print response.url
In the above example, I have used a text file to append urls to start_urls[ ] but these urls can even be downloaded from a redis db into start_urls[ ].
We use the start_requests() function here to con the scrapy framework into thinking that the start_urls list is empty initially. The start_requests() would be the first function that will be called here and thus we can use it to call any user defined functions before generating the initial requests. Like here I call reformat_url (). Later we use the second function by yielding the function call for newly created urls i.e yield self.make_requests_from_url(new_url).
Point to remember here is you should always yield the result of the start_requests() function or return an iterable, since the function will only be called once when the spider starts.