SCRAPY with REDIS – A Distributed Approachby Saahyl
What is Scrapy ? Many of us wanting to scrape content from webspages should go through this highly comprehensive web scraping framework called Scrapy.
What is Scrapy ?
Many of us wanting to scrape content from webspages should go through this highly comprehensive web scraping framework called Scrapy.
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
(This blog is not a tutorial for learning scrapy, you can refer scrapy docs for that.)
What is Redis ?
Redis is an open source (BSD licensed), in-memory data structure store, used as database, cache, and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries.
It’s a “NoSQL” key-value data store. More precisely, it is a data structure server. The closest analogue is probably to think of Redis as Memcached, but with built-in persistence. Persistence to disk means you can use Redis as a real database instead of just a volatile cache. The data won’t disappear when you restart, like with memcached.
The entire data set, like memcached, is stored in-memory so it is extremely fast (like memcached)… often even faster than memcached. Redis had virtual memory, where rarely used values would be swapped out to disk, so only the keys had to fit into memory, but this has been deprecated. Going forward the use of cases for Redis are those where it’s possible (and desirable) for the entire data set to fit into memory.
Redis is a fantastic choice if you want a highly scalable data store shared by multiple processes, multiple applications, or multiple servers.
How to co-ordinate Scrapy and Redis to obtain a distributed framework ?
To obtain this useful distributed architecture you would need your own servers (machines with Redis supported operating systems) or you could make use of some paid server provider api’s such as Linode, Digital Ocean, etc. With the help of these api’s you would create the number of servers according to the amount of data you need to scrape and time available to scrape that data. We also need a redis-server whose bind address has been changed to 0.0.0.0 to allow it to also communicate with other ip addresses including 127.0.0.0.
Steps to accomplish this task:
- We first need a dedicated redis-server, your own physical machine or one created using linode api to store the urls for webpages to scrape data from.
- Run a web app that will call the linode api to create servers and deploy your scrapy spiders onto those servers and start the spiders.
- Create servers using linode api.
- Deploy your scrapy spiders code onto created servers and execute the spiders.
- Spiders will fetch the urls from redis db with the help of scrapy-redis and start scraping the webpages.
- Spiders will then upload the scraped content from webpages to redis database.
- Finally we can download the scraped data from redis server to local machine.