UPDATE: On June 1, 2012, Washington State began allowing private-sector businesses to sell liquor and closed all state-owned stores. The Liquor Control Board web site I used in this example was closed along with them. I’ll leave this tutorial on the site, but only as a reference. If I get time to post a new Scrapy tutorial, I’ll post an update here as well.
This tutorial will walk you through a web-scraping project from scratch using Scrapy, a Python scraping framework. By the end, you’ll be able to:
- Create a spider from scratch using both GET and POST requests
- Handle the responses via Items and Pipelines
- Parse your data into multiple files (in this case – CSV)
- Use XPath selectors to find specific elements within a HTML document
What you’ll need:
- Scrapy (and required libraries – see Scrapy install guide)
- Tutorial files (or you can just go through the steps below)
- Python (I’m using 2.6, but 2.5 or 2.7 should work)
- A browser that makes viewing page markup easy, like Firefox (with Firebug) or Google Chrome
I won’t post any sort of install tutorials here, there is a good one on the Scrapy site already. If you are having trouble getting libraries to install (especially on a Mac), hit me up in the comments or email and I’ll do my best to help. Let’s get started.
Planning the data format
In Washington State, the Liquor Control Board site allows users to search all of the state-controlled liquor stores for specific brands, liquor types, prices, amounts, etc. This is the site we’ll use for our project. Sounds like some interesting data, no?
Before we start writing any code, let’s think about what data we’ll gather and how to save it. For our purposes we’ll focus on the ‘Product Availability‘ page of the site. This page has a list of brand categories in a dropdown list (Brandy, Vodka, Rum, etc.). When you select a category, the site loads a brands page (Smirnoff, Absolute, Grey Goose, etc). From there, you can click one of the ‘Find a store’ buttons to see where the product is in stock and how many bottles are left.
These three pages will be our focus of our project. We’ll break the data into three CSV files – brandCategoryTable.csv, brandsTable.csv, and inStockTable.csv. Here’s the format for each file:
brandCategoryTable.csv
brandCategoryId, brandCategoryName
brandsTable.csv
brandCategoryId, brandId, brandName, totalPrice, specialNote, size
storeStockTable.csv
brandId, stateStoreNumber, amountInStock
NOTE: This project may seem as if there is a database in its future, but that’s not part of this tutorial, so the data format is not a big focus. For now, we’re just focusing on scraping.
Creating a Scrapy project
Open a console and go to the location where you want to create your new project files. Then type:
1 | scrapy startproject waliquor |
This creates the basic structure for your project in a folder called “waliquor”. Before it will do anything, we need to define the spider, the items and the pipelines.
Building the spider
Spiders are user-written classes used to scrape information you’re after. They return Items.
Our spider will have three major steps:
- Request the brand search page containing the list of brand categories. Once we’ve received the response containing all of the page data, write the brand category data to CSV and then…
- Parse the data for each brand page and create a new request for each brand page. Once we receive the response for a brand page, write the brand data to CSV and then…
- Parse the data for each store page that has this brand in stock and write the store data to CSV.
Create a new file in your /waliquor/spiders/ folder. Name it “waliquor_spider.py” and insert the following code:
1 2 3 4 5 6 7 8 9 10 11 | from scrapy.spider import BaseSpider from scrapy.http import FormRequest from scrapy.http import Request from scrapy.selector import HtmlXPathSelector from waliquor.items import BrandCategoryItem class WaliquorSpider(BaseSpider): name = "liq.wa.gov" allowed_domains = ["liq.wa.gov"] SPIDER = WaliquorSpider() |
This is the basic structure always needed for a Scrapy spider. Now add this function below the allowed domains, but above “SPIDER = WaliquorSpider()”.
1 2 3 4 5 6 7 8 | # # Setup the initial request to begin the spidering process # def start_requests(self): brandCategoriesRequest = Request("http://liq.wa.gov/homepageServices/brandsearch.asp", callback=self.parseBrandCategories) return [brandCategoriesRequest] |
Start_request() is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified with a default start_urls list.
Once the request is made and the server returns a response, the callback function (self.parseBrandCategories()) takes over. Add this to your spider below the start_requests() function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # # Scrape, parse all the brand categories into a list # Create a request for every brand category # def parseBrandCategories(self, response): hxs = HtmlXPathSelector(response) # Gather all brand categories into list brandCategoryList = hxs.select('//form/div/center/table/tr/td/select/option[position()>1]/text()').extract() for i in range(len(brandCategoryList)): # Generate new request for each brand category's page yield FormRequest("http://liq.wa.gov/homepageServices/brandpicklist.asp", method='POST', formdata={'BrandName':'','CatBrand':brandCategoryList[i],'submit1':'Find+Product'}, callback=self.parseBrandPage, meta={'brandCategoryId':i,'brandCategoryName':brandCategoryList[i]}) # Create items for the brand category pipeline item = BrandCategoryItem() item['brandCategoryId'] = str(i) item['brandCategoryName'] = brandCategoryList[i] yield item |
Let’s review what’s happening: The function receives the response object from the initial request and creates a HtmlXPathSelector called “hxs”. HtmlXPathSelector gives us a way to look through the HTML and find the specific elements we’re after.
But before we can write an XPath statement, we have to look at the HTML by hand and figure out the structure. This is where a plugin like Firebug can come in very handy. Remember, in this step we’re after the list of brand categories. By looking at the page markup in Firebug, I can see that the brand categories are in a form, then a div, then a center tag, then a table, etc, etc. Eventually, you get to the actual options, where I’ve used the “[position()>1]” property to avoid grabbing the first item: “(Please select…)”. Next is a text() selector to grab only the text, not the tags around it.
1 | brandCategoryList = hxs.select('//form/div/center/table/tr/td/select/option[position()>1]/text()').extract() |
ROOKIE MISTAKE: “If you are viewing page markup in Firefox, remember that the browser adds its own < tbody > tags to tables, but Scrapy will not receive those in the response. Including < tbody > in your Scrapy XPath will always break it.
Now that we have the full list of brand categories, we must step through them and create a new request for each – to load the brand pages. Unlike the previous request, these pages will require a POST request.

Again, Firefox/Firebug is a great tool. In this case, you can use it to load a page and then look through the response the browser is sending. Using that information, we added a form request. You may notice another callback function there – we’ll get to that soon. For now, don’t worry about it.
The last part of the parseBrandCategories() function in your spider handles sending the brand category data to its corresponding item. The code creates a BrandCategoryItem item, then the two pieces of data we’re after – brandCategoryId and brandCategoryName – are populated. Remember these fields from our “brandCategoryTable.csv” file?
1 2 3 4 5 | # Create items for the brand category pipeline item = BrandCategoryItem() item['brandCategoryId'] = str(i) item['brandCategoryName'] = brandCategoryList[i] yield item |
Defining your items
Items are containers that will be loaded with the scraped data.
Open the items.py file in your project folder and overwrite everything with the following code:
1 2 3 4 5 | from scrapy.item import Item, Field class BrandCategoryItem(Item): brandCategoryId = Field() brandCategoryName = Field() |
That’s it! An item is a simple container for the values you just parsed via the spider. The last part is outputting your data to a CSV file. For this, we need a pipeline.
Creating pipelines
Pipelines receive the scraped data from the items and stores it in the way you specify.
In your project folder, there is a file called pipelines.py. Open the file and overwrite everything with the following:
1 2 3 4 5 6 7 8 9 10 11 12 | import csv import items class WaliquorPipeline(object): def __init__(self): self.brandCategoryCsv = csv.writer(open('brandCategoryTable.csv', 'wb')) self.brandCategoryCsv.writerow(['brandCategoryId', 'brandCategoryName']) def process_item(self, item, spider): self.brandCategoryCsv.writerow([item['brandCategoryId'], item['brandCategoryName'].title()]) return item |
The pipeline is using the built-in CSV abilities of Python to open a new file (or an existing one if it’s already been created) and then write a header row to the top. Next, when all the BrandCategoryItem instances are created by your spider, they will automatically be passed in to this pipeline where the process_item() class will add a row to “brandCategoryTable.csv”.
You’ve now gone from an initial request to outputting CSV file! We also completed the first step of our project – getting the brand category data. Don’t worry, the next two steps are very similar, so we’ll fly through them.
Rinse and repeat for the brand pages
Back in your spider class, add the following code at the top of the file:
1 | from waliquor.items import BrandItem |
Also, add this beneath the parseBrandCategories() function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # # Parse each individual brandCategory page # brandCode (unique key), brandCategoryId, brandName, brandCode, totalPrice, specialNote, size, proof # def parseBrandPage(self, response): hxs = HtmlXPathSelector(response) brandRows = hxs.select('//table[@class=\'tbl\']/tr[position()>1]') for brandRow in brandRows: brandId = brandRow.select('td[position()=2]/strong/text()').extract() # Generate new request for this brand category's page yield FormRequest("http://liq.wa.gov/homepageServices/find_store.asp", method='POST', formdata={'brandCode':brandId,'CityName':'','CountyName':'', 'StoreNo':''}, callback=self.parseBrandInStockPage, meta={'brandId':brandId}) item = BrandItem() item['brandId'] = brandId item['brandCategoryId'] = response.request.meta['brandCategoryId'] item['brandName'] = brandRow.select('td[position()=1]/strong/text()').extract() item['totalPrice'] = brandRow.select('td[position()=5]/strong/text()').extract() item['totalPrice'][0] = item['totalPrice'][0].replace('$','') item['totalPrice'][0] = item['totalPrice'][0].replace(',','') item['specialNote'] = brandRow.select('td[position()=7]/text()').extract() item['size'] = brandRow.select('td[position()=8]/text()').extract() item['proof'] = brandRow.select('td[position()=10]/text()').extract() yield item |
When the parseBrandCategories() function builds its requests for each brand page, this is the callback function that will receive the response. The process is almost identical to the previous function. It receives the response and creates a HtmlXPathSelector, but this time, instead of selecting strings we’re selecting table rows.
As the for loop steps through the list of rows, the various properties we want are again selected from the table rows using XPath and sent into the item – this time a BrandItem.
Want to learn more about XPath and selectors? Of course you do! Here’s a nice tutorial from Mozilla.
Next, open your items.py file and add the following below the BrandCategoryItem:
1 2 3 4 5 6 7 8 | class BrandItem(Item): brandId = Field() brandCategoryId = Field() brandName = Field() totalPrice = Field() specialNote = Field() size = Field() proof = Field() |
Now, we need to wire up the pipeline to take the item and store the data. But wait! We’re going to hit a slight snag here. If you remember, the pipelines.py file uses a function called process_item() to handle each item it receives. But in our case, we have three different items from different functions that need to write to different CSV files. Currently, our pipelines.py class will handle every item the same. Here’s how we’ll get around that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import csv import items class WaliquorPipeline(object): def __init__(self): self.brandCategoryCsv = csv.writer(open('brandCategoryTable.csv', 'wb')) self.brandCategoryCsv.writerow(['brandCategoryId', 'brandCategoryName']) self.brandsCsv = csv.writer(open('brandsTable.csv', 'wb')) self.brandsCsv.writerow(['brandCategoryId', 'brandId', 'brandName', 'totalPrice', 'specialNote', 'size', 'proof']) def process_item(self, item, spider): if isinstance(item, items.BrandCategoryItem): self.brandCategoryCsv.writerow([item['brandCategoryId'], item['brandCategoryName'].title()]) return item if isinstance(item, items.BrandItem): self.brandsCsv.writerow([item['brandCategoryId'], item['brandId'][0], item['brandName'][0].title(), item['totalPrice'][0], item['specialNote'][0].title(), item['size'][0], item['proof'][0]]) return item |
With this change, the process item function handles the items differently, depending on their type. Now we have lots of data moving through to our CSVs, only one more step – the store page.
Last step, gather and store data
This process should be starting to make sense now. Let’s go back to the beginning – the spider class – one more time. Add the following code at the very beginning:
1 | from waliquor.items import StockItem |
Then add this function below the parseBrandPage() function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # # Parse each individual brand's store availability page # brandId ("Brand Code" provided), stateStoreNumber(id), amountInStock # def parseBrandInStockPage(self, response): hxs = HtmlXPathSelector(response) storeRows = hxs.select('//table[@class=\'tbl\']/tr[position()>1]') items = [] for storeRow in storeRows: item = StockItem() item['brandId'] = response.request.meta['brandId'] item['stateStoreNumber'] = storeRow.select('td[position()=1]/text()').extract() item['amountInStock'] = storeRow.select('td[position()=5]/font/text()').extract() items.append(item) return items |
Next, we’ll add our StockItem to the item.py class:
1 2 3 4 | class StockItem(Item): brandId = Field() stateStoreNumber = Field() amountInStock = Field() |
Lastly, here’s what the complete pipelines.py file should look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | import csv import items class WaliquorPipeline(object): def __init__(self): self.brandCategoryCsv = csv.writer(open('brandCategoryTable.csv', 'wb')) self.brandCategoryCsv.writerow(['brandCategoryId', 'brandCategoryName']) self.brandsCsv = csv.writer(open('brandsTable.csv', 'wb')) self.brandsCsv.writerow(['brandCategoryId', 'brandId', 'brandName', 'totalPrice', 'specialNote', 'size', 'proof']) self.storeStockTable = csv.writer(open('storeStockTable.csv', 'wb')) self.storeStockTable.writerow(['brandId', 'stateStoreNumber', 'amountInStock']) def process_item(self, item, spider): if isinstance(item, items.BrandCategoryItem): self.brandCategoryCsv.writerow([item['brandCategoryId'], item['brandCategoryName'].title()]) return item if isinstance(item, items.BrandItem): # Double check that items in the pipeline exist # Otherwise, an item with a an empty list would # be completely skipped over by Scrapy try: item['brandId'][0] except: item['brandId'].append("") try: item['brandCategoryId'] except: item['brandCategoryId'] = "9999" try: item['brandName'][0] except: item['brandName'].append("") try: item['totalPrice'][0] except: item['totalPrice'].append("") try: item['specialNote'][0] except: item['specialNote'].append("") try: item['size'][0] except: item['size'].append("") try: item['proof'][0] except: item['proof'].append("") self.brandsCsv.writerow([item['brandCategoryId'], item['brandId'][0], item['brandName'][0].title(), item['totalPrice'][0], item['specialNote'][0].title(), item['size'][0], item['proof'][0]]) return item if isinstance(item, items.StockItem): self.storeStockTable.writerow([item['brandId'][0], item['stateStoreNumber'][1], item['amountInStock'][0]]) return item |
Besides adding the StockItem in the pipelines class, you’ll noticed I also added a stack of try/excepts to the part handling the BrandItems. In some cases, the values in those fields are sometimes empty. These check make sure that the value will be returned as an empty string, not null in those cases.
Running your spider
If you haven’t already, you’re probably ready to test your project and scrape some data. To run your scraper, open your console and go to your project folder. Next, type:
1 | scrapy crawl liq.wa.gov |
You should begin to see lots of lines whizzing by as Scrapy makes requests, sends items, etc. If things don’t work the first time, don’t sweat it. Take a close look at any error messages you may be seeing. Python is usually good at pointing you to the file and line number where the error occurred. Also, the link to the working tutorial files is at the beginning of this post. Keep in mind that we’re scraping a lot of data from the site and making thousands of requests. Don’t be surprised if the spider takes 30-40 minutes to finish.
Want to speed up the spider for testing purposes? Go to the parseBrandCategories() function in the spider and change the for loop to read:
1 | for i in range(len(brandCategoryList)-33): |
This will limit the brand categories you parse to just the first two in the list.
Comments (14)
Excellent tutorial, thank you!!
Awesome Tutorial , Thank you.Keep up the good work.
Aditya
Excellent! I just started scraping in python and this tutorial is very helpful! I’ve been scraping using Java and somehow wants to learn also in Python.
Why are you using the “yield” command for the FormRequest()? I don’t see the need for a iterator in this case?
Hi Rivka,
The “yield” used before “FormRequest()” is because the function also returns some items a bit farther down. You’re right that we probably don’t need an iterator there, but you can’t have a return and a yield in the same function in Python. So, instead, two yields do the trick.
thanks
Thanks for the tutorial however some links are dead and therefore the tutorial won’t work if followed to a tee. Specifically the brand search index is now located at, http://www.liq.wa.gov/LCBhomenet/StoreInformation/BrandSearch.aspx
Hi,
It is a great post. I am writing my first scrapy spider. In my case when i send a FormRequest, I receive Json response,instead of html. Which selector I should use? or How can i parse the response to retrieve data.
Best regards,
Hung
Hi, The WA governmentt site has changed and what you write here is not working. Help Help Help please….i want to write a simple scraper but cant use your nice because the site has changed.
Help,,,RC
Your tutorial is very nice but because the WA government site has changed….what you write here is not working. Help Help Help please….i want to write a simple scraper but cant use your nice blog guide now, because the site has changed.
Help,,,RC
Great tutorial!! Thank you!
It would be great if you can post another tutorial about scrapy since the web site in this example was closed. Because I love to follow along.
Thanks
Do you know if the method that you applied in start_requests() is still a valid way of getting the URL?
Thanks for writing this up.
It’s unfortunate that the site was brought down.
Nevertheless a fantastic tutorial.
I found this really useful – thanks!
One trackback
[...] Tutorial: Gathering data with the Scrapy framework. I found this much easier to follow than the official docs and more relevant to my use case. Unfortunately the site he uses as an example no longer exists, but that didn’t affect me. [...]