Not your usual Web-Scraping Tutorial

Not your usual Web-Scraping Tutorial

The problem

Most of the tutorials I see around the internet regarding web-scraping are based on either showing you how to scrape some data from a specific page/site and use the basic functionality of requests or selenium and beautifulsoup4 - don’t go beyond the package's documentation, or promote some service (crawling/scraping/proxy) and also don’t go beyond the documentation that is already there, but just add a new service to use.

Let’s first start with what is data scraping - it is a form of collecting data from human-readable output. In the case of web scraping the readable output is a web page - HTML formatted text that the browser knows how to display. So we try to get some structured data from the unstructured formatted text as seen by the browser.

The high-level view of the process is usually simple:

  1. Get the result HTML using some library or tool to perform the request

  2. Parse the resulting HTML extracting the required information

  3. Store that information

So if you follow along you will see that most of the tutorials follow the same steps:

  1. Use requests/httpx or selenium to perform the request and get HTML response

  2. Use BeautifulSoup4 to extract the information

  3. Probably store it in a CSV file and call it a day

What are the problems?

In my point of view that is highly abstracted and usually doesn’t teach the reader what are the obstacles and challenges when you try to do this at a higher scale. The websites that are most interesting to extract data from, already prevent or try to prevent you from getting that data. Some pages are badly structured (oftentimes exactly to prevent you from getting the data). It takes time, to understand the structure of the page and what would be the correct selectors for specific data points.

Let’s see the problems and some tips on how to overcome these problems in more detail one by one:

No 2xx response

Well as it often happens you don’t always get the desired response. Sometimes you may get 404 - Not found, or 403 - you are not allowed the request, or the page now requires authentication (401), which you need to provide either as a header token or cookie. Or the server is blocking you from doing requests with HTTP status codes in the 5xx range. Or perhaps the website is malfunctioning and you are constantly redirected in a loop.

What we can do to prevent this from happening:

Have logic that will handle the cases for the 4xx errors. If the page is changes or doesn’t respond on the same URL, perhaps we should remove it in the future, so we don’t lose time on scraping something that doesn’t exists. If we have a problem with the authentication that would require more investigation on our side, to understand what is changed in the requirements. But we could safely say that most of the errors we get in the range of 4xx are some sort of problem we need to fix in our code logic. Or perhaps the server expects some specific parameters to be present in the request header.

For the 5xx errors, there are a lot of possibilities:

  • The Server cannot handle that many requests at a time, so we can implement some back-off logic (in simple words wait before retrying).

  • The Server has banned our IP - in this case, we should consider using a Proxy or a Proxy Pool so we can change proxies and thus IPs before retrying the request. Or perhaps use Tor for the connection, but this could make the whole process slower.

Even at this stage, you could see that it is not that trivial just to get the response. A lot of preparation should go into preparing the request itself:

  • Headers

  • Cookies

  • Retry policy

  • Backoff strategy

  • Changing IPs - request origin

Note

Even if you are getting 2xx response, that doesn’t necessarily mean that you are getting the expected HTML response. It could be a captcha, and then you will need to have logic to solve it and redo the request.

Incorrect changes in Response

Let’s say we have covered the first part. We can be sure 99% of the time that we can get the HTML page we request. We are now faced with the part of parsing the HTML, but after doing this for some time you will notice that some sites do A/B Testing, change the design, or change the data based on your location. There is a reasonable amount of points you need to consider when writing your parsing/extracting logic.

Let’s start with the not-so-obvious but simpler to solve

Text/Data changes based on the request origin.

If you request for example google.com from USA you will get the result in English, but if you request it from another country, most certainly the response will be in the language that corresponds to the native language in the originating country. Again you have to be careful what are the required headers and parameters in order to force the server to respond with the expected HTML. Some sites will not allow you to change the language just by using parameters and headers or perhaps to get the information on how to change the language requires one more request then you would again require some proxy pool service, so you can be sure that your request is originating from the expected location. You could do this with Tor too btw. But that would be a topic for a whole new post.

Text/Data changes because of A/B Testing or new design

That’s in my opinion the most neglected aspect of web-scraping. Just imagine you have done the whole logic you start your scraper, everything looks normal and one day you don’t get any data. Your parser is failing, although till yesterday everything worked as expected. You open the page, just to see that everything looks normal. You go to the logs, just to see that it is failing just on some of the requests and when you open the failing request URL in your browser you see that the title and the price are in different places on the page, and the selector is not working for this specific case. This can lead you to implement some if/else logic in your parser to cover this case. But if these changes happen often (on bigger websites they often do) that your parsing logic will be bloated with if/else if/else statements for every possible scenario, and you will have to add and add if clauses till it becomes unmanageable. You will realise that using a simple parsing library was nice at the beginning but you will have to wrap its functionality in a custom class, introduce some new logic to handle special cases and preserve backward compatibility. You are not the only one facing this problem, so here are some thoughts on how to solve this problem:

  • Abstract the parsing functionality to something easily manageable

  • Selectorlib is a perfect example of how you can achieve this - have your selectors defined in some structure yaml/json/toml and use it as input to your parser, this way it will be easier to add new cases. I have created an example library pypapsy that builds on that aspect allowing you to use XPath/CSS/Regex selectors as a list to a specific field, and the library starts to test every listed selector till it gets the result which it will return.

  • Define the desired output format so you can check your data for consistency and use that information to notify yourself or your team that there is a problem with the page.

Of course, this topic can be extended and looked at more in-depth, but that’s not the point of this post.

Store your result

Say you have taught of everything you have implemented some of the best possible practices to get a reliable response, and you have taught about the parsing strategies so you can handle changes in the HTML, that will not stop the flow of data. Now you need to tinker with where and how to store your data. Well, that is your problem to be fair. I will not give guidelines or my opinion on that, since it will depend on your use case.

I will concentrate on the following scenario:

Say you are getting your data, but the data is changing very often (hourly for example) and once changed you can not get the previous state. You know how to handle changes in the HTML, but implementing the required code changes and deploying them on the service could result in data holes. Missing data is not cool, especially if you do scraping as a business missing data could turn your customers off. One way of preventing this (perhaps preventing is a strong word), or should we say working around such problems is to introduce a backup. A place on a server or block storage, something where you can cheaply store a lot of files. You would use the backup to store the raw responses - yes the HTML, unparsed. So when a page is changed, you could fill the holes of missing data, by reprising your backup with a fixed parsing logic. The exact implementation of such logic and backup is also not in the aspect of this post. It is more or less food for thought.

Final words

If you have reached the end of this post, you would probably agree, that web-scraping is not that trivial when done at scale. To get to my point - most of the web-scraping tutorials are good for one-off scripts, where for example you want to reformat some big table to some other format. But usually, the time to implement would be a little faster than doing it manually. Yes I know, some of you can develop something a lot faster, but if you could do it, then you would probably not need to read a web-scraping tutorial for beginners.

I have not covered everything that goes into web-scraping. Some of the topics I have missed are: handling captchas, respecting robots.txt, the whole process of analysing the HTML structure and of course finding internal APIs that you could use to get structured data directly. But if there is enough interest, I could do some follow-ups on the specific topics.