The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.
Lets start with the formal definitions:
Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from the websites.
As you can see the terms have quite clear definitions, and some people suggest that it is crucial to understand the minute differences if you want to succeed in the industry. But is that true?
Real World Answer
We are a company that has been specializing in Web Scraping services for years. We talk to our present and prospective clients on daily basis, sometimes several times a day. And in these real world conversations the terms Web Scraping and Web Crawling are often used interchangeably without being precise at all. The reality is – there are websites out there that have valuable data that needs to be extracted in a structured format, and how you define the process is not important at all.
What We Actually Do?
When looking in retrospect at the projects we did during these years, a simple pattern emerges. Vast majority of our projects are about creating robots that do targeted web crawling (crawling not the entire internet, but only specific websites) and immediately do web scraping as the web page is retrieved. So both processes occur simultaneously in real time. Most often we discard almost the entire retrieved HTML document and save only the bits of information that are needed for our clients. In some cases we will save the entire HTML for traceability, or for further analysis. So the lines between web crawling and web scraping become somewhat blurred as the amount of data extracted varies.
In the end we found that the essential thing is clear communications about what needs to be done, rather than how to define it. However, this is just our opinion based on our experience, and depending on the project you might be working on, or the business model you might implement, you might reach a different conclusion. In any case, we can all agree – Web Scraping on scale is cool!