Web Scraping

Analysis of different web scraping approaches, best practices and challenges. Overview of Web Robots scraper features.

15 Feb, 2023

New Functions Added

By |2023-02-15T10:39:14+02:00February 15th, 2023|Web Scraping|0 Comments

Web Robots scraping framework documentation has been updated with new functions:

  • blockImages() – changes browser settings regarding image downloading. This function is useful in scenarios when bandwidth is a concern. Sometimes it results in faster crawling speeds.
  • allowImages() – reverses browser settings changes made by blockImages().
  • closeSocket() – closes all idle socket connections in browser.

Web Robots have been using these functions on the internal platform for over 6 months and they proved to be great help in some scenarios. Now they are available in our public extension.

1 Oct, 2020

Our Chrome extension has been updated

By |2020-10-01T14:25:38+02:00October 1st, 2020|Web Scraping|0 Comments

Our public developer extension (IDE) has been untouched since March 2019. It may looks like Web Robots were stagnating, but actually we were constantly working on our internal systems like portal, cloud workers, cloud orchestration. We also has several internal releases of IDE for our staff.

So the IDE published to the Chrome webstore is just the tip of the iceberg.

It is now September 2020 and time has come to release the new version to Chrome webstore. We are glad that our extension passed webstore’s permission audit from the first time as our extension requires access to quite a few Chrome APIs in order to work properly and people at Google are getting ever stricter in their review process for […]

28 Apr, 2020

Instant Data Users Group on Facebook

By |2020-04-28T13:02:25+02:00April 28th, 2020|Datasets, Web Scraping|0 Comments

We have launched a Facebook group where Instant Data Scraper users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users.

We hope that new Facebook group will grow into a community where users can support each other.

6 May, 2019

Web Scraping vs Web Crawling

By |2019-05-06T12:42:21+02:00May 6th, 2019|Web Scraping|2 Comments

The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.

Formal Answer

Lets start with the formal definitions:

Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from […]

10 Apr, 2019

Advanced AJAX Techniques for Web Scraping

By |2019-04-04T16:07:24+02:00April 10th, 2019|Web Scraping|0 Comments

Basic AJAX usage within Web Robots scraper

Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.

// Standard JQuery AJAX call
$.ajax({
    url:'https://webrobots.io',
    method: 'GET'
}).done( function(resp){
    console.log(resp); 
});

// Simplified AJAX call
$.get('https://webrobots.io').done( function(resp){
   console.log(resp); 
});

Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.

Example incorrect and correct done() placement in AJAX:

INCORRECT

steps.start = function(){
$.get(‘https://webrobots.io’).done( function(resp){
[…]

25 Mar, 2019

Using Sitemaps in Web Scraping Robots

By |2019-03-20T13:36:08+02:00March 25th, 2019|Web Scraping|1 Comment

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this –  just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

4 Mar, 2019

Scraping Dynamic Websites Using The wait() Function

By |2019-03-04T15:12:11+02:00March 4th, 2019|Web Scraping|0 Comments

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).


steps.start = function() {

    console.log($('div[class^="ProductPage__details"]').length);

    done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify the […]

20 Feb, 2019

Web Scraping Performance Tuning With fastnext()

By |2019-03-04T15:14:00+02:00February 20th, 2019|Web Scraping|0 Comments

Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.

There are a number of ways to optimize your robot to run faster, replacing setTimeout with our internal wait function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using ajax requests instead of visiting a website […]

21 Jun, 2017

New IDE Extension Release

By |2019-03-04T15:26:12+02:00June 21st, 2017|Web Scraping|2 Comments

Today we are releasing an update to our main extension – Web Robots Scraper IDE. This release has a version number 2017.6.20 and has several improvements in UI, proxy settings control, handling hash symbols in URLs.

Version 2017.6.20 RELEASE NOTES

  • UI: Robot run statistics is displayed in the same place and no longer “jumping”
  • UI: when robot finishes it’s status is a direct link to robot run list on portal. Run link is a direct link to data preview and download on portal.
  • setProxy() functionality has been expanded. See documentation for details.
  • Bugfix: fixed a bug where subsequent steps with URLs having identical address before # symbol were not loading correctly (Example: http://foobar.com#a and after that go to http://foobar.com#b).
  • Other internal engine improvements […]
2 Mar, 2017

Email And Social Media Links Crawling From Websites

By |2019-03-04T15:29:54+02:00March 2nd, 2017|Web Scraping|3 Comments

At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using Web Robots Chrome extension on their own computer.

To start you will need account on Web Robots portal, Chrome extension and thats it. We placed a robot called leads_crawler in our portal’s Demo space so anyone can use it. In case robot’s code is changed below is complete source code for this robot. You […]