nicerobot – Web Scraping Service

New Functions Added

nicerobot — Wed, 15 Feb 2023 08:39:14 +0000

Web Robots scraping framework documentation has been updated with new functions:

blockImages() – changes browser settings regarding image downloading. This function is useful in scenarios when bandwidth is a concern. Sometimes it results in faster crawling speeds.
allowImages() – reverses browser settings changes made by blockImages().
closeSocket() – closes all idle socket connections in browser.

Web Robots have been using these functions on the internal platform for over 6 months and they proved to be great help in some scenarios. Now they are available in our public extension.

Instant Data Scraper v1.0.7 released

nicerobot — Thu, 26 May 2022 12:40:34 +0000

Today we are releasing an update to our Instant Data Scraper. Version 1.0.7 has the following improvements:

Performance improvement for websites with large HTML structure (Google Maps for example).
Improved “Next” page button behaviour.
Migrated to manifest version 3 and manifest version 2 will be phased out by Google soon.

We are also working on some other features for Instant Data, they will be released in the near future!

Instant Data Scraper is now available on Micorosoft Edge

nicerobot — Fri, 18 Dec 2020 06:45:10 +0000

We received an invitation from Microsoft to publish our Chrome extensions to Microsoft Edge webstore. Microsoft Edge browser is Chrome bases, so porting extensions to it should be easy. Actually it was even easier than expected – Edge’s developer dashboard accepted exact same zip file as we use for Chrome webstore. Extension just works without any changes.

Anyone can download Instant Data for Microsoft Edge here.

Our Chrome extension has been updated

nicerobot — Thu, 01 Oct 2020 12:24:45 +0000

Our public developer extension (IDE) has been untouched since March 2019. It may looks like Web Robots were stagnating, but actually we were constantly working on our internal systems like portal, cloud workers, cloud orchestration. We also has several internal releases of IDE for our staff.

So the IDE published to the Chrome webstore is just the tip of the iceberg.

It is now September 2020 and time has come to release the new version to Chrome webstore. We are glad that our extension passed webstore’s permission audit from the first time as our extension requires access to quite a few Chrome APIs in order to work properly and people at Google are getting ever stricter in their review process for extension permissions.

All changes and new features are listed in the changelog here.

PS our IDE extension works only for users who have an approved account on Web Robots portal.

Instant Data Users Group on Facebook

nicerobot — Tue, 28 Apr 2020 11:02:25 +0000

We have launched a Facebook group where Instant Data Scraper users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users.

We hope that new Facebook group will grow into a community where users can support each other.

Instant Data Scraper Update

nicerobot — Tue, 17 Dec 2019 13:58:47 +0000

In October and November of this year we decided to survey Instant Data Scraper extension users to see where Web Robots team should focus for the next update. We already had some ideas from user emails that we received over last couple years, but we needed a more scientific proof to see which features would be most desired. Among features we consider things like infinite scroll support, running jobs on cloud, processing batches of URLs, proxy support, etc.

Before the end of the survey it became clear that infinite scroll support is by far most desired feature and decided to release it as soon as possible. One December 11th we published a 0.2.0 version to Chrome Webstore. Enjoy it!

Other features will follow as well. We are happy to see that our web scraping tool is growing through 40k users and has excellent reviews!

Installs per day over the lifetime of our extension.

Web Scraping vs Web Crawling

nicerobot — Mon, 06 May 2019 09:37:58 +0000

The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.

Formal Answer

Lets start with the formal definitions:

Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from the websites.

As you can see the terms have quite clear definitions, and some people suggest that it is crucial to understand the minute differences if you want to succeed in the industry. But is that true?

Real World Answer

We are a company that has been specializing in Web Scraping services for years. We talk to our present and prospective clients on daily basis, sometimes several times a day. And in these real world conversations the terms Web Scraping and Web Crawling are often used interchangeably without being precise at all. The reality is – there are websites out there that have valuable data that needs to be extracted in a structured format, and how you define the process is not important at all.

What We Actually Do?

When looking in retrospect at the projects we did during these years, a simple pattern emerges. Vast majority of our projects are about creating robots that do targeted web crawling (crawling not the entire internet, but only specific websites) and immediately do web scraping as the web page is retrieved. So both processes occur simultaneously in real time. Most often we discard almost the entire retrieved HTML document and save only the bits of information that are needed for our clients. In some cases we will save the entire HTML for traceability, or for further analysis. So the lines between web crawling and web scraping become somewhat blurred as the amount of data extracted varies.

In the end we found that the essential thing is clear communications about what needs to be done, rather than how to define it. However, this is just our opinion based on our experience, and depending on the project you might be working on, or the business model you might implement, you might reach a different conclusion. In any case, we can all agree – Web Scraping on scale is cool!

Advanced AJAX Techniques for Web Scraping

nicerobot — Wed, 10 Apr 2019 07:32:21 +0000

Basic AJAX usage within Web Robots scraper

Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.

// Standard JQuery AJAX call
$.ajax({
    url:'https://webrobots.io',
    method: 'GET'
}).done( function(resp){
    console.log(resp); 
});

// Simplified AJAX call
$.get('https://webrobots.io').done( function(resp){
   console.log(resp); 
});

Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.

Example incorrect and correct done() placement in AJAX:

INCORRECT

steps.start = function(){
    $.get('https://webrobots.io').done( function(resp){
        // some code
    });
    done(); 
}

CORRECT

 
steps.start = function(){
    $.get('https://webrobots.io').done( function(resp){
        // some code;
        done(); 
    });
}

Example incorrect and correct AJAX looping:

INCORRECT

 
steps.start = function(){
   for( let url of urls){
       $.get(url).done( function(resp){
           // some code 
       });
   }
   done(); 
}

CORRECT

 
steps.start = function(){
    for( let url of urls){
        next('','getUrl',url);
    }
    done(); 
}

steps.getUrl = function(url){
    $.get(url).done( function(resp){
         // some code;
         done(); 
    }); 
}

AJAX timeout

One issue with AJAX requests inside a step function is that the step global retry timeout and the AJAX timeout are independent, and in certain scenarios this can cause problems.

Consider this example. A GET request is performed, and since it is asynchronous, step done() function is placed inside the GET done block. If the GET fails, we can either call a done() function inside the .fail() block and move along with our scraping, or omit the .fail() block and force a step retry after our preset retry timeout.

steps.start = function(){

    $.get('https://webrobots.io').done( function(response){
        // some code;
        done();
    })
    //.fail(done);

}

It works fine when the server returns a failed response (E.g. status code 404) or fails to respond whatsoever. However, depending on how the server is configured, it might return a valid response after a significant delay, sometimes above our locally set step retry timeout. This means that even though the step has already finished, the code inside the GET done block will run and trigger a done(). Depending on the specific code, this can cause instability to the robot and unnecessary error logging. To avoid such a scenario a local AJAX timeout should be set up to be just below the step retry timeout (default is 60000 ms). In the example below, if any response is not received from the server within 55000ms, AJAX call will timeout and code will proceed to run as normal.

steps.start = function(){

    // default retry timer is 60000ms, AJAX timeout should be a few seconds lower.
    $.ajaxSetup({timeout:55000});
    $.get('https://webrobots.io').done( function(response){
        // some code;
        done();
    })
    //.fail(done);
}

Multiple simultaneous AJAX calls using $.when()

Performing several simultaneous AJAX calls is a very efficient way to handle certain scraping situations. One such situation is a website that loads parts of its content as static html, and other parts dynamically through various APIs. Consider an example website that performs a separate AJAX call to get the post content, one to get the post image, and another one for post reviews. A simple approach could be to just stack all three AJAX calls to start as soon as the previous one finishes. We will use jsonplaceholder.typicode.com to construct our example:

steps.start = function(){
    $.get('https://jsonplaceholder.typicode.com/posts/1').done( function(r1){
        console.log( r1 ); 
        $.get('https://jsonplaceholder.typicode.com/photos/1').done( function(r2){
            console.log( r2 ); 
            $.get('https://jsonplaceholder.typicode.com/comments/1').done( function(r3){
                 console.log( r3 ); 
                 done();
            });
        });
    });
};

The downside of this approach is that a new AJAX call cannot start until the previous one ends, wasting valuable time. The solution is to use JQuery.when() method. It takes multiple Deferred objects as arguments, in this case $.get() methods, and will resolve its master Deferred as soon as all the Deferreds resolve, or reject the master Deferred as soon as one of the Deferreds is rejected. The arguments passed to the doneCallbacks provide the resolved values for each of the Deferreds, and matches the order the Deferreds were passed to $.when() method. Our example remade with $.when would look like this:

steps.start = function(){

    let a1 = () => $.get('https://jsonplaceholder.typicode.com/posts/1');
    let a2 = () => $.get('https://jsonplaceholder.typicode.com/photos/1');
    let a3 = () => $.get('https://jsonplaceholder.typicode.com/comments/1');
    $.when( a1(), a2(), a3() ).then(function ( r1, r2, r3 ) {
        // r1, r2 and r3 are arguments resolved for the a1, a2 and a3 ajax requests, respectively.
        // Each argument is an array with the following structure: [ data, statusText, jqXHR ]
        console.log( r1 ); 
        console.log( r2 ); 
        console.log( r3 ); 
        done();
    });
}

This way all AJAX requests are started simultaneously and code proceeds when all responses are resolved. Depending on how many simultaneous requests are made and the response times from the server, this method has potential to significantly increase the speed of a robot.

Dynamic number of simultaneous AJAX calls

During our web scraping journey, we came across a couple instances where it is useful to be able to make multiple AJAX calls when the number of calls is not known in advance. One such example would be taking links from multiple sitemaps and distributing them evenly between forks. Unfortunately this cannot be accomplished using $.when() because it accepts a fixed number of arguments and returns the same amount of responses that each have to be specified individually. We can solve this by using ES6 Promise.all() method, which returns a single Promise that resolves when all of the promises passed as an array have resolved or when the array contains no promises. It rejects with the first promise that rejects. Here is an example using rottentomatoes.com sitemap:

steps.start = function(){
    $.get('https://www.rottentomatoes.com/sitemap.xml').done( function(response){
        let sitemaps = $('loc', response).map((i, v) => $(v).text() ).get()
        next('','distribute',sitemaps);
        done();
    })
}

steps.sitemaps = function( sitemaps ){
    // Creating an array of promises
    let promises = urls.map( url => $.get(url) );

    // Waiting for all AJAX promises to resolve before executing further code
    Promise.all( promises ).then( function( responses ){
        for( let r of responses ){
            // logging the number of links in each sitemap
            console.log( $('loc', r).length );
        }
        done();
    });
}

IMPORTANT: This method should only be used when absolutely necessary because excessive amount of constant simultaneous requests could strain the target server or be identified as unwanted traffic and trigger blocking. So always use proper delays and follow robots.txt rules for each website you scrape.

Vanilla JS AJAX use cases.

While JQuery.ajax() is handy, it has one disadvantage in that it always sets the x-requested-with : XMLHttpRequest header, and on very rare cases this affects the content of the response that is sent by the server. To circumvent this, use Vanilla JS XMLHttpRequest object or the modern fetch API. Refer to their respective documentation pages for more info how to use them. Here are a couple of simple examples.

Example using XMLHttpRequest:

var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
   if (this.readyState == 4 && this.status == 200) {
       console.log( this.responseText );
   }
};

xhttp.open('GET', 'cookies.php', true);
xhttp.send();

Example using fetch:

fetch('https://webrobots.io/').then(function(response){
    return response.text();
}).then(function(text){
    console.log(text);
});

Using Sitemaps in Web Scraping Robots

nicerobot — Mon, 25 Mar 2019 09:41:15 +0000

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this – just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

The fastest way to find a sitemap URL is to check robots.txt file. For example https://www.rottentomatoes.com/robots.txt
We can also probe typical sitemap URLs like domain.com/sitemap or domain.com/sitemap.xml
Sometimes just going to the homepage and searching for the keyword “sitemap” works
If all above bear no fruit, google search can help (example: “target.com sitemap”).

Example of domain/robots.txt:

Working With Large Sitemaps

Sitemaps usually have many thousands of records and opening them directly will freeze Chrome browser for several minutes while browser renders XML. Our best practice is to make $.get request to get a sitemap and process it.

example of getting a sitemap using an ajax request and filtering URLs:


$.get('https://www.rottentomatoes.com/sitemap_0.xml').then( function(response){
    $('url loc',response).each( function(i, v){
        var url = $(v).text();

        // filtering: we only need URLs that have no further path after film name
        // we can filter out URLs with longer URL paths than film page has

        if(url.split('/').length < 6) next(url,'getFilmInfo');

    });
    done();
});

Downsides of Sitemap Approach

A sitemap can be outdated (old URLs leading to 404 pages) and the site owner might not even notice that their sitemaps are incorrect. It is necessary to do spot checks to see if an URL works.
Sitemap might not have all the items listed in the normal website interface. Best practice is to spot check that items found on a website are present in the sitemap as well.
Sitemaps do not allow filtering items based on certain criteria. For example if we need only electronics from a large eshop, we still have to crawl all products and do filtering in the back-end.
Sitemaps do show how popular an item is – for example we cannot infer if a particular item is on the first page in it’s category or somewhere near the end.

Scraping Dynamic Websites Using The wait() Function

nicerobot — Mon, 04 Mar 2019 07:00:48 +0000

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).


steps.start = function() {

    console.log($('div[class^="ProductPage__details"]').length);

    done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify the number of milliseconds to wait before executing a piece of code. This way the browser has some time to process dynamic data and insert it into the DOM. In this example we introduce a simple 3 second wait:


steps.start = function() {

    setTimeout(function() {

        console.log($('div[class^="ProductPage__details"]').length);

        done();

    }, 3000);

};

Logged result is 1, which indicates that we found the expected data in the DOM. However, there are some drawbacks in this method, as the code will be delayed the same amount of time regardless of how much the website actually takes to handle its dynamic requests. This means we are wasting time when the product data appears sooner and missing data when product data loads slower.Dynamic pages have a tendency to load inconsistently, therefore the exact timeout duration for each page load is impossible to know in advance. The maximum observed delay time is usually chosen when using setTimeout(). If we are waiting for 3 seconds, average time for data to appear is 1.5 seconds, and we have to process 50,000 products – then 20.83 hours are wasted. This is 625 hours per month if we run this robot every day!Better waiting strategy – use wait()Web Robots system wait() function enables the user to wait for a particular HTML element to load and then execute the code right after the element appears. wait(string or array selector[], int maxWaitTime)Default maxWaitTime = 10000;Usable callbacks: then, always, fails (Similar as with JQuery deferred https://api.jquery.com/jquery.deferred/)Example:


steps.start = function() {

    wait('div[class^="ProductPage__details"]').then(function() {

        console.log($('div[class^="ProductPage__details"]').length);

        done();

    })

};

wait() can have multiple callbacks for scenarios when an element appears, does not appear, or always: wait(selector, time_to_wait*).then(callback) – callback function will be executed immediately when selector appears. If the selector doesn’t appear, the function will never be executed.wait(selector, time_to_wait*).always(callback) – callback function is executed when element appears or when time_to_wait is reached. wait(selector, time_to_wait*).then(callback).fail(callback2) – callback function will be executed when element appears. Callback2 will be executed if element does not appear.wait([selector1, selector2, …], time_to_wait*).then(callback) – callback function is executed only when all of the selectors (selector1, selector2, …) appeared on the website.*Time_to_wait – is an optional parameter that allows the user to choose the amount of milliseconds to wait for a specified selector. Default amount (if not specified in the function) is 10000 ms.wait() function makes scraping of dynamic pages much easier, more efficient and more reliable.