Web Scraping – Web Scraping Service

New Functions Added

nicerobot — Wed, 15 Feb 2023 08:39:14 +0000

Web Robots scraping framework documentation has been updated with new functions:

blockImages() – changes browser settings regarding image downloading. This function is useful in scenarios when bandwidth is a concern. Sometimes it results in faster crawling speeds.
allowImages() – reverses browser settings changes made by blockImages().
closeSocket() – closes all idle socket connections in browser.

Web Robots have been using these functions on the internal platform for over 6 months and they proved to be great help in some scenarios. Now they are available in our public extension.

Our Chrome extension has been updated

nicerobot — Thu, 01 Oct 2020 12:24:45 +0000

Our public developer extension (IDE) has been untouched since March 2019. It may looks like Web Robots were stagnating, but actually we were constantly working on our internal systems like portal, cloud workers, cloud orchestration. We also has several internal releases of IDE for our staff.

So the IDE published to the Chrome webstore is just the tip of the iceberg.

It is now September 2020 and time has come to release the new version to Chrome webstore. We are glad that our extension passed webstore’s permission audit from the first time as our extension requires access to quite a few Chrome APIs in order to work properly and people at Google are getting ever stricter in their review process for extension permissions.

All changes and new features are listed in the changelog here.

PS our IDE extension works only for users who have an approved account on Web Robots portal.

Instant Data Users Group on Facebook

nicerobot — Tue, 28 Apr 2020 11:02:25 +0000

We have launched a Facebook group where Instant Data Scraper users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users.

We hope that new Facebook group will grow into a community where users can support each other.

Web Scraping vs Web Crawling

nicerobot — Mon, 06 May 2019 09:37:58 +0000

The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.

Formal Answer

Lets start with the formal definitions:

Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from the websites.

As you can see the terms have quite clear definitions, and some people suggest that it is crucial to understand the minute differences if you want to succeed in the industry. But is that true?

Real World Answer

We are a company that has been specializing in Web Scraping services for years. We talk to our present and prospective clients on daily basis, sometimes several times a day. And in these real world conversations the terms Web Scraping and Web Crawling are often used interchangeably without being precise at all. The reality is – there are websites out there that have valuable data that needs to be extracted in a structured format, and how you define the process is not important at all.

What We Actually Do?

When looking in retrospect at the projects we did during these years, a simple pattern emerges. Vast majority of our projects are about creating robots that do targeted web crawling (crawling not the entire internet, but only specific websites) and immediately do web scraping as the web page is retrieved. So both processes occur simultaneously in real time. Most often we discard almost the entire retrieved HTML document and save only the bits of information that are needed for our clients. In some cases we will save the entire HTML for traceability, or for further analysis. So the lines between web crawling and web scraping become somewhat blurred as the amount of data extracted varies.

In the end we found that the essential thing is clear communications about what needs to be done, rather than how to define it. However, this is just our opinion based on our experience, and depending on the project you might be working on, or the business model you might implement, you might reach a different conclusion. In any case, we can all agree – Web Scraping on scale is cool!

Advanced AJAX Techniques for Web Scraping

nicerobot — Wed, 10 Apr 2019 07:32:21 +0000

Basic AJAX usage within Web Robots scraper

Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.

// Standard JQuery AJAX call
$.ajax({
    url:'https://webrobots.io',
    method: 'GET'
}).done( function(resp){
    console.log(resp); 
});

// Simplified AJAX call
$.get('https://webrobots.io').done( function(resp){
   console.log(resp); 
});

Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.

Example incorrect and correct done() placement in AJAX:

INCORRECT

steps.start = function(){
    $.get('https://webrobots.io').done( function(resp){
        // some code
    });
    done(); 
}

CORRECT

 
steps.start = function(){
    $.get('https://webrobots.io').done( function(resp){
        // some code;
        done(); 
    });
}

Example incorrect and correct AJAX looping:

INCORRECT

 
steps.start = function(){
   for( let url of urls){
       $.get(url).done( function(resp){
           // some code 
       });
   }
   done(); 
}

CORRECT

 
steps.start = function(){
    for( let url of urls){
        next('','getUrl',url);
    }
    done(); 
}

steps.getUrl = function(url){
    $.get(url).done( function(resp){
         // some code;
         done(); 
    }); 
}

AJAX timeout

One issue with AJAX requests inside a step function is that the step global retry timeout and the AJAX timeout are independent, and in certain scenarios this can cause problems.

Consider this example. A GET request is performed, and since it is asynchronous, step done() function is placed inside the GET done block. If the GET fails, we can either call a done() function inside the .fail() block and move along with our scraping, or omit the .fail() block and force a step retry after our preset retry timeout.

steps.start = function(){

    $.get('https://webrobots.io').done( function(response){
        // some code;
        done();
    })
    //.fail(done);

}

It works fine when the server returns a failed response (E.g. status code 404) or fails to respond whatsoever. However, depending on how the server is configured, it might return a valid response after a significant delay, sometimes above our locally set step retry timeout. This means that even though the step has already finished, the code inside the GET done block will run and trigger a done(). Depending on the specific code, this can cause instability to the robot and unnecessary error logging. To avoid such a scenario a local AJAX timeout should be set up to be just below the step retry timeout (default is 60000 ms). In the example below, if any response is not received from the server within 55000ms, AJAX call will timeout and code will proceed to run as normal.

steps.start = function(){

    // default retry timer is 60000ms, AJAX timeout should be a few seconds lower.
    $.ajaxSetup({timeout:55000});
    $.get('https://webrobots.io').done( function(response){
        // some code;
        done();
    })
    //.fail(done);
}

Multiple simultaneous AJAX calls using $.when()

Performing several simultaneous AJAX calls is a very efficient way to handle certain scraping situations. One such situation is a website that loads parts of its content as static html, and other parts dynamically through various APIs. Consider an example website that performs a separate AJAX call to get the post content, one to get the post image, and another one for post reviews. A simple approach could be to just stack all three AJAX calls to start as soon as the previous one finishes. We will use jsonplaceholder.typicode.com to construct our example:

steps.start = function(){
    $.get('https://jsonplaceholder.typicode.com/posts/1').done( function(r1){
        console.log( r1 ); 
        $.get('https://jsonplaceholder.typicode.com/photos/1').done( function(r2){
            console.log( r2 ); 
            $.get('https://jsonplaceholder.typicode.com/comments/1').done( function(r3){
                 console.log( r3 ); 
                 done();
            });
        });
    });
};

The downside of this approach is that a new AJAX call cannot start until the previous one ends, wasting valuable time. The solution is to use JQuery.when() method. It takes multiple Deferred objects as arguments, in this case $.get() methods, and will resolve its master Deferred as soon as all the Deferreds resolve, or reject the master Deferred as soon as one of the Deferreds is rejected. The arguments passed to the doneCallbacks provide the resolved values for each of the Deferreds, and matches the order the Deferreds were passed to $.when() method. Our example remade with $.when would look like this:

steps.start = function(){

    let a1 = () => $.get('https://jsonplaceholder.typicode.com/posts/1');
    let a2 = () => $.get('https://jsonplaceholder.typicode.com/photos/1');
    let a3 = () => $.get('https://jsonplaceholder.typicode.com/comments/1');
    $.when( a1(), a2(), a3() ).then(function ( r1, r2, r3 ) {
        // r1, r2 and r3 are arguments resolved for the a1, a2 and a3 ajax requests, respectively.
        // Each argument is an array with the following structure: [ data, statusText, jqXHR ]
        console.log( r1 ); 
        console.log( r2 ); 
        console.log( r3 ); 
        done();
    });
}

This way all AJAX requests are started simultaneously and code proceeds when all responses are resolved. Depending on how many simultaneous requests are made and the response times from the server, this method has potential to significantly increase the speed of a robot.

Dynamic number of simultaneous AJAX calls

During our web scraping journey, we came across a couple instances where it is useful to be able to make multiple AJAX calls when the number of calls is not known in advance. One such example would be taking links from multiple sitemaps and distributing them evenly between forks. Unfortunately this cannot be accomplished using $.when() because it accepts a fixed number of arguments and returns the same amount of responses that each have to be specified individually. We can solve this by using ES6 Promise.all() method, which returns a single Promise that resolves when all of the promises passed as an array have resolved or when the array contains no promises. It rejects with the first promise that rejects. Here is an example using rottentomatoes.com sitemap:

steps.start = function(){
    $.get('https://www.rottentomatoes.com/sitemap.xml').done( function(response){
        let sitemaps = $('loc', response).map((i, v) => $(v).text() ).get()
        next('','distribute',sitemaps);
        done();
    })
}

steps.sitemaps = function( sitemaps ){
    // Creating an array of promises
    let promises = urls.map( url => $.get(url) );

    // Waiting for all AJAX promises to resolve before executing further code
    Promise.all( promises ).then( function( responses ){
        for( let r of responses ){
            // logging the number of links in each sitemap
            console.log( $('loc', r).length );
        }
        done();
    });
}

IMPORTANT: This method should only be used when absolutely necessary because excessive amount of constant simultaneous requests could strain the target server or be identified as unwanted traffic and trigger blocking. So always use proper delays and follow robots.txt rules for each website you scrape.

Vanilla JS AJAX use cases.

While JQuery.ajax() is handy, it has one disadvantage in that it always sets the x-requested-with : XMLHttpRequest header, and on very rare cases this affects the content of the response that is sent by the server. To circumvent this, use Vanilla JS XMLHttpRequest object or the modern fetch API. Refer to their respective documentation pages for more info how to use them. Here are a couple of simple examples.

Example using XMLHttpRequest:

var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
   if (this.readyState == 4 && this.status == 200) {
       console.log( this.responseText );
   }
};

xhttp.open('GET', 'cookies.php', true);
xhttp.send();

Example using fetch:

fetch('https://webrobots.io/').then(function(response){
    return response.text();
}).then(function(text){
    console.log(text);
});

Using Sitemaps in Web Scraping Robots

nicerobot — Mon, 25 Mar 2019 09:41:15 +0000

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this – just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

The fastest way to find a sitemap URL is to check robots.txt file. For example https://www.rottentomatoes.com/robots.txt
We can also probe typical sitemap URLs like domain.com/sitemap or domain.com/sitemap.xml
Sometimes just going to the homepage and searching for the keyword “sitemap” works
If all above bear no fruit, google search can help (example: “target.com sitemap”).

Example of domain/robots.txt:

Working With Large Sitemaps

Sitemaps usually have many thousands of records and opening them directly will freeze Chrome browser for several minutes while browser renders XML. Our best practice is to make $.get request to get a sitemap and process it.

example of getting a sitemap using an ajax request and filtering URLs:


$.get('https://www.rottentomatoes.com/sitemap_0.xml').then( function(response){
    $('url loc',response).each( function(i, v){
        var url = $(v).text();

        // filtering: we only need URLs that have no further path after film name
        // we can filter out URLs with longer URL paths than film page has

        if(url.split('/').length < 6) next(url,'getFilmInfo');

    });
    done();
});

Downsides of Sitemap Approach

A sitemap can be outdated (old URLs leading to 404 pages) and the site owner might not even notice that their sitemaps are incorrect. It is necessary to do spot checks to see if an URL works.
Sitemap might not have all the items listed in the normal website interface. Best practice is to spot check that items found on a website are present in the sitemap as well.
Sitemaps do not allow filtering items based on certain criteria. For example if we need only electronics from a large eshop, we still have to crawl all products and do filtering in the back-end.
Sitemaps do show how popular an item is – for example we cannot infer if a particular item is on the first page in it’s category or somewhere near the end.

Scraping Dynamic Websites Using The wait() Function

nicerobot — Mon, 04 Mar 2019 07:00:48 +0000

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).


steps.start = function() {

    console.log($('div[class^="ProductPage__details"]').length);

    done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify the number of milliseconds to wait before executing a piece of code. This way the browser has some time to process dynamic data and insert it into the DOM. In this example we introduce a simple 3 second wait:


steps.start = function() {

    setTimeout(function() {

        console.log($('div[class^="ProductPage__details"]').length);

        done();

    }, 3000);

};

Logged result is 1, which indicates that we found the expected data in the DOM. However, there are some drawbacks in this method, as the code will be delayed the same amount of time regardless of how much the website actually takes to handle its dynamic requests. This means we are wasting time when the product data appears sooner and missing data when product data loads slower.Dynamic pages have a tendency to load inconsistently, therefore the exact timeout duration for each page load is impossible to know in advance. The maximum observed delay time is usually chosen when using setTimeout(). If we are waiting for 3 seconds, average time for data to appear is 1.5 seconds, and we have to process 50,000 products – then 20.83 hours are wasted. This is 625 hours per month if we run this robot every day!Better waiting strategy – use wait()Web Robots system wait() function enables the user to wait for a particular HTML element to load and then execute the code right after the element appears. wait(string or array selector[], int maxWaitTime)Default maxWaitTime = 10000;Usable callbacks: then, always, fails (Similar as with JQuery deferred https://api.jquery.com/jquery.deferred/)Example:


steps.start = function() {

    wait('div[class^="ProductPage__details"]').then(function() {

        console.log($('div[class^="ProductPage__details"]').length);

        done();

    })

};

wait() can have multiple callbacks for scenarios when an element appears, does not appear, or always: wait(selector, time_to_wait*).then(callback) – callback function will be executed immediately when selector appears. If the selector doesn’t appear, the function will never be executed.wait(selector, time_to_wait*).always(callback) – callback function is executed when element appears or when time_to_wait is reached. wait(selector, time_to_wait*).then(callback).fail(callback2) – callback function will be executed when element appears. Callback2 will be executed if element does not appear.wait([selector1, selector2, …], time_to_wait*).then(callback) – callback function is executed only when all of the selectors (selector1, selector2, …) appeared on the website.*Time_to_wait – is an optional parameter that allows the user to choose the amount of milliseconds to wait for a specified selector. Default amount (if not specified in the function) is 10000 ms.wait() function makes scraping of dynamic pages much easier, more efficient and more reliable.

Web Scraping Performance Tuning With fastnext()

nicerobot — Wed, 20 Feb 2019 09:17:51 +0000

Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.

There are a number of ways to optimize your robot to run faster, replacing setTimeout with our internal wait function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using ajax requests instead of visiting a website directly. In a standard scenario using next with a link will open a webpage on your browser, that means that it will download the HTML file, all the listed additional resources like js and css files, images, video and audio files, and then process, render and display it in the window.

EXAMPLE ROBOT WITH NEXT:

steps.start = function(){ 
    next("https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1","getMovie");
    done();
}

steps.getMovie = function(){
    let movie = {name: $("h1").text()};
    emit("movies",[movie]);
    done();
}

All of this might take only a few hundred milliseconds, but when you are dealing with potentially hundreds of thousands of loads, every millisecond adds up. Fortunately, from a data collecting robot standpoint, having the html rendered with all the images, sleek css and fonts is not needed, because it is only interested in the data present in the html file. Therefore, we can get the same results for the fraction of the time by getting just the html file with an ajax request. However, this requires reformatting of the code by adding extra parameters to the next step and including the response context in subsequent html data selectors.

EXAMPLE ROBOT WITH AJAX:

steps.start = function(){
    next("","getMovie","https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1");
    done();
}

steps.getMovie = function(url){
    $.get(url).done(function(resp){
        let movie = {name: $("h1",resp).text()};
        emit("movies",[movie]);
        done();
    })
}

This does make the code a bit more complex, and requires more work and care when reformatting old robots in order to avoid errors, especially if the old code is long and complicated.

In order to streamline the reformatting and make writing and reading of new robots easier, we integrated the ajax functionality into our extension in the form of fastnext function. It functions just like a regular next, requiring an url, step name, and an optional third data parameter, but instead of loading the whole website, it does a get request in the background and automatically uses the response html as context in the specified step, thus there is no need to reformat the selectors.

EXAMPLE ROBOT WITH FASTNEXT:

steps.start = function(){
    fastnext("https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1","getMovie");
    done();
}

steps.getMovie = function(){
    let movie = {name: $("h1").text()};
    emit("movies",[movie]);
    done();
}

While reformatting old robots from next to fastnext we found that in practice the savings average at around 50% reduction in run time. However this varies between the low of only 25%, up to a high of 85%, and it heavily depends on the structure and technology of the specific scraped website.

It should be noted however that fastnext, just like a regular ajax, will only work for static html websites where the required data is present in the html. Dynamic websites built with technologies like React or Angular require a different approach.

Another nuance that should be taken into account is that currently fastnext does not handle the fail clause of the ajax request and will trigger a step retry, while usually this behaviour is innocuous, sometimes it is needed to handle an ajax fail. In this case a regular ajax function should be used.

As an example we wrote a robot that scrapes a small Amazon category containing 149 products and compared the speed of it using next and fastnext. The run using fastnext finished 36% faster than it’s counterpart, clocking in at ~ 270s, while the run using next clocked in at ~ 420s.

Happy Scraping!

New IDE Extension Release

nicerobot — Wed, 21 Jun 2017 08:14:42 +0000

Today we are releasing an update to our main extension – Web Robots Scraper IDE. This release has a version number 2017.6.20 and has several improvements in UI, proxy settings control, handling hash symbols in URLs.

Version 2017.6.20 RELEASE NOTES

UI: Robot run statistics is displayed in the same place and no longer “jumping”
UI: when robot finishes it’s status is a direct link to robot run list on portal. Run link is a direct link to data preview and download on portal.
setProxy() functionality has been expanded. See documentation for details.
Bugfix: fixed a bug where subsequent steps with URLs having identical address before # symbol were not loading correctly (Example: http://foobar.com#a and after that go to http://foobar.com#b).
Other internal engine improvements and bugfixes.

Email And Social Media Links Crawling From Websites

nicerobot — Thu, 02 Mar 2017 10:45:38 +0000

At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using Web Robots Chrome extension on their own computer.

To start you will need account on Web Robots portal, Chrome extension and thats it. We placed a robot called leads_crawler in our portal’s Demo space so anyone can use it. In case robot’s code is changed below is complete source code for this robot. You must edit variable on lines 14-18 to contain the list of target websites to crawl and run the robot. Then previous data on the Output tab and download it from portal once robot is finished. You will get a nice CSV file with data which can be used in your further leads processing workstream.

Robot’s source code:

var DEPTH = 2;
var EMAIL_PATTERN = /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi;
var SOCIAL_MEDIA = [
    'facebook.com',
    'linkedin.com',
    'instagram.com',
    'youtube.com',
    'twitter.com',
    'pinterest.com',
    'plus.google.com',
    'blogspot.com'
];

var websites = [
    "http://dccentre.com/",
    "http://www.theweddingplanneromaha.com/",
    "http://www.effortlesseventsidaho.com/"
];


steps.start = function() {
    setSettings({skipVisited:true});
    setRetries(5000, 2, 1000); // 5 sec retry timer to skip bad pages quickly
    websites.forEach(function(v, i) {
            next(v, "crawl", 0);
    });
    done();
};


steps.crawl = function(depth){
    
    depth++;
    
    var emails = _.uniq(returnEmails());
    var social = returnSocial();
    var urls = returnURLs();
    
    dbg(urls);
    
    if(emails.length || social.length) {
        var data = {
            'email' : emails.join(';'),
        };
        $.extend(data, social);
        emit('Leads', [data]);
    }
    
    if(depth < DEPTH) {
        urls.forEach(function(v) {
            next(v, 'crawl', depth);
        });
    }
    
    done();
};


returnURLs = function() {
    var urls = [];
    $('a:visible').each(function (i,v) {
        var url = $(v).prop('href').split('#').shift();
        if(isValidLink(url)) {
            urls.push(url);
        };
    });
    return(_.uniq(urls));
};


returnSocial = function() {
    var urls = [];
    var social = {};
    
    $('a:visible').each(function (i,v) {
        urls.push($(v).prop('href'));
    });
    
    _.uniq(urls).forEach(function(link) {
        var domain = link.split('://').pop().split('www.').pop().split('/').shift().toLowerCase();
        var pos = _.indexOf( SOCIAL_MEDIA, domain);
        if(pos !== -1) {
            social[SOCIAL_MEDIA[pos].split('.').shift()] = link;
        };
    });
    return(social);
};


returnEmails = function() {
    return $('*').html().match(EMAIL_PATTERN);
};


isValidLink = function(link){
    // here we check for all bad stuff in links
    if(_.indexOf(SOCIAL_MEDIA, link.split('://').pop().split('www.').pop().split('/').shift()) !== -1) {
        return false;
    }
    
    if ((link === undefined) || (typeof link !== "string") || (link.length < 12)) {
        return false;
    }
    
    if (
        // positives - must be present
        !(link.includes(document.domain)) ||
        !link.startsWith("http") ||
        
        // negatives - must not be present
        link.includes(".zip") ||
        link.includes(".csv") ||
        link.includes(".mpg") ||
        link.includes(".mpeg") ||
        link.includes(".gz") ||
        link.includes(".jpg") ||
        link.includes(".jpeg") ||
        link.includes(".png") ||
        link.includes(".pdf") ||
        link.includes(".doc") ||
        link.includes(".xls") ||
        link.includes(".ppt") ||
        link.includes(".avi") ||
        link.includes(".tif") ||
        link.includes(".exe") ||        
        link.includes(".psd") ||        
        link.includes(".eps") ||   
        link.includes(".txt") ||   
        link.includes(".rtf") ||   
        link.includes(".wmv") ||
        link.includes(".odt") ||   
        link.includes(".css") ||
        link.includes(".js") ||
        link.includes("mailto:") ||
        link.includes("facebook") ||
        link.includes("google") ||
        link.includes("twitter") || 
        link.includes("youtube") || 
        link.includes("linkedin") ||
        link.includes("download") ||
        link.includes("pinterest") 

        ) {
            return false;
        } else {
            return true;
        }
};