Performance Web Scraping with fastnext()

Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.

There are a number of ways to optimize your robot to run faster, replacing setTimeout with our internal wait function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using ajax requests instead of visiting a website directly. In a standard scenario using next with a link will open a webpage on your browser, that means that it will download the HTML file, all the listed additional resources like js and css files, images, video and audio files, and then process, render and display it in the window.

EXAMPLE ROBOT WITH NEXT:

steps.start = function(){ 
    next("https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1","getMovie");
    done();
}

steps.getMovie = function(){
    let movie = {name: $("h1").text()};
    emit("movies",[movie]);
    done();
}

All of this might take only a few hundred milliseconds, but when you are dealing with potentially hundreds of thousands of loads, every millisecond adds up. Fortunately, from a data collecting robot standpoint, having the html rendered with all the images, sleek css and fonts is not needed, because it is only interested in the data present in the html file. Therefore, we can get the same results for the fraction of the time by getting just the html file with an ajax request. However, this requires reformatting of the code by adding extra parameters to the next step and including the response context in subsequent html data selectors.

EXAMPLE ROBOT WITH AJAX:

steps.start = function(){
    next("","getMovie","https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1");
    done();
}

steps.getMovie = function(url){
    $.get(url).done(function(resp){
        let movie = {name: $("h1",resp).text()};
        emit("movies",[movie]);
        done();
    })
}

This does make the code a bit more complex, and requires more work and care when reformatting old robots in order to avoid errors, especially if the old code is long and complicated.

In order to streamline the reformatting and make writing and reading of new robots easier, we integrated the ajax functionality into our extension in the form of fastnext function. It functions just like a regular next, requiring an url, step name, and an optional third data parameter, but instead of loading the whole website, it does a get request in the background and automatically uses the response html as context in the specified step, thus there is no need to reformat the selectors.

EXAMPLE ROBOT WITH FASTNEXT:

steps.start = function(){
    fastnext("https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1","getMovie");
    done();
}

steps.getMovie = function(){
    let movie = {name: $("h1").text()};
    emit("movies",[movie]);
    done();
}

While reformatting old robots from next to fastnext we found that in practice the savings average at around 50% reduction in run time. However this varies between the low of only 25%, up to a high of 85%, and it heavily depends on the structure and technology of the specific scraped website.

It should be noted however that fastnext, just like a regular ajax, will only work for static html websites where the required data is present in the html. Dynamic websites built with technologies like React or Angular require a different approach.

Another nuance that should be taken into account is that currently fastnext does not handle the fail clause of the ajax request and will trigger a step retry, while usually this behaviour is innocuous, sometimes it is needed to handle an ajax fail. In this case a regular ajax function should be used.

As an example we wrote a robot that scrapes a small Amazon category containing 149 products and compared the speed of it using next and fastnext. The run using fastnext finished 36% faster than it’s counterpart, clocking in at ~ 270s, while the run using next clocked in at ~ 420s.

Happy Scraping!

Web Scraping Performance Tuning With fastnext()

EXAMPLE ROBOT WITH NEXT:

EXAMPLE ROBOT WITH AJAX:

EXAMPLE ROBOT WITH FASTNEXT:

Share This Story, Choose Your Platform!

Leave A Comment Cancel reply