Basic AJAX usage within Web Robots scraper
Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.
// Standard JQuery AJAX call $.ajax({ url:'https://webrobots.io', method: 'GET' }).done( function(resp){ console.log(resp); }); // Simplified AJAX call $.get('https://webrobots.io').done( function(resp){ console.log(resp); });
Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.
Example incorrect and correct done() placement in AJAX:
INCORRECT
steps.start = function(){ $.get('https://webrobots.io').done( function(resp){ // some code }); done(); }
CORRECT
steps.start = function(){ $.get('https://webrobots.io').done( function(resp){ // some code; done(); }); }
Example incorrect and correct AJAX looping:
INCORRECT
steps.start = function(){ for( let url of urls){ $.get(url).done( function(resp){ // some code }); } done(); }
CORRECT
steps.start = function(){ for( let url of urls){ next('','getUrl',url); } done(); } steps.getUrl = function(url){ $.get(url).done( function(resp){ // some code; done(); }); }
AJAX timeout
One issue with AJAX requests inside a step function is that the step global retry timeout and the AJAX timeout are independent, and in certain scenarios this can cause problems.
Consider this example. A GET request is performed, and since it is asynchronous, step done() function is placed inside the GET done block. If the GET fails, we can either call a done() function inside the .fail() block and move along with our scraping, or omit the .fail() block and force a step retry after our preset retry timeout.
steps.start = function(){ $.get('https://webrobots.io').done( function(response){ // some code; done(); }) //.fail(done); }
It works fine when the server returns a failed response (E.g. status code 404) or fails to respond whatsoever. However, depending on how the server is configured, it might return a valid response after a significant delay, sometimes above our locally set step retry timeout. This means that even though the step has already finished, the code inside the GET done block will run and trigger a done(). Depending on the specific code, this can cause instability to the robot and unnecessary error logging. To avoid such a scenario a local AJAX timeout should be set up to be just below the step retry timeout (default is 60000 ms). In the example below, if any response is not received from the server within 55000ms, AJAX call will timeout and code will proceed to run as normal.
steps.start = function(){ // default retry timer is 60000ms, AJAX timeout should be a few seconds lower. $.ajaxSetup({timeout:55000}); $.get('https://webrobots.io').done( function(response){ // some code; done(); }) //.fail(done); }
Multiple simultaneous AJAX calls using $.when()
Performing several simultaneous AJAX calls is a very efficient way to handle certain scraping situations. One such situation is a website that loads parts of its content as static html, and other parts dynamically through various APIs. Consider an example website that performs a separate AJAX call to get the post content, one to get the post image, and another one for post reviews. A simple approach could be to just stack all three AJAX calls to start as soon as the previous one finishes. We will use jsonplaceholder.typicode.com to construct our example:
steps.start = function(){ $.get('https://jsonplaceholder.typicode.com/posts/1').done( function(r1){ console.log( r1 ); $.get('https://jsonplaceholder.typicode.com/photos/1').done( function(r2){ console.log( r2 ); $.get('https://jsonplaceholder.typicode.com/comments/1').done( function(r3){ console.log( r3 ); done(); }); }); }); };
The downside of this approach is that a new AJAX call cannot start until the previous one ends, wasting valuable time. The solution is to use JQuery.when() method. It takes multiple Deferred objects as arguments, in this case $.get() methods, and will resolve its master Deferred as soon as all the Deferreds resolve, or reject the master Deferred as soon as one of the Deferreds is rejected. The arguments passed to the doneCallbacks provide the resolved values for each of the Deferreds, and matches the order the Deferreds were passed to $.when() method. Our example remade with $.when would look like this:
steps.start = function(){ let a1 = () => $.get('https://jsonplaceholder.typicode.com/posts/1'); let a2 = () => $.get('https://jsonplaceholder.typicode.com/photos/1'); let a3 = () => $.get('https://jsonplaceholder.typicode.com/comments/1'); $.when( a1(), a2(), a3() ).then(function ( r1, r2, r3 ) { // r1, r2 and r3 are arguments resolved for the a1, a2 and a3 ajax requests, respectively. // Each argument is an array with the following structure: [ data, statusText, jqXHR ] console.log( r1 ); console.log( r2 ); console.log( r3 ); done(); }); }
This way all AJAX requests are started simultaneously and code proceeds when all responses are resolved. Depending on how many simultaneous requests are made and the response times from the server, this method has potential to significantly increase the speed of a robot.
Dynamic number of simultaneous AJAX calls
During our web scraping journey, we came across a couple instances where it is useful to be able to make multiple AJAX calls when the number of calls is not known in advance. One such example would be taking links from multiple sitemaps and distributing them evenly between forks. Unfortunately this cannot be accomplished using $.when() because it accepts a fixed number of arguments and returns the same amount of responses that each have to be specified individually. We can solve this by using ES6 Promise.all() method, which returns a single Promise that resolves when all of the promises passed as an array have resolved or when the array contains no promises. It rejects with the first promise that rejects. Here is an example using rottentomatoes.com sitemap:
steps.start = function(){ $.get('https://www.rottentomatoes.com/sitemap.xml').done( function(response){ let sitemaps = $('loc', response).map((i, v) => $(v).text() ).get() next('','distribute',sitemaps); done(); }) } steps.sitemaps = function( sitemaps ){ // Creating an array of promises let promises = urls.map( url => $.get(url) ); // Waiting for all AJAX promises to resolve before executing further code Promise.all( promises ).then( function( responses ){ for( let r of responses ){ // logging the number of links in each sitemap console.log( $('loc', r).length ); } done(); }); }
IMPORTANT: This method should only be used when absolutely necessary because excessive amount of constant simultaneous requests could strain the target server or be identified as unwanted traffic and trigger blocking. So always use proper delays and follow robots.txt rules for each website you scrape.
Vanilla JS AJAX use cases.
While JQuery.ajax() is handy, it has one disadvantage in that it always sets the x-requested-with : XMLHttpRequest header, and on very rare cases this affects the content of the response that is sent by the server. To circumvent this, use Vanilla JS XMLHttpRequest object or the modern fetch API. Refer to their respective documentation pages for more info how to use them. Here are a couple of simple examples.
Example using XMLHttpRequest:
var xhttp = new XMLHttpRequest(); xhttp.onreadystatechange = function() { if (this.readyState == 4 && this.status == 200) { console.log( this.responseText ); } }; xhttp.open('GET', 'cookies.php', true); xhttp.send();
Example using fetch:
fetch('https://webrobots.io/').then(function(response){ return response.text(); }).then(function(text){ console.log(text); });