Scraping dynamic websites efficiently with the wait function

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).


steps.start = function() {

    console.log($('div[class^="ProductPage__details"]').length);

    done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify the number of milliseconds to wait before executing a piece of code. This way the browser has some time to process dynamic data and insert it into the DOM. In this example we introduce a simple 3 second wait:


steps.start = function() {

    setTimeout(function() {

        console.log($('div[class^="ProductPage__details"]').length);

        done();

    }, 3000);

};

Logged result is 1, which indicates that we found the expected data in the DOM. However, there are some drawbacks in this method, as the code will be delayed the same amount of time regardless of how much the website actually takes to handle its dynamic requests. This means we are wasting time when the product data appears sooner and missing data when product data loads slower.Dynamic pages have a tendency to load inconsistently, therefore the exact timeout duration for each page load is impossible to know in advance. The maximum observed delay time is usually chosen when using setTimeout(). If we are waiting for 3 seconds, average time for data to appear is 1.5 seconds, and we have to process 50,000 products – then 20.83 hours are wasted. This is 625 hours per month if we run this robot every day!Better waiting strategy – use wait()Web Robots system wait() function enables the user to wait for a particular HTML element to load and then execute the code right after the element appears. wait(string or array selector[], int maxWaitTime)Default maxWaitTime = 10000;Usable callbacks: then, always, fails (Similar as with JQuery deferred https://api.jquery.com/jquery.deferred/)Example:


steps.start = function() {

    wait('div[class^="ProductPage__details"]').then(function() {

        console.log($('div[class^="ProductPage__details"]').length);

        done();

    })

};

wait() can have multiple callbacks for scenarios when an element appears, does not appear, or always: wait(selector, time_to_wait*).then(callback) – callback function will be executed immediately when selector appears. If the selector doesn’t appear, the function will never be executed.wait(selector, time_to_wait*).always(callback) – callback function is executed when element appears or when time_to_wait is reached. wait(selector, time_to_wait*).then(callback).fail(callback2) – callback function will be executed when element appears. Callback2 will be executed if element does not appear.wait([selector1, selector2, …], time_to_wait*).then(callback) – callback function is executed only when all of the selectors (selector1, selector2, …) appeared on the website.*Time_to_wait – is an optional parameter that allows the user to choose the amount of milliseconds to wait for a specified selector. Default amount (if not specified in the function) is 10000 ms.wait() function makes scraping of dynamic pages much easier, more efficient and more reliable.

Scraping Dynamic Websites Using The wait() Function

Share This Story, Choose Your Platform!

Leave A Comment Cancel reply