In Web Robots framework robot scripts are written in Javascript using a simple steps functions. Besides our own steps functions we also include the following libraries that can be used in any steps within robot:

  • jQuery – work with DOM elements
  • underscore.js – helper for work with objects and arrays (deduplicate, filter, sort, etc.)
  • Moment.js – parse, manipulate times and dates.

Robot Building Blocks

steps.stepName = function ([Object passedData]) {}

A step function is a block of code that will be injected into a loaded page and executed.

stepName – name given to a step function. First step must be always named “start” and any names can be used after that.
passedData – any variable that can passed to a step from previous steps.

Example 1:

steps.start = function () {
  // do something, emit data, generate next steps, etc
  done();
};

Example 2:

steps.start = function () {
    var some_data ='Hello World!';
    next('http://google.com', 'stepTwo', some_data);
    done();
};

steps.stepTwo = function (passed_data) {
    // Wow, I am scraping Google!
    alert(passed_data);
    done();
};

Functions

done([int delay])

Done() notifies the extension that current step has completed execution so extension can execute the next step. IMPORTANT: this function must be called ONLY ONCE during step execution. Typically it is placed at the end of step’s execution. done() must be skipped when step makes a click on DOM elements which causes loading of another page (example: click on a button to submit a form), in this case done() is called automatically by page loading event.

delay – delay in milliseconds, optional. Used to slow down robot’s crawling rate.

Example:

steps.secondStep = function () {
    var data = { name: $('h1').text().trim() };
    emit('Data', [data]);
    done(1000); // done with a delay of 1 second
}

next(String url, String nextStep[, Object passedData])

Next() adds to execution queue a new step to execute.
url – a URL to load for this particular step. If loading is not necessary an empty string “” must be passed. Use “” when previous step loads a necessary web page for this step (usually by clicking button or submitting a form).
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

IMPORTANT: Web Robots system uses LIFO (last in, first out) principle on que execution. If there are many next() statements generated queue manager will execute the latest one first.

Example:

steps.start = function () {
    next('http://webrobots.io', 'stepTwo');
    done();
}

steps.stepTwo = function () {
    console.log('Hello from stepTwo!');
    done();
}

IMPORTANT: When the number of next() invocations in a single step exceeds 5000 an optional feature of next() to accept an array of next objects should be used.

next(Array [nextObject-1, nextObject-2, …])

Next() can also accept an array of multiple next objects with mandatory url and step fields, as well as optional params field :

let nextObject =  {url: string url, step: string nextStep[, params: object passedData ]} ;

This method is preferred when robot must queue up more than 5000 next steps. Using a normal loop for over 5000 single step next() statments can freeze the browser as it is resource intensive process. Putting all nextObjects into array and passing them through single invocation solves the problem.

Example:

steps.start = function () {
    let nextArray = [];
    $('.item').each(function(i,v){
        let singleNext = {url: $(v).prop('href'), step:'stepTwo',params:{message:'I want to reach stepTwo!'}};
        nextArray.push(singleNext);

    });
    next(nextArray);
    done();
}

steps.stepTwo = function (params) {
    console.log(params.message);
    console.log('Hello from stepTwo!');
    done();
}

fastnext(String url, String nextStep[, Object passedData])

fastnext() is functionally identical to next(), but instead of visiting the page, it gets the static html of the page via AJAX GET request. It is a faster alternative to next that works well with static web pages and can provide a significant speed boost to a robot (read more here). The extension handles the AJAX response and creates a virtual DOM in the background, so unlike a regular AJAX, the context doesn’t need to be specified in consequent data selectors.

url – since this function performs an AJAX call, a valid url string is required.
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

Example:

steps.start = function () {
    fastnext('http://webrobots.io', 'stepTwo');
    done();
}

steps.stepTwo = function () {
    // Notice how the selector doesn't have any specified context
    let title = $('h1').text();
    console.log(title);
    done();
}

nextsel(String selector, String nextStep[, Object passedData])

nextsel() adds to execution queue multiple steps based on how many instances of selector it finds.
selector – a CSS selector string for a DOM element with a href attribute. Usually element.
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

Example:

steps.start = function () {
    
    nextsel('a.product', 'stepTwo');

    // In this example nextsel() is a shorthand for this looped next() code:
    // $('a.product').each(function(i,v) {
    //    next(v.href, 'stepTwo');
    // });

    done();
}

steps.stepTwo = function () {
    console.log('Hello from stepTwo!');
    done();
}

fork(String url, String nextStep[, Object passedData])

Fork() is like a next() statement, except that it starts a new robot from that particular step. Fork() is useful when running large robots and there is a possibility to run parts in parallel. In development mode fork() works as next() for development and troubleshooting purposes, while actual forking is performed on cloud instances in production mode.

url – a URL to load for this particular step. Must be a valid URL, not an empty string.
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

IMPORTANT: Fork() should be used only once during robot run. Newly forked robot starts with default settings, so any settings regarding retries, proxy, skipVisited etc. should be applied after fork(). Fork can launch many parallel robots, which can generate significant load on the target website.

Example:

steps.start = function () {
    $('a.city').each(function(i,v) {
        fork(v.href, 'processCity');// this will kick off a newly forked robot for each a.city element
    }
    done();
}

emit(String tableName, array Data)

emit() sends extracted data to database. It can be called several time during a single step.
tableName – a table name to which data will be saved. One robot can emit data into multiple tables. Example: during execution robot emits data to “Users” and “Products” tables.
Data – an array of JSON objects. Even when emitting a single JSON object, put it into array.

IMPORTANT: Always emit an array variable. If there is a need to emit a single variable still wrap it in array notation. For example emit(“Strings”, [myString]).

Example:

steps.emitData = function () {
    var data = [];
    $('#product-table tr').each(function(i, v) {
        var product = {
            name : $('.product-name', v).text(),
            price : ('.product-price', v).text()
            
        };
        data.push(product);
        
    });
    
  emit('Products', data);
  done();
}

click(var selector)

Performs a click on a DOM element found by selector. click() simulates a real click better than jQuery .click() function because it does a series of events that would happen during a click by a person (mousedown, click, mouseup, etc).
selector – this variable can be two types: a string CSS selector of an item to click on; or a DOM object to click on.

IMPORTANT: Do not use done() in a step where click() results in a new page loading.

Example:

steps.doSomeClicking = function() {
    click('.button');   // Click through string CSS selector
    
    var button = $('.button')[0];
    click(button);      // Click through DOM element
    
    done();             // Include done() only of click does not load a new page!
}

wait(string or array selector[, int maxWaitTime]).then(action).fail(action).always(action)

wait() is used to wait for a specific DOM elements to appear on a page and then perform some actions. Useful when some elements appear dynamically some time after the initial page loading.
selector – a string CSS selector expected to appear in DOM. An array of string CSS selectors can be passed as well.
maxWaitTime – optional parameter maximum waiting time in milliseconds. Default value 10,000ms is used if this parameter is skipped.
action – functions that will be executed in cases that selector appeared (then), did not appear (fail), in any case (always).

Example:

wait('.magic-table', 5000).then(function() {
    // .magic-table appeared, we can scrape it now.
    done();
})
.fail(function() {
    // This will execute of .magic-table doesn't appear.
    done(); // done() must be placed as well
})
.always(function() {
    // This will regardless .magic-table appeared or not.
});

clearCookies()

Clears browser cookies and local storage for current page.

Example:

clearCookies();

isNumber(var number)

Returns true if variable number can be parsed as a real number and false if not.

Example:

isNumber('123abc'); // returns false
isNumber('123.4545'); // returns true

setUA(string userAgent)

Sets User Agent on a browser to specified userAgent. This function is useful when interacting with mobile source. User Agent will stay through the duration of robot run.
userAgent – User Agent string.

Example:

steps.start = function () {
    next('http://webrobots.io', 'stepTwo');
    setUA('Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)');
    done();
}

steps.stepTwo = function () {
    console.log('I visited webrobots.io with custom User-Agent!');
    done();
}

setProxy({ server : string, port : string, username : string, password : string })

IMPORTANT: This is an advanced function. It can cause browser to be stuck with proxy setting if robot does not finish running cleanly, ie. developer closes the Web Robots IDE extension during robot run.

Configures Chrome to start using proxy specified in server string. If port is omitted it will be set  to 8888. Proxy settings are in effect until setProxy is called again or robot finishes run. setProxy() without any parameter will set proxy settings to system defaults without proxy. Paid customers have an option to leverage Web Robots built-in proxy rotator, contact us for details.
server – proxy server
port – proxy server’s port
username – username for proxy authentication
password
– password for proxy authentication

Example:

steps.start = function () {
    setProxy({ 
       server: '123.14.1.2',
       port: '8888',
       username: 'johnny',
       password: 'Passw0rd'
 });
    next('http://webrobots.io', 'stepTwo');

    done();
}

steps.stepTwo = function () {
    console.log('I visited webrobots.io through Proxy!');
    done();
}

closeSocket()

Closes all idle sockets in browser’s network internals. This specialised function can be used to force proxy IP address change when residential proxies are in use.

Example:

steps.start = function () {
    setProxy('MY_PROXY_TAG');
    next('https://webrobots.io', 'stepTwo');

    done();
}

steps.stepTwo = function () {
    closeSocket();
    done();
}

resetProxy()

Function resetProxy() is used to reset any proxy that has been previously set on the robot with the setProxy() function.

Example:

steps.start = function () {
    setProxy({ 
       server: '123.14.1.2',
       port: '8888',
       username: 'johnny',
       password: 'Passw0rd'
    });
    next('http://webrobots.io', 'stepTwo');
    done();
}

steps.stepTwo = function () {
    // Resetting proxy
    resetProxy();
    done();
}

setRetries(int interval, int count, int total)

Modifies robot’s step retry behaviour from default values. Default values are 60,000ms retry interval, 3 retry count, 150 total count. Following this logic, if some step is retried 3 times and still fails – robot proceeds to the next step. If robot encounters 150 retries during run it stops and marks run as failed.
interval – retry interval in milliseconds.
count – maximum retry count on a single step.
total – cumulative allowed retry limit during robot run. If this limit is reached robot will stop and mark run as failed.

Best practice is to increase retry parameters only if source is not reliable and reloading page can yield results. If source is reliable and errors are happening due to error prone javascript used in robot – more retries will not help.

Example:

steps.start = function () {
    next('http://webrobots.io', 'stepTwo');
    // setting robot to retry after 10 second for up to 10 times and stop if total of 6000 retries reached.
    setRetries(10000, 10, 6000);
    // All subsequently executed steps will adhere to the new retry policy
    done();
}

steps.stepTwo = function () {
    console.log('This step will be retried in 10 seconds if there is no done() event');
    done();
}

setSettings(Object settings)

Changes robot running parameters for the downstream steps. Currently supported settings are:

skipVisited – boolean setting to control behaviour if next() statements wants to open already visited page.Applies only to subsequent next() steps after setting is enabled. Default value: false.
respectRobotsTxt – boolean settings to control robot’s behaviour in respect to robots.txt file. If set to true, robot will read and parse robots.txt file from target domain and start respecting it’s directions. Applies only to subsequent next() steps after setting is enabled. Default value: false.

Example

steps.start = function () {
    setSettings({ skipVisited : true, respectRobotsTxt : true});
    // 1st next()
    next('http://webrobots.io', 'stepTwo');
    // 2nd next() will be skipped because we already have a next() going to http://webrobots.io
    next('http://webrobots.io', 'stepTwo');
    done();
}

steps.stepTwo = function () {
    done();
}

screenshot( string fileName )

Takes a screenshot of the scraped website at the time of the function call. The screenshot is saved as a jpg or png and is accessible in the run page at webRobots portal.

The function takes a string as an argument for the screenshot name and format. The default format is png. If a number of screenshots are taken with the same name then they are overwritten with the most recent one, therefore it is recommended to generate dynamic names using timestamps.

Example:

steps.start = function () {
    screenshot('nice_screenshot_' + moment().format('DD-HH:mm:ss') + '.jpg');
    done();
}

blockImages()

Disables image downloading and display in browser. All subsequent steps will be affected by this. Does not carry over into forks. Common use cases: when sized of downloaded content is a concern and needs to be optimised, sometimes improves crawling speed.

Example:

steps.start = function () {
    blockImages();
    done();
}

allowImages()

Enabled image downloading and display in browser.  Used to reverse changes made by blockImages().

Example:

steps.start = function () {
    allowImages();
    done();
}

captureRequests ([string urlFilter, string contentFilter, string requestType])

Returns promise with return value as array of captured network requests (fetch and XHR) and corresponding responses since the last page reload.

To use this function interceptNetworkRequests flag must be set to true in robot config. This will put active browser tab in debugging mode.

captureRequests has three optional string parameters, that can be combined:

urlFilter – returns only requests, where destination URL contains specific string.
contentFilter – returns only requests, where response contains specific string.
requestType – returns only requests of specific type (fetch, xhr).

Example:

steps.loadPage = async function() {
  await wait('.product'); //waiting for some selector indicating, that the request that we need has already fired
  var requests = await captureRequests('productDataApi');
  console.log(requests[0]); //here we have Object containing request data, that we need
  .....
  done();
};

For debugging you can write empty robot with interceptNetworkRequests = true flag in it’s config and then when starting page loads the tab will be in debugging mode and you can write for example: var r = await captureRequests() in Chrome developer console even when robot finished work.