In Web Robots framework robot scripts are written in Javascript using a simple steps functions. Besides our own steps functions we also include the following libraries that can be used in any steps within robot:

  • jQuery – work with DOM elements
  • underscore.js – helper for work with objects and arrays (deduplicate, filter, sort, etc.)
  • Moment.js – parse, manipulate times and dates.

Robot Building Blocks

steps.stepName = function ([Object passedData]) {}

A step function is a block of code that will be injected into a loaded page and executed.

stepName – name given to a step function. First step must be always named “start” and any names can be used after that.
passedData – any variable that can passed to a step from previous steps.

Example 1:

steps.start = function () {
  // do something, emit data, generate next steps, etc
  done();
};

Example 2:

steps.start = function () {
    var some_data ="Hello World!";
    next("http://google.com", "stepTwo", some_data);
    done();
};

steps.stepTwo = function (passed_data) {
    // Wow, I am scraping Google!
    alert(passed_data);
    done();
};

Functions

done([int delay])

Done() notifies the extension that current step has completed execution so extension can execute the next step. IMPORTANT: this function must be called ONLY ONCE during step execution. Typically it is placed at the end of step’s execution. done() must be skipped when step makes a click on DOM elements which causes loading of another page (example: click on a button to submit a form), in this case done() is called automatically by page loading event.

delay – delay in milliseconds, optional. Used to slow down robot’s crawling rate.

Example:

steps.secondStep = function () {
    var data = { name: $("h1").text().trim() };
    emit("Data", [data]);
    done(1000); // done with a delay of 1 second
}

next(String url, String nextStep[, Object passedData])

Next() adds to execution queue a new step to execute.
url – a URL to load for this particular step. If loading is not necessary an empty string “” must be passed. Use “” when previous step loads a necessary web page for this step (usually by by clicking button or submitting a form).
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

IMPORTANT: Web Robots system uses LIFO (last in, first out) principle on que execution. If there are many next() statements generated que manager will execute the latest one first.

Example:

steps.start = function () {
    next("http://webrobots.io";, "stepTwo");
    done();
}

steps.stepTwo = function () {
    console.log("Hello from stepTwo!");
    done();
}

nextsel(String selector, String nextStep[, Object passedData])

nextsel() adds to execution queue multiple steps based on how many instances of selector it finds.
selector – a CSS selector string for a DOM element with a href attribute. Usually element.
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

Example:

steps.start = function () {
    
    nextsel("a.product", "stepTwo");

    // In this example nextsel() is a shorthand for this looped next() code:
    // $("a.product").each(function(i,v) {
    //    next(v.href, "stepTwo");
    // });

    done();
}

steps.stepTwo = function () {
    console.log("Hello from stepTwo!");
    done();
}

fork(String url, String nextStep[, Object passedData])

Fork() is like a next() statement, except that it starts a new robot from that particular step. Fork() is useful when running large robots and there is a possibility to run parts in parallel. In development mode fork() works as next() for development and troubleshooting purposes, while actual forking is performed on cloud instances in production mode.

url – a URL to load for this particular step. Must be a valid URL, not an empty string.
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step, optional.

IMPORTANT: Fork() should be used only once during robot run. Newly forked robot starts with default settings, so any settings regarding retries, proxy, skipVisited etc. should be applied after fork(). Fork can launch many parallel robots, which can generate significant load on the target website.

Example:

steps.start = function () {
    $("a.city").each(function(i,v) {
        fork(v.href, "processCity");// this will kick off a newly forked robot for each a.city element
    }
    done();
}

emit(String tableName, array Data)

emit() sends extracted data to database. It can be called several time during a single step.
tableName – a table name to which data will be saved. One robot can emit data into multiple tables. Example: during execution robot emits data to “Users” and “Products” tables.
Data – an array of JSON objects. Even when emitting a single JSON object, put it into array.

IMPORTANT: Always emit an array variable. If there is a need to emit a single variable still wrap it in array notation. For example emit(“Strings”, [myString]).

Example:

steps.emitData = function () {
    var data = [];
    $("#product-table tr").each(function(i, v) {
        var product = {
            name : $(".product-name", v).text(),
            price : (".product-price", v).text()
            
        };
        data.push(product);
        
    });
    
  emit("Products", data);
  done();
}

click(var selector)

Performs a click on a DOM element found by selector. click() simulates a real click better than jQuery .click() function because it does a series of events that would happen during a click by a person (mousedown, click, mouseup, etc).
selector – this variable can be two types: a string CSS selector of an item to click on; or a DOM object to click on.

IMPORTANT: Do not use done() in a step where click() results in a new page loading.

Example:

steps.doSomeClicking = function() {
    click(".button");   // Click through string CSS selector
    
    var button = $(".button")[0];
    click(button);      // Click through DOM element
    
    done();             // Include done() only of click does not load a new page!
}

wait(string or array selector[, int maxWaitTime]).then(action).fail(action).always(action)

wait() is used to wait for a specific DOM elements to appear on a page and then perform some actions. Useful when some elements appear dynamically some time after the initial page loading.
selector – a string CSS selector expected to appear in DOM. An array of string CSS selectors can be passed as well.
maxWaitTime – optional parameter maximum waiting time in milliseconds. Default value 10,000ms is used if this parameter is skipped.
action – functions that will be executed in cases that selector appeared (then), did not appear (fail), in any case (always).

Example:

wait(".magic-table", 5000).then(function() {
    // .magic-table appeared, we can scrape it now.
    done();
})
.fail(function() {
    // This will execute of .magic-table doesn't appear.
    done(); // done() must be placed as well
})
.always(function() {
    // This will regardless .magic-table appeared or not.
});

clearCookies()

Clears browser cookies and local storage for current page.

Example:

clearCookies();

isNumber(var number)

Returns true if variable number can be parsed as a real number and false if not.

Example:

isNumber("123abc"); // returns false
isNumber("123.4545"); // returns true

setUA(string userAgent)

Sets User Agent on a browser to specified userAgent. This function is useful when interacting with mobile source. User Agent will stay through the duration of robot run.
userAgent – User Agent string.

Example:

steps.start = function () {
    next("http://webrobots.io", "stepTwo");
    setUA("Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)");
    done();
}

steps.stepTwo = function () {
    console.log("I visited webrobots.io with custom User-Agent!");
    done();
}

setProxy({ server : string, port : string, username : string, password : string })

IMPORTANT: This is an advanced function. It can cause browser to be stuck with proxy setting if robot does not finish running cleanly, ie. developer closes the Web Robots IDE extension during robot run.

Configures Chrome to start using proxy specified in server string. If port is omitted it will be set  to 8888. Proxy settings are in effect until setProxy is called again or robot finishes run. setProxy() without any parameter will set proxy settings to system defaults without proxy. Paid customers have an option to leverage Web Robots built-in proxy rotator, contact us for details.
server – proxy server
port – proxy server’s port
username – username for proxy authentication
password
– password for proxy authentication

Example:

steps.start = function () {
    setProxy({ 
       server: "123.14.1.2",
       port: "8888",
       username: "johnny",
       password: "Passw0rd"
 });
    next("http://webrobots.io", "stepTwo");

    done();
}

steps.stepTwo = function () {
    console.log("I visited webrobots.io through Proxy!");
    done();
}

setRetries(int interval, int count, int total)

Modifies robot’s step retry behaviour from default values. Default values are 60,000ms retry interval, 3 retry count, 150 total count. Following this logic, if some step is retried 3 times and still fails – robot proceeds to the next step. If robot encounters 150 retries during run it stops and marks run as failed.
interval – retry interval in milliseconds.
count – maximum retry count on a single step.
total – cumulative allowed retry limit during robot run. If this limit is reached robot will stop and mark run as failed.

Best practice is to increase retry parameters only if source is not reliable and reloading page can yield results. If source is reliable and errors are happening due to error prone javascript used in robot – more retries will not help.

Example:

steps.start = function () {
    next("http://webrobots.io", "stepTwo");
    // setting robot to retry after 10 second for up to 10 times and stop if total of 6000 retries reached.
    setRetries(10000, 10, 6000);
    // All subsequently executed steps will adhere to the new retry policy
    done();
}

steps.stepTwo = function () {
    console.log("This step will be retried in 10 seconds if there is no done() event");
    done();
}

setSettings(Object settings)

Changes robot running parameters for the downstream steps. Currently supported settings are:

skipVisited – boolean setting to control behaviour if next() statements wants to open already visited page.Applies only to subsequent next() steps after setting is enabled. Default value: false.
respectRobotsTxt – boolean settings to control robot’s behaviour in respect to robots.txt file. If set to true, robot will read and parse robots.txt file from target domain and start respecting it’s directions. Applies only to subsequent next() steps after setting is enabled. Default value: false.

Example

steps.start = function () {
    setSettings({ skipVisited : true, respectRobotsTxt : true});
    // 1st next()
    next("http://webrobots.io", "stepTwo");
    // 2nd next() will be skipped because we already have a next() going to http://webrobots.io
    next("http://webrobots.io", "stepTwo");
    done();
}

steps.stepTwo = function () {
    done();
}