Web Robots Documentation

Web Robots Documentation 2017-03-16T10:54:12+00:00

Robots are writen in Javascript using steps functions. Besides our own steps functions we also include the following libraries that can be used in any steps within robot:

  • jQuery – work with DOM elements
  • underscore.js – helper for work with objects and arrays (deduplicate, filter, sort, etc.)
  • Moment.js – parse, manipulate times and dates.

Robot Building Blocks

steps.stepName = function (

[Object passedData]) {}

A step function is a block of code that will be injected into a loaded page and executed.
stepName – name given to a step function. First step must be always named “start” and any names can be used after that.
passedData – any variable that can passed to a step from previous steps.

Example 1:

steps.start = function () {
  // do something, emit data, generate next steps, etc
  done();
};

Example 2:

steps.start = function () {
    var some_data = "Hello World!";
    next("http://google.com", "stepTwo", some_data);
    done();
};

steps.stepTwo = function (passed_data) {
    // Wow, I am scraping Google!
    alert(passed_data);
    done();
};

Functions

done([int delay])

Done() notifies the extension that current step has completed execution so extension can execute the next step. IMPORTANT: this function must be called ONLY ONCE during step execution. Typically it is placed at the end of step’s execution. done() must be skipped when step makes a click on DOM elements which causes loading of another page (example: click on a button to submit a form), in this case done() is called automatically by page loading event.

delay – Optional delay in milliseconds. Used to slow down robot’s crawling rate.

Example:

steps.secondStep = function () {
    var data = { name: $("h1").text().trim() };
    emit("Data", [data]);
    done(1000); // done with a delay of 1 second
}

next(String url, String nextStep[, Object passedData])

Next() adds to execution que a new step to execute.
url – a URL to load for this particular step. If loading is not necessary an empty string “” must be passed. Use “” when previous step loads a necessary web page for this step (usually by by clicking button or submitting a form).
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step. This parameter is optional.

IMPORTANT: Web Robots system uses LIFO (last in, first out) principle on que execution. If there are many next() statements generated que manager will execute the latest one first.

Example:

steps.start = function () {
    next("http://webrobots.io", "stepTwo");
    done();
}

steps.stepTwo = function () {
    console.log("Hello from stepTwo!");
    done();
}

fork(String url, String nextStep[, Object passedData])

Fork() is like a next() statement, except that it starts a new robot from that particular step. Fork() is useful when running large robots and there is a possibility to run parts in parallel. In development mode fork() works as next() for development and troubleshooting purposes, while actual forking is performed on cloud instances in production mode.

url – a URL to load for this particular step. Cannot be be empty string.
nextStep – a string containing step’s name to execute.
passedData – any data that should be passed to the next step. This parameter is optional.

IMPORTANT: Fork() should be used only once during robot run. Newly forked robot starts with default settings, so any settings regarding retries, proxy, skipVisited etc. should be applied after fork(). Fork can launch many parallel robots, which can generate significant load on the target website.

Example:

steps.start = function () {
    $("a.city").each(function(i,v) {
        fork(v.href, "processCity");// this will kick off a newly forked robot for each a.city element
    }
    done();
}

emit(String tableName, array Data)

emit() sends extracted data to database. It can be called several time during a single step.
tableName – a table name to which data will be saved. One robot can emit data into multiple tables. Example: during execution robot emits data to “Users” and “Products” tables.
Data – an array of JSON objects. Even when emitting a single JSON object, put it into array.

IMPORTANT: Always emit an array variable. If there is a need to emit a single variable still wrap it in array notation. For example emit(“Strings”, [myString]).

Example:

steps.emitData = function () {
    var data = [];
    $("#product-table tr").each(function(i, v) {
        var product = {
            name : $(".product-name", v).text(),
            price : (".product-price", v).text()
            
        };
        data.push(product);
        
    });
    
  emit("Products", data);
  done();
}

click(var selector)

Performs a click on a DOM element found by selector. click() simulates a real click better than jQuery .click() function because it does a series of events that would happen during a click by a person (mousedown, click, mouseup, etc).
selector – this variable can be two types: a string CSS selector of an item to click on; or a DOM object to click on.

IMPORTANT: Do not use done() in a step where click() results in a new page loading.

Example:

steps.doSomeClicking = function() {
    click(".button");   // Click through string CSS selector
    
    var button = $(".button")[0];
    click(button);      // Click through DOM element
    
    done();             // Include done() only of click does not load a new page!
}

steps.waitFor(string selector[, int maxWaitTime]).then(function actions)

waitFor is used to wait for a specific DOM element to appear on a page and then perform some actions. Useful when some elements appear dynamically some time after the initial page loading.
selector – a string CSS selector expected to appear in DOM.
maxWaitTime – optional parameter maximum waiting time in milliseconds. Default value 10,000ms is used if this parameter is skipped.
actions – a function that is executed after selector appears in DOM.

Example:

steps.waitFor(".magic-table", 5000).then(function() {
    // .magic-table appeared, we can scrape it now.
    done();     // done() must be placed inside waitFor
});

setUA(string userAgent)

Sets User Agent on a browser to specified userAgent. This function is useful when interacting with mobile source. User Agent will stay through the duration of robot run.
userAgent – User Agent string.

Example:

steps.start = function () {
    next("http://webrobots.io", "stepTwo");
    setUA("Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)");
    done();
}

steps.stepTwo = function () {
    console.log("I visited webrobots.io with custom User-Agent!");
    done();
}

setProxy(string server[, int port])

IMPORTANT: This is an experimental function. It can cause browser to be stuck with proxy setting if robot does not finish running cleanly.

Configures Chrome to start using proxy specified in server string. If port is omitted it will be set  to 8888. Proxy settings are in effect until setProxy is called again, resetProxy() is called or robot finishes run.
server – proxy server.
port – proxy server’s port. This parameter is optional.

Example:

steps.start = function () {
    next("http://webrobots.io", "stepTwo");
    setProxy("10.10.30.199", 8050);
    done();
}

steps.stepTwo = function () {
    console.log("I visited webrobots.io through Proxy!");
    done();
}

clearCookies()

Clears browser cookies for current page.

Example:

clearCookies();

setRetries(int interval, int count, int total)

Modifies robot’s step retry behaviour from default values. Default values are 60,000ms retry interval, 3 retry count, 150 total count. Following this logic, if some step is retried 3 times and still fails – robot proceeds to the next step. If robot encounters 150 retries during run it stops and marks run as failed.
interval – retry interval in milliseconds.
count – maximum retry count on a single step.
total – cumulative allowed retry limit during robot run. If this limit is reached robot will stop and mark run as failed.

Best practice is to increase retry parameters only if source is not reliable and reloading page can yield results. If source is reliable and errors are happening due to error prone javascript used in robot – more retries will not help.

Example:

steps.start = function () {
    next("http://webrobots.io", "stepTwo");
    // setting robot to retry after 10 second for up to 10 times and stop if total of 6000 retries reached.
    setRetries(10000, 10, 6000);
    // All subsequently exectued steps will adhere to the new retry policy
    done();
}

steps.stepTwo = function () {
    console.log("This step will be retried in 10 seconds if there is no done() event");
    done();
}

isNumber(var number)

Returns true if variable number can be parsed as a real number and false if not.

Example:

isNumber("123abc"); // returns false
isNumber(123.4545"); // returns true

setSettings(Object settings)

Changes robot running parameters for the upcoming steps. Currently supported settings are:

skipVisited – boolean setting to control behaviour if next() statements wants to open already visited page.Applies only to subsequent next() steps after setting is enabled. Default value: false.
respectRobotsTxt – boolean settings to control robot’s behaviour in respect to robots.txt file. If set to true, robot will read and parse robots.txt file from target domain and start respecting it’s directions. Applies only to subsequent next() steps after setting is enabled. Default value: false.

Example

steps.start = function () {
    setSettings({ skipVisited : true, respectRobotsTxt : true});
    // 1st next()
    next("http://webrobots.io", "stepTwo");
    // 2nd next() will be skipped because we already have a next() going to http://webrobots.io
    next("http://webrobots.io", "stepTwo");
    done();
}

steps.stepTwo = function () {
    done();
}