Scraping Tutorial

Scraping Tutorial 2017-01-09T16:20:30+00:00

1. Setup tools

Install Chrome browser.  Download from here.

Install Web Robots Chrome extension.  Get it here

Sign-up to our portal portal.webrobots.io

 

2. Create new robot on portal.

 Click ‘Robots‘ in top menu

Click ‘New robot‘ at the bottom of robots list

Chose any name and start url.

In this example we’ll use demo1 for named use http://careers.stackoverflow.com/jobs?sort=p as start url

 

3.  Hello world robot

 Open Web Robots extension and select you new robot from drop down

Enter code:

steps.start = function() {
 dbg("Hello, world!");
 done();
};

 

Click “Run” button.

Wait until status message says robot is finished;

 Run finished in: 1.075: demo_2015-01-09T09_25_44_727ZDone: 2 Que: 0 EmitQue: 0

 See

[Log] tab to view log messages

 

4.  Save data to server

Modify code to be:


steps.start = function() {
    dbg("Hello, world!");
    
    var rows = [];
    var row = {};
    
    row.price = 1;
    row.address = "100 main street";
    row.city =  "New York";
    row.rooms = 5;
    row.area = 50;
    rows.push(row);
    
    emit ("listings", rows);
    done();
};

Run robot and check the [Output] tab.  It shows data we just published:

 
 [
 {
 "price": 1,
 "address": "100 main street",
 "city": "New York",
 "rooms": 5,
 "area": 50,
 "source_url": "http://newyork.craigslist.org/"
 }
]

5.  Extract some data from page

Modify the robot code as shown below. This robot will extract every job posting from a page.

steps.start = function() {
    var rows = [];
    
    $(".-job").each(function (i, v) {
        var row = {};
        
        row.company = $(".employer", v).text().trim();
        row.location =  $(v).find(".location").text().split("•").pop().trim();
        row.position =  $(v).find("a.job-link:first").text().trim();
        row.tags =  $(v).find("div.tags").text().trim();
        
        rows.push(row);
    });
    
    emit ("jobs", rows);
    done();
};

After executing the robot check that data appeared on portal. Find your robot in portal, open it and see that there is a new run with your data.

6. Iterate over multiple pages

Add code to click “Next” link on each page.

steps.start = function() {
    var rows = [];
     
    $(".-job").each(function (i, v) {
        var row = {};
         
        row.company = $(".employer", v).text().trim();
        row.location =  $(v).find(".location").text().split("•").pop().trim();
        row.position =  $(v).find("a.job-link:first").text().trim();
        row.tags =  $(v).find("div.tags").text().trim();
         
        rows.push(row);
    });
 
    if ($(".test-pagination-next").length > 0) {
        next ($(".test-pagination-next")[0].href, "start");
    } 
 
    emit ("jobs", rows);
    done();
};

Refresh portal page and you will see a new run with data extracted from all pages.