27 Jan, 2016

New Dataset – UK LPA Search

By nicerobot|2019-03-04T16:09:32+02:00January 27th, 2016|Datasets|0 Comments

We are excited to announce UK LPA Search – it is a search engine for all UK’s local planning authorities. Until now there was no possibility to search LPA databases from one place. One had to find each LPA’s website and search inside it. Considering there are few hundred of them – this would not be an easy task for a human. Our robots have no problems indexing all databases and providing them as a single dataset.

A bonus point – we geocoded all requests and display them on a map. Therefore anyone can see what building permits are being issues around them. Example: Map of building permits in London […]

31 Dec, 2015

New Kickstarter Dataset

By nicerobot|2019-03-04T16:10:59+02:00December 31st, 2015|Datasets|2 Comments

Recently we updated our Kickstarter robot to crawl project subcategories. This allows us to collect a richer dataset, for example on 2015-12-17 run robot collected data about 144,263 projects with a running time only 2 hours! We also started presenting it in the JSON streaming format which is just a line delimited JSON. Previously we used to stuff all projects into JSON array and the downside of it was that user would have to read the entire large JSON file into memory before any kind of processing starts. with JSON streaming it is possible to read one line at a time.

Data is posted in the usual place.

3 Dec, 2015

New Features

By nicerobot|2019-03-04T16:16:49+02:00December 3rd, 2015|Web Scraping|0 Comments

We are happy to announce some new features in our robot writing framework. These features are:

Fork() – split robot into many parallel robots and run them simultaneously. This feature shortens long scraping jobs by parallelising them. Cloud autoscaling handles necessary instance capacity so our customers can run 100s of instances on-demand.
skipVisited – allows robot to intelligently skip steps to links that were already visited. Avoid data duplication, save robot running time.
respectRobotsTxt – crawl target sources with compliance to their robots.txt file.

These features are explained in detail and examples added to our framework documentation page.

22 Oct, 2015

Fresh Kickstarter Datasets

By nicerobot|2019-03-04T16:14:46+02:00October 22nd, 2015|Datasets|0 Comments

We have been swamped with work and have not updated our Kickstarter dataset page in while. To correct this today we posted new datasets retrieved in June, August and October. They are listed in the usual place: http://webrobots.io/kickstarter-datasets/

Enjoy!

5 Aug, 2015

Scrape Twitter Followers

By nicerobot|2019-03-04T16:22:14+02:00August 5th, 2015|Web Scraping|5 Comments

Today we released a simple robot which scrapes follower information from any Twitter user. This will be useful for anyone who is doing competitor analysis or doing research on who follows particular topics. Robot is placed in Demo space on Web Robots portal for anyone to use.

Easy Twitter Scraping

How to use it:

Sign in to our portal here.
Download our scraping extension from here.
Find robot named twitter_followers in the dropdown.
Modify start URL to your target’s follower list. For example: https://twitter.com/werobots/followers
Click Run.
Let robot finish it’s job and download data from portal.

In case you want to create your […]

17 Feb, 2015

PostgreSQL 9.4 JSON Queries

By nicerobot|2019-03-04T16:19:29+02:00February 17th, 2015|PostgreSQL|0 Comments

Intro

Querying JSON with SQL is extremely powerful and convenient. Some great things about:

Use SQL to query unstructured data
Join relational and JSON tables
Convert between JSON and relational schema

But query writing can be difficult and non obvious at first. Official documentation doesn’t have many samples. Many useful queries need other great but not widely known features of PosgresSQL like LATERAL joins and recursive queries.

This tutorial has some real world examples.

Get some data

Lets use GitHub Archive as source for large JSON with complex structure:

wget http://data.githubarchive.org/2015-01-01-15.json.gz
gzip -d 2015-01-01-15.json.gz

Load JSON to PostgreSQL

Super easy:

COPY github FROM 'c:\temp\2015-01-01-15.json'
WITH (format csv, quote e'\x01', delimiter e'\x02', escape e'\x01')

Query returned successfully: 11351 rows affected, 1025 ms execution time.

Do some simple queries

Still straightforward:

select js->>'type', count(*) from github group by [...]

18 Dec, 2014

How We Validate Data

By nicerobot|2019-03-04T16:24:28+02:00December 18th, 2014|Web Scraping|1 Comment

Data is only valuable if it can be trusted. At weRobots we spend as much effort on validating data as on collecting it. It is a multi stage process.

Scraping

Initial checks happen in scraper robots. Robot crawls target website and looks for data. Captured data is sent to our staging database. Many abnormal situations can arise at this stage:

1. - Site may be down. Robot will log warnings and will retry pages that do not respond. Usually outage is temporary and robot resumes without intervention
  - Site layout changes. If robot cannot find navigation links or data it will stop and report error so […]

28 Nov, 2014

Insights from Google event “Always Ahead”

By nicerobot|2019-03-04T16:34:38+02:00November 28th, 2014|Web Scraping|0 Comments

Yesterday weRobots attended an event sponsored by Google and Enterprise Lithuania. We went there with an expectation of a small event (maybe a workshop) for geeks, but it actually was huge with big crowd, full main hall at LITEXPO conference center and three breakout sessions in the afternoon. The organisers were nice enough to let both weRobots co-founders to enter with only one ticket, plus we were selected for the “red room” break out sessions which was designated for the most advanced crowd.

The most interesting part of the conference was a strong presentation by Pawel Matkowski (Google IE). He teased the audience with a promise that we will see something that […]

16 Oct, 2014

Random Proxy Switcher

By nicerobot|2019-03-04T16:33:52+02:00October 16th, 2014|Web Scraping|0 Comments

We decided to release a side product of our internal systems. It is a free Chrome extension which allows users to browse the web while randomly changing proxy every minute. Of course the catch is that we do not give out proxy servers themselves, but there are plenty services that provide them. Or anyone can launch their own for free by following this tutorial.

13 Oct, 2014

How to Setup a Free Proxy Server on Amazon EC2

By nicerobot|2019-03-04T16:33:18+02:00October 13th, 2014|Web Scraping|1 Comment

Open Amazon Web Services (AWS) Account

Go to the AWS Portal and sign up. Credit card will be required, but Amazon will not charge anything. Amazon will also ask for your phone number and verify it. Amazon EC2 offers free Micro Instances which are good enough for proxy server setup. They remain free for the first year of AWS usage.

Creating an EC2 Instance

Once you have the account login to AWS Management Console and from the EC2 Dashboard click the Launch Instance. Follow the steps and launch the instance of Ubuntu Server which is marked as Free tier eligible. Make sure you download SSH key file (.pem) as it will be needed to connect to server.

Blog

How to use it:

Intro

Get some data

Load JSON to PostgreSQL

Do some simple queries

Open Amazon Web Services (AWS) Account

Creating an EC2 Instance