Blog

14 Oct, 2016

Migrating PostgreSQL Databases From AWS RDS To Standalone

By |2019-03-04T15:50:24+02:00October 14th, 2016|PostgreSQL|0 Comments

Intro

AWS RDS is very convenient and takes care of almost all DBA tasks. It just works as long as you stay inside AWS. But if you want to have a local copy of your database or need to move data to another host it can be tricky.

TL;DR – For our solution skip to part last section

What doesn’t work

AWS daily backups

AWS RDS by default creates daily backups of your data. First thought would be get such backup and restore it locally. But those are not regular Postgres backups. They probably are VM image copies, but no way to know, as you cannot copy or see them.  The only option is to restore them to another RDS instance.

Postgresql replication

AWS uses replication to maintain […]

29 Sep, 2016

Announcement: Robot Naming Change

By |2019-03-04T15:53:34+02:00September 29th, 2016|Web Scraping|0 Comments

Recently we started enforcing that robot names can have only alphanumeric, underscore (_) and dash (-) characters and must be at least 3 characters long. The reason for this move is that robot names are used in generating run_id and later file names. Some nontypical characters in robot names were causing problems when processing files using various ETL tools, storing in file systems. All existing robots were modified by replacing non-compliant characters with underscores.

-Web Robots Team

8 Jun, 2016

Scrape Instagram Followers

By |2019-03-04T16:02:00+02:00June 8th, 2016|Web Scraping|29 Comments

Our platform is often used by growth hackers for lead generation in social media networks. One such use case is building a list of Instagram followers from interestingprofiles. Today we placed one such robot into our portal‘s demo space for anyone to use. Robot is only 30 lines of Javascript code and works quite fast. We tested it with IBM’s Instagram which has 78k followers and it took only 14 minutes to scrape them.

instagram_robot

How to use this robot:

  1. Login to Web Robots portal on Chrome browser.
  2. Make sure you have Web Robots Chrome extension to run the robot.
  3. Open robot instagram_followers in our extension.
  4. Make […]
1 Mar, 2016

Scraping Yelp Data

By |2019-03-04T16:06:28+02:00March 1st, 2016|Web Scraping|3 Comments

We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.

We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc.  Robot is placed in our Demo space on Web Robots portal for anyone to use, just sign up, find the robot and use it.

Screen Shot 2016-03-01 at 3.22.41 [...]
</p srcset=

27 Jan, 2016

New Dataset – UK LPA Search

By |2019-03-04T16:09:32+02:00January 27th, 2016|Datasets|0 Comments

We are excited to announce UK LPA Search – it is a search engine for all UK’s local planning authorities. Until now there was no possibility to search LPA databases from one place. One had to find each LPA’s website and search inside it. Considering there are few hundred of them – this would not be an easy task for a human. Our robots have no problems indexing all databases and providing them as a single dataset.

A bonus point – we geocoded all requests and display them on a map. Therefore anyone can see what building permits are being issues around them. Example: Map of building permits in London […]

31 Dec, 2015

New Kickstarter Dataset

By |2019-03-04T16:10:59+02:00December 31st, 2015|Datasets|2 Comments

Recently we updated our Kickstarter robot to crawl project subcategories. This allows us to collect a richer dataset, for example on 2015-12-17 run robot collected data about 144,263 projects with a running time only 2 hours! We also started presenting it in the JSON streaming format which is just a line delimited JSON. Previously we used to stuff all projects into JSON array and the downside of it was that user would have to read the entire large JSON file into memory before any kind of processing starts. with JSON streaming it is possible to read one line at a time.

Data is posted in the usual place.

3 Dec, 2015

New Features

By |2019-03-04T16:16:49+02:00December 3rd, 2015|Web Scraping|0 Comments

We are happy to announce some new features in our robot writing framework. These features are:

  • Fork() – split robot into many parallel robots and run them simultaneously. This feature shortens long scraping jobs by parallelising them. Cloud autoscaling handles necessary instance capacity so our customers can run 100s of instances on-demand.
  • skipVisited – allows robot to intelligently skip steps to links that were already visited. Avoid data duplication, save robot running time.
  • respectRobotsTxt – crawl target sources with compliance to their robots.txt file.

These features are explained in detail and examples added to our framework documentation page.

5 Aug, 2015

Scrape Twitter Followers

By |2019-03-04T16:22:14+02:00August 5th, 2015|Web Scraping|5 Comments

Today we released a simple robot which scrapes follower information from any Twitter user. This will be useful for anyone who is doing competitor analysis or doing research on who follows particular topics. Robot is placed in Demo space on Web Robots portal for anyone to use.

Twitter Scraper Easy Twitter Scraping

How to use it:

  1. Sign in to our portal here.
  2. Download our scraping extension from here.
  3. Find robot named twitter_followers in the dropdown.
  4. Modify start URL to your target’s follower list. For example: https://twitter.com/werobots/followers
  5. Click Run.
  6. Let robot finish it’s job and download data from portal.

In case you want to create your […]

17 Feb, 2015

PostgreSQL 9.4 JSON Queries

By |2019-03-04T16:19:29+02:00February 17th, 2015|PostgreSQL|0 Comments

Intro

Querying JSON with SQL is extremely powerful and convenient.  Some great things about:

  • Use SQL to query unstructured data
  • Join relational and JSON tables
  • Convert between JSON and relational schema

But query writing can be difficult and non obvious at first.  Official documentation doesn’t have many samples. Many useful queries need other great but not widely known features of PosgresSQL like LATERAL joins and recursive queries.

This tutorial has some real world examples.

Get some data

Lets use GitHub Archive as source for large JSON with complex structure:

wget http://data.githubarchive.org/2015-01-01-15.json.gz
gzip -d 2015-01-01-15.json.gz

Load JSON to PostgreSQL

Super easy:

COPY github FROM 'c:\temp\2015-01-01-15.json'
WITH (format csv, quote e'\x01', delimiter e'\x02', escape e'\x01')

Query returned successfully: 11351 rows affected, 1025 ms execution time.

Do some simple queries

Still straightforward:

select js->>'type', count(*) from github group by [...]