Web Scraping

Analysis of different web scraping approaches, best practices and challenges. Overview of Web Robots scraper features.

24 Feb, 2017

Scraping Extension Update – version 2017.2.23

By |2019-03-04T15:35:16+02:00February 24th, 2017|Web Scraping|0 Comments

Recently we rolled out an updated version of our main web scraping extension which contains several important updates and new features. This update allows our users to develop and debug robots even faster than before. So what exactly is new?

  1. jQuery has been upgraded from version 1.10.2 to 2.2.4
  2. done() now can take a milliseconds delay parameter. For example done(1000); will delay step finish by 1 second.
  3. New tab Selectors which allows testing selectors inline and generates robot code. Selectors are immediately tested on browser’s active tab so developer can see if they work correctly. Copy code button copies Javascript code to clipboard which can be pasted directly into robot’s step.
  4. […]

10 Nov, 2016

Writing Better Data Collection Robots

By |2019-03-04T15:40:36+02:00November 10th, 2016|Web Scraping|2 Comments

At Web Robots we have a fanatical customer support. Large part of this is doing technical support for robot developers. For this we maintain live chatrooms, often do screenshares, joint code writing sessions with each of our customers’ development teams. This helps us solve most of the problems our customers encounter in minutes, difficult ones in several hours. This is not an exaggeration.

Writing web scraping robot

Based on the accumulated experience we were able to identify a list of the most common mistakes that robot writers can make. We published a list with specific code examples that illustrate the mistakes. Then each mistake has an explanation and a code example with solution. This list is highly recommended read for […]

29 Sep, 2016

Announcement: Robot Naming Change

By |2019-03-04T15:53:34+02:00September 29th, 2016|Web Scraping|0 Comments

Recently we started enforcing that robot names can have only alphanumeric, underscore (_) and dash (-) characters and must be at least 3 characters long. The reason for this move is that robot names are used in generating run_id and later file names. Some nontypical characters in robot names were causing problems when processing files using various ETL tools, storing in file systems. All existing robots were modified by replacing non-compliant characters with underscores.

-Web Robots Team

8 Jun, 2016

Scrape Instagram Followers

By |2019-03-04T16:02:00+02:00June 8th, 2016|Web Scraping|33 Comments

Our platform is often used by growth hackers for lead generation in social media networks. One such use case is building a list of Instagram followers from interestingprofiles. Today we placed one such robot into our portal‘s demo space for anyone to use. Robot is only 30 lines of Javascript code and works quite fast. We tested it with IBM’s Instagram which has 78k followers and it took only 14 minutes to scrape them.

instagram_robot

How to use this robot:

  1. Login to Web Robots portal on Chrome browser.
  2. Make sure you have Web Robots Chrome extension to run the robot.
  3. Open robot instagram_followers in our extension.
  4. Make […]
1 Mar, 2016

Scraping Yelp Data

By |2019-03-04T16:06:28+02:00March 1st, 2016|Web Scraping|3 Comments

We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.

We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc.  Robot is placed in our Demo space on Web Robots portal for anyone to use, just sign up, find the robot and use it.

Screen Shot 2016-03-01 at 3.22.41 [...]
</p srcset=

3 Dec, 2015

New Features

By |2019-03-04T16:16:49+02:00December 3rd, 2015|Web Scraping|0 Comments

We are happy to announce some new features in our robot writing framework. These features are:

  • Fork() – split robot into many parallel robots and run them simultaneously. This feature shortens long scraping jobs by parallelising them. Cloud autoscaling handles necessary instance capacity so our customers can run 100s of instances on-demand.
  • skipVisited – allows robot to intelligently skip steps to links that were already visited. Avoid data duplication, save robot running time.
  • respectRobotsTxt – crawl target sources with compliance to their robots.txt file.

These features are explained in detail and examples added to our framework documentation page.

5 Aug, 2015

Scrape Twitter Followers

By |2019-03-04T16:22:14+02:00August 5th, 2015|Web Scraping|5 Comments

Today we released a simple robot which scrapes follower information from any Twitter user. This will be useful for anyone who is doing competitor analysis or doing research on who follows particular topics. Robot is placed in Demo space on Web Robots portal for anyone to use.

Twitter Scraper Easy Twitter Scraping

How to use it:

  1. Sign in to our portal here.
  2. Download our scraping extension from here.
  3. Find robot named twitter_followers in the dropdown.
  4. Modify start URL to your target’s follower list. For example: https://twitter.com/werobots/followers
  5. Click Run.
  6. Let robot finish it’s job and download data from portal.

In case you want to create your […]

18 Dec, 2014

How We Validate Data

By |2019-03-04T16:24:28+02:00December 18th, 2014|Web Scraping|1 Comment

Data is only valuable if it can be trusted. At weRobots we spend as much effort on validating data as on collecting it. It is a multi stage process.

weRobots data validation workflow

  • Scraping

Initial checks happen in scraper robots. Robot crawls target website and looks for data. Captured data is sent to our staging database. Many abnormal situations can arise at this stage:

      • Site may be down. Robot will log warnings and will retry pages that do not respond. Usually outage is temporary and robot resumes without intervention
      • Site layout changes. If robot cannot find navigation links or data it will stop and report error so […]
28 Nov, 2014

Insights from Google event “Always Ahead”

By |2019-03-04T16:34:38+02:00November 28th, 2014|Web Scraping|0 Comments

Yesterday weRobots attended an event sponsored by Google and Enterprise Lithuania. We went there with an expectation of a small event (maybe a workshop) for geeks, but it actually was huge with big crowd, full main hall at LITEXPO conference center and three breakout sessions in the afternoon. The organisers were nice enough to let both weRobots co-founders to enter with only one ticket, plus we were selected for the “red room” break out sessions which was designated for the most advanced crowd.

Google Data Conference

The most interesting part of the conference was a strong presentation by Pawel Matkowski (Google IE). He teased the audience with a promise that we will see something that […]

16 Oct, 2014

Random Proxy Switcher

By |2019-03-04T16:33:52+02:00October 16th, 2014|Web Scraping|0 Comments

We decided to release a side product of our internal systems. It is a free Chrome extension which allows users to browse the web while randomly changing proxy every minute. Of course the catch is that we do not give out proxy servers themselves, but there are plenty services that provide them. Or anyone can launch their own for free by following this tutorial.

proxy extension