17 Dec, 2019

Instant Data Scraper Update

By nicerobot|2019-12-17T15:59:01+02:00December 17th, 2019|Uncategorized|0 Comments

In October and November of this year we decided to survey Instant Data Scraper extension users to see where Web Robots team should focus for the next update. We already had some ideas from user emails that we received over last couple years, but we needed a more scientific proof to see which features would be most desired. Among features we consider things like infinite scroll support, running jobs on cloud, processing batches of URLs, proxy support, etc.

Before the end of the survey it became clear that infinite scroll support is by far most desired feature and decided to release it as soon as possible. One December 11th we published a 0.2.0 version to Chrome Webstore. Enjoy it!

Other features […]

25 Mar, 2019

Using Sitemaps in Web Scraping Robots

By nicerobot|2019-03-20T13:36:08+02:00March 25th, 2019|Web Scraping|1 Comment

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this – just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

The fastest way to find a sitemap URL is to check robots.txt file. For example https://www.rottentomatoes.com/robots.txt
We can also […]

4 Mar, 2019

Scraping Dynamic Websites Using The wait() Function

By nicerobot|2019-03-04T15:12:11+02:00March 4th, 2019|Web Scraping|0 Comments

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).


steps.start = function() {

    console.log($('div[class^="ProductPage__details"]').length);

    done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify the […]

20 Feb, 2019

Web Scraping Performance Tuning With fastnext()

By nicerobot|2019-03-04T15:14:00+02:00February 20th, 2019|Web Scraping|0 Comments

Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.

There are a number of ways to optimize your robot to run faster, replacing setTimeout with our internal wait function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using ajax requests instead of visiting a website […]

2 Mar, 2017

Email And Social Media Links Crawling From Websites

By nicerobot|2019-03-04T15:29:54+02:00March 2nd, 2017|Web Scraping|3 Comments

At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using Web Robots Chrome extension on their own computer.

To start you will need account on Web Robots portal, Chrome extension and thats it. We placed a robot called leads_crawler in our portal’s Demo space so anyone can use it. In case robot’s code is changed below is complete source code for this robot. You […]

24 Feb, 2017

Scraping Extension Update – version 2017.2.23

By nicerobot|2019-03-04T15:35:16+02:00February 24th, 2017|Web Scraping|0 Comments

Recently we rolled out an updated version of our main web scraping extension which contains several important updates and new features. This update allows our users to develop and debug robots even faster than before. So what exactly is new?

jQuery has been upgraded from version 1.10.2 to 2.2.4
done() now can take a milliseconds delay parameter. For example done(1000); will delay step finish by 1 second.
New tab Selectors which allows testing selectors inline and generates robot code. Selectors are immediately tested on browser’s active tab so developer can see if they work correctly. Copy code button copies Javascript code to clipboard which can be pasted directly into robot’s step.

[…]

10 Nov, 2016

Writing Better Data Collection Robots

By nicerobot|2019-03-04T15:40:36+02:00November 10th, 2016|Web Scraping|2 Comments

At Web Robots we have a fanatical customer support. Large part of this is doing technical support for robot developers. For this we maintain live chatrooms, often do screenshares, joint code writing sessions with each of our customers’ development teams. This helps us solve most of the problems our customers encounter in minutes, difficult ones in several hours. This is not an exaggeration.

Based on the accumulated experience we were able to identify a list of the most common mistakes that robot writers can make. We published a list with specific code examples that illustrate the mistakes. Then each mistake has an explanation and a code example with solution. This list is highly recommended read for […]

8 Jun, 2016

Scrape Instagram Followers

By nicerobot|2019-03-04T16:02:00+02:00June 8th, 2016|Web Scraping|32 Comments

Our platform is often used by growth hackers for lead generation in social media networks. One such use case is building a list of Instagram followers from interestingprofiles. Today we placed one such robot into our portal‘s demo space for anyone to use. Robot is only 30 lines of Javascript code and works quite fast. We tested it with IBM’s Instagram which has 78k followers and it took only 14 minutes to scrape them.

How to use this robot:

Login to Web Robots portal on Chrome browser.
Make sure you have Web Robots Chrome extension to run the robot.
Open robot instagram_followers in our extension.
Make […]

1 Mar, 2016

Scraping Yelp Data

By nicerobot|2019-03-04T16:06:28+02:00March 1st, 2016|Web Scraping|3 Comments

We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.

We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc. Robot is placed in our Demo space on Web Robots portal for anyone to use, just sign up, find the robot and use it.

31 Dec, 2015

New Kickstarter Dataset

By nicerobot|2019-03-04T16:10:59+02:00December 31st, 2015|Datasets|2 Comments

Recently we updated our Kickstarter robot to crawl project subcategories. This allows us to collect a richer dataset, for example on 2015-12-17 run robot collected data about 144,263 projects with a running time only 2 hours! We also started presenting it in the JSON streaming format which is just a line delimited JSON. Previously we used to stuff all projects into JSON array and the downside of it was that user would have to read the entire large JSON file into memory before any kind of processing starts. with JSON streaming it is possible to read one line at a time.

Data is posted in the usual place.