Blog

6 May, 2019

Web Scraping vs Web Crawling

By |2019-05-06T12:42:21+02:00May 6th, 2019|Web Scraping|0 Comments

The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.

Formal Answer

Lets start with the formal definitions:

Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from […]

10 Apr, 2019

Advanced AJAX Techniques for Web Scraping

By |2019-04-04T16:07:24+02:00April 10th, 2019|Web Scraping|0 Comments

Basic AJAX usage within Web Robots scraper

Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.

// Standard JQuery AJAX call
$.ajax({
url:’https://webrobots.io’,
method: ‘GET’
}).done( function(resp){
console.log(resp);
});

// Simplified AJAX call
$.get(‘https://webrobots.io’).done( function(resp){
console.log(resp);
});

Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.

Example incorrect and correct done() placement in AJAX:

INCORRECT

steps.start = function(){
$.get(‘https://webrobots.io’).done( function(resp){
[…]

25 Mar, 2019

Using Sitemaps in Web Scraping Robots

By |2019-03-20T13:36:08+02:00March 25th, 2019|Web Scraping|0 Comments

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this –  just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

4 Mar, 2019

Scraping Dynamic Websites Using The wait() Function

By |2019-03-04T15:12:11+02:00March 4th, 2019|Web Scraping|0 Comments

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).

steps.start = function() {

console.log($(‘div[class^="ProductPage__details"]’).length);

done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify […]

20 Feb, 2019

Web Scraping Performance Tuning With fastnext()

By |2019-03-04T15:14:00+02:00February 20th, 2019|Web Scraping|0 Comments

Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.

There are a number of ways to optimize your robot to run faster, replacing setTimeout with our internal wait function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using ajax requests instead of visiting a website […]

27 Jun, 2017

PostgreSQL as a Service options comparison and benchmark

By |2019-03-04T15:20:50+02:00June 27th, 2017|PostgreSQL|0 Comments

Background

PostgresSQL is great. But administering it can suck up a lot of time and for small teams using SaaS service is great value. We use and love Amazon RDS.  Until recently it was the only reasonable choice in the market. But in 2017 new options ore on the verge of becoming available.  Both Google and Azure clouds announced support and Amazon is also launching their Aurora service with PostgreSQL compatibility.

We did a quick comparison of those options.

TLDR:  Google and Aurora are a bit faster for the same money but it’s no free lunch. Azure tests are not yet done.

The Test

Use case we care about is “mid size” database and queries […]

21 Jun, 2017

New IDE Extension Release

By |2019-03-04T15:26:12+02:00June 21st, 2017|Web Scraping|2 Comments

Today we are releasing an update to our main extension – Web Robots Scraper IDE. This release has a version number 2017.6.20 and has several improvements in UI, proxy settings control, handling hash symbols in URLs.

Version 2017.6.20 RELEASE NOTES

  • UI: Robot run statistics is displayed in the same place and no longer “jumping”
  • UI: when robot finishes it’s status is a direct link to robot run list on portal. Run link is a direct link to data preview and download on portal.
  • setProxy() functionality has been expanded. See documentation for details.
  • Bugfix: fixed a bug where subsequent steps with URLs having identical address before # symbol were not loading correctly (Example: http://foobar.com#a and after that go to http://foobar.com#b).
  • Other internal engine improvements […]
2 Mar, 2017

Email And Social Media Links Crawling From Websites

By |2019-03-04T15:29:54+02:00March 2nd, 2017|Web Scraping|1 Comment

At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using Web Robots Chrome extension on their own computer.

To start you will need account on Web Robots portal, Chrome extension and thats it. We placed a robot called leads_crawler in our portal’s Demo space so anyone can use it. In case robot’s code is changed below is complete source code for this robot. You […]

24 Feb, 2017

Scraping Extension Update – version 2017.2.23

By |2019-03-04T15:35:16+02:00February 24th, 2017|Web Scraping|0 Comments

Recently we rolled out an updated version of our main web scraping extension which contains several important updates and new features. This update allows our users to develop and debug robots even faster than before. So what exactly is new?

  1. jQuery has been upgraded from version 1.10.2 to 2.2.4
  2. done() now can take a milliseconds delay parameter. For example done(1000); will delay step finish by 1 second.
  3. New tab Selectors which allows testing selectors inline and generates robot code. Selectors are immediately tested on browser’s active tab so developer can see if they work correctly. Copy code button copies Javascript code to clipboard which can be pasted directly into robot’s step.
  4. […]

10 Nov, 2016

Writing Better Data Collection Robots

By |2019-03-04T15:40:36+02:00November 10th, 2016|Web Scraping|2 Comments

At Web Robots we have a fanatical customer support. Large part of this is doing technical support for robot developers. For this we maintain live chatrooms, often do screenshares, joint code writing sessions with each of our customers’ development teams. This helps us solve most of the problems our customers encounter in minutes, difficult ones in several hours. This is not an exaggeration.

Writing web scraping robot

Based on the accumulated experience we were able to identify a list of the most common mistakes that robot writers can make. We published a list with specific code examples that illustrate the mistakes. Then each mistake has an explanation and a code example with solution. This list is highly recommended read for […]