nicerobot

About nicerobot

This author has not yet filled in any details.
So far nicerobot has created 28 blog entries.
28 Apr, 2020

Instant Data Users Group on Facebook

By |2020-04-28T13:02:25+02:00April 28th, 2020|Datasets, Web Scraping|0 Comments

We have launched a Facebook group where Instant Data Scraper users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users.

We hope that new Facebook group will grow into a community where users can support each other.

17 Dec, 2019

Instant Data Scraper Update

By |2019-12-17T15:59:01+02:00December 17th, 2019|Uncategorized|1 Comment

In October and November of this year we decided to survey Instant Data Scraper extension users to see where Web Robots team should focus for the next update. We already had some ideas from user emails that we received over last couple years, but we needed a more scientific proof to see which features would be most desired. Among features we consider things like infinite scroll support, running jobs on cloud, processing batches of URLs, proxy support, etc.

Before the end of the survey it became clear that infinite scroll support is by far most desired feature and decided to release it as soon as possible. One December 11th we published a 0.2.0 version to Chrome Webstore. Enjoy it!

Other features […]

6 May, 2019

Web Scraping vs Web Crawling

By |2019-05-06T12:42:21+02:00May 6th, 2019|Web Scraping|2 Comments

The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.

Formal Answer

Lets start with the formal definitions:

Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from […]

10 Apr, 2019

Advanced AJAX Techniques for Web Scraping

By |2019-04-04T16:07:24+02:00April 10th, 2019|Web Scraping|0 Comments

Basic AJAX usage within Web Robots scraper

Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.

// Standard JQuery AJAX call
$.ajax({
url:’https://webrobots.io’,
method: ‘GET’
}).done( function(resp){
console.log(resp);
});

// Simplified AJAX call
$.get(‘https://webrobots.io’).done( function(resp){
console.log(resp);
});

Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.

Example incorrect and correct done() placement in AJAX:

INCORRECT

steps.start = function(){
$.get(‘https://webrobots.io’).done( function(resp){
[…]

25 Mar, 2019

Using Sitemaps in Web Scraping Robots

By |2019-03-20T13:36:08+02:00March 25th, 2019|Web Scraping|1 Comment

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this –  just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

4 Mar, 2019

Scraping Dynamic Websites Using The wait() Function

By |2019-03-04T15:12:11+02:00March 4th, 2019|Web Scraping|0 Comments

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).

steps.start = function() {

console.log($(‘div[class^="ProductPage__details"]’).length);

done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify […]

20 Feb, 2019

Web Scraping Performance Tuning With fastnext()

By |2019-03-04T15:14:00+02:00February 20th, 2019|Web Scraping|0 Comments

Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.

There are a number of ways to optimize your robot to run faster, replacing setTimeout with our internal wait function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using ajax requests instead of visiting a website […]

27 Jun, 2017

PostgreSQL as a Service options comparison and benchmark

By |2019-03-04T15:20:50+02:00June 27th, 2017|PostgreSQL|0 Comments

Background

PostgresSQL is great. But administering it can suck up a lot of time and for small teams using SaaS service is great value. We use and love Amazon RDS.  Until recently it was the only reasonable choice in the market. But in 2017 new options ore on the verge of becoming available.  Both Google and Azure clouds announced support and Amazon is also launching their Aurora service with PostgreSQL compatibility.

We did a quick comparison of those options.

TLDR:  Google and Aurora are a bit faster for the same money but it’s no free lunch. Azure tests are not yet done.

The Test

Use case we care about is “mid size” database and queries […]

21 Jun, 2017

New IDE Extension Release

By |2019-03-04T15:26:12+02:00June 21st, 2017|Web Scraping|2 Comments

Today we are releasing an update to our main extension – Web Robots Scraper IDE. This release has a version number 2017.6.20 and has several improvements in UI, proxy settings control, handling hash symbols in URLs.

Version 2017.6.20 RELEASE NOTES

  • UI: Robot run statistics is displayed in the same place and no longer “jumping”
  • UI: when robot finishes it’s status is a direct link to robot run list on portal. Run link is a direct link to data preview and download on portal.
  • setProxy() functionality has been expanded. See documentation for details.
  • Bugfix: fixed a bug where subsequent steps with URLs having identical address before # symbol were not loading correctly (Example: http://foobar.com#a and after that go to http://foobar.com#b).
  • Other internal engine improvements […]
2 Mar, 2017

Email And Social Media Links Crawling From Websites

By |2019-03-04T15:29:54+02:00March 2nd, 2017|Web Scraping|11 Comments

At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using Web Robots Chrome extension on their own computer.

To start you will need account on Web Robots portal, Chrome extension and thats it. We placed a robot called leads_crawler in our portal’s Demo space so anyone can use it. In case robot’s code is changed below is complete source code for this robot. You […]