About nicerobot

This author has not yet filled in any details.
So far nicerobot has created 32 blog entries.

15 Feb, 2023

New Functions Added

By nicerobot|2023-02-15T10:39:14+02:00February 15th, 2023|Web Scraping|0 Comments

Web Robots scraping framework documentation has been updated with new functions:

blockImages() – changes browser settings regarding image downloading. This function is useful in scenarios when bandwidth is a concern. Sometimes it results in faster crawling speeds.
allowImages() – reverses browser settings changes made by blockImages().
closeSocket() – closes all idle socket connections in browser.

Web Robots have been using these functions on the internal platform for over 6 months and they proved to be great help in some scenarios. Now they are available in our public extension.

26 May, 2022

Instant Data Scraper v1.0.7 released

By nicerobot|2022-05-26T14:40:34+02:00May 26th, 2022|Uncategorized|0 Comments

Today we are releasing an update to our Instant Data Scraper. Version 1.0.7 has the following improvements:

Performance improvement for websites with large HTML structure (Google Maps for example).
Improved “Next” page button behaviour.
Migrated to manifest version 3 and manifest version 2 will be phased out by Google soon.

We are also working on some other features for Instant Data, they will be released in the near future!

18 Dec, 2020

Instant Data Scraper is now available on Micorosoft Edge

By nicerobot|2020-12-18T08:45:10+02:00December 18th, 2020|Uncategorized|0 Comments

We received an invitation from Microsoft to publish our Chrome extensions to Microsoft Edge webstore. Microsoft Edge browser is Chrome bases, so porting extensions to it should be easy. Actually it was even easier than expected – Edge’s developer dashboard accepted exact same zip file as we use for Chrome webstore. Extension just works without any changes.

Anyone can download Instant Data for Microsoft Edge here.

1 Oct, 2020

Our Chrome extension has been updated

By nicerobot|2020-10-01T14:25:38+02:00October 1st, 2020|Web Scraping|0 Comments

Our public developer extension (IDE) has been untouched since March 2019. It may looks like Web Robots were stagnating, but actually we were constantly working on our internal systems like portal, cloud workers, cloud orchestration. We also has several internal releases of IDE for our staff.

So the IDE published to the Chrome webstore is just the tip of the iceberg.

It is now September 2020 and time has come to release the new version to Chrome webstore. We are glad that our extension passed webstore’s permission audit from the first time as our extension requires access to quite a few Chrome APIs in order to work properly and people at Google are getting ever stricter in their review process for […]

28 Apr, 2020

Instant Data Users Group on Facebook

By nicerobot|2020-04-28T13:02:25+02:00April 28th, 2020|Datasets, Web Scraping|0 Comments

We have launched a Facebook group where Instant Data Scraper users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users.

We hope that new Facebook group will grow into a community where users can support each other.

17 Dec, 2019

Instant Data Scraper Update

By nicerobot|2019-12-17T15:59:01+02:00December 17th, 2019|Uncategorized|1 Comment

In October and November of this year we decided to survey Instant Data Scraper extension users to see where Web Robots team should focus for the next update. We already had some ideas from user emails that we received over last couple years, but we needed a more scientific proof to see which features would be most desired. Among features we consider things like infinite scroll support, running jobs on cloud, processing batches of URLs, proxy support, etc.

Before the end of the survey it became clear that infinite scroll support is by far most desired feature and decided to release it as soon as possible. One December 11th we published a 0.2.0 version to Chrome Webstore. Enjoy it!

Other features […]

6 May, 2019

Web Scraping vs Web Crawling

By nicerobot|2019-05-06T12:42:21+02:00May 6th, 2019|Web Scraping|2 Comments

The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let’s look at a couple of terms to try and answer these questions: Web Scraping and Web Crawling.

Formal Answer

Lets start with the formal definitions:

Web crawling – A process where a program or automated script browses the World Wide Web in a methodical, automated manner.
Web scraping – extracting specific data from […]

10 Apr, 2019

Advanced AJAX Techniques for Web Scraping

By nicerobot|2019-04-04T16:07:24+02:00April 10th, 2019|Web Scraping|0 Comments

Basic AJAX usage within Web Robots scraper

Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods.

// Standard JQuery AJAX call
$.ajax({
    url:'https://webrobots.io',
    method: 'GET'
}).done( function(resp){
    console.log(resp); 
});

// Simplified AJAX call
$.get('https://webrobots.io').done( function(resp){
   console.log(resp); 
});

Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn’t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.

Example incorrect and correct done() placement in AJAX:

INCORRECT

steps.start = function(){
$.get(‘https://webrobots.io’).done( function(resp){
[…]

25 Mar, 2019

Using Sitemaps in Web Scraping Robots

By nicerobot|2019-03-20T13:36:08+02:00March 25th, 2019|Web Scraping|1 Comment

We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this – just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.

After all, sitemaps are designed for robots to find all resources on a particular domain.

Example of a sitemap:

Finding Sitemaps

The fastest way to find a sitemap URL is to check robots.txt file. For example https://www.rottentomatoes.com/robots.txt
We can also […]

4 Mar, 2019

Scraping Dynamic Websites Using The wait() Function

By nicerobot|2019-03-04T15:12:11+02:00March 4th, 2019|Web Scraping|0 Comments

Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(‘div[class^=”ProductPage__details”]’).


steps.start = function() {

    console.log($('div[class^="ProductPage__details"]').length);

    done();

};

Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy – use setTimeout()We can use setTimeout() where we specify the […]

12 Next