<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>web scraping &#8211; Web Scraping Service</title>
	<atom:link href="https://webrobots.io/tag/web-scraping/feed/" rel="self" type="application/rss+xml" />
	<link>https://webrobots.io</link>
	<description>We do web scraping service better!</description>
	<lastBuildDate>Tue, 17 Dec 2019 13:59:01 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.5.8</generator>
	<item>
		<title>Instant Data Scraper Update</title>
		<link>https://webrobots.io/instant-data-scraper-update/</link>
					<comments>https://webrobots.io/instant-data-scraper-update/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Tue, 17 Dec 2019 13:58:47 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[dynamic website]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=6089</guid>

					<description><![CDATA[In October and November of this year we decided to survey Instant Data Scraper extension users to see where Web Robots team should focus for the next update. We already had some ideas from user emails that we received over last couple years, but we needed a more scientific proof to see which features [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-1 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-0 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>In October and November of this year we decided to survey <a href="https://chrome.google.com/webstore/detail/instant-data-scraper/ofaokhiedipichpaobibbnahnkdoiiah"><strong>Instant Data Scraper</strong></a> extension users to see where Web Robots team should focus for the next update. We already had some ideas from user emails that we received over last couple years, but we needed a more scientific proof to see which features would be most desired. Among features we consider things like infinite scroll support, running jobs on cloud, processing batches of URLs, proxy support, etc.</p>
<p>Before the end of the survey it became clear that infinite scroll support is by far most desired feature and decided to release it as soon as possible. One December 11th we published a 0.2.0 version to Chrome Webstore. Enjoy it!</p>
<p>Other features will follow as well. We are happy to see that our web scraping tool is growing through 40k users and has excellent reviews!</p>
<div id="attachment_6090" style="width: 1034px" class="wp-caption aligncenter"><img fetchpriority="high" decoding="async" aria-describedby="caption-attachment-6090" class="lazyload size-large wp-image-6090" src="https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-1024x149.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-1024x149.png" alt="Instant Data Scraper Installs" width="1024" height="149" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271024%27%20height%3D%27149%27%20viewBox%3D%270%200%201024%20149%27%3E%3Crect%20width%3D%271024%27%20height%3D%273149%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-200x29.png 200w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-300x44.png 300w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-400x58.png 400w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-600x88.png 600w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-768x112.png 768w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-800x117.png 800w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-1024x149.png 1024w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11-1200x175.png 1200w, https://webrobots.io/wp-content/uploads/2019/12/Screenshot-2019-12-17-at-15.55.11.png 1350w" data-sizes="auto" data-orig-sizes="(max-width: 1024px) 100vw, 1024px" /><p id="caption-attachment-6090" class="wp-caption-text">Installs per day over the lifetime of our extension.</p></div>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/instant-data-scraper-update/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>Using Sitemaps in Web Scraping Robots</title>
		<link>https://webrobots.io/using-sitemaps-in-web-scraping-robots/</link>
					<comments>https://webrobots.io/using-sitemaps-in-web-scraping-robots/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Mon, 25 Mar 2019 09:41:15 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[sitemap]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5876</guid>

					<description><![CDATA[We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this -  just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-2 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-1 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p><span style="font-weight: 400;">We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this &#8211;  just using </span><a href="https://en.wikipedia.org/wiki/Sitemaps"><b>sitemaps</b></a><span style="font-weight: 400;">. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.</span></p>
<p><span style="font-weight: 400;">After all, sitemaps are designed for robots to find all resources on a particular domain.</span></p>
<p><b>Example of a sitemap:</b></p>
</div><span style="width:100%;max-width:600px;" class="fusion-imageframe imageframe-none imageframe-1 hover-type-none"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-1024x392.png" width="1024" height="392" alt="" title="sitemap-example" class="lazyload img-responsive wp-image-5877" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271384%27%20height%3D%27530%27%20viewBox%3D%270%200%201384%20530%27%3E%3Crect%20width%3D%271384%27%20height%3D%273530%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-200x77.png 200w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-400x153.png 400w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-600x230.png 600w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-800x306.png 800w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-1200x460.png 1200w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54.png 1384w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 1024px" /></span><div class="fusion-text"><h1><span style="font-weight: 400;">Finding Sitemaps</span></h1>
<ul>
<li style="font-weight: 400;"><span style="font-weight: 400;">The fastest way to find a sitemap URL is to check </span><i><span style="font-weight: 400;">robots.txt</span></i><span style="font-weight: 400;"> file. For example </span><a href="https://www.rottentomatoes.com/robots.txt"><span style="font-weight: 400;">https://www.rottentomatoes.com/robots.txt</span></a></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">We can also probe typical sitemap URLs like </span><i><span style="font-weight: 400;">domain.com/sitemap</span></i><span style="font-weight: 400;"> or </span><i><span style="font-weight: 400;">domain.com/sitemap.xml</span></i></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">Sometimes just going to the homepage and searching for the keyword “sitemap” works</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">If all above bear no fruit, google search can help (example: “target.com sitemap&#8221;).</span></li>
</ul>
<p><b>Example of domain/robots.txt:</b></p>
</div><span style="width:100%;max-width:600px;" class="fusion-imageframe imageframe-none imageframe-2 hover-type-none"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-1024x212.png" width="1024" height="212" alt="" title="sitemap-example-2" class="lazyload img-responsive wp-image-5885" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271024%27%20height%3D%27212%27%20viewBox%3D%270%200%201024%20212%27%3E%3Crect%20width%3D%271024%27%20height%3D%273212%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-200x41.png 200w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-400x83.png 400w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-600x124.png 600w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-800x166.png 800w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1.png 1024w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 1024px" /></span><div class="fusion-text"><h1><span style="font-weight: 400;">Working With Large Sitemaps</span></h1>
<p><span style="font-weight: 400;">Sitemaps usually have many thousands of records and opening them directly will freeze Chrome browser for several minutes while browser renders XML. Our best practice is to make $.get request to get a sitemap and process it.</span></p>
<p><b>example of getting a sitemap using an ajax</b> <b>request and filtering URLs:</b></p>
<pre class="brush: jscript;">

$.get('https://www.rottentomatoes.com/sitemap_0.xml').then( function(response){
    $('url loc',response).each( function(i, v){
        var url = $(v).text();

        // filtering: we only need URLs that have no further path after film name
        // we can filter out URLs with longer URL paths than film page has

        if(url.split('/').length &lt; 6) next(url,'getFilmInfo');

    });
    done();
});

</pre>
<h1><span style="font-weight: 400;">Downsides of Sitemap Approach</span></h1>
<ul>
<li style="font-weight: 400;"><span style="font-weight: 400;">A sitemap can be outdated (old URLs leading to 404 pages) and the site owner might not even notice that their sitemaps are incorrect. It is necessary to do spot checks to see if an URL works.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;"> Sitemap might not have all the items listed in the normal website interface. Best practice is to spot check that items found on a website are present in the sitemap as well.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;"> Sitemaps do not allow filtering items based on certain criteria. For example if we need only electronics from a large eshop, we still have to crawl all products and do filtering in the back-end.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">Sitemaps do show how popular an item is &#8211; for example we cannot infer if a particular item is on the first page in it’s category or somewhere near the end.</span></li>
</ul>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/using-sitemaps-in-web-scraping-robots/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Scraping Dynamic Websites Using The wait() Function</title>
		<link>https://webrobots.io/scraping-dynamic-websites-using-the-wait-function/</link>
					<comments>https://webrobots.io/scraping-dynamic-websites-using-the-wait-function/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Mon, 04 Mar 2019 07:00:48 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[dynamic website]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5842</guid>

					<description><![CDATA[Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-3 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-2 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(&#8216;div[class^=&#8221;ProductPage__details&#8221;]&#8217;).</p>
<pre class="brush: jscript;">

steps.start = function() {

    console.log($('div[class^=&quot;ProductPage__details&quot;]').length);

    done();

};

</pre>
<p>Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy &#8211; use setTimeout()We can use setTimeout() where we specify the number of milliseconds to wait before executing a piece of code. This way the browser has some time to process dynamic data and insert it into the DOM. In this example we introduce a simple 3 second wait:</p>
<pre class="brush: jscript;">

steps.start = function() {

    setTimeout(function() {

        console.log($('div[class^=&quot;ProductPage__details&quot;]').length);

        done();

    }, 3000);

};

</pre>
<p>Logged result is 1, which indicates that we found the expected data in the DOM. However, there are some drawbacks in this method, as the code will be delayed the same amount of time regardless of how much the website actually takes to handle its dynamic requests. This means we are wasting time when the product data appears sooner and missing data when product data loads slower.Dynamic pages have a tendency to load inconsistently, therefore the exact timeout duration for each page load is impossible to know in advance. The maximum observed delay time is usually chosen when using setTimeout(). If we are waiting for 3 seconds, average time for data to appear is 1.5 seconds, and we have to process 50,000 products &#8211; then 20.83 hours are wasted. This is 625 hours per month if we run this robot every day!Better waiting strategy &#8211; use wait()Web Robots system wait() function enables the user to wait for a particular HTML element to load and then execute the code right after the element appears. wait(string or array selector[], int maxWaitTime)Default maxWaitTime = 10000;Usable callbacks: then, always, fails (Similar as with JQuery deferred https://api.jquery.com/jquery.deferred/)Example:</p>
<pre class="brush: jscript;">

steps.start = function() {

    wait('div[class^=&quot;ProductPage__details&quot;]').then(function() {

        console.log($('div[class^=&quot;ProductPage__details&quot;]').length);

        done();

    })

};

</pre>
<p>wait() can have multiple callbacks for scenarios when an element appears, does not appear, or always: wait(selector, time_to_wait*).then(callback) &#8211; callback function will be executed immediately when selector appears. If the selector doesn’t appear, the function will never be executed.wait(selector, time_to_wait*).always(callback) &#8211; callback function is executed when element appears or when time_to_wait is reached. wait(selector, time_to_wait*).then(callback).fail(callback2) &#8211; callback function will be executed when element appears. Callback2 will be executed if element does not appear.wait([selector1, selector2, …], time_to_wait*).then(callback) &#8211; callback function is executed only when all of the selectors (selector1, selector2, …) appeared on the website.*Time_to_wait &#8211; is an optional parameter that allows the user to choose the amount of milliseconds to wait for a specified selector. Default amount (if not specified in the function) is 10000 ms.wait() function makes scraping of dynamic pages much easier, more efficient and more reliable.</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/scraping-dynamic-websites-using-the-wait-function/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Web Scraping Performance Tuning With fastnext()</title>
		<link>https://webrobots.io/web-scraping-performance-tuning-with-fastnext/</link>
					<comments>https://webrobots.io/web-scraping-performance-tuning-with-fastnext/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Wed, 20 Feb 2019 09:17:51 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5736</guid>

					<description><![CDATA[Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-4 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-3 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p><span style="font-weight: 400;">Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.</span></p>
<p><span style="font-weight: 400;">There are a number of ways to optimize your robot to run faster, replacing <strong>setTimeout</strong> with our internal <strong>wait</strong> function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using <strong>ajax</strong> requests instead of visiting a website directly. In a standard scenario using next with a link will open a webpage on your browser, that means that it will download the HTML file, all the listed additional resources like js and css files, images, video and audio files, and then process, render and display it in the window.</span></p>
<h5><strong>EXAMPLE ROBOT WITH NEXT:</strong></h5>
<pre class="brush: jscript;">
steps.start = function(){ 
    next(&quot;https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1&quot;,&quot;getMovie&quot;);
    done();
}

steps.getMovie = function(){
    let movie = {name: $(&quot;h1&quot;).text()};
    emit(&quot;movies&quot;,[movie]);
    done();
}
</pre>
<p><span style="font-weight: 400;">All of this might take only a few hundred milliseconds, but when you are dealing with potentially hundreds of thousands of loads, every millisecond adds up. Fortunately, from a data collecting robot standpoint, having the html rendered with all the images, sleek css and fonts is not needed, because it is only interested in the data present in the html file. Therefore, we can get the same results for the fraction of the time by getting just the html file with an <strong>ajax</strong> request. However, this requires reformatting of the code by adding extra parameters to the next step and including the response context in subsequent html data selectors.</span></p>
<h5><strong>EXAMPLE ROBOT WITH AJAX:</strong></h5>
<pre class="brush: jscript;">
steps.start = function(){
    next(&quot;&quot;,&quot;getMovie&quot;,&quot;https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1&quot;);
    done();
}

steps.getMovie = function(url){
    $.get(url).done(function(resp){
        let movie = {name: $(&quot;h1&quot;,resp).text()};
        emit(&quot;movies&quot;,[movie]);
        done();
    })
}
</pre>
<p><span style="font-weight: 400;">This does make the code a bit more complex, and requires more work and care when reformatting old robots in order to avoid errors, especially if the old code is long and complicated. </span></p>
<p><span style="font-weight: 400;">In order to streamline the reformatting and make writing and reading of new robots easier, we integrated the ajax functionality into our extension in the form of <strong>fastnext</strong> function. It functions just like a regular <strong>next</strong>, requiring an url, step name, and an optional third data parameter, but instead of loading the whole website, it does a get request in the background and automatically uses the response html as context in the specified step, thus there is <strong>no need to reformat the selectors</strong>.</span></p>
<h5><strong>EXAMPLE ROBOT WITH FASTNEXT:</strong></h5>
<pre class="brush: jscript;">
steps.start = function(){
    fastnext(&quot;https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1&quot;,&quot;getMovie&quot;);
    done();
}

steps.getMovie = function(){
    let movie = {name: $(&quot;h1&quot;).text()};
    emit(&quot;movies&quot;,[movie]);
    done();
}
</pre>
<p><span style="font-weight: 400;">While reformatting old robots from next to <strong>fastnext</strong> we found that in practice the <strong>savings average at around 50%</strong> reduction in run time. However this varies between the low of only<strong> 25%</strong>, up to a high of <strong>85%</strong>, and it heavily depends on the structure and technology of the specific scraped website.</span></p>
<p><span style="font-weight: 400;">It should be noted however that fastnext, just like a regular ajax, will only work for static html websites where the required data is present in the html. Dynamic websites built with technologies like React or Angular require a different approach.</span></p>
<p><span style="font-weight: 400;">Another nuance that should be taken into account is that currently fastnext does not handle the fail clause of the ajax request and will trigger a step retry, while usually this behaviour is innocuous, sometimes it is needed to handle an ajax fail. In this case a regular ajax function should be used. </span></p>
<p><span style="font-weight: 400;">As an example we wrote a robot that scrapes a small Amazon category containing 149 products and compared the speed of it using next and fastnext. The run using fastnext finished <strong>36% faster</strong> than it’s counterpart, clocking in at ~ 270s, while the run using next clocked in at ~ 420s.</span></p>
<p><strong>Happy Scraping!</strong></p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/web-scraping-performance-tuning-with-fastnext/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Email And Social Media Links Crawling From Websites</title>
		<link>https://webrobots.io/email-and-social-media-links-crawling-from-websites/</link>
					<comments>https://webrobots.io/email-and-social-media-links-crawling-from-websites/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Thu, 02 Mar 2017 10:45:38 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[email crawling]]></category>
		<category><![CDATA[social media crawling]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5543</guid>

					<description><![CDATA[At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-5 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-4 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak">Web Robots Chrome extension</a> on their own computer.</p>
<p><img loading="lazy" decoding="async" class="lazyload alignnone size-full wp-image-5546" src="https://webrobots.io/wp-content/uploads/2017/03/social-media-leads.png" data-orig-src="https://webrobots.io/wp-content/uploads/2017/03/social-media-leads.png" alt="" width="994" height="604" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27994%27%20height%3D%27604%27%20viewBox%3D%270%200%20994%20604%27%3E%3Crect%20width%3D%27994%27%20height%3D%273604%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-200x122.png 200w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-300x182.png 300w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-400x243.png 400w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-600x365.png 600w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-768x467.png 768w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-800x486.png 800w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads.png 994w" data-sizes="auto" data-orig-sizes="(max-width: 994px) 100vw, 994px" /></p>
<p>To start you will need account on Web Robots <a href="http://portal.webrobots.io">portal</a>, Chrome <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak">extension</a> and thats it. We placed a robot called <a href="http://portal.webrobots.io/robots/2236"><strong>leads_crawler</strong></a> in our portal&#8217;s Demo space so anyone can use it. In case robot&#8217;s code is changed below is complete source code for this robot. You must edit variable on lines 14-18 to contain the list of target websites to crawl and run the robot. Then previous data on the Output tab and download it from portal once robot is finished. You will get a nice CSV file with data which can be used in your further leads processing workstream.</p>
<p>Robot&#8217;s source code:</p>
<pre class="brush: jscript; highlight: &#091;14&#093;;">
var DEPTH = 2;
var EMAIL_PATTERN = /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi;
var SOCIAL_MEDIA = [
    'facebook.com',
    'linkedin.com',
    'instagram.com',
    'youtube.com',
    'twitter.com',
    'pinterest.com',
    'plus.google.com',
    'blogspot.com'
];

var websites = [
    &amp;quot;http://dccentre.com/&amp;quot;,
    &amp;quot;http://www.theweddingplanneromaha.com/&amp;quot;,
    &amp;quot;http://www.effortlesseventsidaho.com/&amp;quot;
];


steps.start = function() {
    setSettings({skipVisited:true});
    setRetries(5000, 2, 1000); // 5 sec retry timer to skip bad pages quickly
    websites.forEach(function(v, i) {
            next(v, &amp;quot;crawl&amp;quot;, 0);
    });
    done();
};


steps.crawl = function(depth){
    
    depth++;
    
    var emails = _.uniq(returnEmails());
    var social = returnSocial();
    var urls = returnURLs();
    
    dbg(urls);
    
    if(emails.length || social.length) {
        var data = {
            'email' : emails.join(';'),
        };
        $.extend(data, social);
        emit('Leads', [data]);
    }
    
    if(depth &amp;lt; DEPTH) {
        urls.forEach(function(v) {
            next(v, 'crawl', depth);
        });
    }
    
    done();
};


returnURLs = function() {
    var urls = [];
    $('a:visible').each(function (i,v) {
        var url = $(v).prop('href').split('#').shift();
        if(isValidLink(url)) {
            urls.push(url);
        };
    });
    return(_.uniq(urls));
};


returnSocial = function() {
    var urls = [];
    var social = {};
    
    $('a:visible').each(function (i,v) {
        urls.push($(v).prop('href'));
    });
    
    _.uniq(urls).forEach(function(link) {
        var domain = link.split('://').pop().split('www.').pop().split('/').shift().toLowerCase();
        var pos = _.indexOf( SOCIAL_MEDIA, domain);
        if(pos !== -1) {
            social[SOCIAL_MEDIA[pos].split('.').shift()] = link;
        };
    });
    return(social);
};


returnEmails = function() {
    return $('*').html().match(EMAIL_PATTERN);
};


isValidLink = function(link){
    // here we check for all bad stuff in links
    if(_.indexOf(SOCIAL_MEDIA, link.split('://').pop().split('www.').pop().split('/').shift()) !== -1) {
        return false;
    }
    
    if ((link === undefined) || (typeof link !== &amp;quot;string&amp;quot;) || (link.length &amp;lt; 12)) {
        return false;
    }
    
    if (
        // positives - must be present
        !(link.includes(document.domain)) ||
        !link.startsWith(&amp;quot;http&amp;quot;) ||
        
        // negatives - must not be present
        link.includes(&amp;quot;.zip&amp;quot;) ||
        link.includes(&amp;quot;.csv&amp;quot;) ||
        link.includes(&amp;quot;.mpg&amp;quot;) ||
        link.includes(&amp;quot;.mpeg&amp;quot;) ||
        link.includes(&amp;quot;.gz&amp;quot;) ||
        link.includes(&amp;quot;.jpg&amp;quot;) ||
        link.includes(&amp;quot;.jpeg&amp;quot;) ||
        link.includes(&amp;quot;.png&amp;quot;) ||
        link.includes(&amp;quot;.pdf&amp;quot;) ||
        link.includes(&amp;quot;.doc&amp;quot;) ||
        link.includes(&amp;quot;.xls&amp;quot;) ||
        link.includes(&amp;quot;.ppt&amp;quot;) ||
        link.includes(&amp;quot;.avi&amp;quot;) ||
        link.includes(&amp;quot;.tif&amp;quot;) ||
        link.includes(&amp;quot;.exe&amp;quot;) ||        
        link.includes(&amp;quot;.psd&amp;quot;) ||        
        link.includes(&amp;quot;.eps&amp;quot;) ||   
        link.includes(&amp;quot;.txt&amp;quot;) ||   
        link.includes(&amp;quot;.rtf&amp;quot;) ||   
        link.includes(&amp;quot;.wmv&amp;quot;) ||
        link.includes(&amp;quot;.odt&amp;quot;) ||   
        link.includes(&amp;quot;.css&amp;quot;) ||
        link.includes(&amp;quot;.js&amp;quot;) ||
        link.includes(&amp;quot;mailto:&amp;quot;) ||
        link.includes(&amp;quot;facebook&amp;quot;) ||
        link.includes(&amp;quot;google&amp;quot;) ||
        link.includes(&amp;quot;twitter&amp;quot;) || 
        link.includes(&amp;quot;youtube&amp;quot;) || 
        link.includes(&amp;quot;linkedin&amp;quot;) ||
        link.includes(&amp;quot;download&amp;quot;) ||
        link.includes(&amp;quot;pinterest&amp;quot;) 

        ) {
            return false;
        } else {
            return true;
        }
};
</pre>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/email-and-social-media-links-crawling-from-websites/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>Scraping Extension Update &#8211; version 2017.2.23</title>
		<link>https://webrobots.io/scraping-extension-update-version-2017-2-23/</link>
					<comments>https://webrobots.io/scraping-extension-update-version-2017-2-23/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Fri, 24 Feb 2017 13:54:44 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[changelog]]></category>
		<category><![CDATA[web robots ide]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5530</guid>

					<description><![CDATA[Recently we rolled out an updated version of our main web scraping extension which contains several important updates and new features. This update allows our users to develop and debug robots even faster than before. So what exactly is new? jQuery has been upgraded from version 1.10.2 to 2.2.4 done() now can take a milliseconds [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-6 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:20px;padding-right:0px;padding-bottom:20px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-5 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Recently we rolled out an updated version of our main <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak?hl=en">web scraping extension</a> which contains several important updates and new features. This update allows our users to develop and debug robots even faster than before. So what exactly is new?</p>
<ol>
<li><strong>jQuery</strong> has been upgraded from version 1.10.2 to 2.2.4</li>
<li><strong>done()</strong> now can take a milliseconds delay parameter. For example done(1000); will delay step finish by 1 second.</li>
<li>New tab <strong>Selectors</strong> which allows testing selectors inline and generates robot code. Selectors are immediately tested on browser&#8217;s active tab so developer can see if they work correctly. <strong>Copy code</strong> button copies Javascript code to clipboard which can be pasted directly into robot&#8217;s step.<a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak/related?hl=en" target="_blank" rel="noopener noreferrer"><br />
<img loading="lazy" decoding="async" class="lazyload wp-image-5533 size-full alignnone" src="https://webrobots.io/wp-content/uploads/2017/02/Selectors-tab.png" data-orig-src="https://webrobots.io/wp-content/uploads/2017/02/Selectors-tab.png" width="600" height="342" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27600%27%20height%3D%27342%27%20viewBox%3D%270%200%20600%20342%27%3E%3Crect%20width%3D%27600%27%20height%3D%273342%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2017/02/Selectors-tab-200x114.png 200w, https://webrobots.io/wp-content/uploads/2017/02/Selectors-tab-300x171.png 300w, https://webrobots.io/wp-content/uploads/2017/02/Selectors-tab-400x228.png 400w, https://webrobots.io/wp-content/uploads/2017/02/Selectors-tab.png 600w" data-sizes="auto" data-orig-sizes="(max-width: 600px) 100vw, 600px" /></a></li>
<li><strong>Output</strong> tab now displays data in table format. This makes easy to monitor quality of data that is emitted. As robot is running table displays 200 most recently emitted rows.<a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak/related?hl=en" target="_blank" rel="noopener noreferrer"><br />
</a><a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak/related?hl=en" target="_blank" rel="noopener noreferrer"><img loading="lazy" decoding="async" class="lazyload wp-image-5534 size-full alignnone" src="https://webrobots.io/wp-content/uploads/2017/02/Output-Tab.png" data-orig-src="https://webrobots.io/wp-content/uploads/2017/02/Output-Tab.png" width="600" height="324" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27600%27%20height%3D%27324%27%20viewBox%3D%270%200%20600%20324%27%3E%3Crect%20width%3D%27600%27%20height%3D%273324%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2017/02/Output-Tab-200x108.png 200w, https://webrobots.io/wp-content/uploads/2017/02/Output-Tab-300x162.png 300w, https://webrobots.io/wp-content/uploads/2017/02/Output-Tab-400x216.png 400w, https://webrobots.io/wp-content/uploads/2017/02/Output-Tab.png 600w" data-sizes="auto" data-orig-sizes="(max-width: 600px) 100vw, 600px" /></a></li>
</ol>
<p>&nbsp;</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/scraping-extension-update-version-2017-2-23/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Writing Better Data Collection Robots</title>
		<link>https://webrobots.io/writing-better-data-collection-robots/</link>
					<comments>https://webrobots.io/writing-better-data-collection-robots/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Thu, 10 Nov 2016 13:32:38 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5439</guid>

					<description><![CDATA[At Web Robots we have a fanatical customer support. Large part of this is doing technical support for robot developers. For this we maintain live chatrooms, often do screenshares, joint code writing sessions with each of our customers' development teams. This helps us solve most of the problems our customers encounter in minutes, difficult [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-7 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-6 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>At Web Robots we have a fanatical customer support. Large part of this is doing technical support for robot developers. For this we maintain live chatrooms, often do screenshares, joint code writing sessions with each of our customers&#8217; development teams. This helps us solve most of the problems our customers encounter in minutes, difficult ones in several hours. This is not an exaggeration.</p>
<p><img loading="lazy" decoding="async" class="lazyload  wp-image-5440 aligncenter" src="https://webrobots.io/wp-content/uploads/2016/11/javascript-coding-1000x576.jpg" data-orig-src="https://webrobots.io/wp-content/uploads/2016/11/javascript-coding-1000x576.jpg" alt="Writing web scraping robot" width="512" height="295" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27512%27%20height%3D%27295%27%20viewBox%3D%270%200%20512%20295%27%3E%3Crect%20width%3D%27512%27%20height%3D%273295%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2016/11/javascript-coding-1000x576-300x173.jpg 300w, https://webrobots.io/wp-content/uploads/2016/11/javascript-coding-1000x576-768x442.jpg 768w, https://webrobots.io/wp-content/uploads/2016/11/javascript-coding-1000x576.jpg 1000w" data-sizes="auto" data-orig-sizes="(max-width: 512px) 100vw, 512px" /></p>
<p>Based on the accumulated experience we were able to identify a list of the most common mistakes that robot writers can make. We published a list with specific code examples that illustrate the mistakes. Then each mistake has an explanation and a code example with solution. This list is highly recommended read for new to robot writing and also to the experienced coders. Here it is: <a href="https://webrobots.io/most-common-robot-writing-mistakes/">Most common robot writing mistakes</a>. Enjoy!</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/writing-better-data-collection-robots/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Scrape Instagram Followers</title>
		<link>https://webrobots.io/scrape-instagram-followers/</link>
					<comments>https://webrobots.io/scrape-instagram-followers/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Wed, 08 Jun 2016 11:18:18 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[social media crawling]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5359</guid>

					<description><![CDATA[Our platform is often used by growth hackers for lead generation in social media networks. One such use case is building a list of Instagram followers from interestingprofiles. Today we placed one such robot into our portal's demo space for anyone to use. Robot is only 30 lines of Javascript code and works quite fast. We [...]]]></description>
										<content:encoded><![CDATA[<p><div class="fusion-fullwidth fullwidth-box fusion-builder-row-8 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-7 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Our platform is often used by growth hackers for lead generation in social media networks. One such use case is building a list of Instagram followers from interestingprofiles. Today we placed one such robot into our <a href="http://portal.webrobots.io" target="_blank" rel="noopener noreferrer">portal</a>&#8216;s demo space for anyone to use. Robot is only 30 lines of Javascript code and works quite fast. We tested it with IBM&#8217;s Instagram which has 78k followers and it took only 14 minutes to scrape them.</p>
<p><img loading="lazy" decoding="async" class="lazyload size-full wp-image-5362 aligncenter" src="https://webrobots.io/wp-content/uploads/2016/06/instagram_robot.png" data-orig-src="https://webrobots.io/wp-content/uploads/2016/06/instagram_robot.png" alt="instagram_robot" width="674" height="615" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27674%27%20height%3D%27615%27%20viewBox%3D%270%200%20674%20615%27%3E%3Crect%20width%3D%27674%27%20height%3D%273615%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2016/06/instagram_robot-300x274.png 300w, https://webrobots.io/wp-content/uploads/2016/06/instagram_robot.png 674w" data-sizes="auto" data-orig-sizes="(max-width: 674px) 100vw, 674px" /></p>
<p>How to use this robot:</p>
<ol>
<li>Login to <a href="http://portal.webrobots.io" target="_blank" rel="noopener noreferrer">Web Robots portal</a> on Chrome browser.</li>
<li>Make sure you have <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak" target="_blank" rel="noopener noreferrer">Web Robots Chrome extension</a> to run the robot.</li>
<li>Open robot <strong>instagram_followers</strong> in our extension.</li>
<li>Make sure you are logged in on Instagram website.</li>
<li>Modify start URL to the desired Instagram profile (example: https://www.instagram.com/ibm) and click Run.</li>
<li>When robot is finished data will be available on portal in CSV and JSON formats.</li>
</ol>
<p>Remember, this robot is placed in Demo space, which means it can be modified by anyone. In case someone messes up the code, you can restore it from code below. Just paste it into extension&#8217;s editor:</p>
</div><div class="fusion-clearfix"></div></div></div></div></div><div class="fusion-fullwidth fullwidth-box fusion-builder-row-9 hundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-8 fusion-one-full fusion-column-first fusion-column-last fusion-column-no-min-height 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><pre class="brush: jscript;">
// Must be logged in
// Start URL above must be target Instagram profile. Example: https://www.instagram.com/ibm/
 
steps.start = function(req) {
 
    var user_id = $(&quot;script:contains(profilePage_)&quot;).text().split('profilePage_')[1].split('&quot;')[0];
 
    if (!req) {
        req = &quot;q=ig_user(&quot; + user_id + &quot;)+%7B%0A++followed_by.first(20)+%7B%0A++++count%2C%0A++++page_info+%7B%0A++++++end_cursor%2C%0A++++++has_next_page%0A++++%7D%2C%0A++++nodes+%7B%0A++++++id%2C%0A++++++is_verified%2C%0A++++++followed_by_viewer%2C%0A++++++requested_by_viewer%2C%0A++++++full_name%2C%0A++++++profile_pic_url%2C%0A++++++username%0A++++%7D%0A++%7D%0A%7D%0A&amp;amp;amp;amp;ref=relationships%3A%3Afollow_list&quot;;
    }
 
    var token = $(&quot;script:contains(csrf_token)&quot;).text().split('&quot;csrf_token&quot;: &quot;').pop().split('&quot;').shift();
 
    $.ajax({
        url: &quot;https://www.instagram.com/query/&quot;,
        headers: {
            'x-instagram-ajax': '1',
            &quot;x-csrftoken&quot;: token
        },
        method: 'POST',
        data: req,
        success: function(data) {
 
            emit(&quot;Followers&quot;, data.followed_by.nodes);
 
            if (data.followed_by.page_info.has_next_page) {
                var next_req = &quot;q=ig_user(&quot; + user_id + &quot;)+%7B%0A++followed_by.after(&quot; + data.followed_by.page_info.end_cursor + &quot;%2C+20)+%7B%0A++++count%2C%0A++++page_info+%7B%0A++++++end_cursor%2C%0A++++++has_next_page%0A++++%7D%2C%0A++++nodes+%7B%0A++++++id%2C%0A++++++is_verified%2C%0A++++++followed_by_viewer%2C%0A++++++requested_by_viewer%2C%0A++++++full_name%2C%0A++++++profile_pic_url%2C%0A++++++username%0A++++%7D%0A++%7D%0A%7D%0A&amp;amp;amp;amp;ref=relationships%3A%3Afollow_list&quot;;
                next(&quot;&quot;, &quot;start&quot;, next_req);
            }
 
            done(1000);
        }
    });
};
</pre>
</div><div class="fusion-clearfix"></div></div></div></div></div></p>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/scrape-instagram-followers/feed/</wfw:commentRss>
			<slash:comments>90</slash:comments>
		
		
			</item>
		<item>
		<title>Scraping Yelp Data</title>
		<link>https://webrobots.io/scraping-yelp-data/</link>
					<comments>https://webrobots.io/scraping-yelp-data/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Tue, 01 Mar 2016 14:41:54 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5321</guid>

					<description><![CDATA[We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp. We have decided to release a simple example Yelp robot which [...]]]></description>
										<content:encoded><![CDATA[<p><div class="fusion-fullwidth fullwidth-box fusion-builder-row-10 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-9 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.</p>
<p>We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc.  Robot is placed in our Demo space on Web Robots <a href="http://portal.webrobots.io/" target="_blank" rel="noopener noreferrer">portal</a> for anyone to use, just sign up, find the robot and use it.</p>
<p><img loading="lazy" decoding="async" class="lazyload size-large wp-image-5328 aligncenter" src="https://webrobots.io/wp-content/uploads/2016/03/Screen-Shot-2016-03-01-at-3.22.41-PM-1024x849.png" data-orig-src="https://webrobots.io/wp-content/uploads/2016/03/Screen-Shot-2016-03-01-at-3.22.41-PM-1024x849.png" alt="Screen Shot 2016-03-01 at 3.22.41 PM" width="669" height="555" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27669%27%20height%3D%27555%27%20viewBox%3D%270%200%20669%20555%27%3E%3Crect%20width%3D%27669%27%20height%3D%273555%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2016/03/Screen-Shot-2016-03-01-at-3.22.41-PM-300x249.png 300w, https://webrobots.io/wp-content/uploads/2016/03/Screen-Shot-2016-03-01-at-3.22.41-PM-768x637.png 768w, https://webrobots.io/wp-content/uploads/2016/03/Screen-Shot-2016-03-01-at-3.22.41-PM-1024x849.png 1024w" data-sizes="auto" data-orig-sizes="(max-width: 669px) 100vw, 669px" /></p>
<h3>How to use it:</h3>
<ol>
<li>Sign in to our portal <a href="http://portal.webrobots.io/" target="_blank" rel="noopener noreferrer">here</a>.</li>
<li>Download our scraping extension from <a href="https://chrome.google.com/webstore/detail/pmagfjeddlknbohojnepcplpgjlincak" target="_blank" rel="noopener noreferrer">here</a>.</li>
<li>Find robot named Yelp_us_demo in the dropdown.</li>
<li>Modify start URL to the first page of your search results. For example: http://www.yelp.com/search?find_desc=Restaurants&amp;find_loc=Arlington,+VA,+USA</li>
<li>Click Run.</li>
<li>Let robot finish it&#8217;s job and download data from portal.</li>
</ol>
<h3><strong>Some things to consider:</strong></h3>
<p>This robot is placed in our Demo space &#8211; therefore it is accessible to anyone. Anyone will be able to modify and run it, anyone will be able to download collected data. Robot&#8217;s code may be edited by someone else, but you can always restore it from sample code below. Yelp limits number of search results, so do not expect to scrape more results than you would normally see by search.</p>
<p>In case you want to create your own version of such robot, here it&#8217;s full code:</p>
</div><div class="fusion-clearfix"></div></div></div></div></div><div class="fusion-fullwidth fullwidth-box fusion-builder-row-11 hundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-10 fusion-one-full fusion-column-first fusion-column-last fusion-column-no-min-height 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><pre class="brush: jscript;">
// starting URL above must be the first page of search results.
// Example: http://www.yelp.com/search?find_desc=Restaurants&amp;amp;amp;find_loc=Arlington,+VA,+USA

steps.start = function ()  {
    
    var rows = [];
    
    // listings
    $(&quot;.biz-listing-large&quot;).each (function (i,v) {
        if ($(&quot;h3 a&quot;, v).length &amp;gt; 0) 
        {
            var row = {};
            row.company = $(&quot;.biz-name&quot;, v).text().trim();
            row.reviews =$(&quot;.review-count&quot;, v).text().trim();
            row.companyLink =  $(&quot;.biz-name&quot;, v)[0].href;
            row.location = $(&quot;.secondary-attributes address&quot;, v).text().trim();
            row.phone = $(&quot;.biz-phone&quot;, v).text().trim();
            rows.push (row);
        }
    });
    
    emit (&quot;yelp&quot;, rows);
    
    // paging
    if ($(&quot;.next&quot;).length === 1) {
        next ($(&quot;.next&quot;)[0].href, &quot;start&quot;);
    }
    
    done();
};
</pre>
<pre></pre>
</div><div class="fusion-clearfix"></div></div></div></div></div></p>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/scraping-yelp-data/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>New Kickstarter Dataset</title>
		<link>https://webrobots.io/new-kickstarter-dataset/</link>
					<comments>https://webrobots.io/new-kickstarter-dataset/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Thu, 31 Dec 2015 08:16:20 +0000</pubDate>
				<category><![CDATA[Datasets]]></category>
		<category><![CDATA[kickstarter datasets]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5289</guid>

					<description><![CDATA[Recently we updated our Kickstarter robot to crawl project subcategories. This allows us to collect a richer dataset, for example on 2015-12-17 run robot collected data about 144,263 projects with a running time only 2 hours! We also started presenting it in the JSON streaming format which is just a line delimited JSON. Previously we [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-12 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-11 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Recently we updated our Kickstarter robot to crawl project subcategories. This allows us to collect a richer dataset, for example on 2015-12-17 run robot collected data about 144,263 projects with a running time only 2 hours! We also started presenting it in the JSON streaming format which is just a line delimited JSON. Previously we used to stuff all projects into JSON array and the downside of it was that user would have to read the entire large JSON file into memory before any kind of processing starts. with JSON streaming it is possible to read one line at a time.</p>
<p>Data is posted in the usual <a href="https://webrobots.io/kickstarter-datasets/">place</a>.</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/new-kickstarter-dataset/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/

Page Caching using Disk: Enhanced 
Minified using Disk
Database Caching 97/121 queries in 0.110 seconds using Disk

Served from: webrobots.io @ 2026-05-10 21:58:31 by W3 Total Cache
-->