<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Web Scraping &#8211; Web Scraping Service</title>
	<atom:link href="https://webrobots.io/category/web-scraping/feed/" rel="self" type="application/rss+xml" />
	<link>https://webrobots.io</link>
	<description>We do web scraping service better!</description>
	<lastBuildDate>Wed, 15 Feb 2023 08:39:14 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.5.8</generator>
	<item>
		<title>New Functions Added</title>
		<link>https://webrobots.io/new-functions-added/</link>
					<comments>https://webrobots.io/new-functions-added/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Wed, 15 Feb 2023 08:39:14 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=6294</guid>

					<description><![CDATA[Web Robots scraping framework documentation has been updated with new functions: blockImages() - changes browser settings regarding image downloading. This function is useful in scenarios when bandwidth is a concern. Sometimes it results in faster crawling speeds. allowImages() - reverses browser settings changes made by blockImages(). closeSocket() - closes all idle socket connections in browser. [...]]]></description>
										<content:encoded><![CDATA[<p><a href="https://webrobots.io/werobots-documentation/">Web Robots scraping framework documentation</a> has been updated with new functions:</p>
<ul>
<li><strong>blockImages()</strong> &#8211; changes browser settings regarding image downloading. This function is useful in scenarios when bandwidth is a concern. Sometimes it results in faster crawling speeds.</li>
<li><strong>allowImages()</strong> &#8211; reverses browser settings changes made by blockImages().</li>
<li><strong>closeSocket()</strong> &#8211; closes all idle socket connections in browser.</li>
</ul>
<p>Web Robots have been using these functions on the internal platform for over 6 months and they proved to be great help in some scenarios. Now they are available in our public extension.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/new-functions-added/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Our Chrome extension has been updated</title>
		<link>https://webrobots.io/our-chrome-extension-has-been-updated/</link>
					<comments>https://webrobots.io/our-chrome-extension-has-been-updated/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Thu, 01 Oct 2020 12:24:45 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[web scraping service]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=6143</guid>

					<description><![CDATA[Our public developer extension (IDE) has been untouched since March 2019. It may looks like Web Robots were stagnating, but actually we were constantly working on our internal systems like portal, cloud workers, cloud orchestration. We also has several internal releases of IDE for our staff. So the IDE published to the Chrome webstore [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-1 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-0 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Our public developer extension (IDE) has been untouched since March 2019. It may looks like Web Robots were stagnating, but actually we were constantly working on our internal systems like portal, cloud workers, cloud orchestration. We also has several internal releases of IDE for our staff.</p>
<p>So the IDE published to the Chrome webstore is just the tip of the iceberg.</p>
<p>It is now September 2020 and time has come to release the <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak?hl=en">new version to Chrome webstore</a>. We are glad that our extension passed webstore&#8217;s permission audit from the first time as our extension requires access to quite a few Chrome APIs in order to work properly and people at Google are getting ever stricter in their review process for extension permissions.</p>
<p>All changes and new features are listed in the <a href="https://webrobots.io/changelog/">changelog here</a>.</p>
<p><em>PS our IDE extension works only for users who have an approved account on Web Robots portal.</em></p>
</div><style type="text/css">.fusion-gallery-1 .fusion-gallery-image {border:0px solid #f6f6f6;}</style><div class="fusion-gallery fusion-gallery-container fusion-grid-3 fusion-columns-total-0 fusion-gallery-layout-grid fusion-gallery-1" style="margin:-5px;"><div style="padding:5px;" class="fusion-grid-column fusion-gallery-column fusion-gallery-column-3 hover-type-zoomin"><div class="fusion-gallery-image"><a href="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32.png" data-title="Web scraping extension" title="Web scraping extension" data-caption="Robot editor and debugger" rel="noreferrer" data-rel="iLightbox[gallery_image_1]" class="fusion-lightbox" target="_self"><img fetchpriority="high" decoding="async" src="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32.png" data-orig-src="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32.png" width="716" height="564" alt="Web scraping extension" title="Web scraping extension" aria-label="Web scraping extension" class="lazyload img-responsive wp-image-6144" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27716%27%20height%3D%27564%27%20viewBox%3D%270%200%20716%20564%27%3E%3Crect%20width%3D%27716%27%20height%3D%273564%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32-200x158.png 200w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32-400x315.png 400w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32-600x473.png 600w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.32.png 716w" data-sizes="auto" data-orig-sizes="(min-width: 2200px) 100vw, (min-width: 784px) 541px, (min-width: 712px) 784px, (min-width: 640px) 712px, " /></a></div></div><div style="padding:5px;" class="fusion-grid-column fusion-gallery-column fusion-gallery-column-3 hover-type-zoomin"><div class="fusion-gallery-image"><a href="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46.png" data-title="Web scraping extension" title="Web scraping extension" data-caption="Reset Proxy and Allow images rescue buttons." rel="noreferrer" data-rel="iLightbox[gallery_image_1]" class="fusion-lightbox" target="_self"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46.png" data-orig-src="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46.png" width="708" height="562" alt="Web scraping extension" title="Web scraping extension" aria-label="Web scraping extension" class="lazyload img-responsive wp-image-6145" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27708%27%20height%3D%27562%27%20viewBox%3D%270%200%20708%20562%27%3E%3Crect%20width%3D%27708%27%20height%3D%273562%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46-200x159.png 200w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46-400x318.png 400w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46-600x476.png 600w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.18.46.png 708w" data-sizes="auto" data-orig-sizes="(min-width: 2200px) 100vw, (min-width: 784px) 541px, (min-width: 712px) 784px, (min-width: 640px) 712px, " /></a></div></div><div style="padding:5px;" class="fusion-grid-column fusion-gallery-column fusion-gallery-column-3 hover-type-zoomin"><div class="fusion-gallery-image"><a href="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27.png" data-title="Web scraping extension" title="Web scraping extension" data-caption="Data preview pane - Excel style table and JSON preview." rel="noreferrer" data-rel="iLightbox[gallery_image_1]" class="fusion-lightbox" target="_self"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27.png" data-orig-src="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27.png" width="748" height="586" alt="Web scraping extension" title="Web scraping extension" aria-label="Web scraping extension" class="lazyload img-responsive wp-image-6146" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27748%27%20height%3D%27586%27%20viewBox%3D%270%200%20748%20586%27%3E%3Crect%20width%3D%27748%27%20height%3D%273586%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27-200x157.png 200w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27-400x313.png 400w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27-600x470.png 600w, https://webrobots.io/wp-content/uploads/2020/10/Screenshot-2020-10-01-at-15.19.27.png 748w" data-sizes="auto" data-orig-sizes="(min-width: 2200px) 100vw, (min-width: 784px) 541px, (min-width: 712px) 784px, (min-width: 640px) 712px, " /></a></div></div><div class="clearfix"></div></div><div class="fusion-clearfix"></div></div></div></div></div><style type="text/css">.fusion-fullwidth.fusion-builder-row-1 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link) , .fusion-fullwidth.fusion-builder-row-1 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):before, .fusion-fullwidth.fusion-builder-row-1 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):after {color: #03a9f4;}.fusion-fullwidth.fusion-builder-row-1 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):hover, .fusion-fullwidth.fusion-builder-row-1 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):hover:before, .fusion-fullwidth.fusion-builder-row-1 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):hover:after {color: #0074a2;}.fusion-fullwidth.fusion-builder-row-1 .pagination a.inactive:hover, .fusion-fullwidth.fusion-builder-row-1 .fusion-filters .fusion-filter.fusion-active a {border-color: #0074a2;}.fusion-fullwidth.fusion-builder-row-1 .pagination .current {border-color: #0074a2; background-color: #0074a2;}.fusion-fullwidth.fusion-builder-row-1 .fusion-filters .fusion-filter.fusion-active a, .fusion-fullwidth.fusion-builder-row-1 .fusion-date-and-formats .fusion-format-box, .fusion-fullwidth.fusion-builder-row-1 .fusion-popover, .fusion-fullwidth.fusion-builder-row-1 .tooltip-shortcode {color: #0074a2;}#main .fusion-fullwidth.fusion-builder-row-1 .post .blog-shortcode-post-title a:hover {color: #0074a2;}</style>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/our-chrome-extension-has-been-updated/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Instant Data Users Group on Facebook</title>
		<link>https://webrobots.io/instant-data-users-group-on-facebook/</link>
					<comments>https://webrobots.io/instant-data-users-group-on-facebook/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Tue, 28 Apr 2020 11:02:25 +0000</pubDate>
				<category><![CDATA[Datasets]]></category>
		<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=6110</guid>

					<description><![CDATA[We have launched a Facebook group where Instant Data Scraper users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users. We hope that [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-2 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-1 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>We have launched a <a href="https://www.facebook.com/groups/instantdata/">Facebook group</a> where <a href="https://chrome.google.com/webstore/detail/instant-data-scraper/ofaokhiedipichpaobibbnahnkdoiiah">Instant Data Scraper</a> users will be able to find support for the extension which currently has 65k users. This extension is wildly popular, but at the same time it is completely free, hence Web Robots has limited capacity to answer questions arising from users.</p>
<p>We hope that new Facebook group will grow into a community where users can support each other.</p>
</div><div class="imageframe-align-center"><div class="fusion-image-frame-bottomshadow image-frame-shadow-1"><style>.fusion-image-frame-bottomshadow.image-frame-shadow-1{display:inline-block}.element-bottomshadow.imageframe-1:before, .element-bottomshadow.imageframe-1:after{-webkit-box-shadow: 0 17px 10px rgba(0,0,0,0.4);box-shadow: 0 17px 10px rgba(0,0,0,0.4);}</style><span class="fusion-imageframe imageframe-bottomshadow imageframe-1 element-bottomshadow hover-type-none"><a class="fusion-no-lightbox" href="https://www.facebook.com/groups/instantdata/" target="_blank" aria-label="Community Support Group" rel="noopener noreferrer"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2020/04/unnamed.png" data-orig-src="https://webrobots.io/wp-content/uploads/2020/04/unnamed.png" width="500" height="228" alt="Community Support Group" class="lazyload img-responsive wp-image-6111" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27500%27%20height%3D%27228%27%20viewBox%3D%270%200%20500%20228%27%3E%3Crect%20width%3D%27500%27%20height%3D%273228%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2020/04/unnamed-200x91.png 200w, https://webrobots.io/wp-content/uploads/2020/04/unnamed-400x182.png 400w, https://webrobots.io/wp-content/uploads/2020/04/unnamed.png 500w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 500px" /></a></span></div></div><div class="fusion-clearfix"></div></div></div></div></div><style type="text/css">.fusion-fullwidth.fusion-builder-row-2 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link) , .fusion-fullwidth.fusion-builder-row-2 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):before, .fusion-fullwidth.fusion-builder-row-2 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):after {color: #03a9f4;}.fusion-fullwidth.fusion-builder-row-2 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):hover, .fusion-fullwidth.fusion-builder-row-2 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):hover:before, .fusion-fullwidth.fusion-builder-row-2 a:not(.fusion-button):not(.fusion-builder-module-control):not(.fusion-social-network-icon):not(.fb-icon-element):not(.fusion-countdown-link):not(.fusion-rollover-link):not(.fusion-rollover-gallery):not(.fusion-button-bar):not(.add_to_cart_button):not(.show_details_button):not(.product_type_external):not(.fusion-quick-view):not(.fusion-rollover-title-link):not(.fusion-breadcrumb-link):hover:after {color: #0074a2;}.fusion-fullwidth.fusion-builder-row-2 .pagination a.inactive:hover, .fusion-fullwidth.fusion-builder-row-2 .fusion-filters .fusion-filter.fusion-active a {border-color: #0074a2;}.fusion-fullwidth.fusion-builder-row-2 .pagination .current {border-color: #0074a2; background-color: #0074a2;}.fusion-fullwidth.fusion-builder-row-2 .fusion-filters .fusion-filter.fusion-active a, .fusion-fullwidth.fusion-builder-row-2 .fusion-date-and-formats .fusion-format-box, .fusion-fullwidth.fusion-builder-row-2 .fusion-popover, .fusion-fullwidth.fusion-builder-row-2 .tooltip-shortcode {color: #0074a2;}#main .fusion-fullwidth.fusion-builder-row-2 .post .blog-shortcode-post-title a:hover {color: #0074a2;}</style>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/instant-data-users-group-on-facebook/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Web Scraping vs Web Crawling</title>
		<link>https://webrobots.io/web-scraping-vs-web-crawling/</link>
					<comments>https://webrobots.io/web-scraping-vs-web-crawling/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Mon, 06 May 2019 09:37:58 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=6019</guid>

					<description><![CDATA[The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-3 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-2 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>The internet is growing exponentially, and the amount of data available for extraction and analysis is growing along side it. It is no wonder then that many new and confusing terms are created and used every day, such as Data Science, Data mining, Data harvesting, Web scraping, Web crawling, etc. But what do they mean? Is it important to understand the subtle differences, or is it all just fancy lingo? Let&#8217;s look at a couple of terms to try and answer these questions: <em><strong>Web Scraping</strong></em> and <em><strong>Web Crawling</strong></em>.</p>
<h2>Formal Answer</h2>
<p>Lets start with the formal definitions:</p>
<p><strong>Web crawling</strong> &#8211; A process where a program or automated script browses the World Wide Web in a methodical, automated manner.<br />
<strong>Web scraping</strong> &#8211; extracting specific data from the websites.</p>
<p>As you can see the terms have quite clear definitions, and some people suggest that it is crucial to understand the minute differences if you want to succeed in the industry. But is that true?</p>
<h2>Real World Answer</h2>
<p>We are a company that has been specializing in <strong>Web Scraping</strong> services for years. We talk to our present and prospective clients on daily basis, sometimes several times a day. And in these real world conversations the terms Web Scraping and Web Crawling are often used interchangeably without being precise at all. The reality is &#8211; there are websites out there that have valuable data that needs to be extracted in a structured format, and how you define the process is not important at all.</p>
<h2>What We Actually Do?</h2>
<p>When looking in retrospect at the projects we did during these years, a simple pattern emerges. Vast majority of our projects are about creating robots that do<strong> targeted web crawling</strong> (crawling not the entire internet, but only specific websites) and immediately do<strong> web scraping</strong> as the web page is retrieved. So both processes occur simultaneously in real time. Most often we discard almost the entire retrieved HTML document and save only the bits of information that are needed for our clients. In some cases we will save the entire HTML for traceability, or for further analysis. So the lines between <strong>web crawling</strong> and <strong>web scraping</strong> become somewhat blurred as the amount of data extracted varies.</p>
<p>In the end we found that the essential thing is clear communications about what needs to be done, rather than how to define it. However, this is just our opinion based on our experience, and depending on the project you might be working on, or the business model you might implement, you might reach a different conclusion. In any case, we can all agree &#8211; Web Scraping on scale is cool!</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/web-scraping-vs-web-crawling/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>Advanced AJAX Techniques for Web Scraping</title>
		<link>https://webrobots.io/advanced-ajax-techniques-for-web-scraping/</link>
					<comments>https://webrobots.io/advanced-ajax-techniques-for-web-scraping/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Wed, 10 Apr 2019 07:32:21 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5974</guid>

					<description><![CDATA[Basic AJAX usage within Web Robots scraper Best and simplest way to perform AJAX calls with the scraper is to use JQuery $.ajax() or the simplified $.get(), $.post() and $.getJSON() methods. [javascript] // Standard JQuery AJAX call $.ajax({ url:'https://webrobots.io', method: 'GET' }).done( function(resp){ console.log(resp); }); // Simplified AJAX call $.get('https://webrobots.io').done( function(resp){ console.log(resp); }); [/javascript] [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-4 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-3 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><h2><strong>Basic AJAX usage within Web Robots scraper</strong></h2>
<p>Best and simplest way to perform AJAX calls with the scraper is to use JQuery <a href="http://api.jquery.com/jquery.ajax/">$.ajax()</a> or the simplified <a href="https://api.jquery.com/jquery.get/">$.get()</a>, <a href="https://api.jquery.com/jquery.post/">$.post()</a> and <a href="https://api.jquery.com/jquery.getjson/">$.getJSON()</a> methods.</p>
<pre class="brush: jscript;">
// Standard JQuery AJAX call
$.ajax({
    url:'https://webrobots.io',
    method: 'GET'
}).done( function(resp){
    console.log(resp); 
});

// Simplified AJAX call
$.get('https://webrobots.io').done( function(resp){
   console.log(resp); 
});

</pre>
<p>Since AJAX is asynchronous, step done() should always be placed inside the AJAX callback function. Also, multiple AJAX calls shouldn&#8217;t be made inside a loop, instead a new step for the AJAX should be created and queued up with next() inside the loop.</p>
<h3><strong>Example incorrect and correct done() placement in AJAX:</strong></h3>
<h3><span style="color: #f03030;">INCORRECT</span></h3>
<pre class="brush: jscript; highlight: &#091;5&#093;;">
steps.start = function(){
    $.get('https://webrobots.io').done( function(resp){
        // some code
    });
    done(); 
}
</pre>
<h3><span style="color: #339966;">CORRECT</span></h3>
<pre class="brush: jscript; highlight: &#091;4&#093;;"> 
steps.start = function(){
    $.get('https://webrobots.io').done( function(resp){
        // some code;
        done(); 
    });
}
</pre>
<h3><strong>Example incorrect and correct AJAX looping:</strong></h3>
<h3><span style="color: #f03030;">INCORRECT</span></h3>
<pre class="brush: jscript;"> 
steps.start = function(){
   for( let url of urls){
       $.get(url).done( function(resp){
           // some code 
       });
   }
   done(); 
}
</pre>
<h3><span style="color: #339966;">CORRECT</span></h3>
<pre class="brush: jscript;">
 
steps.start = function(){
    for( let url of urls){
        next('','getUrl',url);
    }
    done(); 
}

steps.getUrl = function(url){
    $.get(url).done( function(resp){
         // some code;
         done(); 
    }); 
} 
</pre>
<hr />
<h2></h2>
<h2><strong>AJAX timeout</strong></h2>
<p>One issue with AJAX requests inside a step function is that the step global retry timeout and the AJAX timeout are independent, and in certain scenarios this can cause problems.</p>
<p>Consider this example. A GET request is performed, and since it is asynchronous, step done() function is placed inside the GET done block. If the GET fails, we can either call a done() function inside the .fail() block and move along with our scraping, or omit the .fail() block and force a step retry after our preset retry timeout.</p>
<pre class="brush: jscript;">
steps.start = function(){

    $.get('https://webrobots.io').done( function(response){
        // some code;
        done();
    })
    //.fail(done);

}
</pre>
<p>It works fine when the server returns a failed response (E.g. status code 404) or fails to respond whatsoever. However, depending on how the server is configured, it might return a valid response after a significant delay, sometimes above our locally set step retry timeout.  This means that even though the step has already finished, the code inside the GET done block will run and trigger a done(). Depending on the specific code, this can cause instability to the robot and unnecessary error logging. To avoid such a scenario a local AJAX timeout should be set up to be just below the step retry timeout (default is 60000 ms). In the example below, if any response is not received from the server within 55000ms, AJAX call will timeout and code will proceed to run as normal.</p>
<pre class="brush: jscript;">
steps.start = function(){

    // default retry timer is 60000ms, AJAX timeout should be a few seconds lower.
    $.ajaxSetup({timeout:55000});
    $.get('https://webrobots.io').done( function(response){
        // some code;
        done();
    })
    //.fail(done);
}
</pre>
<hr />
<h2></h2>
<h2><strong>Multiple simultaneous AJAX calls using $.when()</strong></h2>
<p>Performing several simultaneous AJAX calls is a very efficient way to handle certain scraping situations. One such situation is a website that loads parts of its content as static html, and other parts dynamically through various APIs. Consider an example website that performs a separate AJAX call to get the post content, one to get the post image, and another one for post reviews.  A simple approach could be to just stack all three AJAX calls to start as soon as the previous one finishes. We will use <a href="https://jsonplaceholder.typicode.com/">jsonplaceholder.typicode.com</a> to construct our example:</p>
<pre class="brush: jscript;">
steps.start = function(){
    $.get('https://jsonplaceholder.typicode.com/posts/1').done( function(r1){
        console.log( r1 ); 
        $.get('https://jsonplaceholder.typicode.com/photos/1').done( function(r2){
            console.log( r2 ); 
            $.get('https://jsonplaceholder.typicode.com/comments/1').done( function(r3){
                 console.log( r3 ); 
                 done();
            });
        });
    });
};
</pre>
<p>The downside of this approach is that a new AJAX call cannot start until the previous one ends, wasting valuable time. The solution is to use <a href="https://api.jquery.com/jquery.when/">JQuery.when()</a> method. It takes multiple Deferred objects as arguments, in this case $.get() methods, and will resolve its master Deferred as soon as all the Deferreds resolve, or reject the master Deferred as soon as one of the Deferreds is rejected. The arguments passed to the doneCallbacks provide the resolved values for each of the Deferreds, and matches the order the Deferreds were passed to $.when() method. Our example remade with $.when would look like this:</p>
<pre class="brush: jscript;">
steps.start = function(){

    let a1 = () =&gt; $.get('https://jsonplaceholder.typicode.com/posts/1');
    let a2 = () =&gt; $.get('https://jsonplaceholder.typicode.com/photos/1');
    let a3 = () =&gt; $.get('https://jsonplaceholder.typicode.com/comments/1');
    $.when( a1(), a2(), a3() ).then(function ( r1, r2, r3 ) {
        // r1, r2 and r3 are arguments resolved for the a1, a2 and a3 ajax requests, respectively.
        // Each argument is an array with the following structure: [ data, statusText, jqXHR ]
        console.log( r1 ); 
        console.log( r2 ); 
        console.log( r3 ); 
        done();
    });
}
</pre>
<p>This way all AJAX requests are started simultaneously and code proceeds when all responses are resolved. Depending on how many simultaneous requests are made and the response times from the server, this method has potential to significantly increase the speed of a robot.</p>
<hr />
<h2></h2>
<h2><strong>Dynamic number of simultaneous AJAX calls</strong></h2>
<p>During our web scraping journey, we came across a couple instances where it is useful to be able to make multiple AJAX calls when the number of calls is not known in advance. One such example would be taking links from multiple sitemaps and distributing them evenly between forks.  Unfortunately this cannot be accomplished using $.when() because it accepts a fixed number of arguments and returns the same amount of responses that each have to be specified individually. We can solve this by using ES6 <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/all">Promise.all()</a> method, which returns a single Promise that resolves when all of the promises passed as an array have resolved or when the array contains no promises. It rejects with the first promise that rejects. Here is an example using <a href="https://www.rottentomatoes.com">rottentomatoes.com </a>sitemap:</p>
<pre class="brush: jscript;">
steps.start = function(){
    $.get('https://www.rottentomatoes.com/sitemap.xml').done( function(response){
        let sitemaps = $('loc', response).map((i, v) =&gt; $(v).text() ).get()
        next('','distribute',sitemaps);
        done();
    })
}

steps.sitemaps = function( sitemaps ){
    // Creating an array of promises
    let promises = urls.map( url =&gt; $.get(url) );

    // Waiting for all AJAX promises to resolve before executing further code
    Promise.all( promises ).then( function( responses ){
        for( let r of responses ){
            // logging the number of links in each sitemap
            console.log( $('loc', r).length );
        }
        done();
    });
}
</pre>
<p><span style="color: #ff0000;">IMPORTANT: </span> This method should only be used when absolutely necessary because excessive amount of constant simultaneous requests could strain the target server or be identified as unwanted traffic and trigger blocking. So always use proper delays and follow robots.txt rules for each website you scrape.</p>
<hr />
<h2><strong>Vanilla JS AJAX use cases.</strong></h2>
<p>While JQuery.ajax() is handy, it has one disadvantage in that it always sets the <strong>x-requested-with : XMLHttpRequest</strong> header, and on very rare cases this affects the content of the response that is sent by the server.  To circumvent this, use Vanilla JS <a href="https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest">XMLHttpRequest</a> object or the modern <a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API">fetch API.</a> Refer to their respective documentation pages for more info how to use them. Here are a couple of simple examples.</p>
<h3><strong>Example using XMLHttpRequest: </strong></h3>
<pre class="brush: jscript;">
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
   if (this.readyState == 4 &amp;&amp; this.status == 200) {
       console.log( this.responseText );
   }
};

xhttp.open('GET', 'cookies.php', true);
xhttp.send();
</pre>
<h3><strong>Example using fetch: </strong></h3>
<pre class="brush: jscript;">
fetch('https://webrobots.io/').then(function(response){
    return response.text();
}).then(function(text){
    console.log(text);
});
</pre>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/advanced-ajax-techniques-for-web-scraping/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Using Sitemaps in Web Scraping Robots</title>
		<link>https://webrobots.io/using-sitemaps-in-web-scraping-robots/</link>
					<comments>https://webrobots.io/using-sitemaps-in-web-scraping-robots/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Mon, 25 Mar 2019 09:41:15 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[sitemap]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5876</guid>

					<description><![CDATA[We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this -  just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-5 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-4 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p><span style="font-weight: 400;">We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this &#8211;  just using </span><a href="https://en.wikipedia.org/wiki/Sitemaps"><b>sitemaps</b></a><span style="font-weight: 400;">. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.</span></p>
<p><span style="font-weight: 400;">After all, sitemaps are designed for robots to find all resources on a particular domain.</span></p>
<p><b>Example of a sitemap:</b></p>
</div><span style="width:100%;max-width:600px;" class="fusion-imageframe imageframe-none imageframe-2 hover-type-none"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-1024x392.png" width="1024" height="392" alt="" title="sitemap-example" class="lazyload img-responsive wp-image-5877" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271384%27%20height%3D%27530%27%20viewBox%3D%270%200%201384%20530%27%3E%3Crect%20width%3D%271384%27%20height%3D%273530%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-200x77.png 200w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-400x153.png 400w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-600x230.png 600w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-800x306.png 800w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-1200x460.png 1200w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54.png 1384w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 1024px" /></span><div class="fusion-text"><h1><span style="font-weight: 400;">Finding Sitemaps</span></h1>
<ul>
<li style="font-weight: 400;"><span style="font-weight: 400;">The fastest way to find a sitemap URL is to check </span><i><span style="font-weight: 400;">robots.txt</span></i><span style="font-weight: 400;"> file. For example </span><a href="https://www.rottentomatoes.com/robots.txt"><span style="font-weight: 400;">https://www.rottentomatoes.com/robots.txt</span></a></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">We can also probe typical sitemap URLs like </span><i><span style="font-weight: 400;">domain.com/sitemap</span></i><span style="font-weight: 400;"> or </span><i><span style="font-weight: 400;">domain.com/sitemap.xml</span></i></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">Sometimes just going to the homepage and searching for the keyword “sitemap” works</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">If all above bear no fruit, google search can help (example: “target.com sitemap&#8221;).</span></li>
</ul>
<p><b>Example of domain/robots.txt:</b></p>
</div><span style="width:100%;max-width:600px;" class="fusion-imageframe imageframe-none imageframe-3 hover-type-none"><img loading="lazy" decoding="async" src="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-1024x212.png" width="1024" height="212" alt="" title="sitemap-example-2" class="lazyload img-responsive wp-image-5885" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271024%27%20height%3D%27212%27%20viewBox%3D%270%200%201024%20212%27%3E%3Crect%20width%3D%271024%27%20height%3D%273212%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-200x41.png 200w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-400x83.png 400w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-600x124.png 600w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-800x166.png 800w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1.png 1024w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 1024px" /></span><div class="fusion-text"><h1><span style="font-weight: 400;">Working With Large Sitemaps</span></h1>
<p><span style="font-weight: 400;">Sitemaps usually have many thousands of records and opening them directly will freeze Chrome browser for several minutes while browser renders XML. Our best practice is to make $.get request to get a sitemap and process it.</span></p>
<p><b>example of getting a sitemap using an ajax</b> <b>request and filtering URLs:</b></p>
<pre class="brush: jscript;">

$.get('https://www.rottentomatoes.com/sitemap_0.xml').then( function(response){
    $('url loc',response).each( function(i, v){
        var url = $(v).text();

        // filtering: we only need URLs that have no further path after film name
        // we can filter out URLs with longer URL paths than film page has

        if(url.split('/').length &lt; 6) next(url,'getFilmInfo');

    });
    done();
});

</pre>
<h1><span style="font-weight: 400;">Downsides of Sitemap Approach</span></h1>
<ul>
<li style="font-weight: 400;"><span style="font-weight: 400;">A sitemap can be outdated (old URLs leading to 404 pages) and the site owner might not even notice that their sitemaps are incorrect. It is necessary to do spot checks to see if an URL works.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;"> Sitemap might not have all the items listed in the normal website interface. Best practice is to spot check that items found on a website are present in the sitemap as well.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;"> Sitemaps do not allow filtering items based on certain criteria. For example if we need only electronics from a large eshop, we still have to crawl all products and do filtering in the back-end.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">Sitemaps do show how popular an item is &#8211; for example we cannot infer if a particular item is on the first page in it’s category or somewhere near the end.</span></li>
</ul>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/using-sitemaps-in-web-scraping-robots/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Scraping Dynamic Websites Using The wait() Function</title>
		<link>https://webrobots.io/scraping-dynamic-websites-using-the-wait-function/</link>
					<comments>https://webrobots.io/scraping-dynamic-websites-using-the-wait-function/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Mon, 04 Mar 2019 07:00:48 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[dynamic website]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5842</guid>

					<description><![CDATA[Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-6 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-5 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Dynamic websites are one of the biggest headaches of every developer who works with web scraping robots. Data extraction becomes complicated when it cannot be found in the initial HTML of the website. For example, walmart.com loads product data via AJAX call after the initial DOM is rendered. Therefore we must wait and then extract data from the DOM.For example Walmart’s product page https://grocery.walmart.com/ip/Viva-Paper-Towels-Choose-A-Sheet-1-Big-Roll/52291575 product data appears in selector $(&#8216;div[class^=&#8221;ProductPage__details&#8221;]&#8217;).</p>
<pre class="brush: jscript;">

steps.start = function() {

    console.log($('div[class^=&quot;ProductPage__details&quot;]').length);

    done();

};

</pre>
<p>Logged result is 0, as our code executes as soon as the DOM is ready, but before the element appears. There are several ways we can fix this.Simple waiting strategy &#8211; use setTimeout()We can use setTimeout() where we specify the number of milliseconds to wait before executing a piece of code. This way the browser has some time to process dynamic data and insert it into the DOM. In this example we introduce a simple 3 second wait:</p>
<pre class="brush: jscript;">

steps.start = function() {

    setTimeout(function() {

        console.log($('div[class^=&quot;ProductPage__details&quot;]').length);

        done();

    }, 3000);

};

</pre>
<p>Logged result is 1, which indicates that we found the expected data in the DOM. However, there are some drawbacks in this method, as the code will be delayed the same amount of time regardless of how much the website actually takes to handle its dynamic requests. This means we are wasting time when the product data appears sooner and missing data when product data loads slower.Dynamic pages have a tendency to load inconsistently, therefore the exact timeout duration for each page load is impossible to know in advance. The maximum observed delay time is usually chosen when using setTimeout(). If we are waiting for 3 seconds, average time for data to appear is 1.5 seconds, and we have to process 50,000 products &#8211; then 20.83 hours are wasted. This is 625 hours per month if we run this robot every day!Better waiting strategy &#8211; use wait()Web Robots system wait() function enables the user to wait for a particular HTML element to load and then execute the code right after the element appears. wait(string or array selector[], int maxWaitTime)Default maxWaitTime = 10000;Usable callbacks: then, always, fails (Similar as with JQuery deferred https://api.jquery.com/jquery.deferred/)Example:</p>
<pre class="brush: jscript;">

steps.start = function() {

    wait('div[class^=&quot;ProductPage__details&quot;]').then(function() {

        console.log($('div[class^=&quot;ProductPage__details&quot;]').length);

        done();

    })

};

</pre>
<p>wait() can have multiple callbacks for scenarios when an element appears, does not appear, or always: wait(selector, time_to_wait*).then(callback) &#8211; callback function will be executed immediately when selector appears. If the selector doesn’t appear, the function will never be executed.wait(selector, time_to_wait*).always(callback) &#8211; callback function is executed when element appears or when time_to_wait is reached. wait(selector, time_to_wait*).then(callback).fail(callback2) &#8211; callback function will be executed when element appears. Callback2 will be executed if element does not appear.wait([selector1, selector2, …], time_to_wait*).then(callback) &#8211; callback function is executed only when all of the selectors (selector1, selector2, …) appeared on the website.*Time_to_wait &#8211; is an optional parameter that allows the user to choose the amount of milliseconds to wait for a specified selector. Default amount (if not specified in the function) is 10000 ms.wait() function makes scraping of dynamic pages much easier, more efficient and more reliable.</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/scraping-dynamic-websites-using-the-wait-function/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Web Scraping Performance Tuning With fastnext()</title>
		<link>https://webrobots.io/web-scraping-performance-tuning-with-fastnext/</link>
					<comments>https://webrobots.io/web-scraping-performance-tuning-with-fastnext/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Wed, 20 Feb 2019 09:17:51 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5736</guid>

					<description><![CDATA[Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-7 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-6 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p><span style="font-weight: 400;">Scraping websites can be a time consuming process and when limited computing resources are available, combined with the need for frequent and up to date data, having a fast running robot is essential. A single robot can take anywhere from hours to weeks to complete a run, thus making a robot just fractionally more efficient could save a lot of valuable time.</span></p>
<p><span style="font-weight: 400;">There are a number of ways to optimize your robot to run faster, replacing <strong>setTimeout</strong> with our internal <strong>wait</strong> function, careful usage of loops, not using excessive delay timers in step done function, etc. However, one of the best methods so far been has proven to be using <strong>ajax</strong> requests instead of visiting a website directly. In a standard scenario using next with a link will open a webpage on your browser, that means that it will download the HTML file, all the listed additional resources like js and css files, images, video and audio files, and then process, render and display it in the window.</span></p>
<h5><strong>EXAMPLE ROBOT WITH NEXT:</strong></h5>
<pre class="brush: jscript;">
steps.start = function(){ 
    next(&quot;https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1&quot;,&quot;getMovie&quot;);
    done();
}

steps.getMovie = function(){
    let movie = {name: $(&quot;h1&quot;).text()};
    emit(&quot;movies&quot;,[movie]);
    done();
}
</pre>
<p><span style="font-weight: 400;">All of this might take only a few hundred milliseconds, but when you are dealing with potentially hundreds of thousands of loads, every millisecond adds up. Fortunately, from a data collecting robot standpoint, having the html rendered with all the images, sleek css and fonts is not needed, because it is only interested in the data present in the html file. Therefore, we can get the same results for the fraction of the time by getting just the html file with an <strong>ajax</strong> request. However, this requires reformatting of the code by adding extra parameters to the next step and including the response context in subsequent html data selectors.</span></p>
<h5><strong>EXAMPLE ROBOT WITH AJAX:</strong></h5>
<pre class="brush: jscript;">
steps.start = function(){
    next(&quot;&quot;,&quot;getMovie&quot;,&quot;https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1&quot;);
    done();
}

steps.getMovie = function(url){
    $.get(url).done(function(resp){
        let movie = {name: $(&quot;h1&quot;,resp).text()};
        emit(&quot;movies&quot;,[movie]);
        done();
    })
}
</pre>
<p><span style="font-weight: 400;">This does make the code a bit more complex, and requires more work and care when reformatting old robots in order to avoid errors, especially if the old code is long and complicated. </span></p>
<p><span style="font-weight: 400;">In order to streamline the reformatting and make writing and reading of new robots easier, we integrated the ajax functionality into our extension in the form of <strong>fastnext</strong> function. It functions just like a regular <strong>next</strong>, requiring an url, step name, and an optional third data parameter, but instead of loading the whole website, it does a get request in the background and automatically uses the response html as context in the specified step, thus there is <strong>no need to reformat the selectors</strong>.</span></p>
<h5><strong>EXAMPLE ROBOT WITH FASTNEXT:</strong></h5>
<pre class="brush: jscript;">
steps.start = function(){
    fastnext(&quot;https://www.imdb.com/title/tt0060196/?ref_=nv_sr_1&quot;,&quot;getMovie&quot;);
    done();
}

steps.getMovie = function(){
    let movie = {name: $(&quot;h1&quot;).text()};
    emit(&quot;movies&quot;,[movie]);
    done();
}
</pre>
<p><span style="font-weight: 400;">While reformatting old robots from next to <strong>fastnext</strong> we found that in practice the <strong>savings average at around 50%</strong> reduction in run time. However this varies between the low of only<strong> 25%</strong>, up to a high of <strong>85%</strong>, and it heavily depends on the structure and technology of the specific scraped website.</span></p>
<p><span style="font-weight: 400;">It should be noted however that fastnext, just like a regular ajax, will only work for static html websites where the required data is present in the html. Dynamic websites built with technologies like React or Angular require a different approach.</span></p>
<p><span style="font-weight: 400;">Another nuance that should be taken into account is that currently fastnext does not handle the fail clause of the ajax request and will trigger a step retry, while usually this behaviour is innocuous, sometimes it is needed to handle an ajax fail. In this case a regular ajax function should be used. </span></p>
<p><span style="font-weight: 400;">As an example we wrote a robot that scrapes a small Amazon category containing 149 products and compared the speed of it using next and fastnext. The run using fastnext finished <strong>36% faster</strong> than it’s counterpart, clocking in at ~ 270s, while the run using next clocked in at ~ 420s.</span></p>
<p><strong>Happy Scraping!</strong></p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/web-scraping-performance-tuning-with-fastnext/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>New IDE Extension Release</title>
		<link>https://webrobots.io/new-ide-extension-release/</link>
					<comments>https://webrobots.io/new-ide-extension-release/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Wed, 21 Jun 2017 08:14:42 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[changelog]]></category>
		<category><![CDATA[web robots ide]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5579</guid>

					<description><![CDATA[Today we are releasing an update to our main extension - Web Robots Scraper IDE. This release has a version number 2017.6.20 and has several improvements in UI, proxy settings control, handling hash symbols in URLs. Version 2017.6.20 RELEASE NOTES UI: Robot run statistics is displayed in the same place and no longer "jumping" [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-8 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-7 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Today we are releasing an update to our main extension &#8211; <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak">Web Robots Scraper IDE</a>. This release has a version number 2017.6.20 and has several improvements in UI, proxy settings control, handling hash symbols in URLs.</p>
<h2>Version 2017.6.20 RELEASE NOTES</h2>
<ul>
<li>UI: Robot run statistics is displayed in the same place and no longer &#8220;jumping&#8221;</li>
<li>UI: when robot finishes it&#8217;s status is a direct link to robot run list on portal. Run link is a direct link to data preview and download on portal.</li>
<li>setProxy() functionality has been expanded. See <a href="/werobots-documentation/">documentation</a> for details.</li>
<li>Bugfix: fixed a bug where subsequent steps with URLs having identical address before # symbol were not loading correctly (Example: http://foobar.com#a and after that go to http://foobar.com#b).</li>
<li>Other internal engine improvements and bugfixes.</li>
</ul>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/new-ide-extension-release/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Email And Social Media Links Crawling From Websites</title>
		<link>https://webrobots.io/email-and-social-media-links-crawling-from-websites/</link>
					<comments>https://webrobots.io/email-and-social-media-links-crawling-from-websites/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Thu, 02 Mar 2017 10:45:38 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[email crawling]]></category>
		<category><![CDATA[social media crawling]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5543</guid>

					<description><![CDATA[At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-9 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-8 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>At Web Robots we often get inquiries on projects to crawl social media links and emails from specific list of small websites. Such data is sought after by growth hackers and sales people for lead generation purposes. In this blog post we show an example robot which does exactly that and anyone can run such web scraping project using <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak">Web Robots Chrome extension</a> on their own computer.</p>
<p><img loading="lazy" decoding="async" class="lazyload alignnone size-full wp-image-5546" src="https://webrobots.io/wp-content/uploads/2017/03/social-media-leads.png" data-orig-src="https://webrobots.io/wp-content/uploads/2017/03/social-media-leads.png" alt="" width="994" height="604" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27994%27%20height%3D%27604%27%20viewBox%3D%270%200%20994%20604%27%3E%3Crect%20width%3D%27994%27%20height%3D%273604%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-200x122.png 200w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-300x182.png 300w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-400x243.png 400w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-600x365.png 600w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-768x467.png 768w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads-800x486.png 800w, https://webrobots.io/wp-content/uploads/2017/03/social-media-leads.png 994w" data-sizes="auto" data-orig-sizes="(max-width: 994px) 100vw, 994px" /></p>
<p>To start you will need account on Web Robots <a href="http://portal.webrobots.io">portal</a>, Chrome <a href="https://chrome.google.com/webstore/detail/web-robots-scraper/pmagfjeddlknbohojnepcplpgjlincak">extension</a> and thats it. We placed a robot called <a href="http://portal.webrobots.io/robots/2236"><strong>leads_crawler</strong></a> in our portal&#8217;s Demo space so anyone can use it. In case robot&#8217;s code is changed below is complete source code for this robot. You must edit variable on lines 14-18 to contain the list of target websites to crawl and run the robot. Then previous data on the Output tab and download it from portal once robot is finished. You will get a nice CSV file with data which can be used in your further leads processing workstream.</p>
<p>Robot&#8217;s source code:</p>
<pre class="brush: jscript; highlight: &#091;14&#093;;">
var DEPTH = 2;
var EMAIL_PATTERN = /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi;
var SOCIAL_MEDIA = [
    'facebook.com',
    'linkedin.com',
    'instagram.com',
    'youtube.com',
    'twitter.com',
    'pinterest.com',
    'plus.google.com',
    'blogspot.com'
];

var websites = [
    &amp;quot;http://dccentre.com/&amp;quot;,
    &amp;quot;http://www.theweddingplanneromaha.com/&amp;quot;,
    &amp;quot;http://www.effortlesseventsidaho.com/&amp;quot;
];


steps.start = function() {
    setSettings({skipVisited:true});
    setRetries(5000, 2, 1000); // 5 sec retry timer to skip bad pages quickly
    websites.forEach(function(v, i) {
            next(v, &amp;quot;crawl&amp;quot;, 0);
    });
    done();
};


steps.crawl = function(depth){
    
    depth++;
    
    var emails = _.uniq(returnEmails());
    var social = returnSocial();
    var urls = returnURLs();
    
    dbg(urls);
    
    if(emails.length || social.length) {
        var data = {
            'email' : emails.join(';'),
        };
        $.extend(data, social);
        emit('Leads', [data]);
    }
    
    if(depth &amp;lt; DEPTH) {
        urls.forEach(function(v) {
            next(v, 'crawl', depth);
        });
    }
    
    done();
};


returnURLs = function() {
    var urls = [];
    $('a:visible').each(function (i,v) {
        var url = $(v).prop('href').split('#').shift();
        if(isValidLink(url)) {
            urls.push(url);
        };
    });
    return(_.uniq(urls));
};


returnSocial = function() {
    var urls = [];
    var social = {};
    
    $('a:visible').each(function (i,v) {
        urls.push($(v).prop('href'));
    });
    
    _.uniq(urls).forEach(function(link) {
        var domain = link.split('://').pop().split('www.').pop().split('/').shift().toLowerCase();
        var pos = _.indexOf( SOCIAL_MEDIA, domain);
        if(pos !== -1) {
            social[SOCIAL_MEDIA[pos].split('.').shift()] = link;
        };
    });
    return(social);
};


returnEmails = function() {
    return $('*').html().match(EMAIL_PATTERN);
};


isValidLink = function(link){
    // here we check for all bad stuff in links
    if(_.indexOf(SOCIAL_MEDIA, link.split('://').pop().split('www.').pop().split('/').shift()) !== -1) {
        return false;
    }
    
    if ((link === undefined) || (typeof link !== &amp;quot;string&amp;quot;) || (link.length &amp;lt; 12)) {
        return false;
    }
    
    if (
        // positives - must be present
        !(link.includes(document.domain)) ||
        !link.startsWith(&amp;quot;http&amp;quot;) ||
        
        // negatives - must not be present
        link.includes(&amp;quot;.zip&amp;quot;) ||
        link.includes(&amp;quot;.csv&amp;quot;) ||
        link.includes(&amp;quot;.mpg&amp;quot;) ||
        link.includes(&amp;quot;.mpeg&amp;quot;) ||
        link.includes(&amp;quot;.gz&amp;quot;) ||
        link.includes(&amp;quot;.jpg&amp;quot;) ||
        link.includes(&amp;quot;.jpeg&amp;quot;) ||
        link.includes(&amp;quot;.png&amp;quot;) ||
        link.includes(&amp;quot;.pdf&amp;quot;) ||
        link.includes(&amp;quot;.doc&amp;quot;) ||
        link.includes(&amp;quot;.xls&amp;quot;) ||
        link.includes(&amp;quot;.ppt&amp;quot;) ||
        link.includes(&amp;quot;.avi&amp;quot;) ||
        link.includes(&amp;quot;.tif&amp;quot;) ||
        link.includes(&amp;quot;.exe&amp;quot;) ||        
        link.includes(&amp;quot;.psd&amp;quot;) ||        
        link.includes(&amp;quot;.eps&amp;quot;) ||   
        link.includes(&amp;quot;.txt&amp;quot;) ||   
        link.includes(&amp;quot;.rtf&amp;quot;) ||   
        link.includes(&amp;quot;.wmv&amp;quot;) ||
        link.includes(&amp;quot;.odt&amp;quot;) ||   
        link.includes(&amp;quot;.css&amp;quot;) ||
        link.includes(&amp;quot;.js&amp;quot;) ||
        link.includes(&amp;quot;mailto:&amp;quot;) ||
        link.includes(&amp;quot;facebook&amp;quot;) ||
        link.includes(&amp;quot;google&amp;quot;) ||
        link.includes(&amp;quot;twitter&amp;quot;) || 
        link.includes(&amp;quot;youtube&amp;quot;) || 
        link.includes(&amp;quot;linkedin&amp;quot;) ||
        link.includes(&amp;quot;download&amp;quot;) ||
        link.includes(&amp;quot;pinterest&amp;quot;) 

        ) {
            return false;
        } else {
            return true;
        }
};
</pre>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/email-and-social-media-links-crawling-from-websites/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/

Page Caching using Disk: Enhanced 
Minified using Disk
Database Caching 70/117 queries in 0.110 seconds using Disk

Served from: webrobots.io @ 2026-05-09 14:44:08 by W3 Total Cache
-->