<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>sitemap &#8211; Web Scraping Service</title>
	<atom:link href="https://webrobots.io/tag/sitemap/feed/" rel="self" type="application/rss+xml" />
	<link>https://webrobots.io</link>
	<description>We do web scraping service better!</description>
	<lastBuildDate>Wed, 20 Mar 2019 11:36:08 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.5.8</generator>
	<item>
		<title>Using Sitemaps in Web Scraping Robots</title>
		<link>https://webrobots.io/using-sitemaps-in-web-scraping-robots/</link>
					<comments>https://webrobots.io/using-sitemaps-in-web-scraping-robots/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Mon, 25 Mar 2019 09:41:15 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[sitemap]]></category>
		<category><![CDATA[web scraping]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5876</guid>

					<description><![CDATA[We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this -  just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-1 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-0 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:20px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p><span style="font-weight: 400;">We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this &#8211;  just using </span><a href="https://en.wikipedia.org/wiki/Sitemaps"><b>sitemaps</b></a><span style="font-weight: 400;">. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.</span></p>
<p><span style="font-weight: 400;">After all, sitemaps are designed for robots to find all resources on a particular domain.</span></p>
<p><b>Example of a sitemap:</b></p>
</div><span style="width:100%;max-width:600px;" class="fusion-imageframe imageframe-none imageframe-1 hover-type-none"><img fetchpriority="high" decoding="async" src="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-1024x392.png" width="1024" height="392" alt="" title="sitemap-example" class="lazyload img-responsive wp-image-5877" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271384%27%20height%3D%27530%27%20viewBox%3D%270%200%201384%20530%27%3E%3Crect%20width%3D%271384%27%20height%3D%273530%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-200x77.png 200w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-400x153.png 400w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-600x230.png 600w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-800x306.png 800w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54-1200x460.png 1200w, https://webrobots.io/wp-content/uploads/2019/03/Screenshot-2019-02-22-at-15.37.54.png 1384w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 1024px" /></span><div class="fusion-text"><h1><span style="font-weight: 400;">Finding Sitemaps</span></h1>
<ul>
<li style="font-weight: 400;"><span style="font-weight: 400;">The fastest way to find a sitemap URL is to check </span><i><span style="font-weight: 400;">robots.txt</span></i><span style="font-weight: 400;"> file. For example </span><a href="https://www.rottentomatoes.com/robots.txt"><span style="font-weight: 400;">https://www.rottentomatoes.com/robots.txt</span></a></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">We can also probe typical sitemap URLs like </span><i><span style="font-weight: 400;">domain.com/sitemap</span></i><span style="font-weight: 400;"> or </span><i><span style="font-weight: 400;">domain.com/sitemap.xml</span></i></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">Sometimes just going to the homepage and searching for the keyword “sitemap” works</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">If all above bear no fruit, google search can help (example: “target.com sitemap&#8221;).</span></li>
</ul>
<p><b>Example of domain/robots.txt:</b></p>
</div><span style="width:100%;max-width:600px;" class="fusion-imageframe imageframe-none imageframe-2 hover-type-none"><img decoding="async" src="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1.png" data-orig-src="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-1024x212.png" width="1024" height="212" alt="" title="sitemap-example-2" class="lazyload img-responsive wp-image-5885" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%271024%27%20height%3D%27212%27%20viewBox%3D%270%200%201024%20212%27%3E%3Crect%20width%3D%271024%27%20height%3D%273212%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-200x41.png 200w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-400x83.png 400w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-600x124.png 600w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1-800x166.png 800w, https://webrobots.io/wp-content/uploads/2019/03/sitemap-example-2-1.png 1024w" data-sizes="auto" data-orig-sizes="(max-width: 800px) 100vw, 1024px" /></span><div class="fusion-text"><h1><span style="font-weight: 400;">Working With Large Sitemaps</span></h1>
<p><span style="font-weight: 400;">Sitemaps usually have many thousands of records and opening them directly will freeze Chrome browser for several minutes while browser renders XML. Our best practice is to make $.get request to get a sitemap and process it.</span></p>
<p><b>example of getting a sitemap using an ajax</b> <b>request and filtering URLs:</b></p>
<pre class="brush: jscript;">

$.get('https://www.rottentomatoes.com/sitemap_0.xml').then( function(response){
    $('url loc',response).each( function(i, v){
        var url = $(v).text();

        // filtering: we only need URLs that have no further path after film name
        // we can filter out URLs with longer URL paths than film page has

        if(url.split('/').length &lt; 6) next(url,'getFilmInfo');

    });
    done();
});

</pre>
<h1><span style="font-weight: 400;">Downsides of Sitemap Approach</span></h1>
<ul>
<li style="font-weight: 400;"><span style="font-weight: 400;">A sitemap can be outdated (old URLs leading to 404 pages) and the site owner might not even notice that their sitemaps are incorrect. It is necessary to do spot checks to see if an URL works.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;"> Sitemap might not have all the items listed in the normal website interface. Best practice is to spot check that items found on a website are present in the sitemap as well.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;"> Sitemaps do not allow filtering items based on certain criteria. For example if we need only electronics from a large eshop, we still have to crawl all products and do filtering in the back-end.</span></li>
<li style="font-weight: 400;"><span style="font-weight: 400;">Sitemaps do show how popular an item is &#8211; for example we cannot infer if a particular item is on the first page in it’s category or somewhere near the end.</span></li>
</ul>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/using-sitemaps-in-web-scraping-robots/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/

Page Caching using Disk: Enhanced 
Minified using Disk
Database Caching 1/1215 queries in 3.956 seconds using Disk

Served from: webrobots.io @ 2026-04-16 10:31:17 by W3 Total Cache
-->