<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>data validation &#8211; Web Scraping Service</title>
	<atom:link href="https://webrobots.io/tag/data-validation/feed/" rel="self" type="application/rss+xml" />
	<link>https://webrobots.io</link>
	<description>We do web scraping service better!</description>
	<lastBuildDate>Mon, 04 Mar 2019 14:24:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.5.8</generator>
	<item>
		<title>How We Validate Data</title>
		<link>https://webrobots.io/how-we-validate-data/</link>
					<comments>https://webrobots.io/how-we-validate-data/#comments</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Thu, 18 Dec 2014 12:29:00 +0000</pubDate>
				<category><![CDATA[Web Scraping]]></category>
		<category><![CDATA[data validation]]></category>
		<guid isPermaLink="false">http://webrobots.io/?p=5004</guid>

					<description><![CDATA[Data is only valuable if it can be trusted. At weRobots we spend as much effort on validating data as on collecting it. It is a multi stage process. Scraping Initial checks happen in scraper robots. Robot crawls target website and looks for data. Captured data is sent to our staging database. Many abnormal [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-1 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-0 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>Data is only valuable if it can be trusted. At weRobots we spend as much effort on validating data as on collecting it. It is a multi stage process.</p>
<p><a href="http://webrobots.io/wp-content/uploads/2014/12/Robot-worklow.png"><img fetchpriority="high" decoding="async" class="lazyload aligncenter wp-image-5011 size-full" src="http://webrobots.io/wp-content/uploads/2014/12/Robot-worklow.png" data-orig-src="http://webrobots.io/wp-content/uploads/2014/12/Robot-worklow.png" alt="weRobots data validation workflow" width="815" height="473" srcset="data:image/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27815%27%20height%3D%27473%27%20viewBox%3D%270%200%20815%20473%27%3E%3Crect%20width%3D%27815%27%20height%3D%273473%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E" data-srcset="https://webrobots.io/wp-content/uploads/2014/12/Robot-worklow-300x174.png 300w, https://webrobots.io/wp-content/uploads/2014/12/Robot-worklow.png 815w" data-sizes="auto" data-orig-sizes="(max-width: 815px) 100vw, 815px" /></a></p>
<ul>
<li>Scraping</li>
</ul>
<p>Initial checks happen in scraper robots. Robot crawls target website and looks for data. Captured data is sent to our staging database. Many abnormal situations can arise at this stage:</p>
<ol>
<li style="list-style-type: none;">
<ol>
<li style="list-style-type: none;">
<ul>
<li>Site may be down. Robot will log warnings and will retry pages that do not respond. Usually outage is temporary and robot resumes without intervention</li>
<li>Site layout changes. If robot cannot find navigation links or data it will stop and report error so that our team can review the situation and the appropriate action.</li>
</ul>
</li>
</ol>
</li>
</ol>
<p>For diagnostics and traceability robot logs all actions it performs and it can take screenshots and content snapshots of source website. This can be important if we need to know what exactly was displayed on a source website at the time of scrape.</p>
<ul>
<li>Schema Validation</li>
</ul>
<p>Scraped data is validated to match predefined schema. For example data about product scraped from an e-shop may have to pass the following validations:</p>
<ol>
<li style="list-style-type: none;">
<ol>
<li style="list-style-type: none;">
<ul>
<li>Product ID that is numeric</li>
<li>Price is a decimal number with two decimal places. Price must be greater than 0.</li>
<li>Product has description that is at most 200 characters long</li>
<li>Optional field &#8220;Availability date&#8221; is a date</li>
</ul>
</li>
</ol>
</li>
</ol>
<p>Records that faill to pass schema validation are stored in &#8220;Bad records&#8221; table with an explanation on why validation failed.</p>
<ul>
<li>Mapping</li>
</ul>
<p>Mapping step maps source identifiers to customer identifiers. For example e-shop product ID might be matched to customer internal product ID. Unmapped records are reported. In some workflows we automatically create new records in customer’s system (for example when we find new products in source website we create new product IDs in customer’s database). This step is a great advantage when scraped data is integrated with data that customer already has.</p>
<ul>
<li>Statistical validation</li>
</ul>
<p>Statistical validation step checks if new scraped data is similar to previous good scrapes. Within allowable tolerances we check that:</p>
<ul>
<li>Robot collected similar amount of records</li>
<li>Numeric fields have similar averages</li>
<li>Text fields have similar lengths</li>
<li>Robot run took similar amount of time</li>
<li>Additional custom calculation checks based on dataset</li>
</ul>
<p>We flag suspicious records and suspicious scraping runs for our staff or customer to review.</p>
<ul>
<li>Export</li>
</ul>
<p>Export step moves data to customer DB. We support export to all relational databases, dumps to CSV, JSON and XML. Export can be batch or real-time replication.<br />
Reports Final summary report is produced after all steps finished and allows us and customer review all information in one place.</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/how-we-validate-data/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/

Page Caching using Disk: Enhanced 
Minified using Disk
Database Caching 20/42 queries in 0.193 seconds using Disk

Served from: webrobots.io @ 2026-06-20 21:18:09 by W3 Total Cache
-->