<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PostgreSQL &#8211; Web Scraping Service</title>
	<atom:link href="https://webrobots.io/category/postgresql/feed/" rel="self" type="application/rss+xml" />
	<link>https://webrobots.io</link>
	<description>We do web scraping service better!</description>
	<lastBuildDate>Mon, 04 Mar 2019 14:19:29 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.5.8</generator>
	<item>
		<title>PostgreSQL as a Service options comparison and benchmark</title>
		<link>https://webrobots.io/postgresql-as-a-service-comparison/</link>
					<comments>https://webrobots.io/postgresql-as-a-service-comparison/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Tue, 27 Jun 2017 11:31:23 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[postgreSQL]]></category>
		<category><![CDATA[PostgreSQL AWS]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5583</guid>

					<description><![CDATA[Background PostgresSQL is great. But administering it can suck up a lot of time and for small teams using SaaS service is great value. We use and love Amazon RDS.  Until recently it was the only reasonable choice in the market. But in 2017 new options ore on the verge of becoming available.  Both Google [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-1 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-0 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><h1><span style="font-weight: 400;">Background</span></h1>
<p><span style="font-weight: 400;">PostgresSQL is great. But administering it can suck up a lot of time and for small teams using SaaS service is great value. We use and love Amazon RDS.  Until recently it was the </span>only reasonable choice in the market. But in 2017 new options ore on the verge of becoming available.  Both Google and Azure clouds announced support and Amazon is also launching their Aurora service with PostgreSQL compatibility.</p>
<p><span style="font-weight: 400;">We did a quick comparison of those options.</span></p>
<p><span style="font-weight: 400;"><strong>TLDR:</strong>  Google and Aurora are a bit faster for the same money but it&#8217;s no free lunch. Azure tests are not yet done.</span></p>
<h1></h1>
<h1><span style="font-weight: 400;">The Test</span></h1>
<p><span style="font-weight: 400;">Use case we care about is &#8220;mid size&#8221; database and queries that process gigabytes of data in seconds-minutes. More specifically we test 45GB table containing JSON documents </span><span style="font-weight: 400;">(PostgreSQL is great for working with JSON).   </span></p>
<p><span style="font-weight: 400;">We won&#8217;t test CRUD/web app backend performance. pg_bench is fine for that. In reality for small/medium apps any service will be fine and cost negligible.</span></p>
<p><span style="font-weight: 400;">We won&#8217;t test &#8220;big data&#8221; scenarios &#8211; Google BigQuery, AWS Redshift and Athena are great for that.</span></p>
<p><span style="font-weight: 400;">To run this test we provisioned High Availability setups with 16GB RAM and minimum number of possible CPUs.  We also ran tests on a desktop machine with a modern SSD.</span></p>
<h1></h1>
<h1><span style="font-weight: 400;">Instance sizes and costs</span></h1>
<p>All amounts are US dollars per month, using on demand pricing.</p>
<table style="border: 2px solid;" width="75%">
<tbody style="border: 2px solid;">
<tr style="border: 2px solid;">
<th width="20%"></th>
<td><strong>AWS RDS</strong></td>
<td><strong>AWS Aurora</strong></td>
<td><strong>Google</strong></td>
</tr>
<tr style="border: 2px solid;">
<th><strong>Notes</strong></th>
<td></td>
<td>PostgreSQL pricing is unavailable at this time. So we are using MySQL pricing:</td>
<td>High availability for PostgreSQL is not yet available.  Assuming 2x costs</td>
</tr>
<tr style="border: 2px solid;">
<td><strong>Cost items</strong></td>
<td>db.r3.large multi AZ (2 cpu, 15GB ram) $376<br />
1TB General Purpose (SSD) Storage $230</td>
<td>db.r3.large $0.290 $215<br />
1TB storage   $.100/GB $100</td>
<td>CPU Master 2 vCPU $94<br />
CPU Slave 2 vCPU $94<br />
RAM Master 13GB $103<br />
RAM Slave 13GB $103<br />
SSD Storage Maser 1TB $187<br />
SSD Storage Slave 1TB $187</td>
</tr>
<tr style="border: 2px solid;">
<th><strong>Total</strong></th>
<td><strong>$606</strong></td>
<td><strong>$315</strong></td>
<td><strong>$758</strong></td>
</tr>
</tbody>
</table>
<h1><span style="font-weight: 400;"><br />
Results</span></h1>
<p><span style="font-weight: 400;">First times to restore and parse 45GB table containing JSON.</span></p>
<table style="border: 2px solid;" width="75%">
<tbody style="border: 2px solid;">
<tr style="border: 2px solid;">
<td></td>
<td><strong>AWS RDS</strong></td>
<td><strong>AWS Aurora</strong></td>
<td><strong>Google </strong></td>
<td><strong>Desktop</strong></td>
</tr>
<tr>
<td><strong>Restore (seq write)</strong></td>
<td><span style="font-weight: 400;"> 50:28</span></td>
<td><span style="font-weight: 400;"> 41:44</span></td>
<td><span style="font-weight: 400;">21:15</span></td>
<td><span style="font-weight: 400;">24:00</span></td>
</tr>
<tr style="border: 2px solid;">
<td><strong>Scan (seq read)</strong></td>
<td><span style="font-weight: 400;"> 32:30</span></td>
<td><span style="font-weight: 400;">5:02</span></td>
<td><span style="font-weight: 400;">15:58</span></td>
<td><span style="font-weight: 400;">2:57</span></td>
</tr>
<tr style="border: 2px solid;">
<td><strong>Parse JSON (CPU)</strong></td>
<td><span style="font-weight: 400;">1:46:50</span></td>
<td><span style="font-weight: 400;">46:54</span></td>
<td><span style="font-weight: 400;">51:55</span></td>
<td><span style="font-weight: 400;">31:51</span></td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p><span style="font-weight: 400;">Dividing time by size we get processing speed in MB/sec:</span></p>
<table style="border: 2px solid;" width="75%">
<tbody style="border: 2px solid;">
<tr style="border: 2px solid;">
<td></td>
<td><strong>AWS RDS</strong></td>
<td><strong>AWS Aurora</strong></td>
<td><strong>Google PG</strong></td>
<td><strong>Desktop</strong></td>
</tr>
<tr style="border: 2px solid;">
<td><strong>Restore (seq write)</strong></td>
<td><span style="font-weight: 400;">14.53</span></td>
<td><span style="font-weight: 400;">17.57</span></td>
<td><span style="font-weight: 400;">34.50</span></td>
<td><span style="font-weight: 400;">30.55</span></td>
</tr>
<tr style="border: 2px solid;">
<td><strong>Scan (seq read)</strong></td>
<td><span style="font-weight: 400;">22.56</span></td>
<td><span style="font-weight: 400;">145.69</span></td>
<td><span style="font-weight: 400;">45.92</span></td>
<td><span style="font-weight: 400;">248.58</span></td>
</tr>
<tr style="border: 2px solid;">
<td><strong>Parse JSON (CPU)</strong></td>
<td><span style="font-weight: 400;">6.86</span></td>
<td><span style="font-weight: 400;">15.64</span></td>
<td><span style="font-weight: 400;">14.12</span></td>
<td><span style="font-weight: 400;">23.02</span></td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<h1><span style="font-weight: 400;">Observations</span></h1>
<p><span style="font-weight: 400;">JSON parsing is CPU bound and only runs in one thread so this is a tight bottleneck on all platforms. Still seems that Google gives faster virtual cpus than RDS.</span></p>
<p><span style="font-weight: 400;">Writing to Aurora is slow and reading is fast.  Could be because of their 6 way replication.</span></p>
<p><span style="font-weight: 400;">Google claims to give many more IOPS than Amazon RDS (15000 vs 3000).  Our test shows only about 2x speedup.</span></p>
<p>As of June 2017 Google is the best value, but no HA support yet.  Aurora is not yet ready but might mature to be the best option.  RDS is still a winner for production work.</p>
<h1></h1>
<h1><span style="font-weight: 400;">SQL queries used</span></h1>
<p>Sequential scan:</p>
<pre><span style="font-weight: 400;">select </span><span style="font-weight: 400;">count(*) row_count, sum(length(content)) text_length </span>from js</pre>
<p>JSON parse:</p>
<pre><span style="font-weight: 400;">with jp as (</span><span style="font-weight: 400;">select json_array_elements(content::json) x from js</span><span style="font-weight: 400;">) </span>select distinct x-&gt;&gt;'ID' id, x-&gt;&gt;'Name' name from  jp</pre>
<pre></pre>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/postgresql-as-a-service-comparison/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Migrating PostgreSQL Databases From AWS RDS To Standalone</title>
		<link>https://webrobots.io/migrating-postgresql-databases-from-aws-rds-service-to-standalone/</link>
					<comments>https://webrobots.io/migrating-postgresql-databases-from-aws-rds-service-to-standalone/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Fri, 14 Oct 2016 12:20:20 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[postgreSQL]]></category>
		<category><![CDATA[PostgreSQL AWS]]></category>
		<guid isPermaLink="false">https://webrobots.io/?p=5389</guid>

					<description><![CDATA[Intro AWS RDS is very convenient and takes care of almost all DBA tasks. It just works as long as you stay inside AWS. But if you want to have a local copy of your database or need to move data to another host it can be tricky. TL;DR - For our solution skip to [...]]]></description>
										<content:encoded><![CDATA[<div class="fusion-fullwidth fullwidth-box fusion-builder-row-2 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-1 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><h3>Intro</h3>
<p>AWS RDS is very convenient and takes care of almost all DBA tasks. It just works as long as you stay inside AWS. But if you want to have a local copy of your database or need to move data to another host it can be tricky.</p>
<p>TL;DR &#8211; For our solution skip to part last section</p>
<h3>What doesn&#8217;t work</h3>
<h4>AWS daily backups</h4>
<p>AWS RDS by default creates daily backups of your data. First thought would be get such backup and restore it locally. But those are not regular Postgres backups. They probably are VM image copies, but no way to know, as you cannot copy or see them.  The only option is to restore them to another RDS instance.</p>
<h3>Postgresql replication</h3>
<p>AWS uses replication to maintain hot standby, but do not give access to that.</p>
<h3>Cold backups</h3>
<p>No access to RDS filesystem, so can&#8217;t do that either</p>
<h3>Amazon Database Migration Service (DMS)</h3>
<p>AWS DMS sounds too good to be true.   Thea features that would be killer if they worked:</p>
<ul>
<li>Any to any DB connections (RDS PG to local MySQL for example is possible)</li>
<li>Change data capture (CDC) and continuous replication without modifications to database schema</li>
<li>DDL capture</li>
</ul>
<p>In reality we couldn&#8217;t get it to do even basic data copying.  It corrupts/truncates data, throws errors and there&#8217;s very little talk about it in Stackoverflow or AWS forums.  Not ready for prime time but worth keeping an eye on it.</p>
<h3>Foreign data wrappers (FDW) &#8211; pushing from RDS to remote</h3>
<p>It works and can be used to push incremental changes like so:</p>
<pre>ON RDS: 
insert into remote_bigtable 
select * from bigtable where id&gt;(select max(id) from remote_bigtable)</pre>
<p>But speed is only few rows per second.  Can&#8217;t use it for anything</p>
<h3>Foreign data wrappers (FDW) &#8211; pulling data</h3>
<p>It&#8217;s faster than pushing but more difficult.    If you run query on local db like this:</p>
<pre>ON Local:</pre>
<pre>select * from rds_bigtable where id&gt;(select max(id) from bigtable)</pre>
<p>It will fetch entire bigtable and do filtering locally (9.6 fixes this).</p>
<p>So for full data pulls this method is still too slow and for incremental requires extra work like so:</p>
<pre>ON local: select max(id) from bigtable
ON RDS: create tmp_bigtable as select * from bigtable where id&gt;maxid_on_local
ON local: insert into local_bigtable select * from rds_tmp_bigtable
ON RDS: drop table tmp_bigtable</pre>
<p>We use this approach for incremental data pulls.   It&#8217;s pain.  Hopping better alternatives will come up.</p>
<h3>pg_dump running on local server</h3>
<p>It works, but much slower than what&#8217;s possible</p>
<h3>pg_dump running on tempororay EC2 instance</h3>
<p>Finally a solution that works and is fast!</p>
<p>Steps to create it</p>
<ol>
<li>Create EC2 instance with enough space to store full db dump in the same datacenter/zone as RDS</li>
<li>Install postgres tools on it: <span style="font-weight: 400;">sudo yum install postgresql95</span></li>
<li>Install aws command line tools (optional, need for file copying)</li>
<li>Open firewall in RDS to allow connections from temporary instance</li>
<li>Run <span style="font-weight: 400;">pg_dump &#8211;host=xxx.rds.amazonaws.com &#8211;username=xxx &#8211;file=x.dmp &#8211;format=c</span></li>
<li>Copy file to s3 (aws s3 cp)</li>
<li>Delete instance</li>
<li>On local host get and restore file:  <span style="font-weight: 400;">pg_restore &#8211;host=localhost &#8211;username=x  x.dmp</span></li>
</ol>
<p>It your db is large enough pg_dump dies in the middle of export.  It&#8217;s an obscure bug with ssl.  AWS will force ssl socket renegotiation after preset amount has been transfered (seems to be about 6GB).  PostgreSQL doesn&#8217;t renegotiate and kills connection instead.</p>
<p>Solution is to temporarily disable SSL on RDS (all security implications and requires a db restart). In RDS Parameter group set flag ssl=0</p>
<h3>Conclusion</h3>
<p>Copying Postgresql RDS data to local server is harder than it should to be. Hopefully my long adventures, misadventures and discoveries will save you some time.</p>
</div><div class="fusion-clearfix"></div></div></div></div></div>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/migrating-postgresql-databases-from-aws-rds-service-to-standalone/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>PostgreSQL 9.4 JSON Queries</title>
		<link>https://webrobots.io/postgresql-json-queries/</link>
					<comments>https://webrobots.io/postgresql-json-queries/#respond</comments>
		
		<dc:creator><![CDATA[nicerobot]]></dc:creator>
		<pubDate>Tue, 17 Feb 2015 09:24:43 +0000</pubDate>
				<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[postgreSQL]]></category>
		<category><![CDATA[PostgreSQL JSON Lateral Recursive]]></category>
		<guid isPermaLink="false">http://webrobots.io/?p=5106</guid>

					<description><![CDATA[Intro Querying JSON with SQL is extremely powerful and convenient.  Some great things about: Use SQL to query unstructured data Join relational and JSON tables Convert between JSON and relational schema But query writing can be difficult and non obvious at first.  Official documentation doesn't have many samples. Many useful queries need other great but [...]]]></description>
										<content:encoded><![CDATA[<p><div class="fusion-fullwidth fullwidth-box fusion-builder-row-3 nonhundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-2 fusion-one-full fusion-column-first fusion-column-last 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><h1>Intro</h1>
<p>Querying JSON with SQL is extremely powerful and convenient.  Some great things about:</p>
<ul>
<li>Use SQL to query unstructured data</li>
<li>Join relational and JSON tables</li>
<li>Convert between JSON and relational schema</li>
</ul>
<p>But query writing can be difficult and non obvious at first.  Official documentation doesn&#8217;t have many samples. Many useful queries need other great but not widely known features of PosgresSQL like LATERAL joins and recursive queries.</p>
<p>This tutorial has some real world examples.</p>
<h1>Get some data</h1>
<p>Lets use GitHub Archive as source for large JSON with complex structure:</p>
<pre>wget http://data.githubarchive.org/2015-01-01-15.json.gz
gzip -d 2015-01-01-15.json.gz</pre>
<pre></pre>
<h1>Load JSON to PostgreSQL</h1>
<p>Super easy:</p>
<pre>COPY github FROM 'c:\temp\2015-01-01-15.json'
WITH (format csv, quote e'\x01', delimiter e'\x02', escape e'\x01')</pre>
<p>Query returned successfully: 11351 rows affected, 1025 ms execution time.</p>
<h1>Do some simple queries</h1>
<p>Still straightforward:</p>
<pre>select js-&gt;&gt;'type', count(*) from github group by 1;</pre>
<pre>select js-&gt;'actor'-&gt;'login' from github where js-&gt;&gt;'type' = 'IssuesEvent';</pre>
<h1>Explore structure with LATERAL joins</h1>
<p>Many Postgress functions return rowsets and are invoked like:</p>
<pre>select * from jsonb_each(jsonb)</pre>
<p>If your json is in many rows like in our GitHub sample then we need LATERAL joins:</p>
<pre>Get top level keys from all:
 select
 key,
 max(length(value::text)),
 json_agg(distinct jsonb_typeof(value))
 from github a, lateral jsonb_each(a.js) kv
 group by key

</pre>
<h1>Recursive queries</h1>
<p>JSON is tree like and recursive queries are often necessary.<br />
Here we enumerate all possible paths that exist in the documents:</p>
<pre>with recursive tree(lvl, key, path, jstype) as (
select distinct 
 0 as lvl,
 kv.key as key, 
 array</p>
</div><div class="fusion-clearfix"></div></div></div></div></div><div class="fusion-fullwidth fullwidth-box fusion-builder-row-4 hundred-percent-fullwidth non-hundred-percent-height-scrolling"  style='background-color: #ffffff;background-position: center center;background-repeat: no-repeat;padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;border-top-width:0px;border-bottom-width:0px;border-color:#eae9e9;border-top-style:solid;border-bottom-style:solid;'><div class="fusion-builder-row fusion-row "><div  class="fusion-layout-column fusion_builder_column fusion_builder_column_1_1 fusion-builder-column-3 fusion-one-full fusion-column-first fusion-column-last fusion-column-no-min-height 1_1"  style='margin-top:0px;margin-bottom:0px;'><div class="fusion-column-wrapper" style="padding: 0px 0px 0px 0px;background-position:left top;background-repeat:no-repeat;-webkit-background-size:cover;-moz-background-size:cover;-o-background-size:cover;background-size:cover;"   data-bg-url=""><div class="fusion-text"><p>[kv.key] as path, 
 jsonb_typeof(value) as jstype 
from github a, lateral jsonb_each(a.js) kv
union all 
select distinct 
 tree.lvl + 1 as lvl,
 kv.key,
 array_append(tree.path, kv.key) as path, 
 jsonb_typeof(value) as jstype 
from tree, github a, lateral jsonb_each(a.js #&gt; tree.path) kv
where jsonb_typeof(a.js #&gt; tree.path) = 'object'
)
select path, jstype from tree
order by path;</p>
<h1>Further reading and references</h1>
<p>Others did tests showing that PostgresSQL is about 3x faster and files occupy about 3x less space than MongoDB.</p>
<p><a href="https://vibhorkumar.wordpress.com/2014/05/15/write-operation-mongodb-vs-postgresql-9-3-json/">https://vibhorkumar.wordpress.com/2014/05/15/write-operation-mongodb-vs-postgresql-9-3-json/<br />
</a><a href="http://blogs.enterprisedb.com/2014/09/24/postgres-outperforms-mongodb-and-ushers-in-new-developer-reality/">http://blogs.enterprisedb.com/2014/09/24/postgres-outperforms-mongodb-and-ushers-in-new-developer-reality/</a></p>
<p>Official operator and function reference:<br />
<a href="http://www.postgresql.org/docs/9.4/static/functions-json.html">http://www.postgresql.org/docs/9.4/static/functions-json.html</a><br />
<a href="http://www.postgresql.org/docs/9.4/static/functions-aggregate.html">http://www.postgresql.org/docs/9.4/static/functions-aggregate.html</a></p>
<pre></pre>
</div><div class="fusion-clearfix"></div></div></div></div></div></p>
]]></content:encoded>
					
					<wfw:commentRss>https://webrobots.io/postgresql-json-queries/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/

Page Caching using Disk: Enhanced 
Minified using Disk
Database Caching 49/62 queries in 0.114 seconds using Disk

Served from: webrobots.io @ 2026-04-30 14:05:57 by W3 Total Cache
-->