Web Scraping

Analysis of different web scraping approaches, best practices and challenges. Overview of Web Robots scraper features.

13 Oct, 2014

How to Setup a Free Proxy Server on Amazon EC2

By |2019-03-04T16:33:18+02:00October 13th, 2014|Web Scraping|1 Comment

Open Amazon Web Services (AWS) Account

Go to the AWS Portal and sign up. Credit card will be required, but Amazon will not charge anything. Amazon will also ask for your phone number and verify it. Amazon EC2 offers free Micro Instances which are good enough for proxy server setup. They remain free for the first year of AWS usage.

Creating an EC2 Instance

Once you have the account login to AWS Management Console and from the EC2 Dashboard click the Launch Instance. Follow the steps and launch the instance of Ubuntu Server which is marked as Free tier eligible. Make sure you download SSH key file (.pem) as it will be needed to connect to server.

Free tier eligible Ubuntu Server

13 Oct, 2014

Smart CSS Selectors

By |2019-03-04T16:37:21+02:00October 13th, 2014|Web Scraping|0 Comments

Our web scraping business requires that we develop scraper robots quickly and efficiently. We can offer competitive pricing only if we are most efficient at creating robots for each source. Old saying “time is money” means a lot here and we always look for ways to do things better and faster.

In scraper development process everyone uses either Xpaths or CSS selectors to parse DOM for data to be extracted or links to crawl through. One can inspect DOM elements (via Google Chrome) for classes, IDs or other attributes. Then solve a small or big puzzle to write a selector. It requires knowing powerful CSS Selector syntax, detective work inside DOM, and some trial and error.

There is a nifty tool […]