A company offering directions and real-time information for commuters who are using all means of public transportation. Service is provided through mobile and Web applications. Company has global reach in multiple markets around the world and entering new markets on daily basis.
webRobots received a task to create scraper robots that would gather data from the official public transportation sources in the new markets where client is expanding. Data includes bus, train, metro, ferry timetables, route information, station and busstop geolocations. Data sources were in various world languages and using a wide array of web technologies ranging from simple MS-Word exported HTML file collections to modern websites with fully dynamic content. In addition collected data had to be cleaned and transformed into standardised data schema which fits directly into client’s system.
webRobots evaluated a list of data sources provided by the client. Only one source was qualified as not practical to scrape because necessary data was presented within jpeg images. webRobots created scraper robots in development environment and received sign-off from the customer that extracted data and data delivery schema are valid. In this case data cleaning and validation was implemented as two stage process: 1) scraper robots were able to to make basic data transformations in real-time while scraping and saving data in separate tables according to data schema; 2) more complicated transformations that required manipulations on groups of data records were performed in database while loading data between webRobots and customer’s databases.
A production environment was setup in Microsoft Azure cloud on a single large size instance. A dedicated environment was chosen because client needs nightly data refresh cycle. Client received the ability to launch scraper robot runs through RESTful API. Data loading (and associated post-processing transformations) was configured to run automatically after successful data scrape.
Once customer started using the platform and found that webRobots scraping technology can handle just about any web source they retired legacy data scraping tools and fully migrated to webRobots platform.