Using Sitemaps in Web Scraping Robots
We often use spidering through categories technique and pagination/infinite scroll when we need to discover and crawl all items of interest on a website. However there is a simpler and more straightforward approach for this – just using sitemaps. Sitemap based robots are easier to maintain than a mix of category drilling, pagination and dynamic content loading imitation.
After all, sitemaps are designed for robots to find all resources on a particular domain.
Example of a sitemap:
Finding Sitemaps
- The fastest way to find a sitemap URL is to check robots.txt file. For example https://www.rottentomatoes.com/robots.txt
- We can also […]