Blog

Follow Mixnode on Twitter

Introducing the robotstxt table

The robots exclusion protocol is the de facto standard of communication of rules and boundaries between websites and their non-human visitors. Also known as robots.txt, the robots exclusion standard allows webmasters to specify which parts of their websites are accessible (or inaccessible) to which crawlers.

As part of our caching process, for every website that we visit we also need to download and process its robots.txt file to make sure that we are allowed to access the content which we are about to cache. We routinely visit millions of websites every day which results in a large number of robots.txt lookups; the all-new robotstxt table allows you to run SQL queries against these millions of robots.txt files from all around the web.

Give it a try

As always, we would love to hear from you! Give the new robotstxt table a try and contact us at hi@mixnode.com if you have any questions or comments.

Turn the web into a database!

Mixnode is a fast, flexible and massively scalable platform to extract and analyze data from the web.

or contact us at hi@mixnode.com