Blog

Follow Mixnode on Twitter

Turn the web into a database: An alternative to web crawling/scraping

What is Mixnode?

Mixnode turns the web into a giant database!

In other words, Mixnode allows you to think of all the web pages, images, videos, PDF files, and other resources on the web as rows in a database table; a giant database table with trillions of rows that you can query using the standard Structured Query Language (SQL). So, rather than running web crawlers/scrapers you can write simple queries in a familiar language to retrieve all sorts of interesting information from this table of live data.

url content_type content_language content headers url_protocol url_host url_domain url_etld url_abs_path
https://news.ycombinator.com/text/html; charset=utf-8en<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=de...HTTP/1.1 200 OK Server: nginx Date: Mon, 24 Sep 2018 19:36:30 GMT Content-Type: text/html; charse...httpsnews.ycombinator.comycombinator.comcom/
https://fr.wikipedia.org/wiki/Base_de_donn%C3%A9estext/html; charset=UTF-8fr<!DOCTYPE html> <html class="client-nojs" lang="fr" dir="ltr"> <head> <meta charset="UTF-8"/> <title...HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:39:49 GMT Content-Type: text/html; charset=UTF-8 Connec...httpsfr.wikipedia.orgwikipedia.orgorg/wiki/Base_de_donn%C3%A9es
https://www.reddit.com/sitemaps/subreddit-sitemaps.xmltext/xmlNULL<?xml version='1.0' encoding='UTF-8'?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/...HTTP/1.1 200 OK Last-Modified: Mon, 24 Sep 2018 06:13:14 GMT ETag: "aeae350d08f76f005e2fe8098a4713...httpswww.reddit.comreddit.comcom/sitemaps/subreddit-sitemaps.xml
http://www.diarioelpuerto.com.mx/text/htmles<!DOCTYPE HTML> <html> <head> <meta name="google-site-verification" content="SzDRrSxL_mhLV_bCAnR_s8e...HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:13:26 GMT Server: Apache X-Powered-By: PHP/5.2.17 Keep...httpwww.diarioelpuerto.com.mxdiarioelpuerto.com.mxcom.mx/
http://www.wfnmc.org/mc20101.pdfapplication/pdfen%PDF-1.6 206 0 obj <</Linearized 1/L 213940/O 208/E 89344/N 12/T 209772/H [ 1196 788]...HTTP/1.1 200 OK ETag: "343b4-53e2b129-5cf784d6aa98c961" Last-Modified: Wed, 06 Aug 2014 22:50:17 G...httpwww.wfnmc.orgwfnmc.orgorg/mc20101.pdf
https://code.jquery.com/jquery-1.11.3.jsapplication/javascript; charset=utf-8NULL/*! * jQuery JavaScript Library v1.11.3 * http://jquery.com/ * * Includes Sizzle.js * http://si...HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:55:14 GMT Connection: Keep-Alive Accept-Ranges: bytes ...httpscode.jquery.comjquery.comcom/jquery-1.11.3.js
...

Mixnode turns the web into a giant database table with multiple columns.

Just like a regular database table, you are provided with several columns (a.k.a. fields) that represent different attributes of web resources such as URL, content, content type, content language, domain name, ... Additionally, Mixnode comes with hundreds of functions that you can use to further analyze the data in any way that you want. From parsing HTML/XML and JSON to handling date/time and processing text, there are numerous built-in functions to use directly in your queries.

As a simple example, using Mixnode, getting the URL and title of every web page from the web boils down to a simple SQL query:

select 
    url,
    string_between(content, '<title>', '</title>') as title
from
    resources
where
    content_type like 'text/html%'

Where the results will look similar to:

url title
https://stackoverflow.com/questions/8318911/why-does-html-think-chucknorris-is-a-color [Why does HTML think “chucknorris” is a color? - Stack Overflow]
https://en.wikipedia.org/wiki/List_of_animals_with_fraudulent_diplomas [List of animals with fraudulent diplomas - Wikipedia]
https://www.amazon.co.jp/dp/B06XXQD54H/ [Amazon | アクータメンツ フィンガーリス 指人形 フィンガーパペット 指人形 | おもちゃ雑貨 | おもちゃ]
https://www.reddit.com/r/funny/comments/5yhipb/its_a_bit_breezy_out_there_today/ [It's a bit breezy out there today : funny]
https://imgur.com/gallery/cJO834B [Just cause you pelican doesn't mean you pelishould - Album on Imgur]
...

You can expand this query in any number of ways by utilizing the built-in columns and functions of Mixnode. For example, if you wanted to get the title of every English web page you could simply use a condition on the content_language column:

select 
    url,
    string_between(content, '<title>', '</title>') as title
from
    resources
where
    content_type like 'text/html%' and
    content_language = 'en'

Did you want the title and first paragraph of every English web page? The css_text_first function has you covered:

select 
    url,
    string_between(content, '<title>', '</title>') as title,
    css_text_first(content, 'p') as first_paragraph
from
    resources
where
    content_type like 'text/html%' and
    content_language = 'en'

Same query, but only on .net domains? You only need to use the url_etld column:

select 
    url,
    string_between(content, '<title>', '</title>') as title,
    css_text_first(content, 'p') as first_paragraph
from
    resources
where
    content_type like 'text/html%' and
    content_language = 'en' and
    url_etld = 'net'

Consider the question "Sort the English Wikipedia articles by length". All you need to answer this question is to use the order by clause:

select
    url,
    cardinality(words(content)) as article_length
from 
    resources
where 
    url_host = 'en.wikipedia.org' and
    url_abs_path like '/wiki/%'
order by article_length desc

By combining table columns and built-in functions you can practically analyze the web in an infinite number of ways. Additionally, you can integrate Mixnode with external data sources (e.g. sending and receiving data from Amazon S3) and create even more flexible queries.

Give it a try!

Mixnode allows you to focus only on what you need to get from the web and not how to get it. It is an end-to-end solution that takes you from question to answer with a simple query; you don't need to deploy web crawlers or run scrapers, you don't need to process raw data, and there are no "intermediate results".

Turn the web into a database!

Mixnode is a fast, flexible and massively scalable platform to extract and analyze data from the web.

or contact us at hi@mixnode.com