How to Extract and Analyze Data from Wikipedia


Introduction

From finding trends and statistics to marketing and keyword research, the wealth of data on Wikipedia can benefit any type of commercial and non-commercial research in almost every domain. Companies of all sizes, researchers, data scientists, data hobbyists, and simply-curious individuals are all examples along the spectrum of people and organizations trying to extract and analyze the data on Wikipedia.

With hundreds of millions of web pages and millions of multilingual, well-edited articles, Wikipedia is a treasure trove of structured and unstructured data and a favorite destination of web crawlers. In fact, a quick search on Github reveals a list of more than 300 web crawlers and similar projects developed specifically for the purpose of extracting data from Wikipedia.

Web crawling is not the only way you can extract and analyze data from Wikipedia. For example, Wikimedia provides regular data dumps in a variety of formats. There is also the Wikimedia API which allows you to not only receive data from different wikis but also create bots and contribute to articles programmatically.

In this tutorial, we are going to focus on how Mixnode can help you to easily extract and analyze data from Wikipedia using SQL queries.

How Mixnode Works

Mixnode allows you to think of the web as a database. Using Mixnode, you are provided with a database table representing the entire web that you can write SQL queries against. Once you run a query, Mixnode automatically finds and analyzes the web pages needed to answer your query.

The following examples illustrate some of the ways you can use Mixnode and standard SQL queries to extract and analyze data from Wikipedia.

Example 1: Get the URL of every page from Wikipedia

select 
    url
from 
    pages
where 
    url_domain = 'wikipedia.org' 

Example 2: Get the URL and title of every Wikipedia article

select 
    url, 
    css_text_first(content, 'h1#firstHeading') as title
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'

Example 3: Get the title of every Wikipedia article that contains the substring Elon Musk

select 
    url, 
    css_text_first(content, 'h1#firstHeading') as title
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
    and
    contains(content, 'Elon Musk')

Example 4: Rank Wikipedia articles by number of references

select 
    url, 
    css_text_first(content, 'h1#firstHeading') as title,
    cardinality(css_text(content, 'ol.references li')) as reference_count
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
order by reference_count desc

Example 5: Rank Wikipedia articles by length

select
    url,
    css_text_first(content, 'h1#firstHeading') as title,
    cardinality(words(css_text_first(content, '#content'))) as article_length
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
order by article_length desc

Example 6: What is the average size of a Wikipedia article

select
    avg(cardinality(words(css_text_first(content, '#content')))) as average_article_length
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'

Example 7: Rank Wikipedia articles by the length of their discussions

select
    url,
    remove_left(css_text_first(content, 'h1#firstHeading'), 'Talk:') as title,
    cardinality(words(css_text_first(content, '#content'))) as discussion_length
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/Talk:%'
order by discussion_length desc

Example 8: Find every Wikipedia article that has a link to bbc.com

select
    url,
    css_text_first(content, 'h1#firstHeading') as title
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
    and
    contains_any(content, array['href="https://www.bbc.com', 'href="http://www.bbc.com', 'href="https://bbc.com', 'href="http://bbc.com', 'href="www.bbc.com', 'href="bbc.com'])

Turn the web into a database!

Mixnode is a fast, flexible and massively scalable platform to extract and analyze data from the web.

or contact us at hi@mixnode.com