How to Extract and Analyze Data from Wikipedia

Introduction

From finding trends and statistics to marketing and keyword research, the wealth of data on Wikipedia can benefit any type of commercial and non-commercial research in almost every domain. Companies of all sizes, researchers, data scientists, data hobbyists, and simply-curious individuals are all examples along the spectrum of people and organizations trying to extract and analyze the data on Wikipedia.

With hundreds of millions of web pages and millions of multilingual, well-edited articles, Wikipedia is a treasure trove of structured and unstructured data and a favorite destination of web crawlers. In fact, a quick search on Github reveals a list of more than 300 web crawlers and similar projects developed specifically for the purpose of extracting data from Wikipedia.

Web crawling is not the only way you can extract and analyze data from Wikipedia. For example, Wikimedia provides regular data dumps in a variety of formats. There is also the Wikimedia API which allows you to not only receive data from different wikis but also create bots and contribute to articles programmatically.

In this tutorial, we are going to focus on how Mixnode can help you to easily extract and analyze data from Wikipedia using SQL queries.

How Mixnode Works

Mixnode allows you to think of the web as a database. Using Mixnode, you are provided with a database table representing the entire web that you can write SQL queries against. Once you run a query, Mixnode automatically finds and analyzes the web pages needed to answer your query.

The following examples illustrate some of the ways you can use Mixnode and standard SQL queries to extract and analyze data from Wikipedia.

Example 1: Get the URL of every page from Wikipedia

select 
    url
from 
    pages
where 
    url_domain = 'wikipedia.org'

url represents the URL of the web page.
pages is the table that represents the web, every row in the pages corresponds to a unique page on the web.
url_domain = 'wikipedia.org' makes sure only pages from wikipedia.org and its subdomains (e.g. en.wikipedia.org, fr.wikipedia.org, ...) are considered.

Example 2: Get the URL and title of every Wikipedia article

select 
    url, 
    css_text_first(content, 'h1#firstHeading') as title
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'

css_text_first(content, 'h1#firstHeading') returns the title of the Wikipedia article. By taking a look at the HTML source of a Wikipedia article we can tell that h1#firstHeading is the CSS path to the title of the article. css_text_first is a built-in function that allows us to extract the text of the first match of a CSS selector on an HTML string. content represents the HTTP response body of the page, in this case, the complete HTML source of the web pages.
In this query we are looking to get the title of every Wikipedia article. We know that the path to an article on Wikipedia always starts with /wiki/ (e.g. https://en.wikipedia.org/wiki/Bitcoin). Using url LIKE '%/wiki/%' we make sure only URLs in the form of .../wiki/... are returned.

Example 3: Get the title of every Wikipedia article that contains the substring `Elon Musk`

select 
    url, 
    css_text_first(content, 'h1#firstHeading') as title
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
    and
    contains(content, 'Elon Musk')

contains() is a built-in function that allows us to check whether a string contains a certain substring. Using contains(content, 'elon musk') we can make sure that only pages are considered whose content have Elon Musk as a substring.

Example 4: Rank Wikipedia articles by number of references

select 
    url, 
    css_text_first(content, 'h1#firstHeading') as title,
    cardinality(css_text(content, 'ol.references li')) as reference_count
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
order by reference_count desc

The source code of a random article on the English Wikipedia reveals that all references can be selected using the ol.references li CSS selector. css_text(content, 'ol.references li') will select the text of every reference of an article and since we are only interested in the number of references, we use the cardinality() function to return only the size of the array output by css_text(content, 'ol.references li') i.e. number of references for each article.

Example 5: Rank Wikipedia articles by length

select
    url,
    css_text_first(content, 'h1#firstHeading') as title,
    cardinality(words(css_text_first(content, '#content'))) as article_length
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
order by article_length desc

words() returns an array of all the words in a text. Using cardinality(words(css_text_first(content, '#content'))) as article_length we extract the number of words for an article and call that number article_length. And finally, using order by article_length desc we rank articles based on the number of words in descending order.

Example 6: What is the average size of a Wikipedia article

select
    avg(cardinality(words(css_text_first(content, '#content')))) as average_article_length
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'

The avg() function will calculate the average of its input values, in this case the number of words in every article from Wikipedia.

Example 7: Rank Wikipedia articles by the length of their discussions

select
    url,
    remove_left(css_text_first(content, 'h1#firstHeading'), 'Talk:') as title,
    cardinality(words(css_text_first(content, '#content'))) as discussion_length
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/Talk:%'
order by discussion_length desc

words() returns an array of all the words in a text. Using cardinality(words(css_text_first(content, '#content'))) as article_length we extract the number of words for an article and call that number article_length. And finally, using order by article_length desc we rank articles based on the number of words in descending order.

Example 8: Find every Wikipedia article that has a link to bbc.com

select
    url,
    css_text_first(content, 'h1#firstHeading') as title
from 
    pages
where 
    url_domain = 'wikipedia.org' 
    and
    url like '%/wiki/%'
    and
    contains_any(content, array['href="https://www.bbc.com', 'href="http://www.bbc.com', 'href="https://bbc.com', 'href="http://bbc.com', 'href="www.bbc.com', 'href="bbc.com'])