How to Extract and Analyze Data from Wikipedia
Introduction
From finding trends and statistics to marketing and keyword research, the wealth of data on Wikipedia can benefit any type of commercial and non-commercial research in almost every domain. Companies of all sizes, researchers, data scientists, data hobbyists, and simply-curious individuals are all examples along the spectrum of people and organizations trying to extract and analyze the data on Wikipedia.
With hundreds of millions of web pages and millions of multilingual, well-edited articles, Wikipedia is a treasure trove of structured and unstructured data and a favorite destination of web crawlers. In fact, a quick search on Github reveals a list of more than 300 web crawlers and similar projects developed specifically for the purpose of extracting data from Wikipedia.
Web crawling is not the only way you can extract and analyze data from Wikipedia. For example, Wikimedia provides regular data dumps in a variety of formats. There is also the Wikimedia API which allows you to not only receive data from different wikis but also create bots and contribute to articles programmatically.
In this tutorial, we are going to focus on how Mixnode can help you to easily extract and analyze data from Wikipedia using SQL queries.
How Mixnode Works
Mixnode allows you to think of the web as a database. Using Mixnode, you are provided with a database table representing the entire web that you can write SQL queries against. Once you run a query, Mixnode automatically finds and analyzes the web pages needed to answer your query.
The following examples illustrate some of the ways you can use Mixnode and standard SQL queries to extract and analyze data from Wikipedia.
Example 1: Get the URL of every page from Wikipedia
select
url
from
pages
where
url_domain = 'wikipedia.org'
-
url
represents the URL of the web page. -
pages
is the table that represents the web, every row in thepages
corresponds to a unique page on the web. -
url_domain = 'wikipedia.org'
makes sure only pages fromwikipedia.org
and its subdomains (e.g.en.wikipedia.org
,fr.wikipedia.org
, ...) are considered.
Example 2: Get the URL and title of every Wikipedia article
select
url,
css_text_first(content, 'h1#firstHeading') as title
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/%'
-
css_text_first(content, 'h1#firstHeading')
returns the title of the Wikipedia article. By taking a look at the HTML source of a Wikipedia article we can tell thath1#firstHeading
is the CSS path to the title of the article.css_text_first
is a built-in function that allows us to extract the text of the first match of a CSS selector on an HTML string.content
represents the HTTP response body of the page, in this case, the complete HTML source of the web pages. -
In this query we are looking to get the title of every Wikipedia article. We know that the path to an article on Wikipedia always starts with
/wiki/
(e.g.https://en.wikipedia.org/wiki/Bitcoin
). Usingurl LIKE '%/wiki/%'
we make sure only URLs in the form of.../wiki/...
are returned.
Example 3: Get the title of every Wikipedia article that contains the substring Elon Musk
select
url,
css_text_first(content, 'h1#firstHeading') as title
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/%'
and
contains(content, 'Elon Musk')
contains()
is a built-in function that allows us to check whether a string contains a certain substring. Usingcontains(content, 'elon musk')
we can make sure that only pages are considered whosecontent
haveElon Musk
as a substring.
Example 4: Rank Wikipedia articles by number of references
select
url,
css_text_first(content, 'h1#firstHeading') as title,
cardinality(css_text(content, 'ol.references li')) as reference_count
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/%'
order by reference_count desc
- The source code of a random article on the English Wikipedia reveals that all references can be
selected using the
ol.references li
CSS selector.css_text(content, 'ol.references li')
will select the text of every reference of an article and since we are only interested in the number of references, we use thecardinality()
function to return only the size of the array output bycss_text(content, 'ol.references li')
i.e. number of references for each article.
Example 5: Rank Wikipedia articles by length
select
url,
css_text_first(content, 'h1#firstHeading') as title,
cardinality(words(css_text_first(content, '#content'))) as article_length
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/%'
order by article_length desc
words()
returns an array of all the words in a text. Usingcardinality(words(css_text_first(content, '#content'))) as article_length
we extract the number of words for an article and call that numberarticle_length
. And finally, usingorder by article_length desc
we rank articles based on the number of words in descending order.
Example 6: What is the average size of a Wikipedia article
select
avg(cardinality(words(css_text_first(content, '#content')))) as average_article_length
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/%'
- The
avg()
function will calculate the average of its input values, in this case the number of words in every article from Wikipedia.
Example 7: Rank Wikipedia articles by the length of their discussions
select
url,
remove_left(css_text_first(content, 'h1#firstHeading'), 'Talk:') as title,
cardinality(words(css_text_first(content, '#content'))) as discussion_length
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/Talk:%'
order by discussion_length desc
words()
returns an array of all the words in a text. Usingcardinality(words(css_text_first(content, '#content'))) as article_length
we extract the number of words for an article and call that numberarticle_length
. And finally, usingorder by article_length desc
we rank articles based on the number of words in descending order.
Example 8: Find every Wikipedia article that has a link to bbc.com
select
url,
css_text_first(content, 'h1#firstHeading') as title
from
pages
where
url_domain = 'wikipedia.org'
and
url like '%/wiki/%'
and
contains_any(content, array['href="https://www.bbc.com', 'href="http://www.bbc.com', 'href="https://bbc.com', 'href="http://bbc.com', 'href="www.bbc.com', 'href="bbc.com'])