What is Mixnode?
Mixnode turns the web into a giant database!
In other words, Mixnode allows you to think of all the web pages, images, videos, PDF files, and other resources on the web as rows in a database table; a giant database table with trillions of rows that you can query using the standard Structured Query Language (SQL). So, rather than running web crawlers/scrapers you can write simple queries in a familiar language to retrieve all sorts of interesting information from this table of live data.
url | content_type | content_language | content | headers | url_protocol | url_host | url_domain | url_etld | url_abs_path |
---|---|---|---|---|---|---|---|---|---|
https://news.ycombinator.com/ | text/html; charset=utf-8 | en | <html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=de... | HTTP/1.1 200 OK Server: nginx Date: Mon, 24 Sep 2018 19:36:30 GMT Content-Type: text/html; charse... | https | news.ycombinator.com | ycombinator.com | com | / |
https://fr.wikipedia.org/wiki/Base_de_donn%C3%A9es | text/html; charset=UTF-8 | fr | <!DOCTYPE html> <html class="client-nojs" lang="fr" dir="ltr"> <head> <meta charset="UTF-8"/> <title... | HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:39:49 GMT Content-Type: text/html; charset=UTF-8 Connec... | https | fr.wikipedia.org | wikipedia.org | org | /wiki/Base_de_donn%C3%A9es |
https://www.reddit.com/sitemaps/subreddit-sitemaps.xml | text/xml | NULL | <?xml version='1.0' encoding='UTF-8'?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/... | HTTP/1.1 200 OK Last-Modified: Mon, 24 Sep 2018 06:13:14 GMT ETag: "aeae350d08f76f005e2fe8098a4713... | https | www.reddit.com | reddit.com | com | /sitemaps/subreddit-sitemaps.xml |
http://www.diarioelpuerto.com.mx/ | text/html | es | <!DOCTYPE HTML> <html> <head> <meta name="google-site-verification" content="SzDRrSxL_mhLV_bCAnR_s8e... | HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:13:26 GMT Server: Apache X-Powered-By: PHP/5.2.17 Keep... | http | www.diarioelpuerto.com.mx | diarioelpuerto.com.mx | com.mx | / |
http://www.wfnmc.org/mc20101.pdf | application/pdf | en | %PDF-1.6 206 0 obj <</Linearized 1/L 213940/O 208/E 89344/N 12/T 209772/H [ 1196 788]... | HTTP/1.1 200 OK ETag: "343b4-53e2b129-5cf784d6aa98c961" Last-Modified: Wed, 06 Aug 2014 22:50:17 G... | http | www.wfnmc.org | wfnmc.org | org | /mc20101.pdf |
https://code.jquery.com/jquery-1.11.3.js | application/javascript; charset=utf-8 | NULL | /*! * jQuery JavaScript Library v1.11.3 * http://jquery.com/ * * Includes Sizzle.js * http://si... | HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:55:14 GMT Connection: Keep-Alive Accept-Ranges: bytes ... | https | code.jquery.com | jquery.com | com | /jquery-1.11.3.js |
... |
Mixnode turns the web into a giant database table with multiple columns.
Just like a regular database table, you are provided with several columns (a.k.a. fields) that represent different attributes of web resources such as URL, content, content type, content language, domain name, ... Additionally, Mixnode comes with hundreds of functions that you can use to further analyze the data in any way that you want. From parsing HTML/XML and JSON to handling date/time and processing text, there are numerous built-in functions to use directly in your queries.
As a simple example, using Mixnode, getting the URL and title of every web page from the web boils down to a simple SQL query:
select
url,
string_between(content, '<title>', '</title>') as title
from
resources
where
content_type like 'text/html%'
Where the results will look similar to:
url | title | |
---|---|---|
https://stackoverflow.com/questions/8318911/why-does-html-think-chucknorris-is-a-color | [Why does HTML think “chucknorris” is a color? - Stack Overflow] | |
https://en.wikipedia.org/wiki/List_of_animals_with_fraudulent_diplomas | [List of animals with fraudulent diplomas - Wikipedia] | |
https://www.amazon.co.jp/dp/B06XXQD54H/ | [Amazon | アクータメンツ フィンガーリス 指人形 フィンガーパペット 指人形 | おもちゃ雑貨 | おもちゃ] | |
https://www.reddit.com/r/funny/comments/5yhipb/its_a_bit_breezy_out_there_today/ | [It's a bit breezy out there today : funny] | |
https://imgur.com/gallery/cJO834B | [Just cause you pelican doesn't mean you pelishould - Album on Imgur] | |
... |
You can expand this query in any number of ways by utilizing the built-in columns and functions of Mixnode.
For example, if you wanted to get the title of every English web page you could simply
use a condition on the content_language
column:
select
url,
string_between(content, '<title>', '</title>') as title
from
resources
where
content_type like 'text/html%' and
content_language = 'en'
Did you want the title and first paragraph of every English web page? The css_text_first
function
has you covered:
select
url,
string_between(content, '<title>', '</title>') as title,
css_text_first(content, 'p') as first_paragraph
from
resources
where
content_type like 'text/html%' and
content_language = 'en'
Same query, but only on .net
domains? You only need to use the url_etld
column:
select
url,
string_between(content, '<title>', '</title>') as title,
css_text_first(content, 'p') as first_paragraph
from
resources
where
content_type like 'text/html%' and
content_language = 'en' and
url_etld = 'net'
Consider the question "Sort the English Wikipedia articles by length". All you need
to answer this question is to use the order by
clause:
select
url,
cardinality(words(content)) as article_length
from
resources
where
url_host = 'en.wikipedia.org' and
url_abs_path like '/wiki/%'
order by article_length desc
By combining table columns and built-in functions you can practically analyze the web in an infinite number of ways. Additionally, you can integrate Mixnode with external data sources (e.g. sending and receiving data from Amazon S3) and create even more flexible queries.
Give it a try!
Mixnode allows you to focus only on what you need to get from the web and not how to get it. It is an end-to-end solution that takes you from question to answer with a simple query; you don't need to deploy web crawlers or run scrapers, you don't need to process raw data, and there are no "intermediate results".