Blog

Follow Mixnode on Twitter

Introducing the news table: a simple alternative to news/blog crawling

Nariman Jelveh June 4, 2019

From media and brand monitoring to machine learning and security applications, up-to-date web data plays an integral role in any modern information pipeline that depends on public data. Whether you need to find the latest trends, measure sentiment for your recent product release, understand the general reaction to certain news or monitor the growth and popularity of your brand, you will have to, quickly, go through millions of relevant articles and posts from all around the web quickly without breaking the bank!

The news table is our answer to the incredibly difficult and complex problem of finding, aggregating, and processing the latest articles, posts, and pages from all around the web. Using the news table you will never have to run another news crawler, blog scraper, or RSS feed aggregator. Every day we visit hundreds of millions of web pages, find and cache the latest published posts and articles and provide them to you as a simple database table that you can run SQL queries against.

Example

Similar to the other tables in the Mixnode ecosystem, every row in the news table corresponds to a page from the web identified by the url value. Many other columns are provided to write flexible queries and help you with processing the data. For example, publication_date corresponds to the date the page was published on, so in order to retrieve the URLs of all pages published on May 25, 2019 you could simply run the following query:

select 
    url 
from 
    news
where
    cast (publication_date as varchar) = '2019-05-25'

content_language is another useful columns that allows you to narrow your queries further down based on the language of the page. If you wanted to find all the English news pages published on May 25, 2019, you could simply modify the previous query like the following:

select 
    url 
from 
    news
where
    cast (publication_date as varchar) = '2019-05-25'
    and
    content_language = 'en'

Did you only need English articles from May 25, 2019, that mention 'bitcoin' in the title? No problem!

select 
    url 
from 
    news
where
    cast (publication_date as varchar) = '2019-05-25'
    and
    content_language = 'en'
    and
    lower(title) like '%bitcoin%'

The news table provides many more columns such as author_meta_tag, description_meta_tag, url_host, ... Using these columns you can write and execute queries with a variety of conditions to extract data from millions of news pages and blog posts. Additionally, if you prefer other processing methods, you can always request firehose access to the news table to use your own tools to process the data.

Give it a try!

We are incredibly excited to share the new news table with our users and look forward to all the innovation that will be unleashed by simple, affordable access to large-scale news data. Give it a try and contact us at hi@mixnode.com if you have any questions or comments.

Turn the web into a database!

Mixnode is a fast, flexible and massively scalable platform to extract and analyze data from the web.

or contact us at hi@mixnode.com