Oh no, you're thinking, yet another cookie pop-up. Well, sorry, it's the law. We measure how many people read us, and ensure you see relevant ads, by storing cookies on your device. If you're cool with that, hit “Accept all Cookies”. For more info and to customize your settings, hit “Customize Settings”.

Review and manage your consent

Here's an overview of our use of cookies, similar technologies and how to manage them. You can also change your choices at any time, by hitting the “Your Consent Options” link on the site's footer.

Manage Cookie Preferences
  • These cookies are strictly necessary so that you can navigate the site as normal and use all features. Without these cookies we cannot provide you with the service that you expect.

  • These cookies are used to make advertising messages more relevant to you. They perform functions like preventing the same ad from continuously reappearing, ensuring that ads are properly displayed for advertisers, and in some cases selecting advertisements that are based on your interests.

  • These cookies collect information in aggregate form to help us understand how our websites are being used. They allow us to count visits and traffic sources so that we can measure and improve the performance of our sites. If people say no to these cookies, we do not know how many people have visited and we cannot monitor performance.

See also our Cookie policy and Privacy policy.

This article is more than 1 year old

The net's a sprawling data mire – Webhose.io sprays away the gunk

Elastic Search and Lucene underlie freemium web text search and analysis tool

Webhose.io turns the undrinkable torrent of web text data into sippable glasses filtered just for you.

Suppose you want to find all mentions on the web of Oracle and Mark Hurd over the past three days. You can't do this on Google. Searching "Oracle" + "Mark Hurd" gets you 229,000 results in 0.47 seconds, but it's not ordered by time.

Switch to a news search on Google and order it by time instead of relevance. You get no total number of results up front, only pages of results, nine at a time, ordered by time and date, and only from recognised Google news sources.

The chaps who founded Webhose.io in 2014, CEO Ran Geva and CMO Guy Mor, have come up with a way of getting this kind of information through a freemium model.

Ran_Geva_and_Guy_Mor

Ran Geva (left) and Guy Mor (right)

They funded Webhose.io with money from previous ventures. Its aim was to write a web-crawler that collects timed textual information from multiple web sources, including the Tor dark web, tag it with metadata, and store it in a searchable way.

It finds the data in websites, in machine-readable form, dealing with multi-language date formats on the way and copies it across to its system.

The infrastructure is surprisingly small: 120 servers, some of them doing the web-crawling, with local storage using 8TB disk drives, in an Israeli data centre. The servers run Ubuntu, so no cash goes to Microsoft or Red Hat.

They provide a repository based on Elastic Search and Lucene. The Webhosers actually started out with NoSQL but migrated to Lucene.

Mor says storage is cheap – it's CPU and memory that are the bottlenecks.

Nothing is analysed in the repository and, after 30 day's residence, content is pumped out to an archive.

Webhose has search-facility deals with customers IBM, Salesforce and Rabobank. IBM feeds Webhose-collected news data to Watson for training. The startup is two and a half years old, has 215 employees, 36,000 registered users, and says it's profitable.

You can sign up for the free Webhose facility here and try it out. You might like it. ®

Similar topics

TIP US OFF

Send us news


Other stories you might like