Powerset's semantic obsession is already working its way into Bing's primary search engine, helping to suss out the meaning behind end-user queries, generate captions for query results, and suggest related queries.
Microsoft acquired the San Francisco-based Powerset last summer in a deal worth a reported $100m, nearly a year before Bing's much-ballyhooed debut. At the time, the startup offered a semantic search engine that indexed nothing but Wikipedia, and this Wikicontraption was eventually bolted to the side of Bing's primary search engine and rechristened as a "Reference" vertical.
But the ultimate goal is to meld Powerset's semantic indexing with Bing proper, and according to Scott Prevost, who oversees the Powerset's interplay with Redmond, the melding is well underway.
"We're taking pieces of our technology and integrating it throughout the Bing stack," Prevost tells The Reg. "So things like helping on some of the query processing. And we're now working on some of the caption generation - the text that occurs under the blue link on search results. This is part of a longer-term, deeper integration of our technologies throughout all of Bing."
He also says the outfit is "doing some work with related searches" - i.e. helping to suggest additional queries the user may be interested in.
Still based in San Francisco - several hundred miles away from Bing's Redmond base - the 65-person-strong Powerset is "diving very deeply" into the task of caption generation. "It's one of the things that helps users understand the relevance of a particular search result to their query," Prevost says. "If you have good captions, it helps users not waste time looking through pages.
"One of the challenges in developing captions is finding the right pieces of text on a page to represent that link, so semantic processing really helps. It helps pick the right sentences, sentences that may have the right concepts but not necessarily the keywords from [the user's query]. It helps us pick the piece of the sentence that's most relevant and not chop it off in places that makes it unreadable...
"You see things in Powerset captions such as whole phrases being highlighted, phrases where the words don't match all the keywords but the meaning of the words matches. Sometimes, you get a great sentence in an article and it doesn't have all the keywords but it's really the thing that best explains what the sentence is about."
Using its own back-end infrastructure, Powerset works to build a semantic index for at least a portion of the web. "When we index a document, we do much heavier processing," Prevost explains. "We do deep linguistic processing, everything from morphological analysis - scanning the words for our speech patterns - to full-on syntactic parsing of sentences.
"Then we have a component that extracts semantic relationships from those parses." For instance, the outfit's proprietary tech works to recognize synonyms or associated generic pronouns with particular names. Then, after doing a similar analysis on an end-user query, Powerset can match semantic data between query and index.
Yes, Powerset's back-end runs on Hadoop, the open-source distributed-computing platform based on Google's proprietary infrastructure. Powerset originated Hadoop's Hbase project, a mirror of Google's distributed database, BigTable. And yes, that means open-source code is juicing at least a portion of Bing proper. "What we provide Bing with is data, and data can be produced using various open-source tools in Powerset's data center," Prevost says.
Famously, Microsoft spent years treating open source like a pariah, and even now it seems that relatively few of the company's shipping products embrace open code. But according to Prevost, Microsoft was always open to the idea of retaining Powerset's Hbase base.
"We obviously had a lot of conversations [with Microsoft] about what we were doing and why it was important," Prevost says. "Microsoft was very open to the idea of open source. Obviously, Microsoft has a lot of IP concerns with software in so many different domains, so they want to be very careful about these things...but it was really just a matter of working out the details."
After the acquisition, while these conversations played out, Powerset's two full-time Hbase committers took leave from the project. But by October, they were approved to resume contributing patches.
As you might expect, Microsoft has no plans to migrate Bing proper onto the platform. "We haven't done anything to the Bing code base that explicitly uses Hbase," Prevost says.
But whether it's underpinned by Hadoop or not, Powerset intends to build a semantic index for the entire web. It just needs some time - and some cheaper, faster processing power. "Where we are right now is that it's still very expensive. We spend a lot more time indexing a page and that takes a lot more processing power. And that creates a much larger index, which is more expensive to serve. It wouldn't make sense for us to index the entire web, because it would be highly expensive, and for certain kinds of pages, we might not see the value."
So, for the moment, Powerset is indexing Wikipedia. But there's more to come. It may add other, contained datasets to Bing's Reference vertical, before attempting to embrace the web as whole. And yes, it will take that Reference tab out of hiding. As it stands, Powerset's Wikisearch is limited to a relatively small number of queries, including the search for "Albert Einstein."
How does Powerset avoid Wikinonsense? According to Prevost, it re-indexes the "free encyclopedia anyone can edit" every two hours or so. "We look for changes and re-index those articles," Prevost explains. "That helps to make sure we don't have pages that are vandalized...the vandalized pages get fixed pretty quickly." Or so it seems.