HPC

How to build your own Watson Jeopardy! supermachine

To rule humanity, download the following open source code...


Ask mom for her credit card...

However, unless you're a geek sheik or a billionaire bit-basher, you're obviously not going to buy all this iron. But you could use your mom's credit card if you're working from your basement bedroom – or your wife's if you're working from your man cave in the garage – to reserve some server instances on Amazon's EC2 compute cloud.

We would go for the Cluster Compute Instances that Amazon announced last July. The Cluster Compute Instances deliver 33.5 EC2 compute units of power, running in 64-bit mode. They present 23GB of virtual memory to the operating system (that's not much) and the processors used in the physical hardware underneath the CCI slices are in a two-socket x64 server based on Intel's 2.93GHz Xeon X5570s.

That means each slice has 8 cores, 16 threads, and 23GB of memory. The nodes are interconnected with 10 Gigabit Ethernet switches. To match the core count of the Power-based Watson machine, you'd need 360 of these slices. To match the thread count, you'd need 720 slices. And to match the aggregate main memory, you'd need 712 boxes. So it looks like 720 boxes will do the trick, provided that the overhead of the Xen-based Amazon EC2 hypervisor is not too high. At $1.60 per hour for the CCI slices, you are in for $1,152 per hour. Trust me, your ma or your wife won't mind. It's all for the benefit of science.

The thing that makes Watson a question-and-answer machine and not just a cluster running Linux is a mountain of code that IBM has developed called DeepQA. You can see what little IBM has to say about the DeepQA stack here. Two key elements of the DeepQA stack are open source programs available through the Apache Software Foundation.

The first is Apache Hadoop, the open source distributed data-crunching system created by Doug Cutting after he read about Google's back-end infrastructure. Hadoop joined the Apache incubator program in 2005 and was a workable system by around 2008 or so.

The other key piece of code in the DeepQA stack that Watson ran is Apache UIMA – Unstructured Information Management Architecture – which is an information-management framework created by IBM database gurus back in 2005 to help them cope with unstructured information such as text, audio, and video streams. The UIMA code performs the natural-language processing (NLP is the term of art in AI) that parses text and helps Watson figure out what a Jeopardy! clue is about.

IBM has embedded UIMA functions in various systems programs it sells, the first being the OmniFind semantic search engine that Big Blue put into its DB2 data warehouses. IBM has proposed UIMA as an OASIS standard, and took it open source to get people on board with its way of creating frameworks for managing unstructured data. UIMA has frameworks for Java and C++, but could no doubt be extended to whatever language you wanted to code your Watson QA machine in.

Gondek tells El Reg that IBM used Prolog to do question analysis. Some Watson algorithms are written in C or C++, particularly where the speed of the processing is important. But Gondek says that most of the hundreds of algorithms that do question analysis, passage scoring, and confidence estimation are written in Java. So maybe you want to use a RHEL-JBoss stack for your Watson.

Now here is the real problem with a DIY Watson: the algorithms that IBM's DeepQA team created to teach Watson how to play Jeopardy! consist of about a million lines of code. That's going to take you and your friends a bit more than a few weekends to create. But, if you do it, you can launch a deep analytics startup and sell it to HP or Microsoft for ba-zillions.

Let me offer you a few pointers from Gondek for when you build your machine. First, don't stuff it full of anything you can find on the Internet. In creating Watson, IBM's researchers figured out that authoritative texts like the Oxford English Dictionary, Bartlett's Familiar Quotations, Wikipedia – yes, Wikipedia – and various encyclopedias were the best data sets suited to playing Jeopardy!. You want precise data, to be sure, but you don't want to surround it with so much extraneous text that the machine will be churning through tons o' text to find an answer.

For example, you don't put in Moby Dick, but instead lots of authoritative texts that talk about Moby Dick and pull out the important passages. As it turns out, Watson needed about 200 million pages of text, or about the equivalent of 1 million books, to play Jeopardy!.

The other key insight that Gondek offers is to really focus on the question-parsing algorithms. By finding out what the key words are in any sentence and dispensing with the noise, you can not only get to the answer faster, but do a better job of coming up with the correct answer.

These two insights are what turned Watson from a crap Jeopardy! player into a champion. Good luck building your own. And dominating the world. ®

Similar topics


Other stories you might like

  • Why Cloud First should not have to mean Cloud Everywhere

    HPE urges 'consciously hybrid' strategy for UK public sector

    Sponsored In 2013, the UK government heralded Cloud First, a ground-breaking strategy to drive cloud adoption across the public sector. Eight years on, and much of UK public sector IT still runs on-premises - and all too often - on obsolete technologies.

    Today the government‘s message boils down to “cloud first, if you can” - perhaps in recognition that modernising complex legacy systems is hard. But in the private sector today, enterprises are typically mixing and matching cloud and on-premises infrastructure, according to the best business fit for their needs.

    The UK government should also adopt a “consciously hybrid” approach, according to HPE, The global technology company is calling for the entire IT industry to step up so that the public sector can modernise where needed and keep up with innovation: “We’re calling for a collective IT industry response to the problem,” says Russell MacDonald, HPE strategic advisor to the public sector.

    Continue reading
  • A Raspberry Pi HAT for the Lego Technic fan

    Sneaking in programming under the guise of plastic bricks

    There is good news for the intersection of Lego and Raspberry Pi fans today, as a new HAT (the delightfully named Hardware Attached on Top) will be unveiled for the diminutive computer to control Technic motors and sensors.

    Using a Pi to process sensor readings and manage motors has been a thing since the inception of the computer, and users (including ourselves) have long made use of the General Purpose Input / Output (GPIO) pins that have been a feature of the hardware for all manner of projects.

    However, not all users are entirely happy with breadboards and jumpers. Lego, familiar to many a builder thanks to lines such as its Mindstorms range, recently introduced the Education SPIKE Prime set, aimed at the classroom.

    Continue reading
  • Reg scribe spends week being watched by government Bluetooth wristband, emerges to more surveillance

    Home quarantine week was the price for an overseas trip, ongoing observation is the price of COVID-19

    Feature My family and I recently returned to Singapore after an overseas trip that, for the first time in over a year, did not require the ordeal of two weeks of quarantine in a hotel room.

    Instead, returning travelers are required to stay at home, wear a government-issued tracking device, and stay within range of a government-issued Bluetooth beacon at all times for a week … or else. No visitors are allowed and only a medical emergency is a ticket out. But that sounded easy compared to the hotel quarantine we endured in 2020.

    Continue reading

Biting the hand that feeds IT © 1998–2021