Winning at chess, losing at language
This approach is much like computerized chess: make a statistical model of the domain and optimize the hell out of it, ultimately winning by sheer computational horsepower. Like chess (but unlike vision), language is a source of pride, something both complex and uniquely human. For chess, computational optimization worked brilliantly; the best chess-playing computers, like Deep Blue, are better than the best human players. But score-based optimization won't work for language in its current form, even though it does do two really important things right
The first good thing about statistical machine translation is the statistics. Human brains are statistical-inference engines, and our senses routinely make up for noisy data by interpolating and extrapolating whatever pixels or phonemes we can rely on. Statistical analysis makes better sense of more data than strict rules do, and statistical rules produce more robust outputs. So any ultimate human-quality translation engine must use statistics at its core.
The other good thing is the optimization. As I've argued earlier, the key to understanding and duplicating brain-like behavior lies in optimization, the evolutionary ratchet which lets an accumulation of small, even accidental adjustments slowly converge on a good result. Optimization doesn't need an Einstein, just the right quality metric and an army of engineers.
So Och's team (and their competitors) have the overall structure right: they converted text translation into an engineering problem, and have a software architecture allowing iterative improvement. So they can improve their Black Box - but what's inside it? Och hinted at various trendy algorithms (Discriminative Learning and Expectation Maximization, I'll bet Bayesian Inference too), although our ever-vigilant chaperon from Google Communications wouldn't let him speak in detail. But so what? The optimization architecture lets you swap out this month's algorithm for a better one, so algorithms will change as performance improves.
Or maybe not. The Achilles' Heel of optimization is that everything depends on the performance metric, which in this case clearly misses a lot. That's not a problem for winning contests - the NIST competition used the same "BLEU"(Bilingual Evaluation Understudy) metric as Google practiced on, so Google's dramatic win mostly proved that Google gamed the scoring system better than IBM did. But the worse the metric, the less likely the translations will make sense.
The gist of the problem is that because machines don't yet understand language - that's the original problem, right? - they can't be too good at automatically evaluating language translations either. So researchers have to bootstrap the BLEU score, taking a scheme like (which merely compares the similarity of two same-language documents) and verifying that on average humans prefer reading outputs with high scores. (They compare candidate translations against gold-standard human translations)
But all BLEU really measures is word-by-word similarity: are the same words present in both documents, somewhere? The same word-pairs, triplets, quadruplets? In obviously extreme cases, BLEU works well; it gives a low score if the documents are completely different, and a perfect score if they're identical. But in between, it can produce some very screwy results.
The most obvious problem is that paraphrases and synonyms score zero; to get any credit with , you need to produce the exact same words as the reference translation has: "Wander" doesn't get partial credit for "stroll," nor "sofa" for "couch."
The complementary problem is that BLEU can give a high similarity score to nonsensical language which contains the right phrases in the wrong order. Consider first this typical, sensible output from a NIST contest:
"Appeared calm when he was taken to the American plane, which will to Miami, Florida"
Now here is a possible garbled output which would get the very same score:
"was being led to the calm as he was would take carry him seemed quite when taken"
The core problem is that word-counting scores like BLEU - the linchpin of the whole machine-translation competitions - don't even recognize well-formed language, much less real translated meaning. (A stinging academic critique of BLEU can be found here.)
A classic example of how the word-by-word translation approach fails comes from German, a language which is too "tough" for Och's team to translate yet (although Och himself is a native speaker). German's problem is its relative-to-English-tangled Wordorder; take this example from Mark Twain's essay "The Awful German Language":
"But when he, upon the street, the (in-satin-and-silk-covered-now-very-unconstrained-after-the-newest-fashioned-dressed) government counselor's wife met, etc"
Until computers deal with the actual language structure (the hyphens and parentheses above), they will have no hope of translating even as well as Mark Twain did here.
So why are computers so much worse at language than at chess? Chess has properties that computers like: a well-defined state and well-defined rules for play. Computers do win at chess, like at calculation, because they are so exact and fussy about rules. Language, on the other hand, needs approximation and inference to extract "meaning" (whatever that is) together from text, context, subject matter, tone, expectations, and so on - and the computer needs yet more approximation to produce a translated version of that meaning with all the right interlocking features. Unlike chess, the game of language is played on the human home-turf of multivariate inference and approximation, so we will continue to beat the machines.
But for Google's purposes, perfect translation may not even be necessary. Google succeeded in web-search partly by avoiding the exact search language of AltaVista in favor of a tool which was fast, easy to use, and displayed most of the right results in mostly the right order. Perhaps it will also be enough for Google to machine-translate most of the right words in mostly the right order, leaving to users the much harder task of extracting meaning from them. ®
Bill Softky has written a neat utility for Excel power users called FlowSheet: it turns cryptic formulae like "SUM(A4:A7)/D5" into pretty, intuitive diagrams. It's free, for now. Check it out.