# Do you Word2Vec? Google's neural-network bookworm

## Making machines eat our words

9 Got Tips?

Several years back, the Google "Brain Team" that was behind Tensorflow hatched another novel neural tool: Word2Vec.

Word2Vec is a two-layer neural net for processing text. It swallows a given set of text that it then returns as a set of vectors – turning the words into a numerical form that computers can understand.

Word2Vec therefore has a potential role in the business and research world, at least, scanning documents picking out meanings and associations.

The Tensorflow site, here, describes Word2Vec in glowing terms as “a particularly computationally efficient predictive model for learning word embeddings from raw text”.

How does that compare to the reality?

I was introduced to Word2Vec by data science student Alastair Firrell and when I first came across it I was – frankly – incredulous.

How does it work? Consider the following:

“King” – Maleness + femaleness = ?

The result of this “calculation” is obviously the word “queen”, but this is just the sort of calculation that Word2Vec can do, and do reliably.

The system can take this further – it can learn relationships and make “guesses” as to what to do when given something that it doesn’t really know about. For example, if we take the relationship that Paris is the capital of France, we can ask Word2Vec the capital of Italy. In their original papers, the Google team gave several examples for cities: if Paris is to France then Rome is to Italy, Tokyo to Japan and Tallahassee to Florida.

Other examples were also presented in that 2013 paper: if Cu is the short form of Copper then Au is the short form of Gold, if Einstein is a scientist, Mozart is a violin player and Picasso is an artist. That said, it was only about 60 per cent accurate as it has been trained on a limited data set.

How does it work? Essentially the system creates vectors for a word that denote how much that word contains those vectors. A simple example (and not a realistic one) might be a rock band – how fast they are, how heavy, how complicated the drum solos are, and so on. In the real code there will be thousands of vectors determined by the machine learning process itself.

None of this is new (something the original authors readily admit), but what they did was speed up the training when given a large set of data. As with any machine learning algorithm, it’s the model training that takes the time.

I’ve been working with a fun model that figures out the relationship between rock bands and allows you to subtract one band from another. So for instance, Black SabbathDeep Purple gives (amongst others) Slayer and Bathory. You can download the code here, but be warned that the code may need updating to run with the latest libraries. The point is, once you have the code and the data set (scraped from Wikipedia) – a massive 107MB of data you need to train the model – you’re looking at something that took over an hour on my MacBook Pro.