AI + ML

This article is more than 1 year old

Do you Word2Vec? Google's neural-network bookworm

Making machines eat our words

Fri 13 Oct 2017 // 09:06 UTC

Several years back, the Google "Brain Team" that was behind Tensorflow hatched another novel neural tool: Word2Vec.

Word2Vec is a two-layer neural net for processing text. It swallows a given set of text that it then returns as a set of vectors – turning the words into a numerical form that computers can understand.

Word2Vec therefore has a potential role in the business and research world, at least, scanning documents picking out meanings and associations.

If Machine Learning is the question, open source is the answer. Right?

The Tensorflow site, here, describes Word2Vec in glowing terms as “a particularly computationally efficient predictive model for learning word embeddings from raw text”.

How does that compare to the reality?

I was introduced to Word2Vec by data science student Alastair Firrell and when I first came across it I was – frankly – incredulous.

How does it work? Consider the following:

“King” – Maleness + femaleness = ?

The result of this “calculation” is obviously the word “queen”, but this is just the sort of calculation that Word2Vec can do, and do reliably.

The system can take this further – it can learn relationships and make “guesses” as to what to do when given something that it doesn’t really know about. For example, if we take the relationship that Paris is the capital of France, we can ask Word2Vec the capital of Italy. In their original papers, the Google team gave several examples for cities: if Paris is to France then Rome is to Italy, Tokyo to Japan and Tallahassee to Florida.

Other examples were also presented in that 2013 paper: if Cu is the short form of Copper then Au is the short form of Gold, if Einstein is a scientist, Mozart is a violin player and Picasso is an artist. That said, it was only about 60 per cent accurate as it has been trained on a limited data set.

How does it work? Essentially the system creates vectors for a word that denote how much that word contains those vectors. A simple example (and not a realistic one) might be a rock band – how fast they are, how heavy, how complicated the drum solos are, and so on. In the real code there will be thousands of vectors determined by the machine learning process itself.

None of this is new (something the original authors readily admit), but what they did was speed up the training when given a large set of data. As with any machine learning algorithm, it’s the model training that takes the time.

I’ve been working with a fun model that figures out the relationship between rock bands and allows you to subtract one band from another. So for instance, Black Sabbath – Deep Purple gives (amongst others) Slayer and Bathory. You can download the code here, but be warned that the code may need updating to run with the latest libraries. The point is, once you have the code and the data set (scraped from Wikipedia) – a massive 107MB of data you need to train the model – you’re looking at something that took over an hour on my MacBook Pro.

Topics

Special Features

Vendor Voice

Resources

AI + ML

Do you Word2Vec? Google's neural-network bookworm

Making machines eat our words

If Machine Learning is the question, open source is the answer. Right?

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Google Cloud chief is really psyched about this AI thing

Arm flexes silicon muscles to push generative AI at the edge

AI spam is winning the battle against search engine quality

A different view from the edge

Developers are calling the shots on AI planning, judging by your experience

What's up with AI lately? Let's start with soaring costs, public anger, regulations...

Why making pretend people with AGI is a waste of energy

Psst, hey. It's the NSA. You want some AI security advice?

AI PCs are here but a killer application for biz users? Nope

Intel CEO suggests AI can help to create a one-person Unicorn

US House mulls forcing AI makers to reveal use of copyrighted training data

Hailo's latest AI chip shows up integrated NPUs and sips power like fine wine

About Us

Our Websites

Your Privacy