This article is more than 1 year old
OpenAI claims GPT-4 will beat 90% of you in an exam
Weirdly coy about the AI model's size and what went into making it, tho
OpenAI on Tuesday announced the qualified arrival of GPT-4, its latest milestone in the making of call-and-response deep learning models and one that can seemingly outperform its fleshy creators in important exams.
According to OpenAI, the model exhibits "human-level performance on various professional and academic benchmarks." GPT-4 can pass a simulated bar exam in the top 10 percent of test takers, whereas its predecessor, GPT-3.5 (the basis of ChatGPT) scored around the bottom 10 percent.
GPT-4 also performed well on various other exams, like SAT Math (700 out of 800). It's not universally capable, however, scoring only 2 on the AP English Language and Composition (14th to 44th percentile).
One thing to consider: OpenAI's GPT series by its very nature is a family of regurgitation engines, drawing upon material it was trained on and reassembling it to address your query. Sometimes it's right, and sometimes it's wrong. That it can recall details for exams may not seem that impressive to you, or it may be more of a comment on the kinds of tests we humans have to take.
"It is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it," OpenAI CEO Sam Altman acknowledged, referring to GPT-4.
Now Google to shove its answer to ChatGPT into Gmail, Docs, apps via APIs, more
ALSO TODAYGPT-4 is a large multimodal model, as opposed to a large language model. It is designed to accept queries via text and image inputs, with answers returned in text. It's being made available initially via the wait-listed GPT-4 API and to ChatGPT Plus subscribers in a text-only capacity. Image-based input is still being refined.
Despite the addition of a visual input mechanism, OpenAI is not being open about or providing visibility into the making of its model. The upstart has chosen not to release details about its size, how it was trained, nor what data went into the process.
"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar," the company said in its technical paper [PDF].
In a live stream on YouTube, Greg Brockman, president and co-founder of OpenAI, demonstrated the difference between GPT-4 and GPT-3.5 by asking the models to summarize the OpenAI GPT-4 blog post in a single sentence where every word begins with the letter "G."
GPT-3.5 simply didn't try. GPT 4 returned "GPT-4 generates ground-breaking, grandiose gains, greatly galvanizing generalized AI goals." And when Brockman told the model that the inclusion of "AI" in the sentence doesn't count, GPT-4 revised its response in another G-laden sentence without "AI" in it.
He then went on to have GPT-4 generate the Python code for a Discord bot. More impressively, he took a picture of a hand-drawn mockup of a jokes website, sent the image to Discord, and associated GPT-4 model responded with HTML and JavaScript code to realize the mockup site.
Finally, Brockman set up GPT-4 to analyze 16 pages of US tax code to return the standard deduction for a couple, Alice and Bob, with specific financial circumstances. OpenAI's model responded with the correct answer, along with an explanation of the calculations involved.
- Now Google injects its answer to ChatGPT into Gmail, Docs, APIs, and more
- GPT-4 to launch this week, Microsoft Germany's CTO lets slip
- LLaMA drama as Meta's mega language model leaks
- 'Robot lawyer' DoNotPay not fit for purpose, alleges complaint
Beyond better reasoning, evident in its improved test scores, GPT-4 is intended to be more collaborative (iterating as directed to improve previous output), better able to handle lots of text (analyzing or outputting novella-length chunks of around 25,000 words), and of accepting image-based input (for object recognition, though that capability isn't yet publicly available).
What's more, GPT-4, according to OpenAI, should be less likely to go off the rails than its predecessors.
"We’ve spent six months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails," the org says.
People may already be familiar with this "far from perfect" level of safety from the rocky debut of Microsoft Bing's question answering capabilities, which it turns out uses GPT-4 as the basis for its Prometheus model.
OpenAI acknowledges that GPT-4 "hallucinates facts and makes reasoning errors" like its ancestors, but the org insists the model does so to a lesser extent.
GPT-4 significantly reduces hallucinations relative to previous models
"While still a real issue, GPT-4 significantly reduces hallucinations relative to previous models (which have themselves been improving with each iteration)," the company explains. "GPT-4 scores 40 percent higher than our latest GPT-3.5 on our internal adversarial factuality evaluations."
Pricing for GPT-4 is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens, where a token is about four characters. There's also a default rate limit of 40,000 tokens per minute and 200 requests per minute.
Also, OpenAI open-sourced Evals, a program for evaluating and benchmarking machine-learning models including its own.
Despite ongoing concern about AI risks, there's a rush to bring AI models to market. On the same day GPT-4 arrived, Anthropic, a startup formed by former OpenAI employees, introduced its own chat-based helper called Claude for handling text summarization and generation, search, Q&A, coding, and more. That's also available via a limited preview.
And Google, worried about falling behind in the marketing of AP models, teased a roll out of an API called PaLM for interacting with various large language models and a prototyping environment called MakerSuite.
A few weeks earlier, Facebook launched its LLaMA large language model, which has now been turned into the Alpaca model by Stanford researchers, which The Register will be covering in more detail later.
"There’s still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model," OpenAI concluded. ®