David Holz, founder of AI art generator Midjourney, on the future of imaging
Optimizing for beauty while trying to suppress sensationalism
Interview In 2008, David Holz co-founded a hardware peripheral firm called Leap Motion. He ran it until last year when he left to create Midjourey.
Midjourney in its present form is a social network for creating AI-generated art from a text prompt – type a word or phrase at the input prompt and you'll receive an interesting or perhaps wonderful image on screen after about a minute of computation. It's similar in some respects to OpenAI's DALL-E 2.
Midjourney image of the sky and clouds, using the text prompt "All this useless beauty." Source: generated by Midjourney
Both are the result of large AI models trained on vast numbers of images. But Midjourney has its own distinctive style, as can be seen from this Twitter thread. Both in recent days have entered public beta testing (though DALL-E 2 access is being expanded slowly).
The ability to create high-quality images from AI models using text input became a popular activity last year following the release of OpenAI's CLIP (Contrastive Language–Image Pre-training), which was designed to evaluate how well generated images align with text descriptions. After its release, artist Ryan Murdock (@advadnoun on Twitter) found the process could be reversed – by providing text input, you could get image output with the help of other AI models.
After that, the generative art community embarked on a period of feverish exploration, publishing Python code to create images using a variety of models and techniques.
"Sometime last year, we saw that there were certain areas of AI that were progressing in really interesting ways," Holz explained in an interview with The Register. "One of them was AI's ability to understand language."
Holz pointed to developments like transformers, a deep learning model that informs CLIP, and diffusion models, an alternative to GANs. "The one that really struck my eye personally was the CLIP-guided diffusion," he said, developed by Katherine Crawson (known on Twitter as @RiversHaveWings).
Not the stereotyped Florida man
Holz grew up in Florida and had a design business in high school where he studied math and physics. He was working on an applied mathematics PhD and took a leave of absence in 2008 to start Leap Motion. The following year, he spent a year as a student researcher at the Max Planck Institute, followed by two years at NASA Langley Research Center as a graduate student researcher working on LiDAR, Mars missions, and atmospheric science.
"I was like, why am I working on all this stuff?" he explained. "I just wanna work on one cool thing that I care about."
So he focused on Leap Motion, which developed a hardware device to track hand motion and use it for device input. He ran the company for twelve years, and when he left it employed about 100 people.
Midjourney, he said, is pretty small right now. "We're like about 10 people," he explained. "We're self-funded. We have no investors. We're not really financially motivated. We're just sort of here to work on things we're passionate about and have fun. And we were working on a lot of different projects."
Holz said the technological aspect of AI and the extent to which it will improve is fairly easy to foresee. "But the human ramifications of that are so hard to imagine," he said. "There's something here that's at the intersection of humanity and technology. In order to really figure out what this is and what it should be, we really need to do a lot of experiments."
The road ahead
The unsettled nature of AI image technology is evident in the difference between tools like Midjourney and a downloadable open source graphics application like Blender, or a locally installed commercial application like Adobe Photoshop (before it became a cloud service).
Midjourney exists in a social context. Its front-end is the chat service Discord. New users log in to Discord's Midjourney server and can then submit text prompts to generate images alongside numerous other users in any of the various newbie channels.
The resulting images for all the users in that channel surface in about a minute, which helps reinforce the notion of community. Those who decide to upgrade to a $10/month or $30/month subscription can submit text to the Midjourney bot in the Discord app as a private Direct Message and receive images in response without the screen-scrolling waterfall of interaction from other users in a public channel. Generated images however remain publicly viewable by default.
As a social app, Midjourney is subject to rules about allowable content – something users of Blender or other locally installed apps do not have to worry about. Midjourney's Terms of Service state: "No adult content or gore. Please avoid making visually shocking or disturbing content. We will block some text inputs automatically."
DALL-E 2 is subject to similar though more extensive limitations, as described in its Content Policy.
"I think if we lived in a world that didn't have social media, then we wouldn't need to have any restrictions," said Holz. "...When Photoshop was invented, there was actually press about it, where it's like, 'oh, you could fake anything and it's a little scary.' [But now], it's a lot more lucrative to be sensationalist than it was before."
"Nowadays, anybody can be sensationalist, and basically profit off of that, you know," said Holz. "And so what it does is it creates a market for drama and sensationalism. That's why I think we have to be a little more careful, because at some point, what people will do is they'll say, 'okay, I can make pictures of this, what is the most dramatic and offensive and horrifying stuff that I can make?'"
No easy answers
Holz allows that there are things social platforms can do to mitigate these problems but says there are no simple answers. "Unfortunately, there isn't a clear way to address it, except as a society, to reward sensationalism less," he said. "However, my impression is that no one really is trying to change social platforms to reduce sensationalism, because that makes them money right now."
What's more, he said, because Midjourney aims to be a social space for anyone over the age of 13, it's necessary to have rules against extreme or graphic content.
"We don't really want to have segmented spaces for people who like making corpses or like nude photos," Holz explained. "We just don't want to have to deal with that. We don't think that we have a moral obligation to do that at this stage. We want one beautiful social space for people to make stuff together and not be offended, basically, and to feel safe."
Toward that end, the company has about 40 moderators keeping an eye on the images that users create.
- Google and Intel cook AI chips, neural network exchanges – and more
- People who regularly talk to AI chatbots often start to believe they're sentient, says CEO
- AMD's AI strategy comes into view with Xilinx, GPU, software plans
- Cerebras sets record for 'largest AI model' on a single chip
The social aspect of Midjourney recently began enhancing image quality. Holz said company engineers recently introduced version three of its software, which for the first time incorporated a feedback loop based on user activity and response.
"If you look at the v3 stuff, there's this huge improvement," he said. "It's mind-bogglingly better and we didn't actually put any more art into it. We just took the data about what images the users liked, and how they were using it. And that actually made it better."
Asked about the Midjourney tech stack, Holz demurred. "At some point, we're probably going to do a press release specifically around which vendors we're using," he said. "What can I say is that we have these big AI models with billions of parameters. They're trained over billions of images."
Holz says users are making millions and millions of images every day, and doing so using green energy compute providers – which doesn't really narrow down the field of major cloud computing providers as they all claim to be at least carbon neutral.
"Every image is taking petaops," he said, a term that means 10^15 operations per second. "So 1000s of trillions of operations. I don't know exactly whether it's five or 10 or 50. But it's 1000s of trillions of operations to make an image. It's probably the most expensive … if you call Midjourney, a service – like you'd call it a service or a product – without a doubt, there has never been a service before where a regular person is using this much compute."
Keeping us in food and clothes
Yet Midjourney isn't on the path toward upselling customers brought in by a free service to paid tiers and then attracting well-paying enterprise clients before going public or getting acquired.
"We're not like a startup that raises a lot of money and then isn't sure what their business or product is and loses money for a long time," said Holz. "We're like a self-funded research lab. We can lose some amount of money. We don't have like $100 million of somebody else's money to lose. To be honest, we're already profitable, and we're fine."
"It's a pretty simple business model, which is, do people enjoy using it? Then if they do, they have to pay the cost of using it because the raw cost is actually quite expensive. And then we add a percentage on top of that, which is hopefully enough to feed and house us. And so that's what we're doing."
As for the future, scaling could be a problem. Holz said Midjourney presently has hundreds of thousands of people using the service, which requires something like 10,000 servers.
"If there were 10 million people trying to use technology like this," he said, "there actually aren't enough computers. There aren't a million free servers to do AI in the world. I think the world will run out of computers before the technology actually gets to everybody who wants to use it."
What are people using it for? Well, if you are signed in to a Midjourney account you can see what people are creating via the Community Feed page. It's a constant flow of interesting, often startling good, images.
"The majority of people are just having fun," said Holz. "I think that's the biggest thing because it's not actually about art, it's about imagination."
But for about 30 percent of users, it's professional. Holz said a lot of graphic artists use Midjourney as part of their concept development workflow. They generate a few variations on an idea and present it to clients to see which direction they should pursue.
"The professionals are using it to supercharge their creative or communication process," Holz explained. "And then a lot of people were just playing with it."
Maybe 20 percent of people use Midjourney for what Holz describes as art therapy. For example, creating dog images after their dog has died. "They're using it as an emotional and intellectual reflective tool," he said. "And that's really cool."
Holz dislikes the idea of using Midjourney to create fake photographs. "Using it editorially to create fake photos is extremely dangerous," he said. "No one should do that." But he's more open to Midjourney as a source of commercial illustration, noting that The Economist ran a Midjourney graphic on its cover in June.
"We only recently allowed people to use it commercially," said Holz. "For a long time, it was non-commercial only. And so one of the things we're doing is we're just watching it, what people are doing, and we might decide that we're not comfortable with some of that and then we're going to put in a rule saying you can no longer use it just for those things."
Holz said he sees AI tools like Midjourney making artists better at what they do rather than making everyone a professional artist. "An artist using these tools is always better than a regular person using these tools. At some point, might there be pressure to use these tools because you can make things that are so great? I think yes. But right now, I don't think it's quite there yet. But it will get shockingly better over the next two years."
Midjourney and DALL-E 2 have drawn more attention to longstanding concerns about whether large AI models, created from the work under copyright or specific licenses, can be reconciled with copyright law and with content creators' own sense of how their work should be treated.
America, land of the lawsuit
In terms of Midjourney output, current US jurisprudence denies the possibility of granting copyright to AI-generated images. In February, the US Copyright Office Review Board rejected [PDF] a second request to grant copyright to a computer-generated landscape titled "A Recent Entrance to Paradise" because it was created without human authorship.
In a phone interview, Tyler Ochoa, a professor in the Law department at Santa Clara University, told The Register, "The US Copyright Office has said it's [acceptable] if an artist uses AI to assist them in creating a work as long as there's some human creativity involved. If it's simply you typing text, and the AI generates a work, that pretty clearly is not subject to copyright protection under current law."
Midjourney's Terms of Service state "you own all Assets you create with the Services," but the company requires a copyright license from users to reproduce content created with the service – a necessary precaution to host users' images, even if it looks doubtful that those making Midjourney images simply through text input have any copyrights to convey or enforce.
That may not always be the case. Ochoa said that he believes Steven Thaler, who created "A Recent Entrance to Paradise," may want to challenge the Copyright Office's rejection of AI-based authorship in court, though that hasn't happened yet.
There are also potential copyright concerns arising from AI models trained on copyrighted material. "The question is whether or not it would be a fair use to use those images for training and AI," said Ochoa. "And I think the case for fair use in that context is fairly strong."
Additionally, there's potential liability for those who generate images that are substantially similar to existing copyrighted material. "If your training set isn't large enough, what the AI spits out might look an awful lot like what it ingested," Ochoa explained, noting that the issue then is whether that's a copyright violation. "Indirectly, I think it very likely could be."
As for potential legal risk to clients using Midjourney-generated assets, Ochoa said he thinks it's fairly low. If the training of an AI model infringed copyright, that was done before the client was involved, he explained. "So unless the client sponsored the creation of the AI in some way, I don't think [the client] would be liable for any infringement of the training set," he said. "And that's the strongest claim here. So I think clients are on pretty solid ground in using these images, assuming it was well done."
Holz acknowledges that the legal situation lacks clarity.
"At the moment, the law doesn't really have anything about this kind of thing," he said. "To my knowledge, every single large AI model is basically trained on stuff that's on the internet. And that's okay, right now. There are no laws specifically about that. Maybe in the future, there will be. But it's sort of a novel area, like the GPL was sort of a novel legal thing around programming code. And it took like 20 or 30 years for it to really become something that the legal system is starting to figure out."
Holz said he believes it's more important at the moment to understand how concerned parties feel about this technology. "We have a lot of artists who use our stuff, and we're constantly checking with them like, 'do you feel okay about this?'" he said.
Holz said if there's enough dissatisfaction with the status quo, it may be worth thinking about some sort of payment structure in the future for artists whose work goes into training models. But he observed that assessing the extent of contributions is difficult presently. "The challenge for anything like that right now is that it's not actually clear what is making the AI models work well," he said. "If I put a picture of a dog in there, how much does it actually help [the AI model] make dog pictures. It's not actually clear what parts of the data are actually giving [the model] what abilities."
Asked what gives Midjourney its distinctive aesthetic, Holz said he couldn't really compare what Midjourney is doing to DALL-E 2, but that in general AI researchers tend to get what they optimize for. If they put in the word "dog" then they probably want a picture of a dog.
"For us, we were when we were optimizing it, we wanted it to kind of look beautiful, and beautiful doesn't necessarily mean realistic. … If anything, actually we do bias it a little bit away from photos. … I know this technology can be used as a deep fake super machine. And I don't think the world needs more fake photos. I don't really want to be a source of fake photos in the world."
"I actually kind of feel uncomfortable if our stuff makes something that looks like a photo. And that's not to say that we'll never let people make things that are more realistic. There are legitimate use cases for trying to make things that look more realistic. However, I feel strongly that, by default, when somebody uses our system, it shouldn't make a fake photo."
"But I do think the world needs more beauty. Basically, if I create something that allows people to make beautiful things, and there are more beautiful things in the world, that's what I want by default." ®