Interview Ocarina, the deduplication startup, is making waves with its partnerships with storage vendors, due to its unique lossless image compression technology. Yet the Ocarina founders were not wedded to image deduplication when they started up the company. How did it come about?
The way Murli Thirumale, Ocarina's CEO, tells it, the three founders had three ideas for a startup which they tested with potential customers and with consultants in a proof of concept exercise. They asked which of the three would have long-lasting and true value in the eyes of customers. The one dealing with ever-growing data storage received overwhelming customer support.
Much of this growth was due to rich media, images and videos. These file types, JPEGs, TIFFs and MPEGs and so forth, were previously thought to be uncompressible if there was to be no loss of image or video resolution. Ocarina's chief technology officer and co-founder Goutham Rao, came up with ways of doing this, of compressing the uncompressible. Thurimale says: "He didn't know you're not supposed to be able to do that."
Image and video files list picture elements and their characteristics. An Ocarina technology brief states: "Visual information is typically complex and includes large numbers of values to represent pixels, chrominance, luminance and other information for both recreating an image for the human eye to see, and storing information about the image for computer programs to be able to manipulate."
You can compress such files by getting rid of pixels, but this means losing image quality. What Rao invented was a way of recoding image and video files to store the same information in fewer bytes, with no lost pixels, with "bit-for-bit losslessness". This is Ocarina's secret technology and it's not revealing much about it.
The technology brief says Ocarina's technology: "extracts the full rich image data from an existing image file in to a Discrete Cosine Matrix (DCT space), correlates related image information like chrominance and luminance boundaries around like areas in an image, and then applies Ocarina’s patented image optimization compressor to the grouped areas. The ECO process is able to compress already-compressed JPEGs up to 40 percent, and sets of scaled images - common on web sites - up to 80 percent. Results on medical and life sciences image formats range from 40 percent to 70 percent, and results on grouped studies are better than on individual images."
Intriguingly, the DCT concept is often used in signal and image processing for lossy compression. Intriguingly again, an academic work on DCT has been written by a Dr K. R. Rao and P. Yip, entitled "Discrete Cosine Transform: Algorithms, Advantages, Applications" (Academic Press, Boston, 1990). Dr Rao is credited with being the co-inventor of the Discrete Cosine Transform but he is not related to Ocarina's Goutham Rao.
It looks as if Goutham Rao has found a way to use a method of encoding data for use in lossy compression algorithms to the opposite end.
Producing an Ocarina-encoded image or video file is much, much harder than displaying it. Production is carried out by using an Ocarina hardware appliance, an Optimizer, which comes in two models, a 2400 and a 3400. The 2400 has two quad-core Xeon 5400 processors, 16GB of RAM and two 500GB SATA disk drives.
The 3400 has the same Xeon processors but paired with 32GB of RAM and four 15,000rpm SAS drives. There is heavyweight processing going on here with a maximum bandwidth of 2TB per 24 hour day quoted by Ocarina. It can be down to 1TB a day if pure JPEGS are involved. Thirumale said: "We are CPU-bound in our optimization," and "We are in the process of benchmarking both the Nehalems and also processors with more cores."
He says that Ocarina does things to ensure it does not overwhelm the storage filer. For example, optimisation can be scheduled for off peak times, and it can be throttled back if the filer is really busy.
During optimisation an existing image or video file is read in by an Optimizer and recoded, using the DCT mathmatical method, and processed with other techniques to produce a smaller output file.
This can be read by Ocarina's ECOreader, a piece of software which sits in-line between the storage and the application needing to access the file. It can be deployed on web servers, application servers, proxy appliances, or in some cases, directly on file servers.
Each Ocarina-encoded file is self-contained and holds all the data and metadata required by any ECOreader to access it and send it on the requesting application in real-time.
Other Ocarina compression techniques
Thurimale says Ocarina's Optimizer looks inside files that can contain various objects, such as images in Word documents and PowerPoint decks and PDFs. Once it finds these it can compress them. Also, once the Optimizer has a DCT version of the data, it can compare this with already processed images and use any correlations to improve the optimisation. It doesn't specify how it does this but it sounds like a form of deduplication.
The Optimizer also looks for sets of images sharing common information, such as a sequence of CT scans. The common data is dealt with once, single-instanced in effect, and stored only once. Ocarina's brief describes this as "an example of deduplication applied at the visual information level, rather than at the block storage level."
The idea here is that sub-file-level deduplication cannot process such files or objects because it doesn't know they exist, it's not application data-aware and only sees raw blocks.
The Optimizer can deal with sets of scaled images by only storing data from the largest one and using it to recreate the smaller ones on the fly through the ECOreader. Where there are small thumbnail images which may be stored inefficiently as separate files, 2KB of data stored in an 8KB block is an example Ocarina uses, then the thumbnails can be grouped together to use storage more efficiently. In other words the small thumbnails can be grouped to fill up the minimum block size in a file server.
Thurimale says 61 percent of Ocarina deployments have been associated with an increase in disk purchases by the customer. This is because customers are using Ocarina-compressed files on disk to do work that was not possible in real time before. They'll have active data for use in creative work on videos or images and older data that had been consigned to tape. Restoring this for real-time editing work is not practical.
Talking of a movie studio customer, Thurimale said: "We were able to give them real-time access to this archive data." Having it stored on disk in an Ocarina-encoded format means they can use it in real-time and thus the creatives are more productive.
He says: "It's about tape replacement. Tape's rightful place is in a deep archive," and talks of Ocarina enabling "cheap and deep" disk storage and of it being no threat to disk storage sales.
According to him, Ocarina's products are resold by BlueArc, and have been certified by Hitachi Data Systems, HP, and Isilon. Ocarina is working with DataDirect and pursuing certification with EMC, where the focus is on Celerra, and Ocarina is being integrated using EMC's file mover API. Thurimale said: "We could work very well with Atmos."
Cloud storage provider Nirvanix is also working with Ocarina. However there is no certification with IBM or NetApp.
There have been two Ocarina funding rounds, the last for $20m in February this year with a total of $31m having been raised. Thurimale is especially pleased about the second round which took place in the middle of the recession. The company must have been able to demonstrate good potential. but no customer numbers or revenue numbers have been made public.
Thurimale says: "There are several billion files under Ocarina optimisation (and) we're demonstrating great customer traction (with) customers in production and making repeat purchases. ... We're an add-on to your storage. We're not a rip-and-replace company. The applications won't recognise that we're there." ®