After we wrote about Microsoft's plans to put a database into each copy of Windows as the native file store back in January, we were delighted to hear from two system architects for whom this news was really old hat.
You see, it's been done before. Benoit Schillings was one of Be Inc's first employees, and authored the original user space database server. This was later superseded by a more conventional approach: BFS, a fast, 64bit journaled file system written by Dominic Giampaolo, which had many database-like properties.
Between them they have more practical experience in making such an ambitious scheme work on a PC than anyone else. So last month we reunited Benoit and Dominic at Menlo Park's Applewood Pizza for reminiscences about Be, and some low-down on file systems and databases.
(Applewood was an unconsciously appropriate choice - Benoit was a Mac developer before joining Be Inc, and is now a director of engineer at OpenWave; while Dominic, we are delighted to learn, has subsequently joined Apple as a file system engineer. He started last week).
Now normally we give you the he-said, she-said quotes padded, but this conversation flowed so sweetly anything other than a transcript couldn't do it justice. We set the tape rolling, and tried to stay out of the way.
And although there's plenty of nitty gritty implementation stuff discussed here, it never strays far from the question "can't computers work better?" - the usability goal that underpins the idea of putting a database at the core of an OS. That's the goal Be had, to make a PC work much more intuitively as an information server, and presumably that's a goal Microsoft has for Longhorn, too. Let's roll -
The first cut
Benoit: I arrived at Be in March 1991… and was there was no file system... the first file system was trying to be just a flat directory with a good database to index the files in that single directory.
So after that I started writing the database, user-space database server called Zookeeper. And the file system in the kernel was sending messages to Zookeeper.
There were two problems with that. The first one is that if Zookeeper was too slow, the ports would fill and so the file system would block on the database. It was actually a very difficult issue. How do you handle the file if the port is full? Do you just wait? Well if you just wait there's a big danger, which is that if you have a loop you have a very subtle deadlock. And you have lots of those. So it's a very hard thing to deal with in the OS.
The other problem is that there was no transactional integrity between the state of the file system and the state of the database.
So if the machine was going down you wouldn't know what state the database was in, because there was no journaling tin the file system, and also it was so out of sync.
And I made one big mistake there, mostly by ignorance, which is that by not understanding the notion of journaling, I tried to fix the file system, like Microsoft does, if there was a bad restart. And it turns out to be an impossible problem. Because the number of failures you can have on a file system when you don't have transactional integrity is pretty bad. So ultimately I got the system to be stable, I don't think I was losing files. Even that was painful.
Sync or swim
Reg: But that's going to be a problem for anyone who tries to put a database in there...?
Dominic: If you do it in user space yes. If you're trying to do things with the file system and split it apart across the kernel and user boundaries, then integrity is an issue. And lately I've been terming it that's it hard to make it a 100 per cent solution. Things get out of sync, and if you allow access to files in the kernel, then things can change under foot, the database can doesn't get notified, and it gets really messy.
Benoit: It's a bad solution. And the Be database was doing data asynchronously, because it was faster, that was the way I solved the problem of how do you log the file system: you just update your index asynchronously. Which makes the synchronization of the file systems and database even worse. So if you had a cold start, you had to change the integrity of all that. Which is OK if you have a thousand files. But if you have 500,000 files, it doesn't work.
And I didn't use B-Trees. I thought B-Trees were evil, and still do. Reading them is slow and super-expensive, but indexing and the search were really very fast. It was not a stable solution, long-term. And do you remember who invented B-Trees? [grins]
Dominic: Two-level B-Trees but really wide with a lot of nodes. A friend of mine used the same approach for a project at grad school.
Reg: what's that?
Dominic: B-Trees use nodes for data and build up a pretty big hierarchy. And the reason for that is that the transfer rate at the time was pretty slow. But they wanted to get lots of small chunks and pay the price for speed. At the time the transfer rate was pretty slow; they optimized to do fewer reads over doing big transfers.
Benoit: Reading big blocks is actually more efficient than using B-Trees on modern hardware, and that's the kind of index I did. We were big, but not a B-Tree.
Dominic: Benoit's right. This is getting into nitty gritty implementation details but that's a huge difference.
Benoit: Rotational latency kills you.
Reg: Back in 1991 the fashion was very much for OpenDoc and OLE and the talk was that "oh, no one will have applications in the future", we'll all manipulate parts - was that an influence on you?
Benoit: Either you make the data shared, or you make the application components shared. And sharing the data was easier. (Dominic agrees).
Reg: you need locking in there...
Benoit: OpenDoc was something that was thought out with the idea of a word processor. Word processors are just one type of document: it's easier to get plug-ins, to modularize the the tool, but, you know, it only works for a very restricted set or type of applications.
Reg: So you started off with this idea that there'd be separate views on to the data - with the 'Model View Controller'. Can you describe that?
Benoit: OK, if you can get some sharing of data between applications so that an application can manipulate the data without knowing anything about the consumer of that information, then you have a combinatory phenomenon.
You can have a table, say, and you can have two types of application that can go and modify that data. And you can have a different viewer that subscribes to that data, and represents it in a dynamic way. That's a nice way to program many things.
Dominic: If I have an MP3 file and I want it mail it to somebody, I shouldn't have to context switch and start up iTunes, I should be able to find MP3 files based on whatever criteria I want from any application. Then apps can become more transparent, work more seamlessly together, where data isn't in these private file formats.
Database vs Database-like
Dominic: Well let's talk more about the database aspect because there were some subtle differences between what we did… you actually had explicit tables and columns in the tables, remember?
Dominic: And I was a little less structured in that there was no schema, or tables per se, there was just a convention. So people had attributes if you wanted to add more you could add more. That's what the power of the ideas here is, it's flexible - not fixed format.
Everything is malleable.
Benoit: I think the other interesting aspect is the way queries were made. Queries were done in reverse polish notation. You would say, New query, Testing, Add field 'Person', Add name 'Dominic', Add operand 'Equal'. OK? Run. And you would create the tree.
Query = new Bquery();
Now the real beauty of that is that I did not have to do the parsing!
Reg: What kind of queries did you show off in the demos?
Benoit: "Show me a natural language parser!"
File and forget
Dominic: The demos stayed the same even from before I did BFS, but it was: show me all files that have an 'e' in their name, and a size greater than, then you would change one of the file names and it would drop out of the query.
Benoit: Also email - that was really useful. The other one is find all email sent by different people.
Dominic: I had a little window for my fiancée, and whenever she'd send me email it would pop up. Doug Fulton, actually, he would put everything in a big flat directory and use queries to find the stuff.
Reg: Oh I do that with my MP3s on BeOS. There are subdirectories there but I can't even remember what they are
Dominic: It's funny, it's a very tempting way to use a machine.
Reg: Because it doesn't matter where you put stuff. Nothing else can do that.
Dominic: That's what I mean. Some people are very anal about organizing things in rigid hierarchies and others are 'I know what I want to find'.
Benoit: Google - I don't keep bookmarks, now. I keep very few bookmarks. I know that with Google I'm going to find it because I know I've got the name. And there's so much data we receive there's no way we can organize the data, it takes too much time.
Benoit (left) illustrates the stationary gravity accelerator, known only to Applewood regulars
Dominic: It's distracting and frustrating, but you know it's not hard to get stuff done right and people have done it for years. It's not rocket science at all.
Reg: You know when Inferno was written they said we're doing Unix properly, using the same semantics for files and devices. Is that what you mean by a clean implementation too? The API being cleaned up?
Dominic: No, when I was talking about the device file system, traditionally you have a major and minor number, which is OK when you have fifteen or twenty devices on a machine. But when you have USB you have 65,000 maximum nodes so you want something that's more dynamic.
Benoit: And you know the reason people do that ultimately is that they don't care about the user experience.
Dominic: I'm a Unix guy and my main mode of using the machine is the shell … but forcibly or not, and I don't know how it happened, I've come to appreciate all the UI stuff.
Reg: One of the things that will bother Microsoft is latency, will they be able to do it efficiently? Is that an issue?
Dominic: When you say latency, be careful - if you're talking about thread scheduling latency or interrupt response time, that's a completely different thing, to say when I drag a file to trash until it actually gets there is more of a function of what they're architecture is internally. And the updates are either batched or sent on every few seconds, something like that. It's a perceived latency. That's different from the OS scheduling.
The problem now is, and I'll be frank here, is that can take the Unix community years to solve such problems. The idea of having device nodes in the filesystem is an antiquated idea.
The problem shouldn't be that it's hard to do these things - it isn't - I mean I did it in ten months and I worked basically by myself, and I'm no genius. It's natural to me that you should have an inode monitor sending updates when files change, but it's a natural, the UI needs it, we’ve got to do this.
When you delete a file in Windows 2000 it takes a good two or three seconds before the fuckin' Trash icon's updated.
Benoit: Or I'm writing a 1.5GB file and halfway through it still shows "zero bytes". It's totally stupid.
I think at Be from the beginning we were aware that a lot of what users perceive as performance is the details, and not a function of the real performance of the system. It's how you present things. The idea of having many threads: that's the reason for that. With Be, even if the machine was slow, you still had a thread that could respond.
A good deal of the perceived performance of BeOS really came from threading. It was really the thing that made the difference there. It's not magic: you just give the impression.
Dominic: Can Microsoft make a database file system fast? It depends on what they're trying to do, I think, is going to be the answer. It's very tricky.
It's got to be appropriate for the things that you do. Benoit is right about B-Trees, really, in that they're a stupid data structure for most cases, because 99 per cent of the time when you do a Find, you do a case insensitive substring match aand a B-Tree is useless for that type of query .
Benoit: It's not going to help you there.
Dominic: So you wind up having to seek to find all the trees and it's really slow. And that's a really poor way to organize things. So you start moving away from a traditional database and looking for more efficient methods. A database is really useful when you know what you're looking for and you can't really see it.
When you think about how people work it's not like that.
Small is beautiful
Benoit: Look at Hypercard. Hypercard did it more right than many people accepted. Do you know Hypercard? There's a lot of smart work in there. They really blew it by trying to make it into something it was not supposed to be, which was a programming environment. And it was NOT a programming environment.
Reg: Sure, but wasn't it just used for stacking and linking simple things…
Benoit: Yes but there's an abstraction model. Hypercard was a product that came ten years before it should have. They did the right thing there.
Reg: Well one of the things Microsoft is selling to its partners as a reason for why you need a database in every Windows machine, is that you'll be able to go to some branch office in Birmingham and drill down, drill down through the database until you get to a named range in a spreadsheet, say, using a network name. So it's one big namespace.
Dominic: Yes, you can do it. It's not going to be fast. There's just going to be so much stuff to get through, and services you have to talk to and co-ordinate between, and you can make something like that work. It's a question of how painful is it. If something's not fast, you shouldn't do it.
Benoit: I see that in the industry. For a thousand records, they use an SQL database. So basically they pay Oracle to put the data in the RAM cache for them.
Dominic: That's pretty funny.
Benoit: … and fields that are never changing.
Dominic: A lot of things are not appropriate for a big database for something of that magnitude. You can keep it in memory. If you’re going to do that anyway you can use something else.
Benoit: Like a binary search on memory! That's a proven technology!
Reg: Would a hybrid make sense? Where they keep the system files on a conventional file system, and this raw, unformatted area which is a database?
Dominic: Another way to think of it is to separate out the notion of storage from naming. They're really separate policies. If you have something like a ZIP file, why shouldn't we be able to use a file system to navigate our way through it? So you let people take over part of the name space. So applications can see a path or a query, but it might be answered by someone else. Or there's something else that's interpreting that and trying to make it look like a file system. So yes, that makes sense - that's even richer, pseudo-file systems.
Benoit: You can say give me an image, and give me an image, "gif.small" and it gives you a small version of the image.
Dominic: And you can have all kinds of filters. And computed attributes, which are not known at run-time, like the load average of a machine on the network.
Reg: It doesn't have to be SQL it can be anything…there are many ways of querying a database.
Dominic: But then people don't think in SQL. They think in terms of the contents of the file, and name, and stuff like that.
Benoit: Now I want to hear your version of the story…
Dominic: No no no no! I had the same ideas. I was a little bit more relaxed, I didn't have tables, and 64bit journaling…
Benoit: Well journaling was the key to what you did
Dominic: It's a big part of it. From the user experience it's really nice when you can just reboot and everything's OK, it just comes up and all that.
Benoit: More than that, in BFS you could do attributes which were a gigabyte wide.
Dominic: It was kinda like a railroad: there was not a whole lot of choice, really. I thought the same way as you did about a lot of the things. But it's really the implementation details that were different. 64bit gives you the size of your data structures, you can't really do it any other way. It's reality.
A lot of people think journaling is a really difficult, complicated thing. But that was actually the easiest part by far. BFS journaling is maybe a thousand, maybe twelve hundred lines of code it was really not difficult. And people make it into this monstrous complicated thing. But again, we do things like change the disk buffer cache so the 64bit features that were needed to do journaling were supported.
In SGI's XFS, their journaling is bigger than all of BFS!
Dominic: To me the most interesting thing about file systems today is how you'd provide those find-by-content features … the bigger thing is information management.
Reg: You know Sherlock only does the first two thousand words of any file?
Dominic: Oh really?
Reg: And that's useless for anything serious. I mean, why bother at all? I've been through a ton - I used to use Lotus Magellan on DOS, and the best one on Windows I found is something called dtSearch, but it's been an afterthought for OS guys… Is there a latent demand or just that users don't want to use this?
Dominic: They don't know what's possible. If someone shows you it - they get it.
Benoit: I think attributes and queries were really underutilized. Well, most of BeOS was underutilized ;-)
Many thanks to Benoit and Dominic for such a fun conversation.®