The Lurker

Learning from the web: don't believe everything you read there

posted by ajf on 2005-11-11 at 01:18 am

Today I read Dare Obasanjo's summary of an article by Adam Bosworth on lessons the web offers about distributed computing.

It is, in short, a load of crap.

That's a little unfair. Bosworth's article begins by outlining what he sees as the strengths of the web. (He characterises them as "unintuitive lessons", though I don't think they're all quite so profound. For example, one of his lessons is that, due to server performance and communication latency constraints, it is necessary to avoid a fine-grained model of code on the server responding to events at the level of mouse moves or keys typed or each time the scrollbar is dragged three millimeters because if you do, you'll be generating about 10 events per second or two orders of magnitude too many. Since the 3270 terminal is older than I am, I think we knew this already. Overall, though, most of what Bosworth says about why the web works makes sense.)

Where the article goes off the rails is in the attempt to apply these lessons to other topics. First up: the Semantic Web.

There are, according to Bosworth, three weaknesses of XML which are a problem for the semantic web:

It doesn't handle binary data well.
It doesn't handle links.
XML documents tend to be monolithic.

It is true that it is unwise to try inserting binary data in an XML document. One of the lessons from the web Bosworth identified earlier — People understand a graph composed of tree-like documents (HTML) related by links (URLs) — is relevant here. Web pages don't contain the binary data for the images they require; they contain a URL, a link to each image. There is no weakness of the Semantic Web here. Far from it: in fact, as far as I understand it, the use of URIs is a fundamental building block of the Semantic Web.

It is also true that XML does not prescribe how to express links. Applications of XML incorporate their own understanding of links. For example, XHTML has a number of elements which contain URLs — <a>, <img>, <object> <iframe>, and <link>. Though each of those elements contains a link to another resource on the web, the intent of each different type of link varies.

(There is, in fact, in XLink a W3C standard for generic links in XML. The conventional wisdom is that it is over-engineered and generally unwieldy. The XHTML working group — the people with the task of adapting precisely the benefits of the existing HTML web that Bosworth praises to XML — essentially refused to use it. XLink was finalised three years after XML. If anything, we are fortunate that the designers of XML allowed people building on top of XML to choose for themselves how best to express links, rather than holding off until they had developed an all-encompassing notion of links that nobody likes.)

It was at the "monolithic" claim where it became clear to me that this was not just an article I disagreed with in parts, but one that just doesn't make sense:

Given a purchase order, for example, and the desire to insert a line item or replace the address, it is very hard to know how, since the items don't contain IDs or the date they were last created/updated. By contrast, the database world breaks things down into rows, each of which typically has a unique ID and a date created, although the last varies from database to database. This lets people "chunk" the information while they still have the ability to assemble it. By default in XML, the order of child elements matters. This encourages errors, however. It violates the rule of letting sloppy people author. A lot of XML errors are simple differences in the order of child elements. In most cases (except documents intended for printing) this does not and should not matter.

It amazes me that there could be so much wrong in so short a paragraph. For a start, there's a lot of sloppy thinking: By default in XML, the order of child elements matters? (This is a point Bosworth returns to in his next, even more bewildering, attempt to apply these lessons of the web.) This is nonsense. In an XML document, there is an ordering over two or more elements with the same parent element. But that does not mean that "order matters"! What "matters" is defined not by XML but by the language built on top of XML.

The X stands for "extensible". One important design goal with XML was that the number of optional features in XML is to be kept to the absolute minimum, ideally zero — XML defines, based on years of experience from SGML and from the web, what every conceivable user might need, and no more.

(We could interpret the subsequent statements as a plea to people who build languages on XML, particularly in the context of schema languages, not to impose order between elements unless it is necessary. I'll come back to why I don't think Bosworth means anything so pragmatic.)

More broadly, it is far from obvious what this purchase order example is trying to illustrate. What I can discern is an underlying assumption that it is important to be able to insert a line item or replace the address without prior understanding of the purchase order language in use by the document's author.

There is a particular focus here on two components of this hypothetical purchase order document: line item identifiers, and line item timestamp. Why? OK, the date of an order is a significant piece of data — but so are things like quantity, and the id of the item itself being ordered (as opposed to the reference to that item from the purchase order, which is the id Bosworth refers to), and of course price. Why doesn't Bosworth mention those?

What is so special about inserting a new line item into the purchase order's set of line items (as opposed, say, to modifying the unit price of an existing line item)?

Recently, an opportunity has arisen to transcend these limitations. RSS 2.0 has become an extremely popular format on the Web. RSS 2.0 and Atom (which is essentially isomorphic) both support a base schema that provides a model for sets. Atom's general model is a container (a <feed>) of <entry> elements in which each <entry> may contain any namespace scoped elements it chooses (thus any XML), must contain a small number of required elements (<id>, <updated>, and <title>), and may contain some other well-known ones in the Atom namespace such as <link>s.

Atom, as a language primarily intended for (but not limited to) transporting blog posts, defines a way to express links and dates and identifies.

And that is why the line item ids and timestamps, but not the quantity or the price, is interesting to Bosworth — because he's discovered a shiny new toy that supports ids and timestamps.

Atom looks like a good application of XML, but there just isn't any reason to adopt its <entry> element for any element which is a member of a set. (He goes on to say Even better, Atom clearly says that the order doesn't matter — for reasons which will become obvious.)

Finally, Bosworth turns his attention to databases — for some reason — arguing that databases don't respect these lessons from the web. But, in another example of less than rigorous thinking, he only applies five of his eight lessons, and adds a sixth observation that isn't really related to any of them.

To start with his least muddled objection, that databases don't handle trees or graphs effectively (number 5), that can be true. Retrieving or manipulating a tree stored in a relational database is messy. But the web lesson he's relating this to is People understand a graph composed of tree-like documents (HTML) related by links (URLs). People. Not "the web". As for the web infrastructure: web browsers and server pretty much deal with one resource at a time. It's actually very easy to work with a single node-to-node relationship in a relational database; it's only when attempting to operate on numerous such relationships in a single operation that the complexity arises. The core components of the web — web servers and web browsers — generally doesn't try to do that anyway. (They support access to multiple resources in parallel, of course, but as discrete operations; there is no need to comprehend the graph as a whole. There are applications, most obviously search engines, which genuinely do deal with the web as an interconnected graph of resources, but they are decidedly atypical web applications.)

Do databases optimize caching when it is OK to be stale? No, they don't. Like XML, databases are designed to solve a broadly applicable problem, and questions like this are best handled by the layers constructed above. I simply can't imagine why Bosworth seems to think this is a bad thing. (In fact, the solution he sketches — a way to mark what the TTL of fields is and a way in the query language to describe the acceptable staleness — is actually substantially less flexible than what the web provides today. If databases supported this, to take full advantage of HTTP you would have to work around that support and develop your own alternative anyway!)

Do databases let schemas evolve for a set of items using a bottom-up consensus/tipping point? This fourth question, presumably based on his observation that The wisdom of crowds works amazingly well rails against the structure that a database schema provides. It is misleading to say that a database schema cannot evolve; while it is not possible to store data which violates the schema, it is possible (though not necessarily trivial) to modify an existing database schema to suit changing needs. In fact it is a defining characteristic of relational databases to support such evolution in a manner that does not require wholesale changes to how an application interacts with its data.

I just don't see what Flickr/del.icio.us style tags have to do with a database. They're a user construct, not one that necessarily belongs as a fundamental feature of the database. (OK, so maybe I want to be able to tag content items, but should I really be able to tag the row that contains a user's email address and password? Tags are an application feature, and, as valuable a feature as they can be, they don't require native database-level support any more than a phone number does.)

Are simple relaxed text formats and protocols supported? Well, yeah: SQL.

Oh. He's not talking about that.

We're still in the CORBA world. One still requires custom code to read a database. It is as though the browser required a driver to talk to each site (or at least type of site).

When it comes to relational databases, we've got SQL, a language for expressing queries about our content, and a straightforward concept of rows and columns which contain the result of a query. The custom code is written by the good folks at Oracle, Microsoft, the PostgreSQL project, and any other database vendor you care to name — they write that esoteric communication code once, and programmers everywhere use the same query language and programmatic interfaces to interact with whichever database provider they happen to be connecting to.

Bosworth is suggesting here that all database vendors could use the same communication protocol. They could, it is true; but as long as I have a driver that works, the protocol they choose to use has absolutely no impact on me at all. I already have a simple, uniform database interface.

Have databases learned from the Web and made their queries simple and flexible? The only "simple and flexible" queries I've ever seen on the web involve search engines — a tool which typically gives you the answer you're looking for, plus several hundred more that you didn't want for good measure. It's a tremendous tool for humans, but the concept just doesn't apply to databases. Except —

just ask a database if it has anyone who, if they have an age, are older than 40; and if they have a city, live in New York; and if they have an income, earn more than $100,000.

Well, I haven't seen a search engine that can do that. So what is he talking about?

The Semantic Web? It fits the bill, but Bosworth has shown that he's no fan of it. And its query languages aren't all that far removed from SQL anyway.

What, then?

Oracle has done a remarkable job of adding XML to its database in the various ways that customers might want. In so doing, it has added a lot of these capabilities. Its ROWID type allows some forms of flexible linkage. But none really shows that they have learned from the Web.

OK, so I can store XML in the database. And I can write queries that use XPath instead of SQL. I like XPath, and while some queries are easier to express in XPath than they would be in SQL, I've personally had to express a given query in both forms, and I can tell you that there are times where the opposite is true.

Apparently this isn't it either, because nothing really shows that they have learned from the Web — but Bosworth offers absolutely no support for this statement, so there's no way of knowing what in the Web he's talking about.

The most staggering question, though, is: Have databases enabled people to harness Moore's law in parallel?

Moore's law is a bit of a red herring here. Bosworth is actually talking about dividing the execution of the query into smaller tasks in a distributed environment. (How does this relate to the web, where essentially every single scrap of code runs on a distant server, and the client only handles presentation? I don't know, and he isn't telling.)

So, if we did want to divide the labour of query evaluation into parallel units of execution, what would we need to do? Bosworth's answer: limit all predicates to ones that can be computed against a single row at a time.

Uh, what?

Does Bosworth really understand the consequences of that idea? Well yes he does, because he lists what we would have to throw out from our query language: things like ORDER BY, joins, subqueries.

Oh, is that all?

So what happens without ORDER BY? Well, I don't post that often to my blog; it has a little under 200 posts at the time of writing. Right now, to display the latest ten, I have a database query that uses ORDER BY to sort by date, and to limit the query to ten results.

Without ORDER BY, I have to fetch all 200 into memory, then write more code to eliminate the 190 I could have told the database I didn't want.

You can eliminate joins in the same way: instead of asking the database to identify the relevant parts of two tables, just fetch both tables entirely into memory and write your own code to match things up. Easy!

For some reason, when it comes to database communication protocols, something that a handful of database vendors have to implement, it's bad to have to write "custom code". But when it comes to sorting and searching, every database user on the planet can join in the fun! (I hear Knuth has a good book on the subject.)

This is idiocy.

(In fairness, I should note that what he actually said was limit all predicates to ones that can be computed against a single row at a time, at least where efficiency and scale are paramount (my emphasis). But even there, he's wrong: you're not making the database more efficient, you're making it less flexible — in fact, undermining Bosworth's own desire for flexible querying. Database vendors are very good at writing efficient searching and sorting code — it's why they exist. Bosworth's proposal is almost always going to be less efficient.)

Because I thought this desire to cripple SQL was the stupidest idea in the whole article, I left it until last. But I glossed over one detail regarding protocol, because he saved the best part for his conclusion:

It is time that the database vendors stepped up to the plate and started to support a native RSS 2.0/Atom protocol and wire format

So that's why Bosworth is excited about the order of <entry> elements being insignificant in Atom — it's because he wants to drop ORDER BY.

Related topics: Rants Web Mindless Link Propagation

All timestamps are Melbourne time.