Elliotte Rusty Harold on native XML data servers

16 August 2007 » DB2, IBM, MySQL, PHP, Web development

Soon after a New York PHP mailing list exchange debating the merits of storing information in hierarchical XML format versus traditional relational tables, XML guru Elliotte Rusty Harold posted a summary of the State of Native XML Databases to his blog.

Like the thread that inspired it, the post has generated a lot of comments showing that it’s an emerging technology whose potential is not well understood and that the products which implement the technology aren’t well known.

Why use an XML database?
Before considering an investment in a data server which offers native XML storage (one which doesn’t decompose it, nor store it as an unstructured chunk; and which allows the user to query its arbitrary individual elements), it’s necessary to take a step back and see what XML as a storage method offers the Web developer.

  • What sort of information should be stored as XML?
    The examples cited by Elliotte include large documents where the document itself is composed of related data, yet which it would be inefficient to break down into related tables and columns. A book can be broken down logically into a title and an abstract but what about the individual paragraphs in each chapter? What of the table of contents and index which are derived dynamically from data which exists elsewhere in the document?
  • Why can’t this data be stored in another format?
    It can be stored, but how do you make use of it? You might shred it, but this requires time to decompose and then recompose, assuming you can get back the data in the form you require. For example, what if you needed the first paragraph and figure of every chapter to compose a detailed table of contents? How would you write that query? What would you do if you needed to add, remove, or reorder a paragraph in an encyclopedia?
  • Why is data stored in XML format increasingly becoming valuable?
    According to Elliotte Rusty Harold and Anant Jhingran, most existing data isn’t traditional relational data at all. There is a ton of information that can not be queried currently with traditional SQL nor stored efficiently in relational tables. Think about the Web itself, it’s a collection of documents that have individual (ideally semantic) structure.

Sound bites
Here are a few of the insightful nuggets which sum up Elliotte’s point of view. For some of his thoughts on the future of XML in general, have a look at his Ten predictions for XML in 2007.

From http://lists.nyphp.org/pipermail/talk/2007-August/022724.html

Roughly 80% of the world’s data cannot plausibly be stored in a
relational database. The 20% that does fit there is important enough
that we’ve spent the last 20 years stuffing it into relational databases
and doing interesting things with it. I’m still doing a lot of that.

But there’s a lot more data out there that doesn’t look like tables than
does. Much of this data fits very nicely in a native XML database like
Mark Logic or eXist. There’s also data that has some tabular parts and
some non-tabular parts. This may work well in a hybrid XML-relational
database like DB2 9.

If your only place to put pegs is a table with square holes, then you’re
going to try pound every peg you find into a square hole. However, some
of us have noticed that a lot of the pegs we encounter aren’t shaped
like squares, and sometimes we need to buy a different table with
different shaped holes. :-)

Relational databases didn’t take the world by storm overnight. XML
databases won’t either. But they will be adopted because they do let
people solve problems they have today that they cannot solve with any
other tools.

From http://lists.nyphp.org/pipermail/talk/2007-August/022788.html

XML is not a file format. We’ve been down this road before. A native XML
database is no more based on a file format than MySQL is based on tab
delimited text.

From http://lists.nyphp.org/pipermail/talk/2007-August/022789.html

Storing books, web pages, and the like in a relational database has only
two basic approaches: make it a blob or cut it into tiny little pieces.
The first eliminates search capabilities; the second performs like a dog.

Also from http://lists.nyphp.org/pipermail/talk/2007-August/022788.html

>> I’m glad we have multiple tools to bring to bear on this kind of
>> problem, because I worry about the performance implications of
>> querying an XML database for the average price of those books, or
>> performing an operation that adds another field (tag?) to each book’s
>> “record”.

Average prices, or adding a field, can be done pretty fast. I don’t know
if it’s as fast as oracle or MySQL. I don’t much care. Sales systems are
exactly the sort of apps that relational databases fit well. But
actually publishing the books? That’s a very different story.

>> If it’s not too much trouble, could you give us some other use cases
>> for an XML database? Because title and first paragraph, if that’s
>> something a system “routinely does” could easily be stored as
>> relational data at the time of import.

Just surf around Safari sometime. Think about what it’s doing. Then try
to imagine doing that on top of a relational database.

Think about combining individual chapters, sections, and even smaller
divisions to make new on-off books like Safari U does. Consider the
generation of tables of contents and indexes for these books.

Closer to home, think about a blogging system or a content management
system. Now imagine what you could do if the page structure were
actually queryable, and not just an opaque blob in MySQL somewhere.

And the takeaway from the State of Native XML Databases:

If you’re working in publishing, including web publishing, you owe it to yourself to take a serious look at the available XML databases. If they already meet your needs, use them. If not, check back again again in a year or two when there’ll be more and better choices.

The relational revolution didn’t happen overnight, and the XQuery revolution isn’t going to happen overnight either. However it will happen because for many applications the benefits are just too compelling to ignore.

Conclusion
This is interesting stuff, and I’m glad Elliotte was able to put forward some of the reasons one might use an XML database and describe the maturity level of the data server products out there now.
FLWOR
We’ve asked Elliotte to present at one of the upcoming New York PHP meetings in October or November. If he can’t make it, it would be interesting to hear from other folks doing PHP work with XML databases, such as the XML Content Store / Zend_Db_Xml in the Zend Framework.

  

3 Responses to 'Elliotte Rusty Harold on native XML data servers'

Subscribe to comments with RSS

  1. Chris Gruber said,

    16 August 2007 at 10:30 pm

    Hi Daniel,

    Did you see the Web 2.0 Starter Toolkit for IBM DB2 on alphaworks.

    http://www.alphaworks.ibm.com/tech/web2db2

    This leverages the pureXML your are speaking of!

  2. naisioxerloro said,

    28 November 2007 at 10:35 am

    Hi.
    Good design, who make it?


  3. 18 May 2011 at 7:45 am

    Great info thanks! What language is close to php as far as functionality and ease of use?

Leave a Reply