Elliotte Rusty Harold on native XML data servers

16 August 2007 » DB2, IBM, MySQL, PHP, Web development

Soon after a New York PHP mailing list exchange debating the merits of storing information in hierarchical XML format versus traditional relational tables, XML guru Elliotte Rusty Harold posted a summary of the State of Native XML Databases to his blog.

Like the thread that inspired it, the post has generated a lot of comments showing that it’s an emerging technology whose potential is not well understood and that the products which implement the technology aren’t well known.

Why use an XML database?
Before considering an investment in a data server which offers native XML storage (one which doesn’t decompose it, nor store it as an unstructured chunk; and which allows the user to query its arbitrary individual elements), it’s necessary to take a step back and see what XML as a storage method offers the Web developer.

  • What sort of information should be stored as XML?
    The examples cited by Elliotte include large documents where the document itself is composed of related data, yet which it would be inefficient to break down into related tables and columns. A book can be broken down logically into a title and an abstract but what about the individual paragraphs in each chapter? What of the table of contents and index which are derived dynamically from data which exists elsewhere in the document?
  • Why can’t this data be stored in another format?
    It can be stored, but how do you make use of it? You might shred it, but this requires time to decompose and then recompose, assuming you can get back the data in the form you require. For example, what if you needed the first paragraph and figure of every chapter to compose a detailed table of contents? How would you write that query? What would you do if you needed to add, remove, or reorder a paragraph in an encyclopedia?
  • Why is data stored in XML format increasingly becoming valuable?
    According to Elliotte Rusty Harold and Anant Jhingran, most existing data isn’t traditional relational data at all. There is a ton of information that can not be queried currently with traditional SQL nor stored efficiently in relational tables. Think about the Web itself, it’s a collection of documents that have individual (ideally semantic) structure.

Sound bites
Here are a few of the insightful nuggets which sum up Elliotte’s point of view. For some of his thoughts on the future of XML in general, have a look at his Ten predictions for XML in 2007.

From http://lists.nyphp.org/pipermail/talk/2007-August/022724.html

Roughly 80% of the world’s data cannot plausibly be stored in a
relational database. The 20% that does fit there is important enough
that we’ve spent the last 20 years stuffing it into relational databases
and doing interesting things with it. I’m still doing a lot of that.

But there’s a lot more data out there that doesn’t look like tables than
does. Much of this data fits very nicely in a native XML database like
Mark Logic or eXist. There’s also data that has some tabular parts and
some non-tabular parts. This may work well in a hybrid XML-relational
database like DB2 9.

If your only place to put pegs is a table with square holes, then you’re
going to try pound every peg you find into a square hole. However, some
of us have noticed that a lot of the pegs we encounter aren’t shaped
like squares, and sometimes we need to buy a different table with
different shaped holes. :-)

Relational databases didn’t take the world by storm overnight. XML
databases won’t either. But they will be adopted because they do let
people solve problems they have today that they cannot solve with any
other tools.

From http://lists.nyphp.org/pipermail/talk/2007-August/022788.html

XML is not a file format. We’ve been down this road before. A native XML
database is no more based on a file format than MySQL is based on tab
delimited text.

From http://lists.nyphp.org/pipermail/talk/2007-August/022789.html

Storing books, web pages, and the like in a relational database has only
two basic approaches: make it a blob or cut it into tiny little pieces.
The first eliminates search capabilities; the second performs like a dog.

Also from http://lists.nyphp.org/pipermail/talk/2007-August/022788.html

>> I’m glad we have multiple tools to bring to bear on this kind of
>> problem, because I worry about the performance implications of
>> querying an XML database for the average price of those books, or
>> performing an operation that adds another field (tag?) to each book’s
>> “record”.

Average prices, or adding a field, can be done pretty fast. I don’t know
if it’s as fast as oracle or MySQL. I don’t much care. Sales systems are
exactly the sort of apps that relational databases fit well. But
actually publishing the books? That’s a very different story.

>> If it’s not too much trouble, could you give us some other use cases
>> for an XML database? Because title and first paragraph, if that’s
>> something a system “routinely does” could easily be stored as
>> relational data at the time of import.

Just surf around Safari sometime. Think about what it’s doing. Then try
to imagine doing that on top of a relational database.

Think about combining individual chapters, sections, and even smaller
divisions to make new on-off books like Safari U does. Consider the
generation of tables of contents and indexes for these books.

Closer to home, think about a blogging system or a content management
system. Now imagine what you could do if the page structure were
actually queryable, and not just an opaque blob in MySQL somewhere.

And the takeaway from the State of Native XML Databases:

If you’re working in publishing, including web publishing, you owe it to yourself to take a serious look at the available XML databases. If they already meet your needs, use them. If not, check back again again in a year or two when there’ll be more and better choices.

The relational revolution didn’t happen overnight, and the XQuery revolution isn’t going to happen overnight either. However it will happen because for many applications the benefits are just too compelling to ignore.

Conclusion
This is interesting stuff, and I’m glad Elliotte was able to put forward some of the reasons one might use an XML database and describe the maturity level of the data server products out there now.
FLWOR
We’ve asked Elliotte to present at one of the upcoming New York PHP meetings in October or November. If he can’t make it, it would be interesting to hear from other folks doing PHP work with XML databases, such as the XML Content Store / Zend_Db_Xml in the Zend Framework.

A new tonneau cover for the Frontier

14 August 2007 » Photos, Potpourri, The truck

This past weekend I installed a hard tonneau cover on my Nissan Frontier pick-up truck. I had been meaning to do it for quite some time, but choosing the right cover has been a challenge.

Nissan sells both a factory hard tonneau and a soft cover tonneau but neither is cheap nor easy to install and remove. The many independent hardware vendors sell hard and soft covers which never quite fit my needs.

After several fits of research over the years, I found a company called Lazer Lite that makes a nice hard aluminum cover which opens with the help of a pair of hydraulic struts. It also removes easily and doesn’t take up much room in the bed itself.

Cat ordered one for me as a surprise for my birthday, and although it took a while for Lazer Lite to build and ship, I’m pretty damn happy with the cover and the customer service we received.

If you’re in the market for a tonneau, I highly recommend Scott and the folks at Lazer Lite.

Here’s some before and after pics of the cover on my truck.

Thoughts on the DB2 9 Fundamentals exam

03 August 2007 » DB2, IBM

This past Tuesday I took the DB2 9 Fundamentals certification exam that I set my sights on a couple of months back. The exam covers the basic topics in DB2 installation, administration and database usage. Successful candidates earn “IBM Certified Database Associate” status.

I hadn’t planned to have a go at it so soon, but I discovered the hard way that I only had until the end of July to both redeem my particular exam voucher and take the test. When buying or receiving vouchers in the past, the two dates have been separate. For example, one normally must redeem a voucher by a certain date, but can slot the exam itself for some period after that deadline.

No matter, I did pass, but I have to admit the test was more challenging than I expected.

Study materials
I split my preparation between Roger Sanders’ DB2 9 Fundamentals Certification Study Guide and the DB2 9 Fundamentals certification 730 prep series on IBM developerWorks written by various DB2 subject matter experts.

Both resources overlap in their coverage of the exam objectives, but the developerWorks tutorials and other articles on the site cover XQuery and the XML topics in more detail whereas the book provides sample questions and a comprehensive mock exam.

Besides being a solid guide to the material, the study guide’s binding held up very well during the course of several trips to the beach, repeated cat bites, and rough rides in the back of my truck. :)

What I gained
Despite the rush to study before I had to take the exam, I retained a lot of knowledge and plan to apply what I learned to some existing database applications.

I would liken the preparation for this type of exam to something I did earlier this summer; spend a couple hours with my vehicle’s owner’s manual. I’ve been driving that truck for 2 years, but I discovered a few things I never knew before that made general usage better. For example, how to use cruise control properly, monitor tire pressure, and fine tune the alarm.

In particular, I shored up my knowledge of the following DB2 topics, and plan to take advantage of them in the coming weeks.

  • Isolation levels and lock characteristics. I’m looking forward to tuning our existing applications by specifying appropriate concurrecy tweaks where I can to improve performance.
  • UNION, INTERSECT, EXCEPT set operators. I vaguely recall these from a generic introduction to SQL course I took a few years ago. With more experience under my belt, I think I can really take advantage of these types of queries now now.
  • User defined types (UDTs). This could be an interesting way to apply Object-Oriented Analysis and Design concepts to the database, since one can map a business concept like MONEY to the native type DECIMAL(6, 2).

The Iowa Choral Directors Association
One gripe I do have that doesn’t have anything to do with the test per se is that the short form of the title attained via certification is hard to pin down.

“ICDA” is ambiguous with the more advanced administrator certification – IBM Certified Database Administrator – which is particularly odd given how thoroughly IBM embraces acronyms. Googling the term to find folks who hold the qualification leads to some interesting results…

My next steps
I intend to follow through with the IBM Certified Application Developer (PDF) path, which means preparing for the DB2 9 Application Developer exam.

As IBM’s Program Manager for Information Management points out, there isn’t a dedicated book for this exam, but she does offer some suggestions for other study materials.

Along with those, I’m planning to crack open the DB2 8 Application Development Certification Guide that I’ve been hanging onto for about 3 years. As before, developerWorks tutorials should come in handy.