Welcome!

Linux Containers Authors: Liz McMillan, Pat Romanski, Elizabeth White, Yeshim Deniz, Stackify Blog

Related Topics: Linux Containers

Linux Containers: Article

Practical XML with Linux, Part 3: XML database tools for Linux

Hierarchical, relational, and object databases

In previous articles (see Resources for links), I have discussed XML's value as a data format for Linux users. I have also discussed a wide variety of Linux tools for processing XML. But the data that XML usually represents often comes in forms that need to be managed on hard storage. Luckily, XML storage techniques and products have developed as much as other aspects of XML. We lucky Linux users have a wealth of choices available to us as we prepare to manage our XML data persistently.

Just a little theory

Although I used the word "practical" in the title of this article, a little background might be useful in choosing the best tools and techniques for managing your XML databases in Linux.

There are almost as many uses of XML as there are XML users, but there are only two ways of looking at how XML documents are organized. XML's roots lie in SGML, which was originally conceived as a way of structuring documents for machine preparation and human consumption. XML has inherited much of that bias toward documents, and is often used for presentation-oriented publishing (POP). Examples include books, slide presentations, and company Websites. POP formats tend to have elements and text that flow in a flexible and free-form manner.

XML has also gained popularity as the basis for data formats suitable for exchange between computer programs: consumed by machines but able to be inspected by humans. This is known as messaging-oriented middleware (MOM) because of its role in the infrastructure of applications. Examples include serialized objects, automated purchase orders, and Mozilla bookmark files. MOM formats tend to be highly regular, with elements making up well-defined fields with content according to strict data typing.

MOM and POP formats often impose different needs on XML databases, based on the differences in usage patterns and format. We will decide whether certain Linux database technologies are more appropriate for MOM or POP documents.

There are many ways of structuring databases. The relational model, used by well-known DBMSs like PostgreSQL and Oracle, is probably the most popular for new systems, but there are many other approaches. Databases can be:

  • Hash-based systems that store data as key/value pairs. That includes popular Unix database tools like Berkeley DB, DBM, and GDBM.
  • Hierarchical databases, once an especially popular format on mainframe and minicomputer systems, are represented in Linux by Adabas, which uses a very modern form of a hierarchical database.
  • Relational and object/relational databases store data in related tables that represent classes of entities. There are many examples for Linux, such as PostgreSQL, MySQL, Oracle, DB2, Informix, and Sybase.
  • Object databases store data as collections of object instances, using relationships based on object-oriented theory. Linux is poorly represented in this area.
  • Multi-dimensional databases generally model data relationships more richly than relational databases do. M3's Pick database for Linux is one example.
  • Semistructured databases are not as well-known as the above paradigms, and existed mostly in academia before XML's emergence. Semistructured databases support the notion that it is impractical to have a rigid schema for data that models the real world, given the fluidity of the real world. Many of its concepts are a natural fit for XML and related technologies like the Resource Description Framework (RDF).

There is a growing body of work on how to effectively manage XML data in hierarchical, relational, and object databases.

Hierarchical databases are often considered the most natural representation because XML itself is a hierarchy of elements and character data. Tamino, a hierarchical XML data store based on Adabas, offers some of the best performance for XML stored in native database format.

Developers sometimes store XML in relational DBMS by defining a column for each element type. There are other design patterns for that purpose as well. (See Resources for links.) The main problem with most RDBMS approaches is that they are so difficult to extend that they hamper flexibility. One can also just store the XML untouched as character large objects (CLOBs), but in this case, the RDBMS "degenerates" to a simple hashed database.

Object databases tend to store XML in a manner that approximates the document object model (DOM). That creates a persistent implementation of the various nodes that make up an XML document. This approach, as well as the DOM itself, is often considered to have excessive overhead. The eXcelon database is an example of an ODBMS XML store.

For more approaches and examples of XML databases, consult Ronald Bourret's excellent compilation on the subject. (See Resources.)

Toward an XML database standard

Databases must often span time, platforms, and application languages. Because data frequently must outlive code, standards are especially important in this area. As I have previously pointed out in this series, XML is built on standards; recently, there has been an effort to standardize XML databases as well. The XML:DB group is an initiative formed by organizations and individuals interested in XML database standardization.

XML:DB is working on a database connectivity API, a protocol for access control of XML document content, and XUpdate, a language for expressing updates to XML documents. The group has open discussion lists and the specification drafts are freely available. If you are interested in standardized XML database management, take a look at the group's work and encourage your favorite XML DBMS project to consider the specifications as they emerge.

Hacking from scratch

So, how do you begin storing and managing your XML documents persistently?

You can roll your own XML data store from many of the existing Linux tools. You can use DB, DBM, or GDBM, one or more of which should come with your Linux distribution. You could use the file's CRC or some other hash value as the key, or you could extract an element field from the XML document using a quick Perl or Python script.

PostgreSQL and MySQL can be turned into XML data stores using the techniques in the articles in the Resources section, or as simple large-object types.

Or you can just dump the XML files to disk, naming them with a hash or a given field you know will be unique. However, most people won't want to roll their own XML data store when so many specialized tools are available. In the following sections, we will look at some of the available tools for Linux.

Specialized XML data stores: Open source

First, the open source offerings -- you can find links to each product in the Resources section.

dbXML is an XML DBMS for Java that incorporates technologies like Simple Object Access Protocol (SOAP) for messaging middleware, XPath querying, and a rich Java API. dbXML's developers are founders of the XML:DB initiative.

hOpla is a rough-and-ready tool for defining mappings between XML data and PostgreSQL DBMS. An official GNU project, hOpla can set up table structures according to an XML file, and import and export data accordingly.

XML-DBMS was developed by Ronald Bourret, who also maintains the comprehensive list of XML database products that I mentioned earlier. XML-DBMS provides an object-relational mapping between XML and a relational database. The mapping is highly simplified from the DOM -- element types are classes (represented by relational tables), and attributes and character data are object attributes (columns). XML-DBMS provides modules for Java and Perl, but is not designed for standalone use. The main mapping object can be generated from a DTD or XSchema (the new XML schema format proposed by the W3C).

DB2XML maps JDBC ResultSets to DOM objects, and can trigger an XSLT transform for final presentation. It can be used standalone (including a GUI interface) or from other code using a simple API.

DBIx::XML_RDB, developed by Matt Sergeant, who develops many XML tools for Perl, is a Perl module that extracts XML from the results of database interface (DBI) queries. The output can be piped straight to XML::Parser, then to the many other XML modules that feed from XML::Parser.

My company, Fourthought, develops 4Suite, which contains DbDOM, a persistent DOM implementation for Python. DbDOM documents can be stored natively to the filesystem, or to a variety of backends, through the 4ODS object database, including PostgreSQL, Oracle, or even DB and DBM. DbDOM is integrated with other 4Suite technologies, including XPath and XSLT, XPointer, XLink, and RDF.

We also develop 4Suite Server, an XML dataserver that provides integrated services on the core 4Suite tools. That includes document management and XML-based Web services based on HTTP, SOAP, or CORBA, distributed transactions and concurrency, users, and access control. It supports experimental WebDAV support and the draft of the XML:DB standard. The 4Suite integration allows -- among other things -- metadata using RDF, content integration using XLink, and transforms and presentation using XSLT, indexing, and rules.

Specialized XML data stores: Closed source

Rogue Wave Software's XML-DB Link is quite an interesting product. It is basically a network API to databases using SOAP. You design a set of SQL queries and register them with the server, then invoke those queries remotely through SOAP messages and get the result tuples in a simple XML format. XML-DB Link is currently in beta; you can download a 30-day free evaluation.

Since SOAP is considered a leading communications protocol in the emerging field of Web services, XML-DB Link is a good early platform for experimenting with a possible major growth area. Its practicality is questionable, though, given SOAP's recent emergence.

Birdstep is a DBMS with some unique features. First, it is designed to have a very small footprint. Also, Birdstep data can be viewed in relational, object-oriented, and hierarchical modes. It also has a built-in hashing system for stored text, which creates interesting possibilities for indexing and efficiency. Birdstep is very suitable for XML storage; indeed, it comes with tools that allow stored XML to be accessed using DOM or SAX APIs. Birdstep is available for free download.

Lore, a Stanford University project, originated as a semistructured DBMS, but has since been tailored into an XML data store. It has proprietary XML query, indexing, and application-support features. It is geared toward deployments of document-management systems that support a wide variety of interactive searches on evolving data.

GMD-IPSI XQL Engine, formerly PDOM, is a persistent implementation of Java DOM for storing XML documents of arbitrary size in ready form for development, query, and extraction. It supports the draft of the XML Query Language (XQL). Of course, XQL is rather shaky -- the working group has had difficulty making progress toward their deliverable specification -- but it's good to see some experimental adoption. The package is free for noncommercial use.

Software AG's Adabas broke new ground in Linux DBMS. The company has also proven to be a pioneer in XML DBMS with Tamino, an XML data store that manages XML in a hierarchical DBMS for improved efficiency over the relational and object-oriented models. The Linux version is supported under SuSE 7.0 for IBM 390 mainframes. An Intel/Linux version is expected soon. This is a first for me: seeing Linux software come out on the mainframe port before Intel.

OpenLink Software's Virtuoso is an impressive array of XML middleware technologies. It includes an XML data store, either native or through SQL mappings, XPath query, XSLT transforms, and text indexing. It can be accessed using SOAP or WebDAV, and supports Internet protocols like HTTP, SMTP, and NNTP. Virtuoso is a well-integrated device for deploying Web services. A free evaluation download for Linux and other platforms is available.

IBM offerings

If you use IBM DB2 Universal Database Version 7 on Linux (or other platforms), you have the XML Extender, which allows you to generate specially formatted XML documents from the results of SQL queries. It also supports general storage of XML documents and DTDs of any form. XML Extender uses DB2's object-relational features to manage XML in database columns. IBM provides a proprietary mechanism to index the documents based on the values of certain elements. Text-based indexing is available using the DB2 text extender, a related tool.

One neat feature of the XML Extender is that it can extract XML from the query results of any Open Database Connectivity (ODBC)-compliant DBMS. This can be a heterogeneous database integration tool for XML middleware. It includes XSLT and XPath transformation and result query tools for postprocessing after the SQL query. This allows one to use XSLT as a stored procedure language, to some extent.

IBM's XML Extender comes with administration tools similar to those provided for DB2 itself. Of course, it's hard to imagine finding wizards that can truly help manage XML data formats efficiently.

IBM seems to have something for every aspect of XML in its alphaWorks project. DatabaseDom lets you use a sample XML file to define a relational database structure, generate similar XML format from the database, and update or insert a database from XML. The functionality is encapsulated in a JavaBean that processes DOM objects, making it easy to integrate into Java, XML, and database projects.

XML Lightweight Extractor (XLE) is an alphaWorks project that accepts DTDs with special annotations that indicate the mapping of XML content model to relational tables, generating a mapping that can be used to map relational databases to XML instances using JDBC.

Oracle utilities

Oracle's XML SQL Utility (XSU) is a great tool if you already have an Oracle database, and you want to extract MOM data from normalized relational tables and perform table updates in the other direction. XSU maps the result of regular SQL queries into XML documents and inserts data from documents into the appropriate relational columns. This could be used for anything from importing data into other applications and DBMS to creating transaction logs suitable for human inspection. XSU also has an XSLT engine that can automatically apply transforms to the results of XML document generation. That could be the basis for a very flexible database reporting mechanism.

XSU is written in Java and works with Oracle on Linux. It can be used from Java and PL/SQL stored procedures, or directly from Java applets or server-side code. The XML output and input can be in string form or DOM form.

Oracle's utility is a handy tool for those using Oracle and XML in their applications. It's easy to set up (much easier than Oracle itself, to be sure) and is built into Oracle 8.1.7 and the coming 9i. The only problem is that it is strongly tied to the relational model. The elements in the XML documents are generated from table or view column names; similarly, element-type names are used to match columns for insertion. Some hierarchy can be contrived, but overall, the setup is somewhat alien to XML -- it might as well be a simple hierarchy file. The XSLT integration helps, but sometimes requires extra work in XSLT on what should be set up in the source document.

Oracle XSU is a free download if you have registered for the Oracle Technology Network. As usual, you can get the book you need from O'Reilly & Associates. Users of Oracle's XML tools should certainly get Steve Muench's well-written book on the subject. (See Resources.)

Conclusion

As you can see, there is no lack of options. Whether you're a poor college-student hacker or a well-funded organization that has made a sensible decision to develop software on Linux, there are plenty of open source and commercial tools for your XML storage needs. In fact, there are very few XML tools that do not have some kind of Linux port.

You can store documents as native XML, build XML from other forms of data, or even develop hybrid systems with POP data stored as native XML and MOM data extracted from database structure.

But XML is all about standards; be sure to demand standards support from your vendors or to contribute to the standards support of your favorite open source tool. The only downside of having all those choices is the potential difficulty of changing your mind once you've chosen.

More Stories By Uche Ogbuji

Uche Ogbuji is cofounder and CEO of Fourthought, Inc.,
a software vendor and consultancy specializing in XML
solutions. A computer engineer, he has worked with XML for several
years, codeveloping 4Suite, an open-source platform for XML
processing. A frequent conference speaker, Uche has authored a number
of articles on the practical use of XML.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Jeff Mathews 09/09/03 11:59:28 AM EDT

By the way, Sleepycat has an XML store built on top of the Berkeley DB. I haven't used it (I'm only aware of C++ and Java bindings for it, so far), but it looks kind of interesting.

http://www.sleepycat.com/products/xml.shtml

alvaro cervantes 09/06/03 09:02:18 PM EDT

Very good articles, I enjoy them very much. I could not find article number one, Practical XML with Linux part 1.

Thanks, and good luck on everything

@ThingsExpo Stories
JETRO showcased Japan Digital Transformation Pavilion at SYS-CON's 21st International Cloud Expo® at the Santa Clara Convention Center in Santa Clara, CA. The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get...
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
In past @ThingsExpo presentations, Joseph di Paolantonio has explored how various Internet of Things (IoT) and data management and analytics (DMA) solution spaces will come together as sensor analytics ecosystems. This year, in his session at @ThingsExpo, Joseph di Paolantonio from DataArchon, added the numerous Transportation areas, from autonomous vehicles to “Uber for containers.” While IoT data in any one area of Transportation will have a huge impact in that area, combining sensor analytic...
Bill Schmarzo, author of "Big Data: Understanding How Data Powers Big Business" and "Big Data MBA: Driving Business Strategies with Data Science," is responsible for setting the strategy and defining the Big Data service offerings and capabilities for EMC Global Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He's written several white papers, is an avid blogge...
Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He has been a regular contributor to both InformationWeek and CIO Insight...
Michael Maximilien, better known as max or Dr. Max, is a computer scientist with IBM. At IBM Research Triangle Park, he was a principal engineer for the worldwide industry point-of-sale standard: JavaPOS. At IBM Research, some highlights include pioneering research on semantic Web services, mashups, and cloud computing, and platform-as-a-service. He joined the IBM Cloud Labs in 2014 and works closely with Pivotal Inc., to help make the Cloud Found the best PaaS.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
I think DevOps is now a rambunctious teenager - it's starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
CloudEXPO New York 2018, colocated with DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution. In his session at @ThingsExpo, Akvelon expert and IoT industry leader Sergey Grebnov provided an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, discussed the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.