Welcome!

Linux Containers Authors: Olivier Huynh Van, Derek Weeks, Stackify Blog, Automic Blog, Elizabeth White

Related Topics: Linux Containers

Linux Containers: Article

Practical XML with Linux, Part 3: XML database tools for Linux

Hierarchical, relational, and object databases

In previous articles (see Resources for links), I have discussed XML's value as a data format for Linux users. I have also discussed a wide variety of Linux tools for processing XML. But the data that XML usually represents often comes in forms that need to be managed on hard storage. Luckily, XML storage techniques and products have developed as much as other aspects of XML. We lucky Linux users have a wealth of choices available to us as we prepare to manage our XML data persistently.

Just a little theory

Although I used the word "practical" in the title of this article, a little background might be useful in choosing the best tools and techniques for managing your XML databases in Linux.

There are almost as many uses of XML as there are XML users, but there are only two ways of looking at how XML documents are organized. XML's roots lie in SGML, which was originally conceived as a way of structuring documents for machine preparation and human consumption. XML has inherited much of that bias toward documents, and is often used for presentation-oriented publishing (POP). Examples include books, slide presentations, and company Websites. POP formats tend to have elements and text that flow in a flexible and free-form manner.

XML has also gained popularity as the basis for data formats suitable for exchange between computer programs: consumed by machines but able to be inspected by humans. This is known as messaging-oriented middleware (MOM) because of its role in the infrastructure of applications. Examples include serialized objects, automated purchase orders, and Mozilla bookmark files. MOM formats tend to be highly regular, with elements making up well-defined fields with content according to strict data typing.

MOM and POP formats often impose different needs on XML databases, based on the differences in usage patterns and format. We will decide whether certain Linux database technologies are more appropriate for MOM or POP documents.

There are many ways of structuring databases. The relational model, used by well-known DBMSs like PostgreSQL and Oracle, is probably the most popular for new systems, but there are many other approaches. Databases can be:

  • Hash-based systems that store data as key/value pairs. That includes popular Unix database tools like Berkeley DB, DBM, and GDBM.
  • Hierarchical databases, once an especially popular format on mainframe and minicomputer systems, are represented in Linux by Adabas, which uses a very modern form of a hierarchical database.
  • Relational and object/relational databases store data in related tables that represent classes of entities. There are many examples for Linux, such as PostgreSQL, MySQL, Oracle, DB2, Informix, and Sybase.
  • Object databases store data as collections of object instances, using relationships based on object-oriented theory. Linux is poorly represented in this area.
  • Multi-dimensional databases generally model data relationships more richly than relational databases do. M3's Pick database for Linux is one example.
  • Semistructured databases are not as well-known as the above paradigms, and existed mostly in academia before XML's emergence. Semistructured databases support the notion that it is impractical to have a rigid schema for data that models the real world, given the fluidity of the real world. Many of its concepts are a natural fit for XML and related technologies like the Resource Description Framework (RDF).

There is a growing body of work on how to effectively manage XML data in hierarchical, relational, and object databases.

Hierarchical databases are often considered the most natural representation because XML itself is a hierarchy of elements and character data. Tamino, a hierarchical XML data store based on Adabas, offers some of the best performance for XML stored in native database format.

Developers sometimes store XML in relational DBMS by defining a column for each element type. There are other design patterns for that purpose as well. (See Resources for links.) The main problem with most RDBMS approaches is that they are so difficult to extend that they hamper flexibility. One can also just store the XML untouched as character large objects (CLOBs), but in this case, the RDBMS "degenerates" to a simple hashed database.

Object databases tend to store XML in a manner that approximates the document object model (DOM). That creates a persistent implementation of the various nodes that make up an XML document. This approach, as well as the DOM itself, is often considered to have excessive overhead. The eXcelon database is an example of an ODBMS XML store.

For more approaches and examples of XML databases, consult Ronald Bourret's excellent compilation on the subject. (See Resources.)

Toward an XML database standard

Databases must often span time, platforms, and application languages. Because data frequently must outlive code, standards are especially important in this area. As I have previously pointed out in this series, XML is built on standards; recently, there has been an effort to standardize XML databases as well. The XML:DB group is an initiative formed by organizations and individuals interested in XML database standardization.

XML:DB is working on a database connectivity API, a protocol for access control of XML document content, and XUpdate, a language for expressing updates to XML documents. The group has open discussion lists and the specification drafts are freely available. If you are interested in standardized XML database management, take a look at the group's work and encourage your favorite XML DBMS project to consider the specifications as they emerge.

Hacking from scratch

So, how do you begin storing and managing your XML documents persistently?

You can roll your own XML data store from many of the existing Linux tools. You can use DB, DBM, or GDBM, one or more of which should come with your Linux distribution. You could use the file's CRC or some other hash value as the key, or you could extract an element field from the XML document using a quick Perl or Python script.

PostgreSQL and MySQL can be turned into XML data stores using the techniques in the articles in the Resources section, or as simple large-object types.

Or you can just dump the XML files to disk, naming them with a hash or a given field you know will be unique. However, most people won't want to roll their own XML data store when so many specialized tools are available. In the following sections, we will look at some of the available tools for Linux.

Specialized XML data stores: Open source

First, the open source offerings -- you can find links to each product in the Resources section.

dbXML is an XML DBMS for Java that incorporates technologies like Simple Object Access Protocol (SOAP) for messaging middleware, XPath querying, and a rich Java API. dbXML's developers are founders of the XML:DB initiative.

hOpla is a rough-and-ready tool for defining mappings between XML data and PostgreSQL DBMS. An official GNU project, hOpla can set up table structures according to an XML file, and import and export data accordingly.

XML-DBMS was developed by Ronald Bourret, who also maintains the comprehensive list of XML database products that I mentioned earlier. XML-DBMS provides an object-relational mapping between XML and a relational database. The mapping is highly simplified from the DOM -- element types are classes (represented by relational tables), and attributes and character data are object attributes (columns). XML-DBMS provides modules for Java and Perl, but is not designed for standalone use. The main mapping object can be generated from a DTD or XSchema (the new XML schema format proposed by the W3C).

DB2XML maps JDBC ResultSets to DOM objects, and can trigger an XSLT transform for final presentation. It can be used standalone (including a GUI interface) or from other code using a simple API.

DBIx::XML_RDB, developed by Matt Sergeant, who develops many XML tools for Perl, is a Perl module that extracts XML from the results of database interface (DBI) queries. The output can be piped straight to XML::Parser, then to the many other XML modules that feed from XML::Parser.

My company, Fourthought, develops 4Suite, which contains DbDOM, a persistent DOM implementation for Python. DbDOM documents can be stored natively to the filesystem, or to a variety of backends, through the 4ODS object database, including PostgreSQL, Oracle, or even DB and DBM. DbDOM is integrated with other 4Suite technologies, including XPath and XSLT, XPointer, XLink, and RDF.

We also develop 4Suite Server, an XML dataserver that provides integrated services on the core 4Suite tools. That includes document management and XML-based Web services based on HTTP, SOAP, or CORBA, distributed transactions and concurrency, users, and access control. It supports experimental WebDAV support and the draft of the XML:DB standard. The 4Suite integration allows -- among other things -- metadata using RDF, content integration using XLink, and transforms and presentation using XSLT, indexing, and rules.

Specialized XML data stores: Closed source

Rogue Wave Software's XML-DB Link is quite an interesting product. It is basically a network API to databases using SOAP. You design a set of SQL queries and register them with the server, then invoke those queries remotely through SOAP messages and get the result tuples in a simple XML format. XML-DB Link is currently in beta; you can download a 30-day free evaluation.

Since SOAP is considered a leading communications protocol in the emerging field of Web services, XML-DB Link is a good early platform for experimenting with a possible major growth area. Its practicality is questionable, though, given SOAP's recent emergence.

Birdstep is a DBMS with some unique features. First, it is designed to have a very small footprint. Also, Birdstep data can be viewed in relational, object-oriented, and hierarchical modes. It also has a built-in hashing system for stored text, which creates interesting possibilities for indexing and efficiency. Birdstep is very suitable for XML storage; indeed, it comes with tools that allow stored XML to be accessed using DOM or SAX APIs. Birdstep is available for free download.

Lore, a Stanford University project, originated as a semistructured DBMS, but has since been tailored into an XML data store. It has proprietary XML query, indexing, and application-support features. It is geared toward deployments of document-management systems that support a wide variety of interactive searches on evolving data.

GMD-IPSI XQL Engine, formerly PDOM, is a persistent implementation of Java DOM for storing XML documents of arbitrary size in ready form for development, query, and extraction. It supports the draft of the XML Query Language (XQL). Of course, XQL is rather shaky -- the working group has had difficulty making progress toward their deliverable specification -- but it's good to see some experimental adoption. The package is free for noncommercial use.

Software AG's Adabas broke new ground in Linux DBMS. The company has also proven to be a pioneer in XML DBMS with Tamino, an XML data store that manages XML in a hierarchical DBMS for improved efficiency over the relational and object-oriented models. The Linux version is supported under SuSE 7.0 for IBM 390 mainframes. An Intel/Linux version is expected soon. This is a first for me: seeing Linux software come out on the mainframe port before Intel.

OpenLink Software's Virtuoso is an impressive array of XML middleware technologies. It includes an XML data store, either native or through SQL mappings, XPath query, XSLT transforms, and text indexing. It can be accessed using SOAP or WebDAV, and supports Internet protocols like HTTP, SMTP, and NNTP. Virtuoso is a well-integrated device for deploying Web services. A free evaluation download for Linux and other platforms is available.

IBM offerings

If you use IBM DB2 Universal Database Version 7 on Linux (or other platforms), you have the XML Extender, which allows you to generate specially formatted XML documents from the results of SQL queries. It also supports general storage of XML documents and DTDs of any form. XML Extender uses DB2's object-relational features to manage XML in database columns. IBM provides a proprietary mechanism to index the documents based on the values of certain elements. Text-based indexing is available using the DB2 text extender, a related tool.

One neat feature of the XML Extender is that it can extract XML from the query results of any Open Database Connectivity (ODBC)-compliant DBMS. This can be a heterogeneous database integration tool for XML middleware. It includes XSLT and XPath transformation and result query tools for postprocessing after the SQL query. This allows one to use XSLT as a stored procedure language, to some extent.

IBM's XML Extender comes with administration tools similar to those provided for DB2 itself. Of course, it's hard to imagine finding wizards that can truly help manage XML data formats efficiently.

IBM seems to have something for every aspect of XML in its alphaWorks project. DatabaseDom lets you use a sample XML file to define a relational database structure, generate similar XML format from the database, and update or insert a database from XML. The functionality is encapsulated in a JavaBean that processes DOM objects, making it easy to integrate into Java, XML, and database projects.

XML Lightweight Extractor (XLE) is an alphaWorks project that accepts DTDs with special annotations that indicate the mapping of XML content model to relational tables, generating a mapping that can be used to map relational databases to XML instances using JDBC.

Oracle utilities

Oracle's XML SQL Utility (XSU) is a great tool if you already have an Oracle database, and you want to extract MOM data from normalized relational tables and perform table updates in the other direction. XSU maps the result of regular SQL queries into XML documents and inserts data from documents into the appropriate relational columns. This could be used for anything from importing data into other applications and DBMS to creating transaction logs suitable for human inspection. XSU also has an XSLT engine that can automatically apply transforms to the results of XML document generation. That could be the basis for a very flexible database reporting mechanism.

XSU is written in Java and works with Oracle on Linux. It can be used from Java and PL/SQL stored procedures, or directly from Java applets or server-side code. The XML output and input can be in string form or DOM form.

Oracle's utility is a handy tool for those using Oracle and XML in their applications. It's easy to set up (much easier than Oracle itself, to be sure) and is built into Oracle 8.1.7 and the coming 9i. The only problem is that it is strongly tied to the relational model. The elements in the XML documents are generated from table or view column names; similarly, element-type names are used to match columns for insertion. Some hierarchy can be contrived, but overall, the setup is somewhat alien to XML -- it might as well be a simple hierarchy file. The XSLT integration helps, but sometimes requires extra work in XSLT on what should be set up in the source document.

Oracle XSU is a free download if you have registered for the Oracle Technology Network. As usual, you can get the book you need from O'Reilly & Associates. Users of Oracle's XML tools should certainly get Steve Muench's well-written book on the subject. (See Resources.)

Conclusion

As you can see, there is no lack of options. Whether you're a poor college-student hacker or a well-funded organization that has made a sensible decision to develop software on Linux, there are plenty of open source and commercial tools for your XML storage needs. In fact, there are very few XML tools that do not have some kind of Linux port.

You can store documents as native XML, build XML from other forms of data, or even develop hybrid systems with POP data stored as native XML and MOM data extracted from database structure.

But XML is all about standards; be sure to demand standards support from your vendors or to contribute to the standards support of your favorite open source tool. The only downside of having all those choices is the potential difficulty of changing your mind once you've chosen.

More Stories By Uche Ogbuji

Uche Ogbuji is cofounder and CEO of Fourthought, Inc.,
a software vendor and consultancy specializing in XML
solutions. A computer engineer, he has worked with XML for several
years, codeveloping 4Suite, an open-source platform for XML
processing. A frequent conference speaker, Uche has authored a number
of articles on the practical use of XML.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Jeff Mathews 09/09/03 11:59:28 AM EDT

By the way, Sleepycat has an XML store built on top of the Berkeley DB. I haven't used it (I'm only aware of C++ and Java bindings for it, so far), but it looks kind of interesting.

http://www.sleepycat.com/products/xml.shtml

alvaro cervantes 09/06/03 09:02:18 PM EDT

Very good articles, I enjoy them very much. I could not find article number one, Practical XML with Linux part 1.

Thanks, and good luck on everything

@ThingsExpo Stories
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that Technologic Systems Inc., an embedded systems solutions company, will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Technologic Systems is an embedded systems company with headquarters in Fountain Hills, Arizona. They have been in business for 32 years, helping more than 8,000 OEM customers and building over a hundred COTS products that have never been discontinued. Technologic Systems’ pr...
SYS-CON Events announced today that Auditwerx will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Auditwerx specializes in SOC 1, SOC 2, and SOC 3 attestation services throughout the U.S. and Canada. As a division of Carr, Riggs & Ingram (CRI), one of the top 20 largest CPA firms nationally, you can expect the resources, skills, and experience of a much larger firm combined with the accessibility and attent...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
SYS-CON Events announced today that HTBase will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. HTBase (Gartner 2016 Cool Vendor) delivers a Composable IT infrastructure solution architected for agility and increased efficiency. It turns compute, storage, and fabric into fluid pools of resources that are easily composed and re-composed to meet each application’s needs. With HTBase, companies can quickly prov...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists peeled away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud enviro...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
In his General Session at 16th Cloud Expo, David Shacochis, host of The Hybrid IT Files podcast and Vice President at CenturyLink, investigated three key trends of the “gigabit economy" though the story of a Fortune 500 communications company in transformation. Narrating how multi-modal hybrid IT, service automation, and agile delivery all intersect, he will cover the role of storytelling and empathy in achieving strategic alignment between the enterprise and its information technology.
Microservices are a very exciting architectural approach that many organizations are looking to as a way to accelerate innovation. Microservices promise to allow teams to move away from monolithic "ball of mud" systems, but the reality is that, in the vast majority of organizations, different projects and technologies will continue to be developed at different speeds. How to handle the dependencies between these disparate systems with different iteration cycles? Consider the "canoncial problem" ...
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Web Real-Time Communication APIs have quickly revolutionized what browsers are capable of. In addition to video and audio streams, we can now bi-directionally send arbitrary data over WebRTC's PeerConnection Data Channels. With the advent of Progressive Web Apps and new hardware APIs such as WebBluetooh and WebUSB, we can finally enable users to stitch together the Internet of Things directly from their browsers while communicating privately and securely in a decentralized way.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.