Welcome!

Linux Containers Authors: Ashish Nanjiani, Yeshim Deniz, Derek Weeks, Liz McMillan, Pat Romanski

Related Topics: Linux Containers

Linux Containers: Article

Practical XML with Linux, Part 3: XML database tools for Linux

Hierarchical, relational, and object databases

In previous articles (see Resources for links), I have discussed XML's value as a data format for Linux users. I have also discussed a wide variety of Linux tools for processing XML. But the data that XML usually represents often comes in forms that need to be managed on hard storage. Luckily, XML storage techniques and products have developed as much as other aspects of XML. We lucky Linux users have a wealth of choices available to us as we prepare to manage our XML data persistently.

Just a little theory

Although I used the word "practical" in the title of this article, a little background might be useful in choosing the best tools and techniques for managing your XML databases in Linux.

There are almost as many uses of XML as there are XML users, but there are only two ways of looking at how XML documents are organized. XML's roots lie in SGML, which was originally conceived as a way of structuring documents for machine preparation and human consumption. XML has inherited much of that bias toward documents, and is often used for presentation-oriented publishing (POP). Examples include books, slide presentations, and company Websites. POP formats tend to have elements and text that flow in a flexible and free-form manner.

XML has also gained popularity as the basis for data formats suitable for exchange between computer programs: consumed by machines but able to be inspected by humans. This is known as messaging-oriented middleware (MOM) because of its role in the infrastructure of applications. Examples include serialized objects, automated purchase orders, and Mozilla bookmark files. MOM formats tend to be highly regular, with elements making up well-defined fields with content according to strict data typing.

MOM and POP formats often impose different needs on XML databases, based on the differences in usage patterns and format. We will decide whether certain Linux database technologies are more appropriate for MOM or POP documents.

There are many ways of structuring databases. The relational model, used by well-known DBMSs like PostgreSQL and Oracle, is probably the most popular for new systems, but there are many other approaches. Databases can be:

  • Hash-based systems that store data as key/value pairs. That includes popular Unix database tools like Berkeley DB, DBM, and GDBM.
  • Hierarchical databases, once an especially popular format on mainframe and minicomputer systems, are represented in Linux by Adabas, which uses a very modern form of a hierarchical database.
  • Relational and object/relational databases store data in related tables that represent classes of entities. There are many examples for Linux, such as PostgreSQL, MySQL, Oracle, DB2, Informix, and Sybase.
  • Object databases store data as collections of object instances, using relationships based on object-oriented theory. Linux is poorly represented in this area.
  • Multi-dimensional databases generally model data relationships more richly than relational databases do. M3's Pick database for Linux is one example.
  • Semistructured databases are not as well-known as the above paradigms, and existed mostly in academia before XML's emergence. Semistructured databases support the notion that it is impractical to have a rigid schema for data that models the real world, given the fluidity of the real world. Many of its concepts are a natural fit for XML and related technologies like the Resource Description Framework (RDF).

There is a growing body of work on how to effectively manage XML data in hierarchical, relational, and object databases.

Hierarchical databases are often considered the most natural representation because XML itself is a hierarchy of elements and character data. Tamino, a hierarchical XML data store based on Adabas, offers some of the best performance for XML stored in native database format.

Developers sometimes store XML in relational DBMS by defining a column for each element type. There are other design patterns for that purpose as well. (See Resources for links.) The main problem with most RDBMS approaches is that they are so difficult to extend that they hamper flexibility. One can also just store the XML untouched as character large objects (CLOBs), but in this case, the RDBMS "degenerates" to a simple hashed database.

Object databases tend to store XML in a manner that approximates the document object model (DOM). That creates a persistent implementation of the various nodes that make up an XML document. This approach, as well as the DOM itself, is often considered to have excessive overhead. The eXcelon database is an example of an ODBMS XML store.

For more approaches and examples of XML databases, consult Ronald Bourret's excellent compilation on the subject. (See Resources.)

Toward an XML database standard

Databases must often span time, platforms, and application languages. Because data frequently must outlive code, standards are especially important in this area. As I have previously pointed out in this series, XML is built on standards; recently, there has been an effort to standardize XML databases as well. The XML:DB group is an initiative formed by organizations and individuals interested in XML database standardization.

XML:DB is working on a database connectivity API, a protocol for access control of XML document content, and XUpdate, a language for expressing updates to XML documents. The group has open discussion lists and the specification drafts are freely available. If you are interested in standardized XML database management, take a look at the group's work and encourage your favorite XML DBMS project to consider the specifications as they emerge.

Hacking from scratch

So, how do you begin storing and managing your XML documents persistently?

You can roll your own XML data store from many of the existing Linux tools. You can use DB, DBM, or GDBM, one or more of which should come with your Linux distribution. You could use the file's CRC or some other hash value as the key, or you could extract an element field from the XML document using a quick Perl or Python script.

PostgreSQL and MySQL can be turned into XML data stores using the techniques in the articles in the Resources section, or as simple large-object types.

Or you can just dump the XML files to disk, naming them with a hash or a given field you know will be unique. However, most people won't want to roll their own XML data store when so many specialized tools are available. In the following sections, we will look at some of the available tools for Linux.

Specialized XML data stores: Open source

First, the open source offerings -- you can find links to each product in the Resources section.

dbXML is an XML DBMS for Java that incorporates technologies like Simple Object Access Protocol (SOAP) for messaging middleware, XPath querying, and a rich Java API. dbXML's developers are founders of the XML:DB initiative.

hOpla is a rough-and-ready tool for defining mappings between XML data and PostgreSQL DBMS. An official GNU project, hOpla can set up table structures according to an XML file, and import and export data accordingly.

XML-DBMS was developed by Ronald Bourret, who also maintains the comprehensive list of XML database products that I mentioned earlier. XML-DBMS provides an object-relational mapping between XML and a relational database. The mapping is highly simplified from the DOM -- element types are classes (represented by relational tables), and attributes and character data are object attributes (columns). XML-DBMS provides modules for Java and Perl, but is not designed for standalone use. The main mapping object can be generated from a DTD or XSchema (the new XML schema format proposed by the W3C).

DB2XML maps JDBC ResultSets to DOM objects, and can trigger an XSLT transform for final presentation. It can be used standalone (including a GUI interface) or from other code using a simple API.

DBIx::XML_RDB, developed by Matt Sergeant, who develops many XML tools for Perl, is a Perl module that extracts XML from the results of database interface (DBI) queries. The output can be piped straight to XML::Parser, then to the many other XML modules that feed from XML::Parser.

My company, Fourthought, develops 4Suite, which contains DbDOM, a persistent DOM implementation for Python. DbDOM documents can be stored natively to the filesystem, or to a variety of backends, through the 4ODS object database, including PostgreSQL, Oracle, or even DB and DBM. DbDOM is integrated with other 4Suite technologies, including XPath and XSLT, XPointer, XLink, and RDF.

We also develop 4Suite Server, an XML dataserver that provides integrated services on the core 4Suite tools. That includes document management and XML-based Web services based on HTTP, SOAP, or CORBA, distributed transactions and concurrency, users, and access control. It supports experimental WebDAV support and the draft of the XML:DB standard. The 4Suite integration allows -- among other things -- metadata using RDF, content integration using XLink, and transforms and presentation using XSLT, indexing, and rules.

Specialized XML data stores: Closed source

Rogue Wave Software's XML-DB Link is quite an interesting product. It is basically a network API to databases using SOAP. You design a set of SQL queries and register them with the server, then invoke those queries remotely through SOAP messages and get the result tuples in a simple XML format. XML-DB Link is currently in beta; you can download a 30-day free evaluation.

Since SOAP is considered a leading communications protocol in the emerging field of Web services, XML-DB Link is a good early platform for experimenting with a possible major growth area. Its practicality is questionable, though, given SOAP's recent emergence.

Birdstep is a DBMS with some unique features. First, it is designed to have a very small footprint. Also, Birdstep data can be viewed in relational, object-oriented, and hierarchical modes. It also has a built-in hashing system for stored text, which creates interesting possibilities for indexing and efficiency. Birdstep is very suitable for XML storage; indeed, it comes with tools that allow stored XML to be accessed using DOM or SAX APIs. Birdstep is available for free download.

Lore, a Stanford University project, originated as a semistructured DBMS, but has since been tailored into an XML data store. It has proprietary XML query, indexing, and application-support features. It is geared toward deployments of document-management systems that support a wide variety of interactive searches on evolving data.

GMD-IPSI XQL Engine, formerly PDOM, is a persistent implementation of Java DOM for storing XML documents of arbitrary size in ready form for development, query, and extraction. It supports the draft of the XML Query Language (XQL). Of course, XQL is rather shaky -- the working group has had difficulty making progress toward their deliverable specification -- but it's good to see some experimental adoption. The package is free for noncommercial use.

Software AG's Adabas broke new ground in Linux DBMS. The company has also proven to be a pioneer in XML DBMS with Tamino, an XML data store that manages XML in a hierarchical DBMS for improved efficiency over the relational and object-oriented models. The Linux version is supported under SuSE 7.0 for IBM 390 mainframes. An Intel/Linux version is expected soon. This is a first for me: seeing Linux software come out on the mainframe port before Intel.

OpenLink Software's Virtuoso is an impressive array of XML middleware technologies. It includes an XML data store, either native or through SQL mappings, XPath query, XSLT transforms, and text indexing. It can be accessed using SOAP or WebDAV, and supports Internet protocols like HTTP, SMTP, and NNTP. Virtuoso is a well-integrated device for deploying Web services. A free evaluation download for Linux and other platforms is available.

IBM offerings

If you use IBM DB2 Universal Database Version 7 on Linux (or other platforms), you have the XML Extender, which allows you to generate specially formatted XML documents from the results of SQL queries. It also supports general storage of XML documents and DTDs of any form. XML Extender uses DB2's object-relational features to manage XML in database columns. IBM provides a proprietary mechanism to index the documents based on the values of certain elements. Text-based indexing is available using the DB2 text extender, a related tool.

One neat feature of the XML Extender is that it can extract XML from the query results of any Open Database Connectivity (ODBC)-compliant DBMS. This can be a heterogeneous database integration tool for XML middleware. It includes XSLT and XPath transformation and result query tools for postprocessing after the SQL query. This allows one to use XSLT as a stored procedure language, to some extent.

IBM's XML Extender comes with administration tools similar to those provided for DB2 itself. Of course, it's hard to imagine finding wizards that can truly help manage XML data formats efficiently.

IBM seems to have something for every aspect of XML in its alphaWorks project. DatabaseDom lets you use a sample XML file to define a relational database structure, generate similar XML format from the database, and update or insert a database from XML. The functionality is encapsulated in a JavaBean that processes DOM objects, making it easy to integrate into Java, XML, and database projects.

XML Lightweight Extractor (XLE) is an alphaWorks project that accepts DTDs with special annotations that indicate the mapping of XML content model to relational tables, generating a mapping that can be used to map relational databases to XML instances using JDBC.

Oracle utilities

Oracle's XML SQL Utility (XSU) is a great tool if you already have an Oracle database, and you want to extract MOM data from normalized relational tables and perform table updates in the other direction. XSU maps the result of regular SQL queries into XML documents and inserts data from documents into the appropriate relational columns. This could be used for anything from importing data into other applications and DBMS to creating transaction logs suitable for human inspection. XSU also has an XSLT engine that can automatically apply transforms to the results of XML document generation. That could be the basis for a very flexible database reporting mechanism.

XSU is written in Java and works with Oracle on Linux. It can be used from Java and PL/SQL stored procedures, or directly from Java applets or server-side code. The XML output and input can be in string form or DOM form.

Oracle's utility is a handy tool for those using Oracle and XML in their applications. It's easy to set up (much easier than Oracle itself, to be sure) and is built into Oracle 8.1.7 and the coming 9i. The only problem is that it is strongly tied to the relational model. The elements in the XML documents are generated from table or view column names; similarly, element-type names are used to match columns for insertion. Some hierarchy can be contrived, but overall, the setup is somewhat alien to XML -- it might as well be a simple hierarchy file. The XSLT integration helps, but sometimes requires extra work in XSLT on what should be set up in the source document.

Oracle XSU is a free download if you have registered for the Oracle Technology Network. As usual, you can get the book you need from O'Reilly & Associates. Users of Oracle's XML tools should certainly get Steve Muench's well-written book on the subject. (See Resources.)

Conclusion

As you can see, there is no lack of options. Whether you're a poor college-student hacker or a well-funded organization that has made a sensible decision to develop software on Linux, there are plenty of open source and commercial tools for your XML storage needs. In fact, there are very few XML tools that do not have some kind of Linux port.

You can store documents as native XML, build XML from other forms of data, or even develop hybrid systems with POP data stored as native XML and MOM data extracted from database structure.

But XML is all about standards; be sure to demand standards support from your vendors or to contribute to the standards support of your favorite open source tool. The only downside of having all those choices is the potential difficulty of changing your mind once you've chosen.

More Stories By Uche Ogbuji

Uche Ogbuji is cofounder and CEO of Fourthought, Inc.,
a software vendor and consultancy specializing in XML
solutions. A computer engineer, he has worked with XML for several
years, codeveloping 4Suite, an open-source platform for XML
processing. A frequent conference speaker, Uche has authored a number
of articles on the practical use of XML.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Jeff Mathews 09/09/03 11:59:28 AM EDT

By the way, Sleepycat has an XML store built on top of the Berkeley DB. I haven't used it (I'm only aware of C++ and Java bindings for it, so far), but it looks kind of interesting.

http://www.sleepycat.com/products/xml.shtml

alvaro cervantes 09/06/03 09:02:18 PM EDT

Very good articles, I enjoy them very much. I could not find article number one, Practical XML with Linux part 1.

Thanks, and good luck on everything

@ThingsExpo Stories
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...
SYS-CON Events announced today that Datanami has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datanami is a communication channel dedicated to providing insight, analysis and up-to-the-minute information about emerging trends and solutions in Big Data. The publication sheds light on all cutting-edge technologies including networking, storage and applications, and the...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
SYS-CON Events announced today that EnterpriseTech has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. EnterpriseTech is a professional resource for news and intelligence covering the migration of high-end technologies into the enterprise and business-IT industry, with a special focus on high-tech solutions in new product development, workload management, increased effi...
In this presentation, Striim CTO and founder Steve Wilkes will discuss practical strategies for counteracting fraud and cyberattacks by leveraging real-time streaming analytics. In his session at @ThingsExpo, Steve Wilkes, Founder and Chief Technology Officer at Striim, will provide a detailed look into leveraging streaming data management to correlate events in real time, and identify potential breaches across IoT and non-IoT systems throughout the enterprise. Strategies for processing massive ...
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organi...
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
SYS-CON Events announced today that TMC has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo and Big Data at Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Global buyers rely on TMC’s content-driven marketplaces to make purchase decisions and navigate markets. Learn how we can help you reach your marketing goals.
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
"MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists looked at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deliver...
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 21st Int\ernational Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to focus on the core of their ...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...