Databases
Data Warehouse Adoption of the Linux-Based Platform
A Study of Trends and Challenges
Digg This!
Data warehouse implementations represent one of the most challenging types of deployments for the enterprise. Several factors contribute to the challenge of deploying a successful data warehouse. Among these are large-scale and complex system configurations, sophisticated data modeling and analysis tools, and high visibility in a broad range of important business functions within the company.
Data warehouse workloads can serve as a litmus test to determine the enterprise readiness of a given deployment platform. For this reason it's interesting to determine how well Linux can support such challenging workloads. To that end I began a study, examining two interrelated aspects of enterprise readiness for a data warehouse on Linux:
- Is the solution stack supported on Linux?
- Are end-user companies actively deploying the stack to support their business needs?
To investigate this issue, I chose to work in cooperation with the Data Center Linux initiative at OSDL. Building on personal, practical experience with data warehouse deployments, I conducted an informal survey of the readiness of the Linux platform for this workload. This article is a summary of the findings of that survey.
Data Warehouse Solution Participants
The survey examined three types of participants in the data warehouse solution or ecosystem:
- Independent software vendors (ISV)
- Independent hardware vendors (IHV)
- End-user company deployments
A number of adequate studies has been published that shows how Linux is well accepted on a variety of industry-standard vendor platforms, so its base acceptance was taken as an assumption within my study. Rather, the focus of my study was on Linux readiness within the ISV and end-user communities.
I used Ralph Kimball's "High Level Warehouse Technical Architecture" as a reference for analysis and to provide common terminology for analysis of the solution stack. I broke down the list of vendors into "front room" and "back room" categories, based upon Kimball's architecture.
The study involved a total of 18 vendors. It's important to note that this roster did not represent a de facto list chosen to illustrate Linux usage. In fact the list represented the dominant vendors, chosen based upon experience in deployments at a number of large companies.
Study Results - Data Warehouse Trends
The study found that overall there exists reasonable support for Linux from ISVs that comprise the data warehouse solution, with 14 of 18 vendors offering some level of support for the open source OS. Within Kimball's technical architecture, the vendors supplying products to meet the needs for the "front room" were predominantly hosting their offerings on client platforms. They had weaker support overall for Linux than the "back room" vendors with products in such areas as extract, transform, and load (ETL) and database. Specifically, the ETL vendors tended to support one particular Linux distribution very well, while database vendors tended to support multiple Linux distributions.
The study further examined motivators and other issues driving (and inhibiting) Linux adoption and support by ISVs, with the following findings.
Motivators
- Market demand for the Linux platform
Issues
- How many and which distributions to support
- Differences in packages across distributions
- Lack of standardization among maintenance tools and lack of usability features
While issues exist with regard to supporting the Linux platform, clearly a majority of ISVs within the data warehouse felt that the market demand was sufficiently compelling to deliver products for that platform.
By examining end-user company deployments, my study focused on companies that had data warehouse and/or data mart implementations that would be considered medium-sized to large (i.e., total implementation data size was at least one terabyte), with a typical configuration around 60 terabytes. These types of configurations shared some common themes:
- Overall configuration elements - medium to large data warehouse:
- SAN disk - use of failover
- Employ NFS
- Use multiple file systems as well as raw disk partitions
- Employ large file systems
- Multi-CPU large servers dominant - use of partitioning
The study further surveyed a subset of companies from a group of companies with data warehouse implementations within the target size. Initially, a small sample set of companies was chosen to limit the scope of the study and get an initial picture of the deployments. In the future, the study of companies will be expanded.
Of the seven companies surveyed, the responses broke down as shown in Table 1.
The following is a summary of the issues and motivators for the three groups above.
Group 1
- While there are some potential motivators for cost consolidation, there are significant inhibitors in terms of the internal infrastructure to support Linux and the perceived immaturity in the platform.
Group 2
- Flexibility in choice of hardware platforms drove decisions to build a development environment as a first step toward evolving a mature support infrastructure for Linux.
- The primary inhibitor to moving to production was the lack of adequate support infrastructure within key ISVs for solutions on Linux.
Group 3
- Migration to Linux represented a strategic move to take advantage of the flexibility of deploying the hardware and software solutions that Linux provides.
- The primary production issue for IT infrastructure teams was providing systems integration services to ensure the success of such a demanding workload, such as the need to build customized monitoring scripts for the environment.
Based on the data above, the most important group to analyze in more detail was Group 1 because it was the dominant group. Moreover, I wanted to provide input to the OSDL Data Center Linux group regarding the strategic focus areas to drive acceptance of the data warehouse on Linux.
Group 1 reported the following motivations and issues in detail.
Motivators
- Cost consolidation
- H/W platform flexibility
- Low-cost clustering
- Consolidation of system administration skills
Issues
- Weak internal support for Linux infrastructure
- Lack of maturity of data warehouse solutions on Linux
- Maturity defined: Referenceable and in production for at least one year
- Lack of acceptance of Linux within DW
- Acceptance defined: Deployments within Fortune 100 companies
Conclusion
The overall conclusion drawn from this survey of the data warehouse and Linux was that the solution stack is sufficient to support the workload on Linux. However, the Linux support infrastructure is often not mature enough for Linux-based deployments for the large, complex configurations and demanding workloads of data warehouses.
End-User Highlights
Some very specific findings emerged from the study with regard to end-user deployment:
- The majority of companies in Group 1 (no plans in the near future to migrate to Linux) will eventually move into Group 2 (development on Linux with a longer- term move to production). They fell into Group 1 because complexity, reliability, and scalability requirements proved too demanding for current deployments on Linux. Staffing and support issues were key inhibitors as well.
- Groups 2 and 3 featured early adopters who leveraged the availability of H/W, database, and ETL server solutions to enable successful deployment.
ISV HighlightsSimilarly, salient ISV data emerged from the study:
- Market adoption of Linux in "back room" solutions is healthy and growing.
- Market adoption of Linux in "front room" solutions is measured, due to limitations in current ISV offerings and challenges for ISVs to support multiple Linux distributions.
- Opportunities exist for standardization across distributions, e.g., tools, packages, etc., to support the ISV community.
The information from this study has been incorporated into the prioritization of requirements for the OSDL Data Center Linux initiative, especially within the database and data warehouse tier. The OSDL intends to expand upon this informal study in the future to continue to drive visibility to the needs of this and other critical Data Center workloads.
About Lynn de la TorreLynn de la Torre is a member of OSDL and coordinates the activities of the DCL Working Group. Lynn has thirty years of experience in the data center, and has worked in operations, system administration, database administration, and software development. Prior to joining OSDL, Lynn was a project manager for a large data warehouse implementation.