| By Takashi Ikebe, Masahiko Uchiyama | Article Rating: |
|
| March 18, 2006 02:00 PM EST | Reads: |
12,026 |
There are some computing systems that require high availability. Telecommunication systems are a good example. They require 24 hours a day and 365 days a year service availability and their downtime should not exceed five minutes per year and that includes hardware and software upgrades. These systems require carrier-grade reliability that guarantees high service availability, 99.999% uptime or higher.
To satisfy high-availability requirements, special-purpose operating systems, sometimes proprietary or self-developed operating system, were used in telecom systems. As the telecommunication world is now moving towards using the Linux operating system on mission-critical systems, new high requirements are imposed on the operating systems. However, Linux is designed to work best on desktop and enterprise systems, and it doesn't have the mechanisms and capabilities needed for mission-critical system with an intense and complex workload that must also handle very confidential information. The OSDL Carrier-Grade Linux (CGL) working group is looking at filling these gaps by creating the CGL requirement definition documents and supporting the creation of Open Source projects to fill these gaps.
Software developers usually provide patches to fix bugs, enhance existing capabilities, or add new capabilities. The intervals between software program updates are getting shorter and shorter as software structure grows in size and complexity. The number of patches t is on the rise. Normally, it's necessary to restart the process (or the service) to apply these patches, and sometimes the operating system has to be rebooted. The software program itself can't be modified without being stopped because it's loaded in the process memory space, which only the process can access. In some instances, it takes a few seconds and sometimes a few minutes to restart a process or service. As a result, the services offered aren't available during the restart.There are special software programs that can modify themselves and their functions via a defined interface. However, most software can't.
What Is Live Patching?
Live patching is one of the capabilities in version 3.1 of the CGL requirement definition document released in June 2005. This feature enables a process to modify its functions without restarting, a very needed capability for telecommunication systems that are expected be continuously in service.
One approach to achieving live patching is overwriting the "jmp" assembly code to the entry point of function, which is the method applied by the PANNUS project. PANNUS enables the replacement of a function without restarting a process. This approach is very practical because many software programs are usually implemented with various functions.
Live Patching Requirements
This section describes the requirements of live patching from viewpoint of carrier service, software structure, and operating environment.
Real-Time Capability
Live patching has been used in the telecommunication industry for a long time. Customers expect that their voice and data services will always be available. To ensure service 24 hours a day 365 days a year, maintaining and expanding service on running telecom systems without disruption must be possible. Typical telecommunication systems are constructed as redundant systems following, for example, the 1+1 redundancy model where one server is active handling service requests, and the second is a hot-standby for the first server.. Each server in such a configuration knows the status of the other through a "heartbeat" mechanism that sends signals between the two servers as a keep-alive message. If both redundant servers fail, services are stopped and many customers are affected. Furthermore, once the service is no longer available, it takes a long time to recover and resume service because telecommunication switches are complex systems that consist of multiple components. So patches have to be applied without disrupting the service to end users, subscribe to telephony services.
The Limitations of Target Software
Developers can release several hundred software patches for each piece of software, including patches that are significant to the system's base software, e.g., the fundamental system software and library. If these patches aren't applied through a live-patching mechanism and use source patching instead, the processes require frequent restarting to enable the new patches and bug fixes. If these patches aren't applied quickly, the servers will encounter fatal errors or delays in addingfeatures necessary to the service. So live patching should be applied to a customer's original software and to generic fundamental system software programs that are widely used. If the approach requires the target software to have a specific feature or to link to a particular library, achieving live patching is expensive especially on large complex systems.
Easy Operation
After applying a patch module by live patching, the modified system must be surveyed for a certain period of time to confirm that the patches are acting properly. If some fatal problem occurs during that "trial" period, activated patches must be deactivated immediately, re-checked, fixed, and re-applied again. So to make operations easier, patch modules should have an explicitly stated state transition that can be cancelled.
In a typical operation environment, the person who applies the patch modules is a maintenance engineer, not the original developer of the patches. Maintenance engineers don't know much aboutbug fixes. If applying a live patch is too complicated, some mistakes can't be avoided. So the act of live patching should be easy to do.
The PANNUS Approach
PANNUS is a live-patching implementation that enables live patching for processes by overwriting the "jmp" assembly code at the entry point of a function (see Figure 1).
Outline of Processing
PANNUS uses a slightly different method for a process that handles exceptions for C++. The binary of the process that handles exceptions has sections such as ".eh_frame" or ".gcc_except_table" that have some information used for exception handling. With "jmp" assembly code overwriting, these sections aren't executed correctly. If an exception occurs, the process aborts because it can't find the exception information to handle the exception. In this case, PANNUS makes a target process execute "dlopen," which is usually used for loading additional shared libraries by initializing the patch modules during loading. This approach costs much higher than normal "mmap3" loading because the process itself has to load the patch modules through "dlopen" before executing initialization.
Published March 18, 2006 Reads 12,026
Copyright © 2006 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Takashi Ikebe
Takashi Ikebe is a senior open source development engineer with NTT Network Service Systems Laboratories. Within CGL, he participates in the Specifications Group.
More Stories By Masahiko Uchiyama
Masahiko Uchiyama is a software developer in PANNUS project. He is a chief system engineer as well as an assistant manager for NTT COMWARE. He's worked with IP telephony switches and related systems for six years. He currently lives in Chiba, Japan.
- Kindle 2 vs Nook
- Is Cloud Computing Like Teenage Sex?
- GovIT Expo Highlights Cloud Computing
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- Cloud Computing Can Revitalize Your Career as Software Developer
- Ubuntu-based Open Source Linux Mint Tests KDE Version
- Yahoo! SVP Shelton Shugar to Discuss Innovation at Cloud Computing Expo
- Virtualization Journal "Readers' Choice Awards" Voting Is Now Open
- Einstein, Sharks and Clouds: IT Security in the Cloud
- Adobe Flex Developer Earns $100K in New York City
- Amazon Web Services Database in the Cloud
- Virtualization Expo Call for Papers Deadline December 15
- Kindle 2 vs Nook
- Cloud CEOs, CTOs & SVPs to Speak at 4th International Cloud Computing Expo
- Is Cloud Computing Like Teenage Sex?
- 1st Annual GovIT Expo: Letter from the Technical Chair
- Ulitzer News: Search vs New Media
- The Difference Between Web Hosting and Cloud Computing
- Cloud Computing Expo: Exclusive Q&A with Yahoo! SVP Cloud Computing
- Confessions of a Ulitzer Addict
- GovIT Expo Highlights Cloud Computing
- Twitter, Linked In, Ning and Ulitzer: Easy Personal Branding Strategy
- My Thoughts on Ulitzer
- Tactical Cloud Computing Panel at 1st Annual GovIT Expo
- The i-Technology Right Stuff
- Linux.SYS-CON.com Exclusive: Linus Discloses *Real* Fathers of Linux
- After Ubuntu, Windows Looks Increasingly Bad, Increasingly Archaic, Increasingly Unfriendly
- Linus' Top Ten SCO Barbs
- A Closer Look at Damn Small Linux
- Netscape Co-Founder's 12 Reasons for Growth of Open Source
- Introducing "Cooperative Linux" - Linux for Windows, No Less
- *POINT - COUNTERPOINT SPECIAL* What's Wrong with the Open Source Community?
- Where Are RIA Technologies Headed in 2008?
- Linux.SYS-CON.com Exclusive: What Would UserLinux Look Like?
- i-Technology Viewpoint: The New Paradigm of IT Buying
- Is Linux Desktop-Ready Yet...or Not?































