Welcome!

Linux Containers Authors: Gerardo A Dada, JP Morgenthal, Liz McMillan, Sematext Blog, Carmen Gonzalez

Related Topics: @DevOpsSummit, Java IoT, Linux Containers, Agile Computing, @BigDataExpo

@DevOpsSummit: Blog Post

Application Failures in Production | @DevOpsSummit [DevOps]

The wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message

How to Approach Application Failures in Production

In my recent article, "Software Quality Metrics for your Continuous Delivery Pipeline - Part III - Logging," I wrote about the good parts and the not-so-good parts of logging and concluded that logging usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. In response to the heated debates on reddit.com/r/devops and reddit.com/r/programing, I want to demonstrate the wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message if you only are equipped with the magic ingredient; full transaction context:

Examples of insights you could obtain from full transaction context on a single log message

Bear with me until I get to explain what this actually means and how it helps you get almost immediate answers to the most urgent questions when your users are struck by an application failure:

  • "How many users are affected and who are they?"
  • "Which tiers are affected by which errors and what is the root cause?"

Operator: I'm here because you broke something. (courtesy of ThinkGeek.com)

When All You Have Is a Lousy Log Message
Does this story sound familiar to you? It's a Friday afternoon and you just received the release artifacts from the development team belatedly, which need to be released by Monday morning. After spending the night and another day in operations to get this release out into production timely, you notice the Monday after that everything you have achieved in the end was some lousy log message:

08:55:26 SEVERE com.company.product.login.LoginLogic - LoginException occurred when processing Login transaction

While this scenario hopefully does not reflect a common case for you, it still shows an important aspect in the life of development and operations: working as an operator involves monitoring the production environment and providing assistance in troubleshooting application failures mainly with the help of log messages - things that developers have baked into their code. While certainly not all log messages need to be as poor as this one, getting down to the bottom of a production failure is often a tedious endeavor (see this comment on reddit by RecklessKelly who sometimes needs weeks to get his "Eureka moment") - if at all possible.

Why There Is No Such Thing as a 100% Error-Free Code
Production failures can become a major pain for your business with long-term effects: they will not only make your visitors buy elsewhere, but depending on the level of frustrations, your customers may choose to stay at your competition instead of giving you another chance.

As we all know, we just cannot get rid of application failures in production entirely. Agile methodologies, such as Extreme Programming or Scrum, aim to build quality into our processes; however, there is still no such thing as a 100% error-free application. "We need to write more tests!" you may argue and I would agree: disciplines such as TDD and ATDD should be an integral part of your software development process since they, if applied correctly, help you produce better code and fewer bugs. Still, it is simply impossible to test each and every corner of your application for all possible combinations of input parameters and application state. Essentially, we can run only a limited subset of all possible test scenarios. The common goal of developers and test automation engineers, hence, must be to implement a testing strategy, which allows them to deliver code of sufficient quality. Consequently, there is always a chance that something can go wrong, and, as a serious business, you will want to be prepared for the unpredictable and, additionally, have as much control over it as possible:

Why you cannot get rid of application failures in production: remaining failure probability

Without further ado, let's examine some precious out-of-the-box insights you could obtain if you are equipped with full transaction context and are able to capture all transactions.

Why this is important? Because it enables you to see the contributions of input parameters, processes, infrastructure and users at all times whenever a failure occurred, solve problems faster, and additionally use the presented information such as unexpected input parameters to further improve your testing strategy.

Initial Situation: Aggregated Log Messages
Instead of crawling a bunch of possibly distributed log files to determine the count of particular log messages, we may, first of all, want to have this done automatically for us just as they happen. This gives a good overview on the respective message frequencies and facilitates prioritization:

Aggregated log events: severity, logger name, message and count

What we see here (analysis view based on our PurePath technology) is that there have been 104 occurrences of the same log message in the application. We could also observe other captured event data, such as the severity level and the name of the logger instance (usually the name of the class that created the logger).

Question #1: How many users are affected and who are they?

Failed Business Transactions: "Logins" and "Logins by Username"

Having the full transactional context and not just the log message allows us to figure out which critical Business Transactions of our application are impacted. From the dashboard above we can observe that "Logins" and "Logins by Username" have failed: we see that 61 users attempted the 104 logins and who these users were by their username.

For questions 2 and 3, and for further insight, click here for the full article.

More Stories By Martin Etmajer

Leveraging his outstanding technical skills as a lead software engineer, Martin Etmajer has been a key contributor to a number of large-scale systems across a range of industries. He is as passionate about great software as he is about applying Lean Startup principles to the development of products that customers love.

Martin is a life-long learner who frequently speaks at international conferences and meet-ups. When not spending time with family, he enjoys swimming and Yoga. He holds a master's degree in Computer Engineering from the Vienna University of Technology, Austria, with a focus on dependable distributed real-time systems.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
"We're a cybersecurity firm that specializes in engineering security solutions both at the software and hardware level. Security cannot be an after-the-fact afterthought, which is what it's become," stated Richard Blech, Chief Executive Officer at Secure Channels, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lead...
Video experiences should be unique and exciting! But that doesn’t mean you need to patch all the pieces yourself. Users demand rich and engaging experiences and new ways to connect with you. But creating robust video applications at scale can be complicated, time-consuming and expensive. In his session at @ThingsExpo, Zohar Babin, Vice President of Platform, Ecosystem and Community at Kaltura, discussed how VPaaS enables you to move fast, creating scalable video experiences that reach your aud...
"Once customers get a year into their IoT deployments, they start to realize that they may have been shortsighted in the ways they built out their deployment and the key thing I see a lot of people looking at is - how can I take equipment data, pull it back in an IoT solution and show it in a dashboard," stated Dave McCarthy, Director of Products at Bsquare Corporation, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
An IoT product’s log files speak volumes about what’s happening with your products in the field, pinpointing current and potential issues, and enabling you to predict failures and save millions of dollars in inventory. But until recently, no one knew how to listen. In his session at @ThingsExpo, Dan Gettens, Chief Research Officer at OnProcess, discussed recent research by Massachusetts Institute of Technology and OnProcess Technology, where MIT created a new, breakthrough analytics model for ...
IoT is rapidly changing the way enterprises are using data to improve business decision-making. In order to derive business value, organizations must unlock insights from the data gathered and then act on these. In their session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, and Peter Shashkin, Head of Development Department at EastBanc Technologies, discussed how one organization leveraged IoT, cloud technology and data analysis to improve customer experiences and effici...
Everyone knows that truly innovative companies learn as they go along, pushing boundaries in response to market changes and demands. What's more of a mystery is how to balance innovation on a fresh platform built from scratch with the legacy tech stack, product suite and customers that continue to serve as the business' foundation. In his General Session at 19th Cloud Expo, Michael Chambliss, Head of Engineering at ReadyTalk, discussed why and how ReadyTalk diverted from healthy revenue and mor...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
The Internet of Things (IoT) promises to simplify and streamline our lives by automating routine tasks that distract us from our goals. This promise is based on the ubiquitous deployment of smart, connected devices that link everything from industrial control systems to automobiles to refrigerators. Unfortunately, comparatively few of the devices currently deployed have been developed with an eye toward security, and as the DDoS attacks of late October 2016 have demonstrated, this oversight can ...
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
"Dice has been around for the last 20 years. We have been helping tech professionals find new jobs and career opportunities," explained Manish Dixit, VP of Product and Engineering at Dice, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
"ReadyTalk is an audio and web video conferencing provider. We've really come to embrace WebRTC as the platform for our future of technology," explained Dan Cunningham, CTO of ReadyTalk, in this SYS-CON.tv interview at WebRTC Summit at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
The many IoT deployments around the world are busy integrating smart devices and sensors into their enterprise IT infrastructures. Yet all of this technology – and there are an amazing number of choices – is of no use without the software to gather, communicate, and analyze the new data flows. Without software, there is no IT. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Dave McCarthy, Director of Products at Bsquare Corporation; Alan Williamson, Principal...
Businesses and business units of all sizes can benefit from cloud computing, but many don't want the cost, performance and security concerns of public cloud nor the complexity of building their own private clouds. Today, some cloud vendors are using artificial intelligence (AI) to simplify cloud deployment and management. In his session at 20th Cloud Expo, Ajay Gulati, Co-founder and CEO of ZeroStack, will discuss how AI can simplify cloud operations. He will cover the following topics: why clou...
WebRTC is the future of browser-to-browser communications, and continues to make inroads into the traditional, difficult, plug-in web communications world. The 6th WebRTC Summit continues our tradition of delivering the latest and greatest presentations within the world of WebRTC. Topics include voice calling, video chat, P2P file sharing, and use cases that have already leveraged the power and convenience of WebRTC.
"At ROHA we develop an app called Catcha. It was developed after we spent a year meeting with, talking to, interacting with senior citizens watching them use their smartphones and talking to them about how they use their smartphones so we could get to know their smartphone behavior," explained Dave Woods, Chief Innovation Officer at ROHA, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.