One of the recent trends in software development is DevOps. DevOps methodology combines software coding and deployment to achieve simple, often and safe upgrades and at the same time allow fast and efficient analysis of defects of the running system. One of the best ways to determine a cause of an application failing or even proactively predict symptoms of an approaching failure is to monitor the whole system at all times. However, it requires choosing the right infrastructure and tools so that the state of the system can be easily observed in real time. These issues were addressed during the Enterprise Tech Solutions group meet-up that took place on September 26th at the office of ING Innovation Lab at Sokolska street, Katowice.
One of the recent trends in software development is DevOps. DevOps methodology combines software coding and deployment to achieve simple, often and safe upgrades and at the same time allow fast and efficient analysis of defects of the running system.
One of the best ways to determine a cause of an application failing or even proactively predict symptoms of an approaching failure is to monitor the whole system at all times. However, it requires choosing the right infrastructure and tools so that the state of the system can be easily observed in real time. These issues were addressed during the Enterprise Tech Solutions group meet-up that took place on September 26th at the office of ING Innovation Lab at Sokolska street, Katowice.
The meeting started with a presentation given by Mateusz Beczek, titled “When the monitoring system is sleeping, failures are awoken: a case study”. As the most common causes of failures the speaker mentioned code bugs, suboptimal test coverage, working on the production infrastructure, human errors and momentary system load surges. Failures can take different forms. A slow-down can occur, some system functionality can stop functioning and eventually the whole system can stop responding. It may not be obvious, but even planned unavailability is an unwanted situation that can be considered a failure. Users of the system should therefore be noticed prior to all maintenance breaks.
There is a difference between terms “an incident” and “a problem”. As an incident one can classify every disruption in services provided by an application. Restoring service resolves an incident even if its root cause remains unknown. A problem is the root cause that in most cases is initially unknown. Once enough similar incidents are diagnosed and resolved, a support team can determine the root cause, remove the issue and prevent subsequent incidents of the same type from happening.
An incident can be raised by an automatic system or by a user. Regardless of the source, opening a new incident starts the race against time. First, the priority of the incident is consulted with the system's end user and the product owner. Then a team is formed to resolve the incident. If the issue seems to have a broader scope, the team is extended with members of other application teams. When the issue gets resolved, a post mortem analysis is performed to determine its root cause.
With the issue positively resolved, it is key to document all actions taken and all suspicions of possible causes of the problem. This documentation will definitely help the team should a similar issue emerge. Additionally, suggestions of future improvements should be noted down and forwarded to relevant teams so that no such incidents will happen again or at least their debugging will be easier.
With this broad theoretical introduction delivered, the speaker presented some real-life examples illustrating application of these techniques. First was a case of implementing a wrong design pattern. Instead of being a prototype with multiple instances created and destroyed, an object was created as a singleton. This prevented the object from freeing resources it allocated during its lifetime and caused the application to slow down over time. Emergency measures of shutting down some application modules would help for a moment, but resources it freed were almost immediately used up by the leak. Only after a thorough analysis the root cause of the issue was found, a fix applied and deployed and – most importantly – new monitoring metrics configured so that no resource leak would go unnoticed in future. The conclusion was that a lack of proper resource monitoring can cause a slight slowdown to be escalated to a failure while preventing the team from quickly locating the cause of the issue.
A lack of proper monitoring mechanism was also the cause of the second of cases presented. An application update performed in one of two datacenters ended up with a “silent failure” with the instance being down and all requests being processed by second datacenter. However, the second datacenter was taken offline shortly after for maintenance purposes and the whole service became unresponsive. Were the number of active application instances monitored, the lack of one of them would be noticed prior to complete service unavailability.
In some cases, however, the monitoring system can start malfunctioning itself. A case was presented in which an increase in throughput caused the Splunk monitoring cluster, responsible for data acquisition, aggregation and analysis, to overload. The cluster started skipping some events, impairing monitoring effectiveness and preventing subsequent application failures from being discovered and reported. Team members dispatched to fixing these issues could not rely on monitoring information and were forced to manually report incidents and analyse monitoring data. The conclusion was that the monitoring system may need proper capacity monitoring itself.
Second presentation, given by Arkadiusz Szewczyk and titled “How to build a log analysing rocket”, continued the theme. The speaker described the architecture of previously mentioned Splunk monitoring system. Events reported by applications contain information on the source of the event, a state of the monitored module and a suggested action. For instance, a HTTP server experiencing prolonged CPU usage increase may report an event containing its IP address, average CPU usage for some past minutes and a request for an incident to be reported, so that an administrator would notice and investigate the issue. Subsequent events can update or revoke incidents, and each incident can be associated with multiple source events. Even the monitoring system itself can report events on its state.
A serious difficulty lies in proper configuration of action activation criteria. The system has to cope with both “false positives”, when a failure is reported even though the system state is still within bounds, and “false negatives”, when even though some functionality stops working, the monitoring system detects no issues. Additionally, a switch must be present to prevent maintenance works from causing a cascade of events triggering failure incidents.
To detect failures efficiently, the system has to process lots of information. However, as was mentioned, Splunk may have issues with properly processing events during extreme loads. These issues may be remedied by configuring two separate monitoring clusters, one that collects and stores information and one that processes queries. However, the most effective measure is to educate users not to create overcomplicated, frequently run queries that access large sets of data. Additionally, only data relevant to application state monitoring should be recorded, for even though Splunk is able to collect really large information sets, storing that much data in its database may prompt users to perform queries on everything that is available.
The last presentation, titled “How to use Elasticsearch and Kafka for monitoring purposes” was given by Paweł Dąbek. The speaker presented an open-source alternative to Splunk consisting of Elasticsearch and Apache Kafka. Elasticsearch is a NoSQL database that supports distributed storage and processing. Though it needs enormous resources, mostly RAM capacity, it is able to perform blinding fast data analysis, including full-text searches. Due to its architecture it also provides high availability and resistance to failures. Then, Kafka provides an efficient mechanism for data distribution and can split, aggregate and process data streams in a way that is transparent to software endpoints.
In most cases, data streams coming from multiple sources are redirected to a Kafka server that preprocesses and buffers them so that the target database would not be overloaded. Input data can be collected by Elasticsearch's own acquisition modules called beats or sent by the applications being monitored themselves using additional libraries such as APM Agent. Database content can be then analysed either manually or with dashboard tools such as Kibana or Grafana. It is also possible to remotely access the data using Elasticsearch's REST interface.
Elasticsearch's primary advantage is its performance and completeness. With no custom configuration needed it can store and process data sets up to 100 TB in size. There are also ready-made beats available that can automatically redirect operating system's and application container's logs and key metrics to Elasticsearch. Also, unlike simple solutions such as mrtg, Elasticsearch always processes the whole data set with no averaging. This prevents extreme values from being lost and causes long-term charts to present all peaks.
Elasticsearch can also be customized so that it monitors performance of a development team. Information on commits merged to repository branches, application deployment errors or positively resolved incidents can be stored in the database and analysed. Results of these analyses can be then used to determine insufficient employee training or lack of human or hardware resources.
The presentation completed with the suggestion that a fully functional monitoring system could help with transition from reactive to proactive issue management. With enough data collected, not only in-progress failures can be detected, but also some future incidents can be predicted. For instance, a steady increase in RAM and CPU usage or application request processing time may suggest that the code should be optimized or the application should be migrated to a more capable hardware.
The meeting concluded with a short discussion, mostly revolving around the topic of maintaining high performance of an application monitoring system. The organizers mentioned that the system deployed at ING can already deal well with system capability monitoring and analysis but still needs some tuning up in the automated defect detection department. Application availability level of 99.8% (which translates to around 17 hours of downtime annually) definitely needs some improvement.
The meeting was very well organized. The presentations were well coordinated in time, and when the discussion was closed all guests were offered fresh hot pizza, so that further talks and networking could take place.