Enterprise Tech Solutions Meetup: setting up an efficient software monitoring system

The text is also available in:
View count: 346

One of the re­cent trends in soft­wa­re de­ve­lop­ment is DevOps. DevOps methodo­logy com­bines soft­wa­re co­ding and de­ploy­ment to achieve sim­ple, often and safe upgrades and at the sa­me time allow fast and efficient ana­ly­sis of defects of the run­ning sys­tem.

One of the best ways to de­ter­mi­ne a cause of an appli­ca­tion failing or even pro­acti­vely pre­dict symp­toms of an approaching fai­lu­re is to moni­tor the whole sys­tem at all times. However, it re­quires choosing the right in­fra­struc­tu­re and tools so that the state of the sys­tem can be easily observed in real time. These issues were addressed during the Enter­pri­se Tech Solutions group meet-up that took pla­ce on September 26th at the of­fice of ING Inno­vation Lab at Sokolska street, Katowice.

The meeting star­ted with a pre­sen­ta­tion given by Mateusz Beczek, titled “When the mo­ni­toring sys­tem is sleeping, failures are awoken: a case study”. As the most common causes of failures the speaker men­tio­ned code bugs, sub­op­ti­mal test cove­rage, work­ing on the pro­duc­tion in­fra­struc­tu­re, human errors and momen­tary sys­tem load surges. Failures can take diffe­rent forms. A slow-down can occur, some sys­tem func­tio­nality can stop fun­ctio­ning and even­tu­ally the whole sys­tem can stop res­pon­ding. It may not be ob­vious, but even planned un­availabi­lity is an un­wanted si­tua­tion that can be con­si­de­red a fai­lu­re. Users of the sys­tem should there­fore be no­ticed prior to all main­tenance breaks.

There is a dif­fe­ren­ce be­tween terms “an in­ci­dent” and “a prob­lem”. As an in­ci­dent one can classify every dis­ruption in ser­vi­ces pro­vided by an appli­ca­tion. Restoring ser­vi­ce re­solves an in­ci­dent even if its root cause re­mains un­known. A prob­lem is the root cause that in most cases is ini­tially un­known. Once enough simi­lar inci­dents are diag­nosed and re­solved, a sup­port team can de­ter­mi­ne the root cause, re­move the issue and pre­vent sub­se­quent inci­dents of the sa­me ty­pe from happening.

An in­ci­dent can be raised by an auto­matic sys­tem or by a user. Regardless of the source, opening a new in­ci­dent starts the race against time. First, the prio­rity of the in­ci­dent is con­sulted with the system's end user and the pro­duct owner. Then a team is formed to re­solve the in­ci­dent. If the issue seems to have a broader scope, the team is ex­ten­ded with mem­bers of other appli­ca­tion teams. When the issue gets re­solved, a post mortem ana­ly­sis is per­for­med to de­ter­mi­ne its root cause.

With the issue po­si­ti­ve­ly re­solved, it is key to docu­ment all actions taken and all sus­picions of possible causes of the prob­lem. This do­cu­men­tation will defini­tely help the team should a simi­lar issue emerge. Additio­nal­ly, sugges­tions of future improve­ments should be noted down and for­war­ded to rele­vant teams so that no such inci­dents will happen again or at least their de­bug­ging will be easier.

With this broad theore­tical in­tro­duc­tion de­li­vered, the speaker pre­sen­ted some real-life examples illu­strating appli­ca­tion of these tech­niques. First was a case of imple­men­ting a wrong de­sign pattern. Instead of being a proto­type with mul­tiple in­stances created and de­stroyed, an object was created as a singleton. This pre­vented the object from freeing re­sour­ces it allo­ca­ted during its life­time and cau­sed the appli­ca­tion to slow down over time. Emergency mea­sures of shutting down some appli­ca­tion mo­dules would help for a mo­ment, but re­sour­ces it freed were almost immedia­tely used up by the leak. Only after a tho­rough ana­ly­sis the root cause of the issue was found, a fix applied and deployed and – most im­por­tant­ly – new mo­ni­toring metrics con­fi­gu­red so that no re­source leak would go un­noticed in future. The con­clu­sion was that a lack of pro­per re­source mo­ni­toring can cause a slight slow­down to be esca­lated to a fai­lu­re while pre­ven­ting the team from quickly locating the cause of the issue.

A lack of pro­per mo­ni­toring me­cha­nism was also the cause of the se­cond of cases pre­sen­ted. An appli­ca­tion up­date per­for­med in one of two data­centers ended up with a “silent fai­lu­re” with the in­stan­ce being down and all requests being pro­cessed by se­cond data­center. However, the se­cond data­center was taken off­line shortly after for main­tenance pur­poses and the whole ser­vi­ce be­came un­respon­sive. Were the num­ber of active appli­ca­tion in­stances moni­tored, the lack of one of them would be no­ticed prior to com­ple­te ser­vi­ce un­availabi­lity.

In some cases, how­ever, the mo­ni­toring sys­tem can start mal­func­tioning itself. A case was pre­sen­ted in which an in­crease in through­put cau­sed the Splunk mo­ni­toring clus­ter, res­pon­sib­le for da­ta acqui­sition, aggre­gation and ana­ly­sis, to over­load. The clus­ter star­ted skipping some events, im­pairing mo­ni­toring effec­tive­ness and pre­ven­ting sub­se­quent appli­ca­tion failures from being dis­co­vered and re­ported. Team mem­bers dis­patched to fixing these issues could not rely on mo­ni­toring in­for­ma­tion and were forced to manually re­port inci­dents and analyse mo­ni­toring da­ta. The con­clu­sion was that the mo­ni­toring sys­tem may need pro­per ca­pa­city mo­ni­toring itself.

Second pre­sen­ta­tion, given by Arkadiusz Szewczyk and titled “How to build a log ana­lysing roc­ket”, con­ti­nued the theme. The speaker de­scri­bed the archi­tec­ture of pre­vio­usly men­tio­ned Splunk mo­ni­toring sys­tem. Events re­ported by appli­cations con­tain in­for­ma­tion on the source of the event, a state of the moni­tored mo­du­le and a suggested action. For in­stan­ce, a HTTP ser­ver exper­iencing pro­longed CPU usage in­crease may re­port an event con­tain­ing its IP add­ress, ave­rage CPU usage for some past minu­tes and a re­quest for an in­ci­dent to be re­ported, so that an ad­mi­nis­tra­tor would notice and in­ve­sti­ga­te the issue. Subsequent events can up­date or re­voke inci­dents, and each in­ci­dent can be asso­cia­ted with mul­tiple source events. Even the mo­ni­toring sys­tem itself can re­port events on its state.

A serious diffi­culty lies in pro­per con­fi­gu­ra­tion of action acti­vation cri­teria. The sys­tem has to cope with both “false posi­tives”, when a fai­lu­re is re­ported even though the sys­tem state is still within bounds, and “false nega­tives”, when even though some func­tio­nality stops work­ing, the mo­ni­toring sys­tem detects no issues. Additio­nal­ly, a switch must be pre­sent to pre­vent main­tenance works from cau­sing a cas­cade of events triggering fai­lu­re inci­dents.

To de­tect failures effi­ciently, the sys­tem has to pro­cess lots of in­for­ma­tion. However, as was men­tio­ned, Splunk may have issues with pro­per­ly pro­cessing events during ex­treme loads. These issues may be re­me­died by con­fi­guring two sepa­rate mo­ni­toring clus­ters, one that collects and stores in­for­ma­tion and one that pro­cesses queries. However, the most effective mea­sure is to edu­cate users not to cre­ate over­comp­li­cated, fre­quently run queries that access large sets of da­ta. Additio­nal­ly, on­ly da­ta rele­vant to appli­ca­tion state mo­ni­toring should be re­corded, for even though Splunk is able to collect really large in­for­ma­tion sets, sto­ring that much da­ta in its data­base may prompt users to per­form queries on every­thing that is avail­able.

The last pre­sen­ta­tion, titled “How to use Elastic­search and Kafka for mo­ni­toring pur­poses” was given by Paweł Dąbek. The speaker pre­sen­ted an open-source alter­na­tive to Splunk con­sisting of Elastic­search and Apache Kafka. Elastic­search is a NoSQL data­base that supports dis­tri­bu­ted sto­rage and pro­cessing. Though it needs enor­mous re­sour­ces, mostly RAM ca­pa­city, it is able to per­form blinding fast da­ta ana­ly­sis, in­clu­ding full-text searches. Due to its archi­tec­ture it also pro­vides high avail­ability and re­sis­tance to failures. Then, Kafka pro­vides an efficient me­cha­nism for da­ta dis­tri­bu­tion and can split, aggre­gate and pro­cess da­ta streams in a way that is trans­parent to soft­wa­re end­points.

In most cases, da­ta streams coming from mul­tiple sources are re­di­rec­ted to a Kafka ser­ver that pre­processes and buffers them so that the tar­get data­base would not be over­loaded. Input da­ta can be collected by Elastic­search's own acqui­sition mo­dules called beats or sent by the appli­cations being moni­tored them­sel­ves using addi­tio­nal libraries such as APM Agent. Database con­tent can be then ana­lysed either manually or with dash­board tools such as Kibana or Grafana. It is also possible to remo­tely access the da­ta using Elastic­search's REST in­ter­fa­ce.

Elastic­search's pri­ma­ry ad­van­tage is its per­for­mance and com­plete­ness. With no cus­tom con­fi­gu­ra­tion needed it can store and pro­cess da­ta sets up to 100 TB in size. There are also ready-made beats avail­able that can auto­ma­ti­cal­ly re­direct ope­ra­ting system's and appli­ca­tion con­tainer's logs and key metrics to Elastic­search. Also, un­like sim­ple solu­tions such as mrtg, Elastic­search al­ways pro­cesses the whole da­ta set with no aver­aging. This pre­vents ex­treme va­lues from being lost and causes long-term charts to pre­sent all peaks.

Elastic­search can also be custo­mized so that it moni­tors per­for­mance of a de­ve­lop­ment team. Information on commits merged to re­posi­tory bran­ches, appli­ca­tion de­ploy­ment errors or po­si­ti­ve­ly re­solved inci­dents can be sto­red in the data­base and ana­lysed. Results of these ana­lyses can be then used to de­ter­mi­ne in­suffi­cient em­ployee training or lack of human or hard­ware re­sour­ces.

The pre­sen­ta­tion com­pleted with the suggestion that a fully func­tio­nal mo­ni­toring sys­tem could help with trans­ition from re­active to pro­active issue mana­ge­ment. With enough da­ta collected, not on­ly in-pro­gress failures can be de­tected, but also some future inci­dents can be pre­dicted. For in­stan­ce, a steady in­crease in RAM and CPU usage or appli­ca­tion re­quest pro­cessing time may suggest that the code should be opti­mized or the appli­ca­tion should be mi­grated to a more capable hard­ware.

The meeting con­cluded with a short dis­cussion, mostly re­volving around the topic of main­taining high per­for­mance of an appli­ca­tion mo­ni­toring sys­tem. The orga­nizers men­tio­ned that the sys­tem deployed at ING can already deal well with sys­tem ca­pa­bi­lity mo­ni­toring and ana­ly­sis but still needs some tuning up in the auto­mated defect de­tection de­part­ment. Appli­ca­tion avail­ability level of 99.8% (which trans­la­tes to around 17 hours of down­time annu­ally) defini­tely needs some im­prove­ment.

The meeting was very well orga­ni­zed. The pre­sen­ta­tions were well coordi­nated in time, and when the dis­cussion was closed all guests were offered fresh hot piz­za, so that fur­ther talks and net­working could take pla­ce.