Monitoring | DevOps

Definition

Monitoring is continuously gathering feedback, using tools, of the indicators of the IT system throughout the IT delivery cycle and use that information to forecast the behavior of the IT system.

Indicators used for Monitoring

During the realization of the IT system, the different disciplines involved should implement monitoring to check progress and if quality indicators are met. And later, in the lifecycle to measure the indicators of the IT system and to prevent failures by forecasting the future quality and when necessary taking preventive or corrective actions. 

In the building block “Indicators of the VOICE model” we distinguish 4 groups of indicators.

The figure below shows the relation between the 4 groups of indicators from the VOICE model and the elaborated indicators for monitoring. 

Indicators

In this building block about monitoring we go in more depth with indicators for monitoring. Throughout the complete lifecycle of the IT system four kinds of indicators for monitoring should be considered.

  1. Quality indicators 
  2. Team performance indicators 
  3. Functional system indicators 
  4. Non-functional system indicators 

Quality indicators 

Quality indicators are checks to measure if the IT system meets the technical quality set by the team and the functional quality set by the business. An important part of quality control is validation, that is to check if the team is building the right IT system. An indicator for this is customer satisfaction, which is related to the pursued business value.  

Another part is verification, using controls like test coverage on different levels, number of test failures, code quality, etcetera.  

For more information see examples of indicators.  

Team performance indicators 

Most IT projects consist of multiple releases and have a duration of months or even years. During the projects' lifecycle the stakeholders want to be in control of the project. For example, they want to know how much the team can build in a certain period so they can adjust priorities when needed.  

The team performance indicators measure the performance of the team (or multiple teams) and give the stakeholders feedback and information to adjust the priorities when necessary.  

Examples of team performance indicators are: 

  • Burndown charts
  • Velocity (story points per sprint)
  • Changes in quality against time
  • Changes in test coverage against time

Functional system indicators 

At an early stage of the development of the IT system, the team and the business, need to define on which functional metrics the IT system is going to be monitored. These functional metrics should give the team sufficient feedback to be able to monitor and, if needed, provide adaptive measures to ensure that the IT system works properly.  

Functional checks can be high level or in-depth. High-level checks such as liveness check or heartbeat check, only monitor if the system is responding to high level requests. The in-depth level checks such as readiness check or health check monitor if the inner working of the IT system is behaving correctly and if the IT system is ready for its purpose.  

Examples of functional system indicators are: 

  • Ping request / ping response times
  • Number of HTTP requests with fault status
  • Number records successfully processed during batch
  • Number of unique visitors on a website

In addition to the functional checks, the team should also pay attention to the logging, tracing and auditing of the IT system. Extra information can be derived from the logging data. The logging, tracing and auditing is also very useful for the analysis of anomalies and unexpected behavior of the IT system.  

The functional indicators which are implemented during the realization phase come into play during the next phases of the lifecycle of an IT system, especially during operation.  

Heartbeat versus health check 

The heartbeat of an IT system is an elementary high-level check if the IT system is available. Complementary to the heartbeat check a more in-depth level check is the health check. A health check also verifies the internal functioning of an IT system. E.g.: is the database accessible by the system, are third party interfaces available, are the submodules available.  

Logging and tracing 

Logging and tracing are key elements of an application. The purpose of logging is to send sufficient data out of the IT system to a central logging collection system to provide information for an analysis. This analysis can be done to find the cause of a fault or failure. Logging data can also be a source of data for a dashboard.  

In general, logging data are sent with a severity level. The severity levels are especially relevant during the realization phase when the levels are often lower than when the IT system is operating.  

Tracing information shows the changes of a request or records the entire journey through the complete IT system. Besides the state changes, throughput or processing times can also be traced.  

For logging and tracing data, keep GDPR regulations in mind when producing data. Logging data should not refer to any privacy-sensitive information.  

Auditing 

The auditing of an IT system is the check based on logging of events if the data within the IT system is compliant with the integrity, confidentiality and availability regulations of the organization. Auditing will also reveal inappropriate attempts to access data. 

Non-functional system indicators 

Next to functional indicators, non-functional indicators can also tell us both about the current state of the IT system and predict future behavior. Those non-functional indicators are focused on the lower layers of the IT system.  

Examples of non-functional system indicators are: 

  • CPU utilization
  • Memory consumption
  • Diskspace usage&
  • HTTP response time

The CPU, memory, diskspace and response-time can be measured at different levels, e.g. at IT system level and subsystem level, but also in the end-to-end chain of systems. Thresholds can be configured upon which measures can be taken to stay in control and keep the system healthy.  

Dashboarding and reporting  

The continuous stream of output from monitoring these indicators is gathered in datastores and can be used to perform analysis and reporting. To be able to forecast the system's behavior, the datastore should have the characteristics of a timeseries database. A dashboard shows the status and highlights anomalies that occurred in the system.  

The quality assurance dashboard provides various insights; it can support to predict the end date of the project, for example. There are various parameters on which the prediction is based. Suppose there is an improvement in the right first-time percentage, what does this do to the end date? What does a reduction of the lead time with the end date do? Or what is the expected end date of the project if only critical problems are solved? These questions are answered with the dashboard. The dashboard mainly uses descriptive and predictive analytics.

Quality forecasting 

The monitoring information is used to prevent failures by forecasting the evolution of the quality level. If the quality tends to go below a specified threshold, corrective measures can be taken.  

Cognitive QA is an example of an artificial intelligence platform that allows QA Professionals to improve decision-making by providing real-time descriptive, predictive and prescriptive software quality analytics:  

  • Descriptive analytics uses data aggregation and data mining to provide more insight into the past and provides insight into the question: "What happened and what is happening?"  
    For example:  
    • Test coverage percentage
    • Orphan test cases
    • Open critical anomalies
    • Orphan tests
  • Predictive analytics uses statistical models and forecasting techniques to understand the future and answers the question: "What could happen?"  
    For example:
    • Fault prediction
    • Test case fail
    • System integration quality
  • Prescriptive analytics uses optimization and simulation algorithms to advise on possible outcomes and offers an answer to the question: "What should we do?";
    For example:
    • What to test
    • What to automate
    • Test case selection
    • Anomalies to testcase mapping