Peter Drucker is often quoted as saying, "you can't manage what you can't measure”. This is especially true in the world of technology and software development.

In DevOps, collecting, analyzing, and measuring metrics is a critical aspect.

A study by Atlassian in 2020 found that DevOps success directly correlates with important metrics such as MTTR and Lead Time, as well as business metrics like revenue and profitability.

On the other hand, another opinion resonated amongst the respondents of the study:

“It is difficult to measure the impact of DevOps progress, and organizations do not have a clear way to measure success,” voted half of the respondents.

In this post, we will be discussing how organizations can gain valuable insights into their development and deployment processes, and make data-driven decisions to improve efficiency using DevOps metrics.

Why measure DevOps Metrics?

Effective measurement of DevOps metrics enables teams to have a bird's-eye view of the development and deployment process, enabling them to identify bottlenecks and areas of improvement.

Identifying bottlenecks:  It is important to track where delays are occurring in the development and deployment process and work to eliminate them. Metrics such as Lead Time and Cycle Time can be used to track delays.

Improving collaboration: To improve collaboration between teams, metrics such as Change Failure Rate and Mean Time to Recover (MTTR) can help teams identify areas where collaboration can be improved.

Increasing visibility: Providing company-wide visibility into the entire development and deployment process can offer insights into how long it takes to move from one stage to the next.

Identifying cost-saving opportunities: By reducing operational inefficiencies, teams can identify areas where they can save money. Metrics such as Deployment Frequency and Lead Time can direct teams where to optimize their DevOps processes.

4 Key DevOps Metrics according to DORA

The goal is to measure performance by breaking down abstract processes in software development and making them visible through data. These data points would then guide stakeholders to take necessary steps to streamline processes and increase software throughput and stability.

Additionally, we will reference DORA's 2022 State of DevOps report to compare team performance between low-to-medium-to-high performers and see what success and efficiency look like.

Mean Time To Recovery: MTTR, short for Mean Time to Recovery, is a metric that calculates the average duration required to recover from product and system failures, known as incidents.

For instance, if an organization pushes improvements in the form of better automations or enhanced collaboration between teams, they may experience a decline in MTTR. This suggests that the enhancements are having a positive impact, leading to faster incident resolution and better service availability.

According to the DORA report, teams can evaluate their performance based on the time required to restore service during an outage. High-performing teams take less than one day to restore unplanned outage or service impairment.

Lead Time: Lead time refers to the time it takes for a feature or user story to move from the initial request to the point of release to production. It includes the time taken for planning, development, testing, code review, and deployment, as well as any delays that occur during the process, such as waiting for approvals or dependencies.

For example, if lead time is consistently high, it may indicate bottlenecks in the development process, such as slow code reviews or inefficient testing practices. DORA has consistently found that high performers have shorter lead times than low performers.

Deployment Frequency: Deployment frequency measures how often changes are deployed to production. A higher deployment frequency means a quicker release of new features and updates, a faster response to customer feedback, and a reduced risk of disruptive deployments.

In contrast, organizations that deploy infrequently may struggle to keep up with customer demands and market changes. This can result in frustrated customers, missed opportunities, and, ultimately, lost revenue.

Overall, a high deployment frequency is a key indicator of a successful DevOps culture. According to DORA, high-performing organizations deploy code to production for end-users on demand, whereas low-performers take up to once per month or even six months to deploy one code.

Change Failure Rate: Change Failure Rate (CFR) measures the percentage of changes that fail during deployment. A high CFR indicates issues with the development or deployment process, such as poor testing or insufficient automation.

To improve CFR, focus on improving testing and quality assurance processes, as well as investing in infrastructure that can support frequent code deployments.

Overall, a low CFR is a key indicator of a successful DevOps culture that values quality and reliability. Organizations that prioritize this metric are likely to see improved customer satisfaction and faster time to market.

DevOps Metrics beyond DORA

While the DORA metrics are widely used and highly effective, there are many other metrics that can provide valuable insights.

In addition, I am also providing a link to a Google Sheet that includes several other metrics, including the ones listed below. The sheet outlines their basic definition, how to measure them, and the stage at which to measure those metrics.

Cycle Time: Cycle time refers to the time it takes for a feature or a user story to move from the start of the development process (such as planning or design) through to deployment and release. It includes all the steps involved in the development process, including coding, testing, code review, and deployment. This metric offers valuable insights into the speed and efficiency of the software delivery process, enabling teams to identify bottlenecks and streamline operations.

The goal is to reduce cycle time by automating as many tasks as possible, removing bottlenecks, and improving collaboration and communication between teams.

While a good cycle time will vary based on the complexity of the application and the specific needs of the business, high-performing DevOps organizations typically aim for a cycle time of just a few hours or less.

Defect Escape Time: Defect Escape Time is a critical metric that measures the time it takes for defects to be detected from the point they are introduced into the code until they are discovered in production.

A high Defect Escape Time indicates that defects are not being detected early in the development and testing process, which can lead to a poor user experience. Generally, a defect escape time of a few hours or less is considered good in a high-performing DevOps organization.

However, the acceptable level of defect escape time can vary depending on the criticality of the application, the business requirements, and the level of risk associated with defects.

Defect Escape Rate: Defect escape rate measures the percentage of defects that are not detected during testing and are discovered after the software is deployed in production.

Defect escape rate can be calculated by dividing the number of defects discovered in production by the total number of defects.

A good Defect escape rate in DevOps is a low percentage. Generally, a defect escape rate of less than 5% is considered good in a high-performing DevOps organization.

Automated Test Pass Percentage: The automated test pass percentage is defined as the percentage of automated tests that pass successfully during a specific period.

For instance, if a software team runs 100 automated tests and 90 of them pass, the automated test pass percentage would be 90%.

This metric indicates the reliability of the software being developed; a high pass percentage suggests that the software is likely to be free of bugs and issues, while a low automated test pass percentage suggests that there may be issues with the software that need to be addressed.

Application Performance Metrics: Application performance measures responsiveness, reliability, and scalability – gauging how well an application performs – in terms of speed, stability, and resource utilization under various workloads and conditions.

Metrics related to application performance help teams monitor and optimize the application's performance throughout the development lifecycle – from development and testing to deployment and production.

For instance, response time measures the time it takes for the application to respond to a user request, while throughput measures the number of transactions or requests that the application can handle in a given period. Error rate indicates the frequency of errors and failures in the application, and resource utilization measures the amount of resources such as CPU, memory, and disk space, that the application uses.

By tracking these metrics, teams can pinpoint performance issues and bottlenecks and optimize the application to ensure it satisfies performance requirements and provides a positive user experience.

Challenges in measuring DevOps Metrics Correctly

Measuring DevOps metrics can be a tricky affair, as there are several challenges that organizations must overcome.

Lack of standardization: One of the main challenges in measuring DevOps success is the lack of standardization. Different organizations have their own unique approaches to DevOps implementation, making it difficult to develop a standard set of metrics that can be used to measure success. This makes it challenging to benchmark progress, and can lead to confusion and frustration amongst stakeholders.

Limited visibility: Another challenge is limited visibility. DevOps requires effective collaboration and communication between different teams, but tracking progress can be challenging when visibility is limited. The right tools and processes in place to capture and analyze data on different aspects of DevOps implementation can do the trick.

Difficulty in defining and measuring success: Perhaps the most significant challenge in measuring DevOps success is defining what success actually means.

Success can mean different things to different organizations, and even within the same organization, there may be different opinions on what constitutes as success. Some organizations may focus on speed of delivery, while others may prioritize stability and reliability. Defining success requires careful consideration of organizational goals, as well as taking into account the needs of different stakeholders, including customers, developers, and operations teams.

Tools for effective DevOps Metrics tracking

To effectively track DevOps metrics, it is important to use a combination of tools to gather data and insights from every stage of the DevOps pipeline. Here are three types of tools that can help you effectively track key metrics:

CI/CD Tools: CI/CD tools help automate the process of building, testing, and deploying software, enabling you to measure metrics such as build success rates, test coverage, and deployment frequency. Some popular CI/CD tools are Jenkins, CircleCI, and Travis CI.

Monitoring Tools: Monitoring tools help you track the performance of your software in production environments. By monitoring key metrics such as server response times, error rates, and user activity, you can identify potential issues and make data-driven decisions to improve your application's performance. Some popular monitoring tools are New Relic, Datadog, and Prometheus.

Collaboration Tools: Effective collaboration is critical to the success of any DevOps team. Collaboration tools such as Jira, Trello, and Asana can help you track progress on tasks, assign responsibilities, and communicate with team members. By tracking collaboration metrics such as task completion rates and response times, you can identify areas where collaboration can be improved.

Customer Support Ticketing Tools: All the metrics in the world are irrelevant if your efforts do not reduce customer support tickets. Tools such as Zendesk, Freshdesk, Jira Service Management, and Salesforce Service Cloud help teams manage customer support tickets effectively by providing a centralized platform for ticket management, communication, and knowledge base creation.

In addition, using these software tools, you can automate features like canned responses and workflows, helping to reduce response time and improve ticket resolution time.

Let’s Conclude

While standard metrics can guide you towards a well-run DevOps practice, it is important to design your own metrics based on your current development and business context.

We perform regular studies with engineering leaders to learn how they use metrics to transform their tech. In one such learning session, Kaushik Mukherjee, a seasoned engineering leader, gave us an example of how he used a metric to guide a couple of quarters of development sprints.

Kaushik observed that they needed to reduce the defect rate but not the cost-reduced velocity. Hence, they created a metric that would measure the rate of new defects introduced per the number of deployments in a given period, and worked to drive this metric down.

That's just one example of how designing your own metrics can help you achieve success and drive continuous improvement in your development process.