What are DORA Metrics and How Do They Inform DevOps Success?

Quick Links

Deployment Frequency

Change Lead Time

Change Failure Rate

Time to Restore Service

Summary

The DORA metrics are four key measurements that help team leaders to understand the effectiveness of their DevOps working practices. The DevOps Research and Assessment (DORA) group developed the metrics after six years of research into successful DevOps adoption.

Measuring data is the best way to gauge the effect that DevOps is having on your organization. Focusing on the aspects identified by DORA can uncover opportunities to optimize your processes and improve efficiency. In this article, we'll explain how each of the four metrics contributes to DevOps success.

Deployment Frequency

Deployment frequency measures how often you ship new code into your production environment. As the overriding objective of DevOps is to deliver functioning code more efficiently, deployment frequency is a great starting point when you're evaluating success.

You can collect this data by simply analyzing how many times new code has been deployed over a particular time period. You can then look for opportunities to increase your release rate, without sacrificing any guard rails that maintain quality standards. Using continuous delivery to automatically deploy code as you merge it is one way you can accelerate your workflow.

The ideal deployment frequency depends on the type of system you're building. While it's now common for web apps to be delivered multiple times a day, this cadence isn't suitable for game developers producing multi-gigabyte builds.

In some situations it can be helpful to acknowledge this difference by thinking of deployment frequency slightly differently. You can approach it as the frequency with which you could have deployed code, if you'd wanted to cut a new release at a particular point in time. This can be a more effective way to gauge throughput when true continuous delivery isn't viable for your project.

Change Lead Time

A change's lead time is the interval between a code revision being committed and that commit entering the production environment. This metric reveals delays that occur during code review and iteration, after developers have completed the original sprint.

Measuring this value is straightforward. You need to find the time at which the developer signed off a change, then the time at which the code was delivered to users. The lead time is the number of hours and minutes between the two values.

As an example, consider a simple change to send a security alert email after users log in. The developer completes the task at 11am and commits their work to the source repository. At 12pm, a reviewer reads the code and passes it to QA. By 2pm, the QA team's tester has noticed there's a typo in the email's copy. The developer commits a fix at 3pm and QA merges the final change into production at 4pm. The lead time of this change was 5 hours.

Lead time is used to uncover inefficiencies as work moves between items. Although standards vary widely by industry and organization, a high average lead time can be indicative of internal friction and a poorly considered workflow. Extended lead times can also be caused by poorly performing developers producing low quality work as their first iteration on a task.

Some organizations use different measurements for lead time. Many select the time that elapses between a developer beginning a feature and the final code entering production. Others may look back even further and use the time at which a change was requested - by a customer, client, or product manager - as the starting point.

These methods can produce information that's more broadly useful within the business, outside engineering teams. DORA's interpretation using commit timestamps has one big advantage though: the data is captured automatically by your source control tool, so developers don't need to manually record start times for each new task.

Change Failure Rate

The change failure rate is the percentage of deployments to production that cause an incident. An incident is any bug or unexpected behavior that causes an outage or disruption to customers. Developers and operators will need to spend time resolving the problem.

You can calculate your change failure rate by dividing the number of deployments you've made by the number that have led to an error. The latter value is usually acquired by labeling bug reports in your project management software with the deployment that introduced them.

Accurately attributing incidents to the change that caused them can sometimes be tricky, especially if you have a high deployment frequency, but in many cases it's possible for developers and triage teams to determine the most probable trigger. Another challenge can be agreeing on what constitutes a failure: should minor bugs increase your failure rate, or are you only interested in major outages? Both kinds of issue impact how customers perceive your service so it can be useful to maintain several different values for this metric, each looking at a different class of problem.

You should always aim to drive the change failure rate as low as possible. Using automated testing, static analysis, and continuous integration can help prevent broken code from making it out to production. Protect your processes with new tools and working methods to gradually reduce the failure rate over time.

Time to Restore Service

Unfortunately failures can't be eradicated altogether. Eventually you're going to run into an issue that causes pain to your customers. The fourth DORA metric, Time to Restore Service, analyzes how effectively you can respond to these events.

Similarly to change lead time, the duration which is measured can vary between organizations. Some teams will use the time at which the bug was deployed, others will go from the first customer report, and some may take the time at which the incident response team was paged. Whichever trigger point you adopt, you should use it consistently and keep measuring until the incident is marked as resolved.

A high average recovery time is a signal that your incident response processes need fine-tuning. Effective responses depend on the right people being available to identify the fault, develop a patch, and communicate with affected customers. You can reduce the time to restoration by developing agreed response procedures, keeping critical information centrally accessible in your organization, and introducing automated monitoring to alert you to problems as soon as they occur.

Optimizing this metric is often neglected because too many teams assume a major outage will never happen. You may also have relatively few data points to work with if your service is generally stable. Running incident response rehearsals using techniques such as chaos testing can provide more meaningful data that's representative of your current recovery time.

Summary

The four DORA metrics provide DevOps team leaders with data that uncovers improvement opportunities. Regularly measuring and analyzing your Deployment Frequency, Change Lead Time, Change Failure Rate, and Time to Restore Service helps you understand your performance and make informed decisions about how to enhance it.

DORA metrics can be calculated manually using the information in your project management system. There are also tools like Google Cloud's Four Keys that will generate them automatically from commit information. Some ecosystem tools like GitLab are beginning to include integrated support too.

The best DevOps implementations will facilitate quick changes and regular deployments that very rarely introduce new errors. Any regressions that do occur will be dealt with promptly, minimizing downtime so customers have the best impression of your service. Tracking DORA trends over time lets you check whether you're achieving these ideals, giving you the best chance of DevOps success.