Metrics for Software Testing, Part 1: The Why and How of Metrics

Series · Part 1 of 4 · Managing with facts

When you use metrics to track, control, and manage testing and quality, you're managing with facts and reality instead of opinion and guesswork. This paper lays the framework for the series: why metrics matter, the top-down process that produces metrics worth collecting, and the four properties that separate a useful metric from a vanity one.

Four-part series: Part 1 (this paper), Why & how · Part 2, Process metrics · Part 3, Project metrics · Part 4, Product metrics

Why metrics at all

Sometimes engineers dismiss metrics with variations on the line "not everything that matters can be measured, and not everything that can be measured matters." It's a clever line. It's also a bad excuse. Metrics let you measure attributes, understand what's happening, make decisions you can defend, and (maybe most importantly) check whether the decisions you made were the right ones. Testing without metrics is testing by opinion, and opinion is not a management basis.

The everyday world runs on metrics. When you drive, you track speed. When you shop, you compare price. When you fly, altitude, fuel, and heading are on the glass. Strip those metrics away and you'd feel lost in ninety seconds. In most cases we've simply stopped noticing the measurements because they're so reliably there. Software testing has the inverse problem: the metrics aren't reliably there, and the ones that are (the ones built into your test management tool out of the box) are usually the wrong ones.

Reasonable-sounding opinions are the most dangerous kind of mistake. Two thousand years of European thought accepted Aristotle's claim that heavier objects fall faster, because it sounded reasonable, until Galileo dropped two cannonballs off the Leaning Tower of Pisa and overturned the whole thing with a single thud. In consulting engagements we see the same pattern constantly. A client with a 20% bug-report rejection rate (four times the industry-healthy rate) had a firm, unanimous internal opinion about why: their testers didn't have enough end-user experience. A scatterplot of rejection rate against years of plant experience produced an R² of effectively zero. The reasonable-sounding opinion was simply wrong, and for two years it had masked a real process problem.

The core claim

Testing produces information. Information has no value unless it is generated and communicated effectively. Effective communication requires metrics. Without metrics, the test function is operating without instruments.

Three kinds of communication metrics enable

Testing communicates for three reasons:

Notification. Making people aware of a status. "We have 24 bugs remaining to close" is more useful than "there are still bugs in the backlog."
Enlightenment. Explaining an impact. "Bug-fix failures have cost us 212 person-hours this cycle, about 9% of planned test effort" is more useful than "it's frustrating to deal with all these bad fixes."
Influence. Driving a decision. Showing a breakdown of the backlog by severity proposes a bug-triage meeting to defer the unimportant reports in order to focus on the critical ones.

In practice, a dashboard is the set of metrics reported regularly (process-, project-, or product-focused) whose job is ongoing notification, enlightenment, or influence. An ad-hoc metric is the one you produce to explain a specific situation that just came up. Both matter. Conflating them usually results in a dashboard full of one-time analyses, which gets ignored, and no ad-hoc analysis when the moment calls for one.

The top-down development process

The most common failure mode of a metrics program is bottom-up metric selection: picking metrics because the tool produces them. Test management tools generate huge volumes of tactical metrics that can be useful to a test manager and are simultaneously overwhelming or misleading to non-testers. That's backwards. Use the tool to collect the raw data; decide for yourself what to report.

The top-down process starts with objectives and reverse-engineers metrics from there:

Define objectives

What is the test program actually trying to achieve?

Ask E/E/E questions

How effective, efficient, and elegant are we in pursuing each objective?

Devise metrics

Direct or surrogate, concrete and measurable.

Set realistic goals

Baseline against yourself or benchmark against industry.

Define the objectives. When we start engagements, more than half the time the client has no clear, documented, realistic, agreed-upon test objectives. Typical high-level objectives for a test program: find bugs (especially important ones); build confidence in the product; reduce risk of post-release failures; provide useful, timely information about testing and quality. Your set might differ. Write them down.

Ask effectiveness, efficiency, and elegance questions. For each objective, three natural questions emerge:

Effectiveness. To what extent are we producing the desired result at all?
Efficiency. To what extent are we producing it without waste?
Elegance. To what extent is the work graceful, well-executed, and credible to outsiders?

Elegance isn't a vanity concern. Consider an espresso bar that serves a perfectly good cappuccino in 90 seconds for a low price, but the cashier overcharges you by mistake, the bar is filthy, and the barista's hair is shedding into the drink. The objective was met effectively and efficiently, and the experience is unacceptable. Test teams are judged the same way. A test team that produces accurate, timely information but presents it in an incomprehensible pile of numbers loses credibility.

Devise measurable metrics. For each E/E/E question, come up with a metric you can actually measure. Direct metrics measure the thing itself. Surrogate metrics measure something closely related when direct measurement isn't practical, the way you can use vehicle volume and a density assumption to estimate vehicle weight when you don't have a scale. Confidence in a system is a state of mind, so we measure it through surrogate metrics of coverage. We'll return to that in Measuring Confidence Along the Dimensions of Test Coverage.

Set realistic goals. Two legitimate ways: baselining (measuring where you stand now) and benchmarking (comparing yourself to industry norms or best practices). One illegitimate way: picking an arbitrary extreme and writing it into the metric. We've seen organizations where testers had "find 100% of bugs" in their annual reviews while developers had "ship code with zero defects" in theirs. Both goals are impossible, both were used as individual-performance weapons, and both violated the non-negotiable rule that process metrics must not be used for individual performance appraisal.

Two worked examples

Example 1, Are we done finding bugs?

Objective: Find bugs, especially important ones. Effectiveness question: Have we finished finding new bugs? Metric: Cumulative bug-open and bug-close trend over the test window. The goal: the open curve flattens before release; the close curve converges with it.

Direct metric · effectiveness

Bug open and resolution trends

Healthy shape, the cumulative opened curve flattens near release, and the closed curve converges.

Opened

Closed

The vertical gap between the curves is the open-bug backlog. The chart also hints at obvious improvements: shift both curves left (find and fix earlier) and shift the terminal opened value down (fewer bugs introduced in the first place).

This chart also hints at two obvious improvements: shift both curves left (find and fix earlier in the lifecycle), and push the final opened value down (introduce fewer defects in the first place). Fewer bugs, found and resolved earlier, is always the smart direction to drive.

Example 2, How confident can we be in what we built?

Objective: Build confidence in the product. Effectiveness question: Are all the requirements fully covered? Metric: Requirements coverage, per area, what percentage is currently untested, tested-and-failed, and tested-and-passed. Target: 100% tested with no remaining must-fix failures at release.

Surrogate metric · effectiveness

Requirements coverage by area

Surrogate metric for confidence, 'tested and passed' is the only green state.

Tested & passed

Tested & failed

Untested

Usability has 25% still untested and 17% failing (a hole worth an escalation. Reliability is 100% tested but 17% failing) a hole worth a fix. Same format, different problem.

This is a surrogate metric (we're measuring coverage as a proxy for confidence) but it's a good surrogate, because the relationship between "all requirements tested and passing" and "stakeholders can trust the system" is strong and well-understood.

Two examples, two forms of balance

The two metrics above balance each other. The opened/closed chart can look healthy even if you've stopped finding bugs for the wrong reason (blocked test environments, scope starvation, exhaustion) as long as nothing is being filed. The requirements coverage table won't let that pass: "Tested and passed" won't cross 100% if you're not actually running tests. Every metric in a dashboard should be balanced by at least one other metric that would catch the way it can be gamed or misinterpreted.

The four tests a metric has to pass

Most bad metrics programs aren't bad because the team skipped the top-down process. They're bad because the metrics fail one of four tests.

Simple

To calculate and to understand

A test manager shouldn't need to re-explain the metric at every status meeting.

Effective

Tied to real action

The 'so what?' test, if the number moves, what do you do differently?

Efficient

Cheap to produce

The effort to produce it is repaid by the value it creates.

Elegant

Clear to the audience

Looks professional. Reads cleanly. Holds up in front of executives.

Good metrics programs are also concise and balanced. Concise means you settle on a short, diverse set after the exploratory phase, it's tempting to collect everything, and it's destructive. Balanced means no single metric can paint a rosy picture by itself, every metric has a counter-metric that would expose the gaming, the blind spot, or the perverse incentive. The opened/closed curve plus requirements coverage above is the smallest balanced set; a production quality-risk metric completes the triangle.

Presentation forms

Three presentation options cover most use cases:

Snapshots. A table or chart of status at a moment in time. The requirements coverage table above is a snapshot.
Trends. A metric graphed against time. The opened/closed curve is a trend.
Relationship charts. Scatter plots and correlation analyses that test hypotheses. The "rejected reports vs. years of plant experience" example at the start of this paper is a relationship chart.

When in doubt, try more than one form and see which produces the cleanest conversation. Edward Tufte's work on chart design is a 40-year-old investment that's still paying dividends; it belongs on every test manager's shelf.

People are not machines, plan for psychology

Three psychological dynamics will distort how people read and react to metrics. Every test manager will meet them.

Confirmation bias. People accept data that confirms what they already believe and reject data that contradicts it. The project manager whose bonus depends on on-time release will have significant confirmation bias when reading the backlog curve.

Cognitive dissonance. The feelings of confusion, anxiety, and anger that come from trying to hold two incompatible beliefs at the same time. The project manager who begins to understand what the backlog means will experience it in real time.

Transference. Emotions attached to the situation get displaced onto someone else. The project manager may end up angry with the test manager who reported the numbers, even though the test manager did not cause them.

You cannot fix human nature. You can recognize when it's operating and design around it: deliver bad news early so there's time to act on it, co-brief with stakeholders whose interests are aligned with the data, and separate the report of the facts from the recommendation about what to do next so debate over the latter doesn't poison acceptance of the former.

Avoid the performance-appraisal trap

The single most destructive thing an organization can do with test metrics is use them for individual performance appraisal. If DDE feeds into a tester's annual review, they will find ways to inflate it. If defect counts feed into a developer's review, they will find ways to suppress them. The process metric stops measuring the process and starts measuring the tug-of-war between the metric and the people being measured by it. Process metrics measure process capability. Project metrics measure project status. Product metrics measure product quality. None of them measure individual performance, and trying to use them that way destroys their usefulness and the honesty of the team.

Where this goes next

You now have a framework for generating metrics worth generating. The rest of the series applies it to the three levels at which test metrics operate:

Part 2, Process metrics measure the capability of the test process and the surrounding software process: defect-detection effectiveness, defect closure period, reopen count. These are the least-used and least-understood of the three types.
Part 3, Project metrics measure progress and status on a single project: multi-series bug trends, test-case fulfillment, test-execution hours. These are the most commonly used (and the most commonly misused) test metrics.
Part 4, Product metrics measure the quality of the thing you're shipping: requirements coverage, residual quality risk, risk-category breakdowns. Often forgotten, and without them you don't actually know what you're about to ship.

Part 2 (Process Metrics) next in the series.
Measuring Confidence Along the Dimensions of Test Coverage, the coverage-metric toolkit this paper references.
Effective Test Status Reporting, how to present metrics to the rest of the organization.
Charting the Progress of System Development Using Defect Data, four foundational charts built on the framework in this paper.

Metrics for Software Testing, Part 1: The Why and How of Metrics

Why metrics at all

Three kinds of communication metrics enable

The top-down development process

Two worked examples

Example 1, Are we done finding bugs?

Bug open and resolution trends

Example 2, How confident can we be in what we built?

Requirements coverage by area

Two examples, two forms of balance

The four tests a metric has to pass

Presentation forms

People are not machines, plan for psychology

Avoid the performance-appraisal trap

Where this goes next

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D

The ISTQB Advanced Level path, mapped

Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix

Building Quality In: What Engineering Organizations Do from Day One

Where this leads

Software Quality & Security

Risk Reduction & Clear Decisions

Reliable Software at Scale

Working on something like this?