Test Estimation: Realistic, Actionable, Flexible

Companion paper · pairs with QA Library checklist

At the start of most projects, the test manager (or test lead, or delivery lead) gets a version of the same question: how long will it take, and what resources do you need, to test this system? Maybe the boss asked. Maybe the boss already gave you an end date and you have to decide whether you can hit it. Maybe you're bidding on a project or putting a proposal together for an important stakeholder. Whatever the impetus, you need to know how to estimate a test effort, and produce a number you can defend.

Pairs with the Test Estimation Process checklist in the QA Library, a printable one-pager that walks through the estimation steps and the factor checklist at the end of this paper.

What a good estimate is

An estimate should accurately predict and guide the project's future. A useful estimate has three properties:

Realistic. It includes all the tasks you can reasonably anticipate. It forecasts the most likely outcome given current information. It surfaces the risks so they can be mitigated.
Actionable. It has clear ownership, tasks assigned to committed individuals. It shows assigned resources and known dependencies.
Flexible. When deadlines and resources are constrained (they always are), the estimate has to accommodate reality without collapsing into wishful thinking.

An estimate without all three is a forecast that will surprise everyone the first time something doesn't go to plan, which is usually week one.

Terminology

A quick vocabulary list so we're all referring to the same things:

Project. Temporary endeavor to create or provide a product or service.
Test subproject. The subset of the project performed by the test team to provide test services to the larger project.
Work breakdown structure (WBS). A hierarchical decomposition of a project into phases, activities, and tasks, with resources, durations, and dependencies attached.

Divide and conquer: the work breakdown structure

A WBS is the estimator's primary tool. It's a hierarchical decomposition of the project (in this case, the test effort) into stages, activities, and tasks. For a test subproject, the usual starting stages are:

Planning: test strategy, plan, risk analysis, schedule.
Staffing (if applicable), hiring, contracting, onboarding.
Test environment acquisition and configuration.
Test development: scripted and automated cases, test data, tooling.
Test execution: including find / fix / retest cycles.

Once you have the stages, divide them into ever-smaller chunks of work, ultimately down to the level of one person over a short period (one to five business days). Then conquer the estimation problem by asking, for each task, how long it will take (duration) and how much effort it will take (person-hours). The overall effort and duration estimates derive from the lowest-level constituent tasks. These are bottom-up estimates.

To know when tasks are genuinely done, each task should produce a deliverable, or at least a first-draft or measurable piece of one. Deliverables may be internal to the test team (test cases, test data, automation harnesses, CI pipelines), inbound to the test team (the first feature-complete build, unit test results, a configured environment), or outbound to the project team (test plans, bug-tracking configuration, results reports). Deliverables often become inputs to subsequent tasks.

For a modern project, the WBS usually lives in the same place the project lives, Jira / Linear / Asana / Monday / ClickUp, or Microsoft Project for traditional waterfall shops. For greenfield test subprojects where the WBS hasn't been established, some teams still find that index cards on a whiteboard (or virtual equivalents in Miro / FigJam / Mural) work well for the first pass; the results go into the real project management tool after.

Unite and estimate: three team techniques

Decomposing the work into tasks can be done alone. Accurate predictions of how long those tasks will take require the collective wisdom of the people who have done them before. Gathering the team to estimate has three additional benefits: it leverages varied experience, it signals trust, and it builds commitment to the number.

Three techniques to structure team estimation:

Delphic Oracle

Each team member individually estimates each task. During review, the lowest and highest estimators for each task explain their reasoning. The low estimator may know an optimization or a reusable component ("we can generate that data with a script instead of typing it in." The high estimator may know a hidden risk) "imported hardware prototypes will sit in customs for two weeks." The process is repeated two more times, incorporating what everyone just heard. The average at the end of the final iteration is the estimate.

Three-Point

Each team member gives three numbers per task: best case (everything goes well), worst case (fears realized), and expected case. The average of the expected cases is the estimate. The best-case and worst-case numbers are retained to feed the risk register and the contingency buffer.

Wideband (Delphic Oracle + Three-Point)

Team members give three numbers per task. Low and high estimators for each number explain their reasoning. Repeat twice. The average of the expected cases becomes the estimate; the average best- and worst-case numbers form the confidence range.

Names like Delphic Oracle are a reminder: you're trying to foretell the future. Risky business. As the project proceeds you will learn things that would have shifted the estimate. New tech won't work the first time. Things will change. Build in contingency (some slack) especially on the riskiest tasks. A rule of thumb that's held up is ~20% on the riskiest tasks; your actual number should come from looking back at previous projects' initial-vs-final estimate ratios.

In agile shops these techniques show up as planning poker (Delphic Oracle on a Fibonacci scale) and three-point estimation is baked into PERT / reference-class forecasting. The underlying ideas are the same.

It depends: dependencies and the critical path

A WBS with durations alone is not yet a schedule. Tasks have dependencies on other tasks. Predecessor tasks have to complete (or reach some defined state) before successor tasks can start (or complete).

The two most common dependency types in test subprojects:

Finish-to-start. You want a completed, approved test plan before test development starts.
Finish-to-finish. You want system test to continue for a defined window after feature completion and after the last change is delivered, even if feature work happened to finish early.

For small teams the whole group can plug dependencies directly into the project management tool. For larger projects (or greenfield test subprojects), mapping dependencies on a whiteboard first is often cleaner: stick the no-dependency tasks at the left, then add tasks that depend only on the tasks already on the board, drawing lines for each dependency. Repeat until everything is placed. Then move the result into the tool.

Once dependencies are in, you can identify the critical paths, sequences where any day of slippage pushes the project end date day-for-day. Near-critical paths are sequences where a day or two of slippage is absorbed but larger slippages become critical. Tasks on the critical path demand disproportionate attention: they are where most external dependencies converge (phase entry / exit, environment readiness, vendor deliverables) and where most projects lose time.

No free lunch: resources

Resources fall into three categories:

People. Engineers and technicians, employees and contractors, plus any outside test resources (labs, vendor QA teams, crowd-testing services). Remember that a less-skilled person assigned to a task takes longer and produces lower-quality output, be ready to revise when you see the actual skill mix. For every task on the schedule, at least one person on the team should know how to do it; otherwise that task is a schedule risk.
Test environments. Compute, storage, networks, test data, lab space, mobile-device farms, cloud accounts, third-party service sandboxes, LLM API budgets. Long-lead items (hardware, vendor integrations, regulated data) deserve their own line items.
Test tools and testware. Custom test data, cases, scripts, harnesses, fixtures, commercial test-management tools. Many of these are deliverables from the early stages of the test project, estimate the cost of building them before depending on them.

Three common estimation pitfalls to avoid:

Assuming two people finish a task in half the time of one. (Brooks's Law still applies. Collaboration overhead, knowledge transfer, and environment contention eat much of the gain, and some tasks are inherently sequential.)
Overloading tools or environments. Too few licenses; running performance and functional testing simultaneously on shared infrastructure; a mobile-device farm with too few concurrent sessions.
Forgetting time and resources to set up and support the environment and tools. Environment work is consistently the single most under-estimated line item in a test subproject.

Estimating test execution

Test execution is the stage that most resists estimation. How long it takes is a function of two questions: how long to run each planned test at least once, and when will the team be done finding and confirming fixes for bugs.

The planned-execution time

You need three inputs:

Total person-hours of planned testing. Sum your case-level effort estimates. Suppose a team projects 280 hours of planned test effort for a given cycle.
Raw person-hours per week on the team. For seven testers at 40 hours each, 280 person-hours per week.
Percentage of tester time spent actually running tests. Testers attend meetings, confirm bug closures, update scripts, read email, and do other legitimate work. If 50% of time goes to actual test execution, that's 140 person-hours of testing per week.

In this example: 280 hours of planned work ÷ 140 hours of effective capacity per week = two weeks to run each test once. This is a floor, not a ceiling.

The find-and-fix time

Now the harder question: how long will it take to find (and confirm) the bugs? The technique is a defect-removal model, a simple forecast built from historical data.

First, predict the total number of bugs. Function points and lines of code are classical inputs; if your process supports them, use them. If not, use whatever historical data you have, defects per story point, defects per feature, defects per tester-week, defects per commit, or a rough "past releases of this size had roughly N bugs." The idea is to build a simple model (a spreadsheet is fine) that projects total bug count from one or two metrics you can measure during estimation.

Second, predict the find rate. With historical data, calculate what percentage of remaining bugs are typically found each week during system test. Then how long to fix and confirm, what percentage of open bugs get closed each week, given the development team's observed velocity?

Sanity-check the absolute numbers against team capability. If the model predicts a peak find rate of 200 bugs per week for a team of seven, ask whether testers really can produce five well-researched bug reports per person per day. If the model predicts a fix rate the development team has never achieved, it's a wish, not a forecast.

Organizations with good historical data can project total bug counts within ±10% on projects spanning multiple years. Organizations without that history should build the model anyway and tighten it over subsequent releases.

But we don't have until April 30th

Suppose you bring the realistic, actionable schedule to management and the response is "make it shorter." Now what? Reluctantly or petulantly accepting an imposed date is not a strategy. Better options exist.

Relax entry criteria (pull forward)

Many entry criteria say "the previous phase must be complete before the next begins." Relax "feature complete" to "almost complete" and you can start system test earlier. This overlap can pull in the end date, in a typical project, by one or two weeks. It also increases quality risk: can developers really finish features and fix bugs simultaneously? Testing an unready system is less efficient, which may mean less and less thorough testing. The overlap needs eyes-open sign-off from the product and engineering sides.

Add staff

Suppose you increase the test and development teams enough to run passes in a week and fix bugs twice as fast. This can pull the end date in substantially. Real cost: headcount budget may double, and new hires take weeks or months to contribute. If the project is short, you pay the ramp cost without getting the benefit. Use this for long projects with true skill gaps, not short projects under deadline pressure.

Cut scope (cut features)

Drop significant chunks of functionality (the multiplayer mode, the secondary platform, the non-primary market) and you reduce both development and test effort. Typical savings on a well-scoped cut: ~25%. This tends to be the most effective lever because it cuts both sides of the equation. The risk: sequencing those features into subsequent releases can introduce regression surface that's expensive to test a second time. Bundle them with the next regular release, not as a quick point-release.

Cut coverage deliberately

Rather than an across-the-board reduction in test execution time (which is almost always the worst option), deliberately reduce coverage where the risk is acceptable:

Eliminate whole areas of coverage, identify lowest-risk areas and drop them, or test them only as side effects of other tests.
Reduce extent across the board, identify the lowest-risk subset of the highest-risk areas and adopt a broader, shallower approach.
Postpone automation of non-regression-critical paths.
Use outsourced test labs or crowd-testing for device matrix coverage.

Whatever technique you pick, the objective is a proactive, explicit reduction driven by a risk analysis, not an arbitrary haircut.

What not to propose

A few approaches that show up in schedule-crunch meetings and should be refused:

Sustained overtime. If people were machines, seven-day weeks would shrink schedules by 40%. They don't. Occasional overtime is fine; sustained overtime produces burnout, degraded productivity, and a spike in escapes. Extended overtime as a plan is usually a way for the schedule's owner to look blameless when the schedule is missed.

Tight schedules as a "stretch goal." For a 50/50 chance of on-time completion, each task (especially on the critical path) needs a 50/50 chance of finishing on time. Tight schedules that expect heroics only work in "Theory X" management, the assumption that people only do their best work under pressure. In practice, tight schedules cause people to skip the parts they don't get rewarded for (peer review, test data hygiene, documentation), and the technical debt shows up a release later.

Silent coverage cuts. Deliberate, documented cuts are fine. Unplanned cuts because you ran out of time are how escape rates climb. If the team can't finish the planned coverage, the cut needs to be called before the end of the cycle, not discovered afterward.

Modern additions

Some estimation patterns that are now standard and weren't when this framework was first written:

AI-assisted code generation changes dev velocity estimates more than test velocity estimates. Developers using good AI assistants ship features faster, but the test work to cover those features doesn't scale the same way. The estimation gap between dev and test velocity has widened, not narrowed.
LLM-output evaluation harnesses are their own estimation category. If the product has LLM-backed features, someone has to build the evaluation harness, maintain the golden-answer datasets, and run the regression against them. Budget for this as a first-class line item, it's not "a bit more automation."
Cloud, container, and IaC environments are cheap to provision, expensive to maintain. Spinning up a test environment is a command; keeping it healthy, seeded, and in sync with production topology is a persistent line item, not a one-time cost.
Mobile release cadence compresses cycles. Weekly or biweekly app-store cadences mean the estimation unit is usually a sprint or a release train, not a project. Use per-sprint velocity projections instead of phase-level estimation, and keep a separate estimate for the larger pieces (platform migrations, SDK upgrades).
DORA metrics give you an external reality check. Lead time for changes, deployment frequency, change failure rate, and mean time to recovery tell you whether your estimate is feasible in your team's current operating envelope. A plan that implies the team needs 3× its recent deployment frequency is a plan that probably won't land on time.

Realistic, actionable estimates

In a successful project, schedule, budget, features, and quality (the four moving parts) converge as the release date approaches. Realistic, actionable estimates lay the foundation for that convergence.

The best practices of project estimation can help you produce a good estimate. A good estimate is complete and accurate; it captures and balances risk; it has committed team and individual ownership; it accounts for dependencies and the critical path. It gives executives and the project management team options that let them balance competing risks. Working together, through deliberate trade-offs in the context of a good estimate, you can guide a project to a successful outcome.

Appendix, Factors that influence test estimation

Estimation techniques by themselves aren't enough. System engineering (including the testing) is a complex, high-risk, human endeavor. Many factors can influence effort, time, dependencies, and resources. Some can speed things up or slow them down; others, when present, can only slow things down.

When preparing a test estimate, go through these four categories and ask, for each factor, whether it applies and how it affects the current project. Forgetting just one can turn a realistic estimate into an unrealistic one.

Process factors

The extent to which testing pervades the project (or is tacked on at the end).
Clearly defined hand-offs between the test team and the rest of the organization.
Well-managed change control for project and test plans, product requirements, design, implementation, and testing.
The chosen system development or maintenance lifecycle, including the maturity of testing and project processes within it.
Timely and reliable bug fixes.
Realistic and actionable project and testing schedules and budgets.
Timely arrival of high-quality test deliverables.
Proper execution of early test phases (unit, component, integration).

Material factors

Existing, assimilated, high-quality test and process automation and tools.
The quality of the test system, environment, process, cases, tools, data.
An adequate, dedicated, secure test environment.
A separate, adequate development debugging environment.
The availability of a reliable test oracle (so a bug can be recognized as a bug).
Available, high-quality project documentation, requirements, designs, plans.
Reusable test systems and documentation from previous, similar projects.
The similarity of the project and testing to previous efforts.
Availability of realistic, representative test data (including privacy-compliant production-derived data or synthetic data of adequate fidelity, see A Few Thoughts on Test Data).

People factors

Often the most important.

Inspired and inspiring managers and technical leaders.
An enlightened management team committed to appropriate levels of quality and sufficient testing.
Realistic expectations across all participants, individual contributors, managers, and stakeholders.
Proper skills, experience, and attitudes on the project team, especially in the managers and key players.
Stability of the team, especially the absence of turnover.
Established, positive project-team relationships across contributors, managers, and stakeholders.
Competent, responsive test-environment support.
Project-wide appreciation of testing, release engineering, system administration, and other "unglamorous but essential" roles. (Put another way: not an individual-heroics culture.)
Use of skilled contractors and consultants to fill gaps.
Honesty, commitment, transparency, and open, shared agendas across contributors, managers, and stakeholders.

Complicating factors

When present, these only slow things down, never speed them up.

High complexity of the process, project, technology, organization, or test environment.
Many stakeholders in the testing, quality of the system, or the project.
Many subteams, especially when they're geographically separated.
The need to ramp up, train, and orient a growing test or project team.
The need to assimilate or develop new tools, techniques, or technologies at the testing or project levels.
The presence of custom hardware.
Any requirement for new test systems, especially automated testware, as part of the test effort.
Any requirement to develop highly detailed, unambiguous test cases, especially to an unfamiliar standard of documentation.
Tricky timing of component arrival, especially for integration testing and test development.
Fragile test data, for example, data that is time-sensitive, expires quickly, or depends on third-party availability.
Compliance regimes with specific evidence requirements (regulated industries, SOC 2, PCI DSS, HIPAA, GDPR, EU AI Act).

Experience is often the ultimate teacher for these factors, but a smart test manager can learn to ask smart questions, of herself and of the project team, about how each factor will affect the current effort.

Test Estimation Process checklist, the printable one-pager companion.
Quality Risk Analysis, the input that makes the WBS and schedule defensible.
Test Execution Processes, how the execution-stage estimate actually gets consumed.
Test Release Processes, how the release cadence becomes a constraint on your schedule.

Working on this?

Rex Black, Inc. has been running test-estimation workshops with enterprise engineering teams since 1994. If you want help producing an estimate that stakeholders will trust, or coaching your test leads on the estimation techniques, talk to us.

Test Estimation: Realistic, Actionable, Flexible

What a good estimate is

Terminology

Divide and conquer: the work breakdown structure

Unite and estimate: three team techniques

Delphic Oracle

Three-Point

Wideband (Delphic Oracle + Three-Point)

It depends: dependencies and the critical path

No free lunch: resources

Estimating test execution

The planned-execution time

The find-and-fix time

But we don't have until April 30th

Relax entry criteria (pull forward)

Add staff

Cut scope (cut features)

Cut coverage deliberately

What not to propose

Modern additions

Realistic, actionable estimates

Appendix, Factors that influence test estimation

Process factors

Material factors

People factors

Complicating factors

Related reading

Evaluation Before Shipping: How to Test an AI Application Before It Hits Production

Choosing the Right Model (and Knowing When to Switch)

Beyond ISTQB: A Multi-Domain Certification Roadmap for Technical L&D

The ISTQB Advanced Level path, mapped

Bug Triage: A Cross-Functional Framework for Deciding Which Defects to Fix

Building Quality In: What Engineering Organizations Do from Day One

Where this leads

Software Quality & Security

Risk Reduction & Clear Decisions

Reliable Software at Scale

Working on something like this?