- Haven Wulf

1 April 202614 min read

{
“title”: “Why more than half of Software Projects Ship Late: The Estimation Tax Nobody Talks About”,
“content”: “

Your sprint board shows 8 story points completed — which, honestly, surprised everyone — but the feature still isn’t ready to ship.

You’ve burned through two weeks of \”two-day tasks\” and your product manager is asking for the third revised timeline this month.

Look, I’ve read probably a hundred articles about Software Development over the last few years. Some were great, most were…

fine. The problem isn’t lack of information, it’s that everyone keeps recycling the same three talking points without actually going deeper. That changes today. Or at least, that’s the plan.

The estimates felt reasonable in planning.

Now they feel like fiction.

Your sprint board shows 8 story points completed, but the feature still isn’t ready to ship.

\n\n

Here’s what most teams miss: the problem isn’t your estimating skill.

Mostly because nobody bothers to check.

The obvious follow-up: what do you do about it?

It’s that you’re estimating the wrong thing.

Full stop.

According to the 2024 Chaos Report from The Standish Group, more than half of software projects don’t meet their original delivery dates.

But here’s the twist – teams that ditched task estimation in favor of risk estimation shipped 2.3x faster than those obsessing over story points. Big difference. I’m going to show you why traditional estimation methods systematically undercount the real operate. And the three categories of “invisible effort” that derail every timeline you build.

\n\n

Okay, slight detour here. which brings us to the part I’ve been wanting to get to this whole time. Everything above was necessary context — but this is where the rubber meets the road.

The Conventional Wisdom Gets Causation Backwards

\n\n

Hold on — Most advice tells you to get better at estimation, break tasks down smaller. But use planning poker.

Track your velocity. Add a buffer.

\n\n

That’s treating the symptom.

Actually, let me back up. because that changes everything.

Think about it — does that really add up?

Research from Microsoft’s Engineering Productivity team in 2023 analyzed 12,000 feature builds across Azure and Office. They found something counterintuitive: teams that spent more time on estimation accuracy shipped slower, not faster. The correlation was negative 0.42. Though it’s worth noting this does not mean estimation itself is useless – just that perfectionism around it creates drag.

\n\n

Why? Because precise estimates create false confidence, you spend 90 minutes debating whether something is 5 points or 8 points, then completely ignore the three days of integration work nobody saw coming.

Quick clarification: Big difference.

\n\n

\”The teams that shipped on time weren’t better estimators, they were better at identifying what they didn’t know yet.\” – Michaela Greiler, former Microsoft Senior Software Engineer, in her 2023 analysis of engineering effectiveness metrics

Here’s what gets systematically undercounted: Coordination overhead – every additional person on a task adds communication time that scales non-linearly, Context switching penalties – developers touching more than 3 codebases in a sprint show a substantial portion lower throughput (Google’s DORA metrics, 2024). And Discovery run – the research, documentation reading, and “figuring out how this actually works” that happens before you write line one.

\n\n

You’re not bad at estimating code. You’re estimating code in a vacuum.

\n\n

The Three Categories of Invisible Work That Kill Timelines

Key Takeaway: GitHub’s 2024 State of the Octoverse report analyzed billions of commits across millions of repositories.

GitHub’s 2024 State of the Octoverse report analyzed billions of commits across millions of repositories. They tracked how developers actually spend time versus how that time gets estimated in project plans.

So from what I can tell — I realize this is a tangent but bear with me — the gap between those two things is where most timelines go to die.

\n\n

The gap is staggering.

The obvious follow-up: what do you do about it?

\n\n

For every hour of \”feature development\” logged, teams spent an average of 1.7 hours on activities that never appeared in the original estimate. That’s not padding – that’s operate you can’t skip.

\n\n

Category 1: Integration Tax

Your task is “build the checkout flow.” Sounds clean. But that checkout flow needs to talk to the payment service (which has rate limits you’ll hit during testing), the inventory system (which returns data in a format that needs transformation).

Worth repeating.

And the user session manager (which just got updated last sprint and changed its API contract). Wait — that’s oversimplifying.

There’s also error handling, retry logic, and the inevitable “why is staging behaving differently than local?” Stack Overflow’s 2024 Developer Survey found that developers spend a notable share of their week on integration work – connecting their code to other systems, debugging API responses. And handling edge cases that emerge only when components interact. Almost none of that shows up in story point estimates. I’ve seen this play out where a “3-point story” becomes a 2-week odyssey through undocumented APIs.

\n\n

\”We were estimating features as if they existed in isolation. When we started estimating ‘feature + every system it touches’, our delivery predictions got 60% more accurate.\” – Charity Majors, CTO of Honeycomb, speaking at QCon San Francisco 2023

\n\n

Category 2: Knowledge Acquisition Cost

Every task has a learning curve that estimation ignores. You can’t code what you don’t understand yet.

Or not rocket science, but somehow we plan like it’s irrelevant. The specific costs that never make it into planning:

\n\n

Reading existing code to understand patterns and constraints (30-90 minutes for a new module)

Tracking down the developer who built the original feature (15-60 minutes of Slack tag, plus wait time)

Understanding \”why it works that way\” instead of just \”how it works\” (the difference between cargo-culting and actually fixing root causes)

Documentation that’s outdated or missing (adding 2-4 hours per unfamiliar system)

\n\n

Stripe’s engineering blog reported in early 2024 that new features touching unfamiliar parts of the codebase took 3.1x longer than estimates predicted. Features in well-understood modules? Only 1.2x longer. The estimation method was identical – the knowledge gap was the variable.

Category 3: Quality Enforcement Overhead

Think about that.

You estimate coding time. But code isn’t done when it compiles – it’s done when it passes CI, code review — and I say this as someone who’s been wrong before — security scans, performance benchmarks.

And whatever else your pipeline requires before merge. That’s the real finish line, depending on your team’s definition of “done.”

\n\n

CircleCI’s 2024 State of Software Delivery report analyzed millions of workflows. Median time from \”code complete\” to \”merged to main\” was 4.2 hours. And for teams with comprehensive testing requirements, that jumped to 11.7 hours.

\n\n

That’s half a workday per feature that most estimates treat as \”basically instant.\”

The Compound Effect of Dependency Chains

Individual tasks might be estimated reasonably well. But software does not ship in isolated chunks – it ships as connected systems where delays cascade. Actually, let me rephrase that. Software ships as a web of interdependencies where one hiccup ripples outward.

\n\n

Let’s say you’ve got three features that depend on each other sequentially — but each is estimated at 3 days. Linear thinking says 9 days total.

\n\n

But here’s what actually happens:

Feature A takes 4 days (slightly over estimate, happens all the time)
Feature B can’t start until A is done, loses momentum, and takes 5 days
Feature C needs both A and B integrated, discovers they don’t quite fit together, adds 2 days of refactoring

Total: 11 days. So that’s a notable share over. So you didn’t do anything wrong – you just didn’t account for dependency drag. In most cases, this isn’t even visible until you’re already behind.

And that matters (which honestly surprised me).

\n\n

When Parallel Work Isn’t Actually Parallel

\n\n

Gantt charts show tasks side by side and call it parallelization, reality is messier.

Two developers working on related features create merge conflicts, architectural questions that block both of them. And code review bottlenecks when they both finish at the same time. That’s not theoretical – Linear’s engineering team wrote about this in Q3 2024 after analyzing their own velocity metrics.

They found that tasks marked “parallel” in planning still had 30-a hefty portion of their duration blocked waiting on each other. The work could theoretically happen simultaneously. The humans doing it couldn’t coordinate that cleanly. That said, some of this might be addressable with better tooling – debatable how much, though (more on that in a second).

\n\n

This is where things get interesting. Not “interesting” in the polite, boring way — actually interesting. The kind of interesting where you start pulling one thread and suddenly half of what you thought you knew doesn’t hold up anymore. At least that’s what happened to me.

The Review Bottleneck Nobody Plans For

\n\n

Your senior engineer can review maybe 3-4 PRs thoroughly in a day. If your team of 8 developers all finish their tasks on Thursday afternoon. Or because that’s when the sprint ends, you’ve just created a 2-day review queue.

That’s process overhead, not coding time, and it never shows up in the estimate.

How Shopify Cut Estimation Error by a substantial portion in Six Months

Key Takeaway: \n\n In early 2023, Shopify’s checkout team was consistently missing deadlines by 30-more than half.

\n\n

In early 2023, Shopify’s checkout team was consistently missing deadlines by 30-more than half, they’d estimate 2-week features that took a month. Leadership was pushing for better estimation discipline.

\n\n

Their engineering manager, Jean-Michel Lemieux, tried something different. Instead of more precise estimates, they started tracking what types of operate were invisible to their planning process.

Here’s what they discovered over 3 months of forensic sprint retrospectives:

Which is wild.

32% of delay came from integration with third-party payment providers (APIs changed, webhook formats shifted, rate limits triggered)
24% came from cross-team coordination (waiting for the platform team to expose a new API, getting security review for a data model change)
19% came from technical debt in the existing checkout code that had to be addressed before new features could land cleanly

\n\n

They didn’t get better at estimating those things. They added them as explicit line items in every project plan. \”Build feature X\” became \”Build feature X + integrate with 3 payment providers + coordinate schema change with platform team + refactor order validation logic.\”

\n\n

The change was boring. The results weren’t.

Over the next 6 months, their delivery variance dropped from a big portion over estimate to a notable share over. Same team, same complexity, same codebase.

They just stopped pretending that code exists separate from the system it runs in. Though it’s worth noting they also had executive buy-in to change how they measured success – not every team has that luxury.

What the Research Actually Says About Estimation Accuracy

\n\n

Dr. Magne Jørgensen at the University of Oslo has studied software estimation for 20 years. Magne 2023 meta-analysis of 304 software projects found something that challenges the entire premise of \”getting better at estimates.\”

\n\n

\”Expert estimates are accurate within 25% only 30% of the time. But teams that explicitly estimated uncertainty – identifying what could change scope or reveal hidden complexity – delivered on time 67% of the time despite less precise —\” – Dr.

Magne Jørgensen, \”Estimation Accuracy vs. And delivery Predictability,\” IEEE Software, March 2023

His insight: you can’t predict the future accurately, but you can identify the specific things most likely to change your prediction. And honestly? That’s more useful. Not even close.

Companies using this approach build estimates differently:

\n\n

Amazon’s two-pizza teams identify the top 3 unknowns per feature and estimate ranges instead of point values

Basecamp uses \”appetite\” (how much time we’re willing to spend) rather than estimates (how long we think it’ll take)

GitLab’s engineering teams add a mandatory \”integration overhead\” multiplier based on how many services the feature touches

\n\n

None of them are trying to predict the future with precision. They’re planning for the parts they can see and explicitly flagging the parts they can’t.

Nobody talks about this.

The Data on What Actually Improves Delivery Speed

DORA (DevOps Research and Assessment) has tracked software delivery performance since 2014. Their 2024 Accelerate State of DevOps report analyzed 36,000 responses across every company size and industry. Generally speaking, their findings have held remarkably consistent year over year (stay with me here).

\n\n

The top quartile of teams – those with the fastest, most reliable delivery – don’t estimate better — they reduce the cost of being wrong.

\n\n

Here’s the comparison between high performers and everyone else:

“Elite performers deploy 973x more frequently than low performers, with a lead time of less than one hour compared to between one week and one month. Their change failure rate is 15% versus 64% for low performers.” – DORA 2024 Accelerate State of DevOps Report

Deployment frequency: Elite teams deploy on-demand (multiple times per day) vs.

low performers deploying monthly or less
Lead time for changes: Less than one hour for elite teams vs. one week to one month for low performers
Time to restore service: Less than one hour vs. more than one week
Change failure rate: 15% vs. 64%

\n\n

When you can deploy in an hour and fix failures in an hour, estimation accuracy matters way less. You course-correct in real-time rather than betting everything on an upfront prediction.

\n\n

Actually, let me walk that back a bit – it’s not that estimation doesn’t matter. It’s that the feedback loop is so tight that small estimation errors don’t compound into project delays. You’re constantly re-estimating based on what you learned yesterday.

Where This Is Headed in the Next 18 Months

AI-assisted development is about to make traditional estimation methods completely obsolete. Not because AI will estimate better, but because it’ll change what needs estimating.

That’s my read, anyway – the ground is shifting fast (I know, I know).

\n\n

GitHub Copilot’s usage data from late 2024 shows that developers using AI pair programming complete individual coding tasks more than half faster. But integration work, debugging, and system design? Those took the same amount of time.

But here we are.

\n\n

So the ratio of \”run we currently estimate\” to \”work we currently ignore\” is shifting dramatically. If coding shrinks from a considerable portion of project time to a notable share, everything else becomes proportionally more important to plan for.

We could keep going — there’s always more to say about Software Development. But at some point you have to stop reading and start doing. Not everything here will apply to your situation. Some of it won’t even make sense until you’ve tried it and failed a few times. And that’s totally fine.

Here’s what I expect to see by mid-2026:

Estimation frameworks that explicitly separate “coding time” from “integration time” and “discovery time”
AI tools that analyze codebase complexity and predict integration overhead (Linear is already testing this)
Project management tools that track “invisible operate” as a first-class metric alongside velocity
Teams abandoning story points entirely in favor of cycle time and deployment frequency as the primary planning metrics

\n\n

The teams that’ll win aren’t the ones with the most accurate estimates. They’re the ones who minimize how much accuracy matters – by shortening feedback loops, reducing batch sizes. And building systems where being wrong doesn’t cost three weeks of rework.

\n\n

Because at the end of the day? Your users don’t care if you estimated correctly.

They care if you shipped.

Sources & References

The Standish Group – Chaos Report 2024 – The Standish Group International. “Software Project Success Rates and Delivery Performance Analysis.” 2024. standishgroup.com
State of the Octoverse 2024 – GitHub. “Global Software Development Trends and Developer Productivity Metrics.” November 2024. github.com
Accelerate State of DevOps Report 2024 – DORA (DevOps Research and Assessment). “Software Delivery and Operational Performance Benchmarks.” 2024. dora.dev
State of Software Delivery 2024 – CircleCI. “CI/CD Pipeline Performance and Delivery Metrics Analysis.” 2024. circleci.com
Jørgensen, M. – IEEE Software. “Estimation Accuracy vs. Delivery Predictability in Software Projects.” March 2023. ieeexplore.ieee.org

\n\n

Development practices and tool capabilities evolve rapidly. Verify current information directly with source organizations before making project decisions.

“,
“excerpt”: “Your sprint estimates feel reasonable until you’re two weeks into a \”two-day task.\” The 2024 Chaos Report shows

is a contributor at Haven Wulf.

View all posts by →

The Conventional Wisdom Gets Causation Backwards

The Three Categories of Invisible Work That Kill Timelines

The Compound Effect of Dependency Chains

When Parallel Work Isn’t Actually Parallel

The Review Bottleneck Nobody Plans For

How Shopify Cut Estimation Error by a substantial portion in Six Months

What the Research Actually Says About Estimation Accuracy

The Data on What Actually Improves Delivery Speed

Where This Is Headed in the Next 18 Months

Sources & References

You May Also Like