#3121: Can You Benchmark Government Value for Money?

A century of attempts to measure whether citizens get a good deal on taxes — and why none have fully worked.

Featuring

Listen

0:00

Episode Details

Episode ID: MWP-3291
Published: May 29
Duration: 29:35
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: public-transit productivity urban-planning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

We instinctively compare prices for everything — groceries, streaming services, car repairs — but have no mental framework for "did I get a good deal?" on taxes, the single largest recurring expense most of us will ever have. This episode explores nearly a century of attempts to solve that problem, and why none have fully succeeded.

The first serious attempts came in the 1930s, when the National Industrial Conference Board produced Cost of Government indexes measuring spending per capita across cities and states. The problem: no way to account for quality differences. A city spending more on roads might simply have harsher winters. The rankings became lists of which cities had mild weather and low crime, dressed up as efficiency scores.

The 1970s and 80s brought the Productivity in Government movement, shifting from measuring spending to measuring outputs — potholes filled, permits processed, books checked out. But these were process metrics, not outcome metrics. You could fill a thousand potholes while having terrible roads, because you were filling the wrong ones or filling them badly.

The 1990s saw the most ambitious attempt yet. The Government Performance and Results Act mandated every federal agency create strategic plans with measurable goals. Yet compliance audits show only 34% of agencies actually use the performance data for budget decisions. The rest collect data, file reports, and continue making decisions the same way. The measurement system and decision-making system run on parallel tracks that never intersect.

Modern efforts like Boston's CityScore and Seoul's Smart Dashboard track hundreds of operational metrics updated daily. Yet citizen satisfaction with tax value remains stubbornly low. The dashboards measure whether chosen services are delivered efficiently — not whether the right services are being provided at the right price. A city can build an unwanted park on time and under budget, score a win on metrics, while citizens who wanted better schools get nothing.

The structural problem persists: measuring what's easy to measure rather than what actually matters — outcomes divided by tax burden.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#3121: Can You Benchmark Government Value for Money?

Daniel sent us this one, and it's the kind of question that feels obvious the moment you hear it. We instinctively compare prices for everything — groceries, streaming services, car repairs, you name it. But when it comes to the single largest recurring expense most of us will ever have, which is taxes, we have no mental framework for "did I get a good deal?" And he's asking specifically about Jerusalem, where we've talked before about how residents pay among the highest municipal tax rates in the country while getting demonstrably worse services. But without a benchmark, "worse" is just a feeling. So the question is: have any serious attempts been made over the years to actually benchmark and compare government value for money in comparable terms? And if they have, why don't we use them?

This is one of those questions where the answer is simultaneously "yes, dozens of attempts" and "no, none of them really worked." And the reason they didn't work is the reason the question itself is so interesting. Let me start with the structural problem. In private markets, you have price signals. You buy a coffee, you know within about three seconds whether it was worth what you paid. You have competitors — if this coffee shop charges too much for bad coffee, you walk across the street. The feedback loop is immediate and brutal. Government services have none of that. There's no single price because your tax burden is spread across property tax, income tax, sales tax, fuel tax, and about fifteen other things you don't even notice. There's no competitor to switch to unless you move cities or countries. And the product itself is this diffuse bundle of roads and schools and policing and permits and garbage collection — most of which you consume passively and never price individually.

I can tell you exactly whether my twelve-dollar streaming subscription is worth it because I watched three shows this month or I didn't. But the five thousand dollars I pay in property taxes? I have no idea. I couldn't even tell you what percentage went to which service.

That's not a personal failing. That's a design feature of how we fund government. The opacity isn't accidental. So the obvious question is, hasn't anyone tried to fix this? Actually, people have been trying for nearly a century. Let's walk through the history of government benchmarking, because understanding why most of it failed tells you what would actually need to be true for a useful benchmark to work.

Let's do it.

The first serious attempts were in the nineteen thirties. The National Industrial Conference Board in the United States — which still exists, it's now called The Conference Board — they started producing something called Cost of Government indexes. The idea was straightforward: measure government spending per capita across different cities and states, compare the numbers, and you'd know which governments were more efficient. Lower spending per person meant better value, in theory.

I'm guessing the "in theory" part is where this falls apart.

The problem was they had no way to account for quality differences. City A might spend twice as much per capita on road maintenance as City B, but if City A has brutal winters that destroy asphalt every year and City B is in Arizona, the raw spending number tells you nothing about efficiency. Or one city might spend more on policing because it has higher crime — is that inefficient spending, or is it responding to greater need? The indexes couldn't distinguish. So they produced rankings, but the rankings were basically lists of which cities had mild weather and low crime rates, dressed up as efficiency scores.

The musical equivalent of beige wallpaper. It looked like analysis but it wasn't.

Fast forward to the nineteen seventies and eighties, and you get what was called the Productivity in Government movement. The US Government Accountability Office and the UK Audit Commission started developing performance indicators for local governments. This was a big step forward conceptually because they moved from measuring spending to measuring outputs. How many potholes were filled? How many library books were checked out? How quickly were building permits processed?

Instead of "how much did you spend," it became "what did you actually do with the money.

And for a while this seemed like the solution. But there was a fundamental problem that became clear over time: these were process metrics, not outcome metrics. You could fill a thousand potholes and have terrible roads, because you were filling the wrong potholes, or filling them badly so they reopened in three months. You could process building permits quickly by rubber-stamping everything, which creates a different kind of problem. The metrics captured activity, not results.

This feels like the core design flaw that keeps recurring. Measuring what's easy to measure, not what actually matters.

And it leads us to the nineteen nineties, which is when things got genuinely interesting. David Osborne and Ted Gaebler published a book called Reinventing Government in nineteen ninety-two, which argued that government should be mission-driven rather than rule-driven, and should focus on outcomes rather than inputs. This became hugely influential. The Clinton administration launched the National Performance Review, which was led by Al Gore, and it created the first citizen-facing report cards for federal agencies.

Like the ones you got in school.

They'd rate agencies on things like customer service, efficiency, and results. And in nineteen ninety-three, Congress passed the Government Performance and Results Act — GPRA — which mandated that every federal agency had to create strategic plans with measurable goals and report annually on whether they met them.

This sounds like exactly what we're talking about. Mandatory measurement, public reporting, clear goals. Did it work?

Here's the brutal number. Compliance audits show that only thirty-four percent of federal agencies actually use the performance data they collect for budget decisions. The rest collect the data, file the reports, and continue making budget decisions the same way they always did. It became a checkbox exercise. The reports exist, they're publicly available, but they don't drive resource allocation. The measurement system and the decision-making system are essentially parallel tracks that never intersect.

The transparency is real, but the accountability isn't.

That's a critical distinction. Transparency measures openness. It tells you what happened. Accountability means there are consequences for what happened. You can have perfect transparency about terrible value, and the transparency doesn't fix the value. It just documents the failure more thoroughly.

Like a restaurant that posts detailed nutritional information about its food but the food is still terrible. Knowing the calorie count doesn't make it taste better.

So the two-thousands brought the open data wave. The International Budget Partnership launched the Open Budget Index in two thousand six, which ranks countries on budget transparency. One hundred and twenty countries are now covered. But again, the index measures whether governments publish budget documents, not whether the spending in those documents represents good value. A government can score perfectly on the Open Budget Index while systematically wasting taxpayer money — as long as it documents the waste.

I assume the two-thousands also brought the first wave of digital dashboards.

And this is where we get some of the most sophisticated attempts. Boston launched CityScore in twenty sixteen. It tracks more than thirty operational metrics — three-one-one response times, street cleanliness scores, library visits, emergency medical response times. Everything gets a score, and the overall CityScore is an aggregate. The idea was to give the mayor a single number every day that tells him how the city is performing.

A single number. That's ambitious. What does it actually capture?

Here's the problem. Boston's CityScore can show a ninety-five out of one hundred on operational metrics, and citizens can still feel overtaxed and underserved. Because the metrics don't measure whether the right services are being provided at the right price. They measure whether the services the city has chosen to provide are being delivered efficiently. Those are completely different questions. If the city decides to spend forty million dollars on a new park that nobody asked for and nobody uses, and they build it on time and under budget, CityScore gives them a win. The citizen who wanted better schools gets nothing.

This is the "measuring what's easy versus what matters" problem you mentioned.

Seoul takes it even further. The Smart Seoul Dashboard tracks more than twelve hundred indicators, updated every twenty-four hours. Air quality, traffic flow, crime reports, waste collection rates, public transport punctuality — twelve hundred data points, refreshed daily. It's an extraordinary technical achievement. And yet citizen satisfaction surveys in Seoul consistently show satisfaction with tax value below sixty percent. There's a persistent gap between what the metrics say and what citizens feel.

Because the dashboard tells you whether the bus arrived on time. It doesn't tell you whether you needed that bus route in the first place, or whether the money spent on it could have been better spent on something else.

So that's the history. Nearly a century of attempts, and they all share the same structural flaw. They measure inputs, or they measure processes, or they measure operational efficiency. None of them measure the ratio that actually matters, which is outcomes divided by tax burden. What did citizens actually get, relative to what they paid?

Let's talk about the attempts that have tried to measure that ratio. You mentioned the OECD's Government at a Glance report.

The twenty twenty-three edition is the most ambitious attempt at cross-country comparison I've seen. It compares government spending per capita across thirty-eight OECD countries against twelve outcome indicators — health outcomes, education outcomes, infrastructure quality, public safety, and so on. The idea is you can see which countries are getting more outcomes per dollar spent. Norway and Switzerland tend to come out near the top. The United States tends to come out in the middle of the pack despite very high spending, which suggests lower value for money.

What's the criticism of this approach?

Two big ones. First, the outcome indicators are weighted equally, but citizens don't weight things equally. A country that spends heavily on healthcare and gets great health outcomes but neglects road maintenance and has terrible infrastructure will score well on the OECD's index, because health outcomes are a big part of the composite. But citizens in that country might be furious about the potholes and feel they're getting terrible value. The index imposes a uniform preference structure that no actual citizen holds.

It's like a restaurant review that weights the appetizers, mains, and desserts equally, even though you're someone who only cares about the mains.

Second criticism is that even the outcome indicators are proxies. Healthcare spending per capita versus life expectancy sounds clean, but life expectancy is influenced by diet, genetics, lifestyle, environmental factors — things government healthcare spending has limited control over. So you're measuring government performance with a yardstick that government doesn't fully control.

This connects to something you mentioned about tax receipts. The experiments where governments actually showed people where their money went.

This is one of my favorite findings in this whole field. Sweden did this in twenty ten. They sent citizens a personalized breakdown of exactly where their tax payments went — so many kronor to healthcare, so many to education, so many to infrastructure, so many to defense. Just a simple statement. No policy changes, no tax cuts, no service improvements. Citizen satisfaction with tax value increased by fifteen percent. Japan replicated this in twenty fourteen with similar results. The UK ran a pilot in twenty sixteen and saw satisfaction increases of twelve to eighteen percent, depending on the demographic group.

The value problem is partly a visibility problem. People don't hate what they're getting. They don't know what they're getting.

The implication is kind of staggering when you think about it. A significant portion of tax dissatisfaction isn't about the actual ratio of services to tax burden. It's about the perceived ratio. And perception can be shifted just by making the services visible. Which suggests governments have been leaving satisfaction on the table for decades, simply by being opaque about where the money goes.

Or they've been strategically opaque because they don't want citizens doing the math.

That's the political economy question, isn't it. Let me give you a concrete example of what happens when someone does try to build a real value-for-money framework. The Mercatus Center at George Mason University produces something called the State Fiscal Condition Index. It ranks US states on their fiscal health — debt levels, unfunded pension liabilities, revenue adequacy, and so on. The Tax Foundation produces the State Business Tax Climate Index, which compares tax structures across states. These are serious, data-driven efforts.

They come with an ideological frame.

They're explicit about it. The Tax Foundation's index is designed to measure how favorable a state's tax system is to business, which means lower and simpler taxes score better almost by definition. That's not a value-for-money framework. That's a low-tax advocacy framework dressed in index clothing. A high-tax state with excellent public services could score terribly on the Tax Foundation's index while delivering fantastic value to its citizens.

Which brings us to the misconception that lower taxes always mean better value.

It's probably the most common error in this whole discussion. Value is a ratio. It's outcomes divided by cost. You can improve value by increasing outcomes, not just by cutting costs. A city with high taxes and world-class services can deliver better value than a city with low taxes and crumbling infrastructure. The question isn't "how much am I paying." The question is "what am I getting for what I'm paying." And those are different questions.

Like adopting a feral cat.

I'm going to need you to connect that one for me.

You think you're getting one thing — a cute pet, low maintenance. Then you discover you're actually getting a small ecosystem of veterinary bills, destroyed furniture, and psychological warfare. The initial cost told you nothing about the actual value proposition.

That's actually a perfect analogy. The headline tax rate is the adoption fee. The total value equation includes everything that comes after. So the question becomes: what would a useful tax value benchmark look like if we designed it from scratch?

Let's build it.

So the first thing you need is a clear numerator. That's the total bundle of services you receive. Not just the ones you use directly, but the ones that benefit you indirectly. You may never call the fire department, but you benefit from its existence because your insurance rates depend on the city's fire response capability. You may not have kids in public schools, but you benefit from an educated workforce and the effect of school quality on property values. So the numerator has to capture the full service bundle, not just your personal consumption.

That's already hard, because how do you weight services you don't use?

This is where citizen preference weighting becomes essential. You need to know what citizens actually value, not what a central planner assumes they value. And preferences vary dramatically. A retiree and a parent of three young children live in the same city, pay the same tax rate, and value completely different services. Any single composite score will misrepresent both of them.

You need personalized value scores, not a single number.

Which is technically possible now in a way it wasn't even a decade ago. But let me finish the framework. The denominator is total tax burden. Not just property tax, not just income tax. Property tax, income tax, sales tax, fuel tax, vehicle registration, utility fees, permit fees, everything that flows from your wallet to the government. If you don't capture the full burden, you're comparing partial costs to full benefits, which inflates the apparent value.

That's hard because some of those taxes are invisible. Sales tax is embedded in prices. Fuel tax is built into what you pay at the pump. Most people couldn't tell you their total annual tax burden within a thousand dollars.

Studies show that when you ask people what they pay in total taxes, the average estimate is off by thirty to forty percent, usually low. People are systematically unaware of their own tax burden. So the denominator in any value-for-money calculation is something most citizens don't actually know.

We have an invisible denominator and a numerator that depends on subjective preferences. This is why nobody's solved this.

Here's what's changed. We now have the technical infrastructure to make both visible. There's a concept that's emerged in the last few years called Quality-Adjusted Tax Burden, or QATB. Academic papers by Martinez-Vazquez and Timofeev in twenty twenty-two have proposed formal methodologies for this. The idea is you take total tax burden as the denominator, and for the numerator you create a weighted composite of service quality scores, adjusted for local cost of living and citizen preferences. No government has adopted it yet, but the math exists.

The digital twin approach goes even further.

Singapore's Virtual Singapore platform, launched in twenty eighteen, is the closest thing we have to a "what if" tool for government services. It's a three-dimensional digital model of the entire city-state, updated with real-time data on traffic, energy use, service delivery, everything. And citizens can run simulations. What if my property taxes increased by five percent? The system would show you projected changes in park maintenance frequency, bus arrival intervals, school funding levels. It makes the tax-service tradeoff visible and interactive.

It's essentially a price comparison tool for government services. You can see what you'd get for a different tax level.

Helsinki has something similar with their digital twin, launched in twenty twenty-one. And Estonia's X-Road system already provides real-time cost-per-service data across government agencies. The technical infrastructure for personalized tax value assessments exists. It's not science fiction. It's deployed, in production, in multiple countries.

If the tools exist, and the math exists, and the pilots show that transparency alone increases satisfaction, why isn't this everywhere?

Because it's not a technical problem. It's a political problem. Governments don't want to be benchmarked. A genuine tax value index would create a clear, comparable metric that citizens could use to judge their government's performance. And if you're a mayor or a city councilor or a member of parliament, the last thing you want is a number that says "your constituents are getting forty-two cents of value for every dollar they pay in taxes, while the city next door delivers seventy-eight cents.

Especially if that number updates in real time and gets published on a dashboard.

The incentive structure is completely misaligned. The people who would need to implement these systems are the same people who would be evaluated by them. And they control the implementation. It's like asking restaurant owners to voluntarily install a system that publishes health inspection scores on a giant screen facing the street.

Although restaurants did eventually get health inspection scores. They fought it, but it happened.

That's a good counterpoint. Restaurant health grades in places like Los Angeles and New York were fiercely opposed by the restaurant industry when they were proposed. Now they're just part of the landscape, and they demonstrably improved food safety outcomes. So it's not impossible. It just requires enough political pressure to overcome the institutional resistance.

That brings us to Jerusalem. Because this isn't abstract for us.

It's painfully concrete. The numbers are stark. Jerusalem's municipal tax rate — the arnona — is approximately forty percent higher than Tel Aviv's. That's according to twenty twenty-three data from the Israel Ministry of Interior. Meanwhile, service satisfaction scores in Jerusalem are twenty-eight percent lower than Tel Aviv's. So you're paying forty percent more and getting nearly thirty percent less satisfaction. That is the textbook definition of terrible value for money.

We've talked before about where the money goes. Twenty-two percent of Jerusalem's arnona revenue gets diverted to coalition-mandated pet projects rather than basic services.

That's the number. Nearly a quarter of the city's primary tax revenue stream goes to politically negotiated earmarks rather than core service delivery. And there's no mechanism for citizens to see this, challenge it, or vote on it directly. The budget process is opaque, the service outcomes are poor, and the tax rates keep rising. Jerusalem is a case study in what happens when there's no value-for-money benchmark — decades of decline without accountability.

Let's talk about what citizens can actually do. Because we've spent a lot of time describing the problem and the failed attempts. What's the actionable framework?

I think there are three things any citizen can do, starting tomorrow. The first is build your own personal tax value index. You need three data points. One, your total tax burden — property tax plus income tax plus sales tax, as best you can estimate it. Two, service quality scores from whatever local surveys or independent audits are available in your area. And three, a comparison point — what are neighboring jurisdictions paying and getting? The formula is simple. Service quality score divided by tax burden, multiplied by a hundred. That's your personal tax value index. It's crude, but it's a starting point. And the act of calculating it changes how you think about what you're paying.

The second thing?

Demand a tax receipt from your local government. If they don't provide one — most don't — use open budget data to create your own. Most cities publish their budgets online now. You can map the spending categories to your tax payments and build a rough breakdown. The Sweden experiment showed that just seeing the breakdown changes perceived value. Do it yourself if your government won't do it for you.

Push for participatory budgeting in your municipality. It started in Porto Alegre, Brazil in nineteen eighty-nine and now exists in more than seven thousand cities globally. It's the only mechanism that forces a direct conversation about tradeoffs between tax levels and service bundles. Citizens literally vote on how a portion of the budget is spent. It creates the feedback loop that's missing from conventional representative government. It doesn't solve the whole problem — it usually only covers a small percentage of the total budget — but it builds the muscle of thinking about tax value.

There's a fourth thing, which is write to your city council and ask for a citizen's value report. An annual document that shows tax rates, service levels, and benchmarks against comparable cities.

If enough people ask, it becomes a political priority. Council members start thinking, "my constituents are asking for this, maybe I should champion it." That's how restaurant health grades eventually happened. Enough people demanded to know what they were getting.

Even if we all build our personal tax value indexes and demand better reporting, there's a deeper question. Would a universal tax value index actually change voting behavior? Would citizens punish low-value governments and reward high-value ones? Or would political tribalism override the data?

That's the question that keeps me up at night. The evidence on this is mixed. There are studies showing that when voters are given clear information about government performance, they do reward and punish accordingly — but the effect is smaller than you'd hope, and it's easily overwhelmed by partisan identity. People will rationalize terrible value from their own party while condemning decent value from the other party. The information doesn't land on a blank slate. It lands on a pre-existing political identity that filters everything.

We might build this beautiful measurement system, and people might just ignore it when it conflicts with their tribal loyalties.

That's the pessimistic take. The optimistic take is that the Sweden experiment suggests something different. When people got their tax receipts, satisfaction changed. It wasn't filtered through partisanship — it was just information, and it shifted perception. Maybe value-for-money assessment is concrete enough, personal enough, that it can cut through some of the identity barriers. You might rationalize away a newspaper article about government waste. It's harder to rationalize away a personalized document showing you got thirty-seven cents of value for every dollar you paid.

There's something about seeing your own name on the statement that changes the psychology.

And this is where I think the next decade is going to be interesting. AI-powered government dashboards are becoming more sophisticated. Estonia's X-Road already provides real-time cost-per-service data. Singapore's digital twin lets you simulate tax scenarios. The technical infrastructure for personalized tax value assessments is being built, whether governments want it or not. The question is whether citizens will demand access to it.

Whether governments will provide access voluntarily or have to be dragged there.

My guess is dragged. The Jerusalem case shows what the default looks like when there's no accountability mechanism. Rates go up, services degrade, and the only people who notice are the ones who can afford to leave. Everyone else just lives with the decline.

The decline becomes normal. You stop expecting better because you've never had a way to prove that better is possible.

That's the insidious part. Without a benchmark, bad becomes the baseline. You don't know you're getting terrible value because you've never seen what good value looks like. The benchmark isn't just a measurement tool. It's a imagination tool. It shows you what's possible.

The tools exist. Tax receipts, digital twins, participatory budgeting, personal value indexes. The math exists. The pilots show it works. The barrier isn't technical or methodological. It's that the people who control implementation are the ones who'd be measured.

That's not a small barrier. But it's not an eternal one either. Restaurant health grades happened. Credit scores happened. Product nutrition labels happened. All of them were opposed by the industries being measured. All of them eventually became normal. Government tax value might be next.

Which means the most useful thing a citizen can do right now is start asking the question. Once enough people ask "what am I actually getting for what I'm paying," the political calculus shifts. The cost of not providing an answer becomes higher than the cost of providing one.

That's the note I want to end on. Not despair about the opacity of government, but recognition that opacity has a breaking point. Every transparency reform in history started with people asking a question that the institutions didn't want to answer. The tax value question is one of the last big ones that hasn't been asked at scale yet. But it's coming.

Now: Hilbert's daily fun fact.

Hilbert: Ship's biscuit, the hardtack ration fed to sailors from at least the thirteenth century, was essentially a chemically preserved food — its near-zero moisture content, achieved by baking at low temperature for hours until the starch matrix crystallized into a glass-like state, made it shelf-stable for decades. Sailors would soften it by dipping in grog or, in the waters off Suriname during the medieval period, in coconut milk when naval expeditions passed through.

Coconut milk hardtack. That's a new one.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop for keeping the show running. If you want to dig deeper into any of this, you can find show notes and links at myweirdprompts.And if you found this episode useful, leave us a review wherever you listen — it helps other people find the show.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#3121: Can You Benchmark Government Value for Money?

Downloads

You Might Also Like

#3121: Can You Benchmark Government Value for Money?