Mantic Monday 3/14/22

More Ukraine warcasting, nuclear war risk, forecasters vs. experts

53 min ago

Ukraine Warcasting

Changes in Ukraine prediction markets since my last post February 28:

  1. Will Kiev fall to Russian forces by April 2022?: 69% —→ 14%

  2. Will at least three of six big cities fall by June 1?: 71% —→ 70%

  3. Will World War III happen before 2050?: 20% —→21%

  4. Will Russia invade any other country in 2022?: 12% —→10%

  5. Will Putin still be president of Russia next February?: 71% —→ 80%

  6. Will 50,000 civilians die in any single Ukrainian city?: 8% —→ 12%

  7. Will Zelinskyy no longer be President of Ukraine on 4/22?: 63% —→20%

If you like getting your news in this format, subscribe to the Metaculus Alert bot for more (and thanks to ACX Grants winner Nikos Bosse for creating it!)

Numbers 1 and 7 are impressive changes! (it’s interesting how similarly they’ve evolved, even though they’re superficially about different things and the questions were on different prediction markets). Early in the war, prediction markets didn’t like Ukraine’s odds; now they’re much more sanguine.

Let’s look at the exact course:

This is almost monotonically decreasing. Every day it’s lower than the day before.

How suspicious should we be of this? If there were a stock that decreased every day for twenty days, we’d be surprised that investors were constantly overestimating it. At some point on day 10, someone should think "looks like this keeps declining, maybe I should short it", and that would halt its decline. In efficient markets, there should never be predictable patterns! So what’s going on here?

Maybe it’s a technical issue with Metaculus? Suppose that at the beginning of the war, people thought there was an 80% chance of occupation. Lots of people predicted 80%. Then events immediately showed the real probability was more like 10%. Each day a couple more people showed up and predicted 10%, which gradually moved the average of all predictions (old and new) down. You can see a description of their updating function here - it seems slightly savvier than the toy version I just described, but not savvy enough to avoid the problem entirely.

But Polymarket has the same problem:

It shouldn’t be able to have technical issues like Metaculus, so what’s up?

One possibility is that, by a crazy coincidence, every day some new independent event happened that thwarted Russia and made Ukraine’s chances look better. Twenty dice rolls in a row came up natural 20s for Ukraine. Seems unlikely.

Another possibility is that forecasters started out thinking that Russia was strong, in fact Russia was weak, and every day we’ve gathered slightly more evidence for that underlying reality. I’m having trouble figuring out of if this makes sense. You’d still think that after ten straight days of that, people should say "probably tomorrow we’ll get even more evidence of the same underlying reality, might as well update today".

A third possibility is that forecasters are biased against updating. A perfect Bayesian, seeing the failures of the Russian advance over the first few days, would have immediately updated to something like correct beliefs. But the forecasters here were too conservative and didn’t do that.

A fourth possibility is that forecasters biased towards updating too much. Ukrainian propaganda is so good that every extra day you’re exposed to it, you become more and more convinced that Ukraine is winning.

[EDIT: Commenter HouseAlwaysWins notes "If you plotted a prediction for "will this iodine-131 nucleus have decayed by April 1" you'd also get a roughly linear decline (unless it decayed in which case it would jump up to 100%). Prediction markets are allowed to have "story arcs", so long as the *expected* change is zero." Some other people make similar good points, which you can find in the comments section.]

Nuclear Warcasting

A friend recently invited me to their bolthole in the empty part of Northern California. Their argument was: as long as the US and Russia are staring menacingly at each other, there’s a (slight) risk of nuclear war. Maybe we should get out of cities now, and beat the traffic jam when the s#!t finally hits the fan.

I declined their generous offer, but I’ve been wondering whether I made the right call. What exactly is the risk of nuclear war these next few months?

Enter Samotsvety Forecasts. This is a team of some of the best superforecasters in the world. They won the CSET-Foretell forecasting competition by an absolutely obscene margin, "around twice as good as the next-best team in terms of the relative Brier score". If the point of forecasting tournaments is to figure out who you can trust, the science has spoken, and the answer is "these guys".

As a service to the community, they came up with a formal forecast for the risk of near-term nuclear war:

We aggregated the forecasts of 8 excellent forecasters for the question What is the risk of death in the next month due to a nuclear explosion in London? Our aggregate answer is 24 micromorts (7 to 61) when excluding the most extreme on either side. A micromort is defined as a 1 in a million chance of death. Chiefly, we have a low baseline risk, and we think that escalation to targeting civilian populations is even more unlikely.

For San Francisco and most other major cities, we would forecast 1.5-2x lower probability (12-16 micromorts). We focused on London as it seems to be at high risk and is a hub for the effective altruism community, one target audience for this forecast.

Given an estimated 50 years of life left, this corresponds to ~10 hours lost. The forecaster range without excluding extremes was <1 minute to ~2 days lost. Because of productivity losses, hassle, etc., we are currently not recommending that individuals evacuate major cities.

You can read more about their methodology and reasoning on the post on Effective Altruism Forum, but I found this table helpful:

Along with reassuring me I made the right choice not to run and hide, this is a new landmark in translating forecasting results to the real world. The whole stack of technologies came together: tournaments to determine who the best predictors are, methods for aggregating probabilities, and a real-world question that lots of people care about. Thanks to Samotsvety and their friends for making this happen!

(see here for some pushback, disagreement, and back-and-forth)

Forecasters Vs. Experts

Also from the EA Forum this month: Comparing Top Forecasters And Domain Experts, by Arb Consulting (the team also includes one of the Samotsvety members who worked on the nuclear risk estimate).

Everyone always tells the story of how Tetlock’s superforecasters beat CIA experts. Is it true? Arb finds that it’s more complicated:

A common misconception is that superforecasters outperformed intelligence analysts by 30%. Instead: Goldstein et al showed that [EDIT: the Good Judgment Project's best-performing aggregation method][2] outperformed the intelligence community, but this was partly due to the different aggregation technique used (the GJP weighting algorithm performs better than prediction markets, given the apparently low volumes of the ICPM market). The forecaster prediction market performed about as well as the intelligence analyst prediction market; and in general, prediction pools outperform prediction markets in the current market regime (e.g. low subsidies, low volume, perverse incentives, narrow demographics). [85% confidence]

In the same study, the forecaster average was notably worse than the intelligence community.

If I’m understanding this right, the average forecaster did worse than the average expert, but Tetlock had the bright idea to use clever aggregation methods for his superforecasters, and the CIA didn’t use clever aggregation methods for their experts. The CIA did try a prediction market, which in theory and under ideal conditions should work at least as well as any other aggregation method, but under real conditions (it was low-volume and poorly-designed) it did not.

They go on to review thirteen other studies in a variety of domains (keep in mind that different fields may have different definitions of "expert" and require different levels of expertise to master). Overall there was no clear picture. Eyeballing the results, it looks like forecasters often do a bit better than experts, but with lots of caveats and possible exculpatory factors. Sometimes the results seemed a little silly: in one, forecasters won because the experts didn’t bother to update their forecasts often enough as things changed; in another, "1st place went to one of the very few public-health professionals who was also a skilled Hypermind forecaster."

They conclude:

To distinguish some claims:

1: "Forecasters > the public"
2: "Forecasters > simple models"
3: "Forecasters > experts"

3a: "Forecasters > experts with classified info"
3b: "Averaged forecasters > experts"
3c: "Aggregated forecasters > experts"

We think claim (1) is true with 99% confidence[1] and claim (2) is true with 95% confidence. But surprisingly few studies compare experts to generalists (i.e. study claim 3). Of those we found, the analysis quality and transparency leave much to be desired. The best study found that forecasters and health professionals performed similarly. In other studies, experts had goals besides accuracy, or there were too few of them to produce a good aggregate prediction.

So, kind of weak conclusion, but you can probably believe some vague thing like "forecasters seem around as good as experts in some cases".

Also, keep in mind that in real life almost no one ever tries to aggregate experts in any meaningful way. Real-life comparisons tend to be more like "aggregated forecasters vs. this one expert I heard about one time on the news". I’d go with the forecasters in a situation like this - but again, the studies are too weak to be sure!


1: Taosumer reviews my Prediction Market Cube and asks why I don’t have "decentralized" on there as a desideratum. My answer: decentralization is great, but for me it cashes out in "ease of use" - specifically, it’s easy to use it because the government hasn’t shut it down or banned you. Or as "real money" - the reason Manifold isn’t real-money is because they’re centralized and therefore vulnerable and therefore need to obey laws. Or as "easy to create market" - the reason Kalshi doesn’t let you create markets is partly because it’s centralized and therefore vulnerable and therefore needs to limit markets to things regulators like. I agree that, because of those second order effects, decentralization is crucial and needs to be pursued more, and I agree that it’s a tragedy that [whatever happened to Augur] happened to Augur.

2: More people make Ukraine predictions: Maxim Lott, Richard Hanania (again), Samo Burja (again), EHarding (possibly trolling?), Robin Hanson (sort of)

3: Last month we talked about some problems with the Metaculus leaderboard. An alert reader told me about their alternative Points Per Question leaderboard, which is pretty good - although I think different kinds of questions give different average amounts of points so it’s still not perfect.

4: Also last month, I suggested Manifold Markets have a loan feature to help boost investment in long-term markets. They’ve since added this feature: your first $M20 will automatically be a zero-interest loan.

5: Related: I’m testing Manifold as a knowledge-generation device. If you want to help, go bet in the market about how I’ll rank interventions in an upcoming updated version of the Biodeterminists’ Guide To Pregnancy.

6: Reality Cards is a new site that combines the legal hassles of prediction markets with the technical hassles of NFTs. You bid to "rent" the NFT representing a certain outcome, your rent goes into a pot, and then when the event happens the pot goes to whoever held the relevant NFT. I’m not math-y enough to figure out whether this is a proper scoring rule or not, but it sure does sound unnecessarily complicated. I imagine everyone involved will be multimillionaires within a week.

7: In case a prediction market using NFTs isn’t enough for you, this article suggests that OpenDAO is working on a prediction market about NFTs. It claims they should be done by January, but I can’t find it.

Subscribe to Astral Codex Ten

By Scott Alexander · Thousands of paid subscribers

P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.


Regarding the degradation of Zelenskyy NO on Polymarket, a lot of these markets behave like options: they have a time value, and in absence of any new meaningful information, the time value falls until it reaches 0.

Good point! Given this time-value-of-options model, does it surprise you that the Zelenskyy NO price fell from 63% to 20% in the first 14 out of the 53 days until expiration?

It does not. I was a decently sized player in helping the odds go from ~60% to 20 or so. The bulk of the move was when market participants realized how inefficient Russian troop movements have been in Northern Ukraine.

Furthermore, the bulk of the shock move was done by one participant (a whale) who's trades move polymarkets substantially.

I see. Sounds like "market participants realized how inefficient Russian troop movements have been" was a bigger factor than "options generally go to 0 due to time value" in this case -- does that seem right to you in explaining what happened in the market?

It's dangerous to immediately price in the updates you expect to see based on the trend! The forecasters are behaving as proper Bayesian agents and updating slowly, according to the trickle of evidence. Moreover, waiting makes existing evidence stronger - nobody debunked it, you've seen nothing that contradicts it despite time in which such information could surface.

"This is almost monotonically decreasing. Every day it’s lower than the day before. How suspicious should we be of this? If there were a stock that decreased every day for twenty days, we’d be surprised that investors were constantly overestimating it. At some point on day 10, someone should think "looks like this keeps declining, maybe I should short it", and that would halt its decline. In efficient markets, there should never be predictable patterns!"

If you plotted a prediction for "will this iodine-131 nucleus have decayed by April 1" you'd also get a roughly linear decline (unless it decayed in which case it would jump up to 100%). Prediction markets are allowed to have "story arcs", so long as the *expected* change is zero.

Shouldn't the fall of Kiev by April decrease with each day closer to the April deadline that it hasn't fallen? It doesn't seem that strange to me

Does it seem strange to you that it fell from 69% to 14% in the first 14 out of 53 days before the deadline?

7 hr ago·edited 7 hr ago

April 1 is 17 days away, and the forecast graph only goes back to feb 28, 14 days ago, so 31 days total. I don't know where your 53 comes from.

to answer the question, no i don't think that drop is strange. over that two weeks, we saw the results and new information adjusted quickly, then it should slowly decline as the deadline approaches.

Sorry, I'm probably just reading the graph badly, but what period in 2/28 are you thinking of as the "adjusted quickly" period, and what period is the "slowly decline" period? (The graph that Scott posted of the metaculus aggregation over time seems to move pretty smoothly from 2/28 to now; I don't see an obvious "adjusted quickly" period.)

The way I interpret and see the chart, it had some relatively old value Feb 28-Mar5, then steep drop to new lower value, then slow decay from there. This may not be the case, but the main point of my comment is to describe the behavior after that Mar5 drop. The comment from House Always Wins illustrates that point perhaps more articulately.

Ah, thinking of 3/5-3/6 as being a quick adjustment, and the resulting slide from 40% to 14% in the 8 days of the remaining 26 seems helpful. I still wouldn't have guessed that 2/3 of the probability mass "should" lie in the first 1/3 of days, so it does seems like there's more information being incorporated than just "probability decreases with each day closer to the April deadline that Kiev hasn't fallen".

It's probably not mathematically correct to think in these terms, but the psychology of something going from "likely to happen" to "not likely to happen" on that 60%-40% drop seems like a big deal to me.

Oh, I mixed up the deadline on the Zelenskyy polymarket market with the Kiev metaculus market. Should have said:

"Does it seem strange to you that it fell from 69% to 14% in the first 14 out of 32 days before the deadline?"

7 hr ago·edited 7 hr ago

The slowness of Russia in the early phases of seizing Kyiv suggests a slowness of their attack in the later phases. The distribution of "how long it takes Russia to get its troops to Kyiv," "how long does it take Russia to start shelling Kyiv with artillery," "how long does it take Ukrainian resistance to break down," "how long does it takes to seize the first raion," "how long does it take to seize the second raion," etc, are not going to be uncorrelated, even if they don't all have the same distribution.

A prediction market for next week’s Powerball numbers makes no sense as that is an unknowable thing. Similarly how Putin, Zelinskyy, Biden, Sholtz, random software problems that accidentally send a Russian Cruise Missile to Warsaw all interact to influence the future is just as unknowable as the future state of the balls bouncing around at lottery HG.

What am I missing?

A prediction market for next week's Powerball numbers should have every combination priced at around ~0.000000342229781%. The balls bouncing around have a huge number of unknowable interactions, so it's hopeless to form predictions based on predicting every step of the process. But, like with the Powerball *results*, it's possible to estimate what the distribution of *outcomes* might be, even if it's hopeless to predict every footfall of every soldier.

It isn't *just as* unknowable as Powerball, even if it's quite unknowable. All those men you named have stated their intentions and have a history of behaviors in situations of varying degrees of similarity. Human behavior can be predictable to some extent. Degrees of confidence exist.

A prediction market for next week's Powerball numbers is perfectly legitimate - the correct probability to assign to any number would be 1 / 1 million (or whatever the actual Powerball odds are). The only reason people wouldn't use such a market is that it would be so obvious that nobody would bet against the 1 / 1 million number.

What's the correct odds (ie the equivalent of saying "1 / 1 million") for Russia taking Kyiv this month? I'm not sure, but it's probably a different number than the correct odds for Russia taking Mariupol this month, and it's fair to speculate on both sets of odds.

The fact that we can't get certainty is fine, all prediction markets are asking you for is a probability. You probably have opinions on this probability already - 99% would seem unreasonably high, 0.1% unreasonably low.

Does that answer your question?

The odds of winning any lottery, for anybody, is 50%. It’s 50-50. You either do or you don’t.

Is it possible that 'story arcs' in prediction markets come because most people have far heavier priors on other people's assessment and judgement, than they do on events in reality itself?

Maybe most people, even those using prediction markets, have a real low prior on something like 'prediction markets are totally wrong here' ? I have far more experience with 'prediction markets being general correct' than i do with forecasting military outcomes.

For 5., the Biodeterminist's Guide to Parenting, #5 reads "Supplement with nicotinamide mononucleotide (for fathers)"

Should we assume that #s 2, 3, 7, 4, 10, all of them really, apply to the mother prior to birth? Subsequent to birth? The baby?

Yes, all the others are about the mother when pregnant.

"One possibility is that, by a crazy coincidence, every day some new independent event happened that thwarted Russia and made Ukraine’s chances look better. Twenty dice rolls in a row came up natural 20s for Ukraine. Seems unlikely."

That's more or less exactly what's happened. It's not unlikely at all, it's how we experience the world.

On day 1, Ukraine was still standing.

On day 2, Ukraine was still standing.

On day 3, Ukraine was still standing.

etc etc etc

These are all new information! On any given day N, different things might happen, we don't know. We have to wait and observe. Simply drawing a linear extrapolation through the first few days would be lunacy.

Also, labeling these as "natural 20s" is a huge overstatement of what we know about the underlying distribution. We have very little information on which to base an estimate of "Can Ukraine defend against an attack by the Russian army?", and I offer as evidence the fact that this post exists, we're having this conversation, etc etc. So it *may* be that Ukraine is on a hot streak and rolling 20s, or it may be that the outcomes we see are right in the middle of the distribution because it happens that Ukraine is a better match for Russia than some (many? most? idk!) people thought.

I'm thinking that one of the worst things about surviving a nuclear war would be finding yourself in a society organized and dominated by the kind of people who optimize their lives around surviving a nuclear war.

This past month has really brought back the '80s, when "is nuclear armageddon more or less likely this week?" was a perfectly normal topic for news discussion.

7 hr ago·edited 7 hr ago

> If there were a stock that decreased every day for twenty days, we’d be surprised that investors were constantly overestimating it. At some point on day 10, someone should think "looks like this keeps declining, maybe I should short it", and that would halt its decline. In efficient markets, there should never be predictable patterns!

Most high-growth tech stocks have been in secular decline since November (see eg NFLX, COIN, DASH, PTON, U, etc), so if you really believe this, you should go pick up those obvious $100 bills just laying on the ground :)

Another explanation for the slow decline in Ukraine predictions:

This might happen if people are thinking something along the lines of "Russia will not capture Kiev by April unless they do something to turn things around before then". If so, then every day that Russia fails to turn things around provides a bit more evidence that Russia will not capture Kiev. This doesn't violate conservation of expected evidence or the efficient market hypothesis, because every day there is some small chance that Russia WILL turn things around and the odds will swing substantially in the other direction.

I'm kind of echoing what's been said before in the comments, but I'm really not sure why the monotonically decreasing pattern is odd. A stock shouldn't decline over time, but a stock is also in theory an infinite prediction (there's no time expiry). A prediction with a specific deadline is quite different.

Suppose your mental model is something like "there is a 1% chance Kiev will fall on any given day". With 30 days until April, you'd predict about a 74% chance of Kiev not falling (.99 ^ 30). With 15 days left, conditional on it not falling yet, you're up to about an 86% chance. Your base rate probability of Kiev falling doesn't have to change (it's still 1% on any given day), but the decreasing time to expiry changes your odds of the actual contract resolving one way or the other.

7 hr ago·edited 7 hr ago

I think the key here is that the evidence has mostly been of the following form:

"the Russians haven't moved very far, and they've already been invading for X days"

That's evidence that becomes monotonically stronger as X increases but there's no sharp cutoff point where it qualitatively changes in character. It's quite different from the updates you would get from crucial pivotal events (agreeing to ship a more advanced SAM system, the fall of a city, losing a key skirmish etc) but I also think it's extremely relevant evidence.

For instance, if we think one of the two situations is likely to be a correct description of the situation, we could reasonably believe that every day without solid Russian progress is evidence towards option 2 and update accordingly.

Scenario 1) Russian forces are fundamentally powerful enough to take Ukraine but take time to deploy effectively

Scenario 2) Russian forces are not powerful enough to take Ukraine

Every day X ticks one higher, and the P(evidence| scenario 2) becomes higher and P(evidence|scenario 1) becomes lower meaning a larger update but incremental update towards scenario 2.

Even now, it's still possible that the Russians are gradually getting into position and will still prevail, but the alternative story where their logistics, morale, tactical awareness etc are too constraining to achieve their objectives has become a lot more plausible.

Ready for more?

© 2022 Scott Alexander
Get the Substack app
Substack is the home for great writing