Deference for Bayesians

Most people in the knowledge producing industry in academia, foundations, media or think tanks are not Bayesians. This makes it difficult to know how Bayesians should go about deferring to experts.

Many experts are guided by what Bryan Caplan has called ‘myopic empiricism’, also sometimes called scientism. That is, they are guided disproportionately by what the published scientific evidence on a topic says, and less so by theory, common sense, scientific evidence from related domains, and other forms of evidence. The problem with this is that, for various reasons, standards in published science are not very high, as the replication crisis across psychology, empirical economics, medicine and other fields has illustrated. Much published scientific evidence is focused on the discovery of statistically significant results, which is not what we ultimately care about, from a Bayesian point of view. Researcher degrees of freedom, reporting bias and other factors also create major risks of bias.

Moreover, published scientific evidence is not the only thing that should determine our beliefs.

1. Examples

I will now discuss some examples where the experts have taken views which are heavily influenced by myopic empiricism, and so their conclusions can come apart from what an informed Bayesian would say.

Scepticism about the efficacy of masks

Leading public health bodies claimed that masks didn’t work to stop the spread at the start of the pandemic.¹ This was in part because there were observational studies finding no effect (concerns about risk compensation and reserving supplies for medical personnel were also a factor).² But everyone also agrees that COVID-19 spreads by droplets released from the mouth or nose when an infected person coughs, sneezes, or speaks. If you put a mask in the way of these droplets, your strong prior should be that doing so would reduce the spread of covid. There are videos of masks doing the blocking. This should lead one to suspect that the published scientific research finding no effect is mistaken, as has been confirmed by subsequent research.

Scepticism about the efficacy of lockdowns

Some intelligent people are sceptical not only about whether lockdowns pass the cost-benefit analysis, but even about whether lockdowns reduce the incidence of covid. Indeed, there are various published scientific papers suggesting that such measures have no effect.³ One issue such social science studies will have is that the severity of a covid outbreak is positively correlated with the strength of the lockdown measures, so it will be difficult to tease out cause and effect. This is especially in cross-country regressions where the sample size isn’t that big and there are dozens of other important factors at play that will be difficult or impossible to properly control for.

As for masks, given our knowledge of how covid spreads, on priors it would be extremely surprising if lockdowns don’t work. If you stop people from going to a crowded pub, this clearly reduces the chance that covid will pass from person to person. Unless we want to give up on the germ theory of disease, we should have an extremely strong presumption that lockdowns work. This means an extremely strong presumption that most of the social science finding a negative result is false.

Scepticism about first doses first

In January, the British government decided to implement ‘first doses first’ – an approach of first giving out as many first doses of the vaccine as possible before giving out second doses. This means leaving a longer gap between the two doses – from 12 weeks rather than 21 days. However, the 21 day gap was what was tested in the clinical trial of the Oxford/AstraZeneca vaccine. As a result, we don’t know from the trial whether spacing out the doses has a dramatic effect on the efficacy of the vaccine. This has led expert groups, such as the British Medical Association and the WHO, to oppose the UK’s strategy.

But again, on the basis of our knowledge of how other vaccines work and of the immune system, it would be very surprising if the immune system response did decline substantially when the gap between the doses is increased. To borrow an example from Rob Wiblin, the trial also didn’t test whether people turned into a unicorn two years after receiving the vaccine, but that doesn’t mean we should be agnostic about that possibility. Subsequent evidence has confirmed that the immune response doesn’t drop off after the longer delay.

Scepticism about whether the AstraZeneca works on the over 65s

Almost half of all EU countries have forbidden the use of the Oxford/AstraZeneca vaccines on the over 65s. This was in part because the sample of over 65s in the initial study was not conclusive enough to form a judgement on efficacy for that age group. But there was evidence from the study that the AZ vaccine is highly effective for under 65s. Given what we know about similarities across the human immune system, it is very unlikely that the AZ vaccine has 80% efficacy for under 65s but drops off precipitously for over 65s. To make useful judgments about empirical research, one has to make some judgments about external validity, which need to be informed by non-study considerations. The myopic empiricist approach often neglects this fact.

Scepticism about the effects of mild drinking while pregnant

There are a lot of observational studies that show pretty conclusively that moderate or severe drinking while pregnant is extremely bad for babies: doctors will try to get pregnant alcoholic drug addicts off alcohol before getting them off heroin. However, observational studies have struggled to find an effect of mild drinking on birth outcomes, which has led some experts to argue that mild drinking in pregnancy is in fact safe.⁴

Given what we know about the effects of moderate drinking while pregnant, and theoretical knowledge about the mechanism of how it has this effect, we should have a very strong presumption that mild drinking is also mildly bad for babies. The reason observational studies struggle to find an effect is that they are searching for an effect that is too close to zero to distinguish the signal from the noise, and there are million potential confounders. The effect is still very likely there.

Nutritional epidemiology

Nutritional epidemiology tries to tease out the effect of different foods on health. It has produced some extraordinary claims. John Ioannidis describes the effect of different foods found in meta-analyses

“The emerging picture of nutritional epidemiology is difficult to reconcile with good scientific principles. The field needs radical reform… Assuming the meta-analyzed evidence from cohort studies represents life span–long causal associations, for a baseline life expectancy of 80 years, eating 12 hazelnuts daily (1 oz) would prolong life by 12 years (ie, 1 year per hazelnut), drinking 3 cups of coffee daily would achieve a similar gain of 12 extra years, and eating a single mandarin orange daily (80 g) would add 5 years of life. Conversely, consuming 1 egg daily would reduce life expectancy by 6 years, and eating 2 slices of bacon (30g) daily would shorten life by a decade, an effect worse than smoking.”

As Ioannidis notes: “These implausible estimates of benefits or risks associated with diet probably reflect almost exclusively the magnitude of the cumulative biases in this type of research, with extensive residual confounding and selective reporting”. Sticking to an enlightened common sense prior on nutrition is probably a better bet.

Denying that the minimum wage affects demand for labour

Prior to the 1990s, almost all economists believed that the minimum wage would cause firms to economise on labour either by causing unemployment or reducing hours. This was on the basis of Economics 101 theory – if you increase the price of something, then demand for it falls. In the 1990s, some prominent observational studies found that the minimum wage did not in fact have these effects, which has caused much more sanguine views about the employment effects of the minimum wage among some economists. See this recent IGM poll of economists on the employment effects of a minimum wage increase in the US – many respondents appeal to the empirical evidence on the minimum wage when justifying their conclusion.

“An increase to $15/hour is a big jump, and I’m not sure we have the data to know what the effect on employment would be.”

“Evidence is that small increases in min. wage (starting from US lows) don’t have large disemployment effects. Don’t know what $15 will do”

“The weight of the evidence does not support large job loss. But I’m above extra nervous about setting min $15/hr during the pandemic.”

“Research has shown modest min. wage increases do not increase unemployment. But going from $6 to $15 in the current situation is not modest.”

“Evidence on employment effects of minimum wages is inconclusive, and the employment losses may well be small.”

A lone economist – Carl Shapiro – digs his heels in and sticks to theory

“Demand for labor is presumably downward sloping, but the question does not ask anything about magnitudes.”

As Bryan Caplan argues, there are several problems with the myopic empiricist approach:

There are strong grounds from theory to think that any randomly selected demand curve will slope downward. The only way minimum wages wouldn’t cause firms to economise on labour is if they were a monopsonistic buyer of labour or if firms are not systematically profit-seeking, which just doesn’t seem to be the case.
The observational evidence is very mixed and is almost always testing very small treatment effects – increases in the minimum wage of a few dollars. Given the quality of observational studies, we should expect a high number of false negatives if enough studies are conducted.
The literature on the impact of immigration on the wages of native workers suggests that the demand curve for labour is highly elastic, which is strongly inconsistent with the view that minimum wages don’t damage employment.
Most economists agree that many European countries have high unemployment due to regulations that increase the cost of hiring workers – minimum wages are one way to increase the costs of hiring workers.
Keynesians think that unemployment is sometimes caused by nominal downward wage rigidity, i.e. that nominal wages fail to fall until the market clears. This view is very hard to reconcile with the view that the minimum wage doesn’t cause firms to economise on labour.

2. Implications for deference

I have outlined some cases above where hewing to the views of certain experts seems likely to lead one to mistaken beliefs. In these cases, taking account of theory, common sense and evidence from other domains leads one to a different view on crucial public policy questions. This suggests that, for Bayesians, a good strategy would be to defer to the experts on what the published scientific evidence says, and let this be one input into one’s all-things-considered judgement about a topic.

For example, we might accept that some studies find limited effects of masks but also discard that evidence given our other knowledge.

Many subject matter experts are not experts on epistemology – on whether Bayesianism is true. So, this approach does not obviously violate epistemic modesty.

Endnotes

1. For an overview of the changing guidance, see this Unherd article by Stuart Ritchie.

2. For an overview see Greenhalgh. For example, “A preprint of a systematic review published on 6 April 2020 examined whether wearing a face mask or other barrier (goggles, shield, veil) prevents transmission of respiratory illness such as coronavirus, rhinovirus, tuberculosis, or influenza.11 It identified 31 eligible studies, including 12 randomised controlled trials. The authors found that overall, mask wearing both in general and by infected members within households seemed to produce small but statistically non-significant reductions in infection rates. The authors concluded that “The evidence is not sufficiently strong to support the widespread use of facemasks as a protective measure against covid-19”11 and recommended further high quality randomised controlled trials.” Trisha Greenhalgh et al., ‘Face Masks for the Public during the Covid-19 Crisis’, BMJ 369 (9 April 2020): m1435, https://doi.org/10.1136/bmj.m1435.

3. For an overview of the sceptical literature, see this website.

4. For example, Emily Oster, an economist and author of the popular book Expecting Better argues that there is “little evidence” that one to two drinks per week causes harm to the foetus, as discussed in this Vox piece.

Economic policy in poor countries

When funding policy advocacy in the rich world, Open Philanthropy Project aims to only fund projects that at least meet the ‘100x bar’, which means that the things they fund need to increase incomes for average Americans by $100 for every $1 spent to get as much benefit as giving $1 to GiveDirectly recipients in Africa. The reason for this is that (1) there is roughly a 100:1 ratio between the consumption of Americans to GiveDirectly cash transfer recipients, and (2) the returns of money to welfare are logarithmic. A logarithmic utility function implies that $1 for someone with 100x less consumption is worth 100x as much. Since GiveWell’s top charities are 10x better than GiveDirectly, the standard set by GiveWell’s top charities is a ‘1,000x bar’.

Since 2015, Open Phil has made roughly 300 grants totalling almost $200 million in their near-termist, human-centric focus areas of criminal justice reform, immigration policy, land use reform, macroeconomic stabilisation policy, and scientific research. In ‘GiveWell’s Top Charities Are (Increasingly) Hard to Beat‘, Alex Berger argues that much of Open Phil’s US policy work probably passes the 100x bar, but relatively little passes the 1,000x bar.

The reason that Open Phil’s policy work is able to meet the 100x bar is that it is leveraged. Although trying to change planning law in California has a low chance of success, the economic payoffs are so large that the expected value of these grants is high. So, even though it is a lot harder to increase welfare in the US, because the policy work has so much leverage, the expected benefits are high enough to 100x the $ benefits.

This raises the question: if all of this true, wouldn’t advocating for improved economic policy in poor countries be much better than GiveWell’s top charities? If policy in the US has high expected benefits because it is leveraged, then policy in Kenya must also have high expected benefits because it is leveraged. We should expect many projects improving economic policy in Kenya to produce 100x the welfare benefits of GiveDirectly, and we should expect a handful to produce 1,000x the welfare benefits of GiveDirectly.

This is an argument for funding work to improve economic policy in the world’s poorest countries. Lant Pritchett has been arguing for this position for at least 7 years without any published response from the EA community. Hauke Hillebrandt and I summarise his arguments here. My former colleagues from Founders Pledge, Stephen Clare and Aidan Goth, discuss the arguments in more depth here.

At present, according to GiveWell, the best way to improve the economic outcomes of very poor people is to deworm them. This is on the basis of one very controversial RCT conducted in 2004. I don’t think this is a tenable position.

Pangea: The Worst of Times

260 million years ago, our planet had an unfamiliar geography. Nearly all of the landmasses were united into a single giant continent known as ‘Pangea’ that stretched from pole to pole. On the other side of the world you would find a vast ocean, even larger than the present Pacific, called Panthalassa. The Pangean era lasted 160 million years, and 80 million of these were extremely inhospitable to animal and plant life, coinciding with two mass extinctions and four other major extinction events. This is why Paul Wignall, a Professor of Palaeoenvironments at Leeds has called the Pangean era ‘The Worst of Times’. Understanding why the Pangean era was so miserable helps inform several questions of interest to those studying existential risk.

● What level of natural existential risk do we face now, and have we faced in the past?

● What is the threat of super-volcanic eruptions?

● How much existential risk does anthropogenic climate change pose?

1. Background

There have been five mass extinctions so far. The Ordovician–Silurian (450-440 million years ago) and the Late Devonian (375-360 million years ago) each preceded the age of Pangea. The Pangean period coincided with the two worst mass extinctions, the huge Permian-Triassic mass extinction (252 million years ago) and the Triassic-Jurassic extinction event (201 million years ago).[1] The last crisis, the Cretaceous–Paleogene event (65 million years ago), accounted for the dinosaurs and occurred once continental drift had done its business and Pangea had broken apart. With the exception of the end Cretaceous extinction, since the breakup of Pangea, it has been relatively plain sailing for Earth’s various species, until humans started killing off other species themselves.

[2]

As one can see on this diagram, in the 145 million years since the start of the Cretaceous, the average rate of global genus extinctions from extinction events has been around 5% and never passed 15%, except for the death of the dinosaurs. But in the 80 million years from the first Pangean extinction event, the Capitanian, to the early Jurassic extinction events, the average rate of global genus extinctions in extinction events is more around 15-20%, and 12 events produced global genus extinction rates in excess of 15%.

Below is a useful chart from Wikipedia on the Phanerozoic, which shows the long-term trend in biodiversity as well as the impact of different extinction events.

Again, this highlights how unusually bad things were in the Pangean era – specifically the 80 million years after the Capitanian extinction event 260 million years ago. But it also highlights how good things have been since the end of the Pangean era and the start of the Cretaceous (145 million years ago).

2. What caused such ecological trauma in Pangea?

Huge volcanic eruptions were implicated in all of the six major extinction events in the Pangean era. One can see this in the first diagram above, where the volcanic eruptions are shown at the top and the line traces down to corresponding extinction events at the bottom. Every Pangean extinction event coincided with the outpouring of enormous fields of lava that, once cooled, produced what geologists call Large Igneous Provinces (LIPs).[3]

To put these LIPs in context, the eruption of Mount Pinatubo in 1991 produced 10 cubic km of magma, which caused the Earth to cool by about half a degree. The eruption of the Siberian Traps which appeared to cause the end Permian extinction produced 3 million cubic km of magma. You can see the volume of magma for all major LIPs at the top of the first diagram above.

These volcanic eruptions emitted sulphur dioxide, carbon dioxide and halogen gases, each of which could potentially have an effect on the ecosystem. Bond and Grasby provide a useful diagram of the kill mechanisms for the Permian extinction, the greatest ecological disaster of all time.

[4]

A single pulse of sulphur dioxide into the atmosphere can cause cooling for two to three year, after which time it is usually rained out. Thus, sulphur dioxide is probably only capable of driving death-by-cooling if eruptions were frequent and of high volume and were sustained for several centuries at a time. Unfortunately, the geological record of LIPs is not sufficiently resolved to permit an evaluation of whether this has actually happened during a mass extinction interval.[5]

In contrast, a single pulse of CO2 can cause warming on geological timescales: about a fifth of CO2 stays in the atmosphere for several thousand years after it is injected.[6] The total CO2 release from a LIP, such as the Siberian Traps has been estimated at 30,000 billion tonnes, ten times the mass in today’s atmosphere,[7] and three orders of magnitude greater than current annual emissions.

The release of such massive amounts of CO2 could cause extinctions through several mechanisms:

● Warming places thermal stress on organisms

● Warming reduces the capacity of the ocean to absorb oxygen, which can lead to ocean anoxia.

● CO2 dissolves in the ocean causing ocean acidification, which can lead to marine extinctions.

● CO2 can build up in tissues, a process known as hypercapnia, which Knoll has argued contributed to extinctions.[8]

The Pangean world was generally very hot and arid, with atmospheric CO2 at 3,000ppm[9] (compared to around 400ppm today). Average ocean temperatures before the end Permian disaster were 20°C and peaked at 40°C in the early Triassic. At no point did temperatures drop below 32°C in the 5 million years after the end Permian event.[10] For comparison, modern equatorial sea temperatures average around 28°C and never exceed 30°C. Peak temperatures in the early Triassic were the same as you would find in a bowl of very hot soup.[11]

3. Volcanism in the post-Pangean era

The diagram pasted above shows that:

Massive LIPs + Pangea = major extinction.

In Pangea, every time there was massive volcanism, there was a major extinction.[12]

However, once we remove Pangea from the equation, we no longer get a major extinction. Since the end of Pangea, comparably massive volcanism has barely registered on the fossil record.[13] 135 million years ago, an eruption went to form one of the largest LIPs – the Paraná-Etendeka Province. And yet, the eruption caused neither catastrophic environmental change nor mass extinction.[14]

The most interesting LIP for modern climatologists is the North Atlantic Igneous Province, which, according to Wignall, produced huge amounts of lava, even by LIP standards.[15] This produced 70,000 years of intense global warming, causing the Paleocene-Eocene Thermal Maximum, which is the closest geological analogue for the kind of global warming that humans are set to produce. During the Paleocene Eocene Thermal Maximum, global temperatures rose by ∼5°C in 10,000 years with average values in northern tropics estimated to have been ∼31–34°C.[16] However, during the Paleocene Eocene Thermal Maximum, losses were very minor and on the whole, this was a time of flourishing and success for a broad range of animals.[17] The Eocene was very hot compared to today, but the name Eocene comes from the Ancient Greek ἠώς (ēṓs, “dawn”) and καινός (kainós, “new”) and refers to the “dawn” of modern fauna that appeared during the epoch.

One possible exception to this is that the Deccan Traps coincided with the death of the dinosaurs. But then of course this also coincided with a meteorite impact in Chicxulub, Mexico, complicating cause and effect.[18] Indeed, I think Wignall’s arguments in The Worst of Times lend support to the meteorite explanation because it is unclear why massive volcanism would have accounted for the dinosaurs, but have been otherwise so insignificant from the Cretaceous (145 million years ago) onwards.

Overall, the post-Pangean era is much more robust and resilient than the Pangean era. Indeed, the picture we see in our post-Pangean era is one of climate change – both cooling and warming – having little effect on species extinctions.[19] This is true for when the magnitude of warming was comparable to what we are in for in the next 200 years, and when, on a regional basis, the rate of warming was comparable to what we are in for in the next 200 years.

4. Why has the post-Pangean era been so benign?

Wignall argues that Pangea was so inhospitable because various carbon cycle feedbacks were not in play due to the unique geography and geology of the supercontinent.[20] By the early Cretaceous, the world was much more efficient at removing CO2, with the result that atmospheric CO2 concentrations were a tenth of the Pangean level.[21] There are several reasons for this:

● Rainfall weathering – In a supercontinent, huge areas are too far away from the sea to receive much rain, which reduces the scope for removal of CO2 by rainfall and weathering.

● Limestone deposition – In Pangea, limestone deposition – which sequesters CO2 in the oceans – was at a minimum because the shelf fringe of the supercontinent is much smaller than the shelf fringe of a collection of much smaller continents.

● The evolution of coccolithophorids – Coccolithophorids appeared in the late Triassic and help to sequester carbon because they use CO2 in shell formation and then sink to the bottom of the ocean when dead, which also helps to counteract ocean acidification.

● Terrestrial plants – The end Permian led to a mass extinction of terrestrial plants. Without plants, the weathering feedback still occurs, but plants make it happen much more rapidly. A world without plants is therefore much more prone to rapid climatic fluctuations.[22]

● Slow circulation in Panthalassa – Ocean circulation in the massive Panthalassa ocean would have been relatively sluggish, contributing to extensive subsurface oxygen shortage.[23]

Wignall provides the following diagram of the various processes of carbon sequestration:

5. Possible lessons

The hypothesis that Pangea was uniquely ecologically vulnerable has many potential lessons for existential risk today.

Natural existential risk

Researchers have recently tried to place a bound on the natural existential risk that we face today.[24] The arguments here suggest that the level of natural existential risk is dependent on our geology. Past mass extinctions may not be a reliable guide to the extinction risk we face today and estimates of the risk may be biased upwards.

Risk from climate change

Some researchers have argued that CO2 release from volcanism is a worrying analogue for contemporary anthropogenic CO2 emissions. For example, Penn et al argue that plausible emissions scenarios predict a magnitude of upper ocean warming by 2300 that is 35 to 50% of that required to account for most of the end-Permian extinction intensity. The historical analogue lends weight to the idea that anthropogenic warming might cause a mass extinction.

If Wignall’s argument is right, we should be less worried about this analogy. The threat the ecosystem faced In the Pangean era was peculiar to the geography of Pangea and has receded since the fracturing of the supercontinent. Mercifully, our ecosystem is now more resilient than it once was.

6. Caveats and uncertainties

Several caveats and uncertainties should be noted:

● The context in which warming is occurring today is importantly different to that since the breakup of Pangea. Warming today is occurring in the context of widespread habitat loss and other environmental stress.

● There is vigorous scholarly disagreement about the causes of past extinction events.[26]

● I am unsure to what extent Wignall’s argument commands assent in the wider literature.

● Given the uncertainties involved in studying events so far in the past, it would not be at all surprising if current hypotheses were rejected in the coming decades.

[1] P. B. Wignall, The Worst of Times: How Life on Earth Survived Eighty Million Years of Extinctions (Princeton: Princeton University Press, 2015), 165.

[2] David P. G. Bond and Stephen E. Grasby, “On the Causes of Mass Extinctions,” Palaeogeography, Palaeoclimatology, Palaeoecology, Mass Extinction Causality: Records of Anoxia, Acidification, and Global Warming during Earth’s Greatest Crises, 478 (July 15, 2017): 14, https://doi.org/10.1016/j.palaeo.2016.11.005.

[3] Wignall, The Worst of Times, 9.

[4] Bond and Grasby, “On the Causes of Mass Extinctions,” 10.

[5] Bond and Grasby, 16.

[6] Bond and Grasby, 16.

[7] Bond and Grasby, 17.

[8] Andrew H. Knoll et al., “Skeletons and Ocean Chemistry: The Long View,” Ocean Acidification, 2011, 67–82.

[9] Wignall, The Worst of Times, 168.

[10] Wignall, 96ff.

[11] Wignall, 97.

[12] Wignall, xvi.

[13] P. B. Wignall, The Worst of Times: How Life on Earth Survived Eighty Million Years of Extinctions (Princeton: Princeton University Press, 2015), 9ff.

[14] Wignall, The Worst of Times, 153ff.

[15] Wignall, 161. However, note that this is at odds with the first Bond and Grasby diagram above. I am not sure what explains this discrepancy.

[16] K. J. Willis and G. M. MacDonald, “Long-Term Ecological Records and Their Relevance to Climate Change Predictions for a Warmer World,” Annual Review of Ecology, Evolution, and Systematics 42, no. 1 (2011): 270, https://doi.org/10.1146/annurev-ecolsys-102209-144704.

[17] Wignall, The Worst of Times, 161ff; Willis and MacDonald, “Long-Term Ecological Records and Their Relevance to Climate Change Predictions for a Warmer World,” 270ff.

[18] Wignall, 9.

[19] Daniel B. Botkin et al., “Forecasting the Effects of Global Warming on Biodiversity,” BioScience 57, no. 3 (March 1, 2007): 227–36, https://doi.org/10.1641/B570306; Kathy J. Willis et al., “4 °C and beyond: What Did This Mean for Biodiversity in the Past?,” Systematics and Biodiversity 8, no. 1 (March 25, 2010): 3–9, https://doi.org/10.1080/14772000903495833; Willis and MacDonald, “Long-Term Ecological Records and Their Relevance to Climate Change Predictions for a Warmer World”; Terence P. Dawson et al., “Beyond Predictions: Biodiversity Conservation in a Changing Climate,” Science 332, no. 6025 (April 1, 2011): 53–58, https://doi.org/10.1126/science.1200303.

[20] Wignall, The Worst of Times, chap. 7.

[21] Wignall, 168.

[22] Wignall, 170ff.

[23] Knoll et al., “Skeletons and Ocean Chemistry,” 13.

[24] Andrew E. Snyder-Beattie, Toby Ord, and Michael B. Bonsall, “An Upper Bound for the Background Rate of Human Extinction,” Scientific Reports 9, no. 1 (July 30, 2019): 1–9, https://doi.org/10.1038/s41598-019-47540-7; David Manheim, “Questioning Estimates of Natural Pandemic Risk,” Health Security 16, no. 6 (2018): 381–390.

[25] Toby Ord, The Precipice: Existential Risk and the Future of Humanity, 1 edition (Bloomsbury Publishing, 2020), 167.

[26] Bond and Grasby, “On the Causes of Mass Extinctions.”

Is mindfulness good for you?

I personally find mindfulness useful for reducing rumination. Its main value is revealing that your mind is basically going completely bananas all the time, that patterns of thought emerge which are difficult to control. I also find that loving and kindness meditation improves my mood in the short-term. However, I think the strength of the evidence on the benefits of meditation is often overstated.

Here, I discuss the overall strength of the evidence on meditation as a treatment for anxiety and depression.

1. What is mindfulness?

With its roots in Buddhism, attention towards mindfulness has grown enormously since the early 2000s.[2] Two commonly studies forms of mindfulness are Mindfulness-Based Stress Reduction and Mindfulness-Based Cognitive Therapy. Mindfulness-Based Stress Reduction is an 8-week group-based program consisting of:

● 20-26 hours of formal meditation practice, including:

○ Weekly 2 to 2.5 hour sessions

○ A whole-day retreat (6 hours)

○ Home practice ~45 mins per day for 6 days a week.[3]

Mindfulness-Based Cognitive Therapy incorporates cognitive therapy into the sessions. Mindfulness-Based Stress Reduction can be led by laypeople, whereas Mindfulness-Based Cognitive Therapy must be led by a licensed health care provider.

In my experience, most people practising mindfulness use an app such as Headspace or the Sam Harris Waking Up app. I personally find the Waking Up app far superior to the Headspace app.

2. How over-optimistic should we expect the evidence to be?

Mindfulness has many features that should make us suspect that the strength of the evidence claimed in the literature is overstated:

● A form of psychology/social psychology research

● Most outcome metrics are subjective

● Many of those researching it seem to be true believers[4]

● Hints of religion, alternative medicine, and woo

Research fields with these features are ripe for replication crisis trauma. We should expect inflated claims of impact which are then brought back to Earth by replications or further analysis of the existing research.

3. Main problems with the evidence

3.1. Reporting bias

Reporting bias includes:

Study publication bias – the publication of significant results and the failure to publish insignificant results
Selective outcome reporting bias, in which outcomes published are chosen based on statistical significance with non-significant outcomes not published.
Selective analysis reporting bias, in which data are analyzed with multiple methods but are reported only for those that produce positive results.
Other biases, such as relegation of non-significant primary outcomes to secondary status when results are published.

There is good evidence of reporting bias in mindfulness research.

Coronado-Montoya et al (2016) test for reporting bias by estimating the expected number of positive trials that mindfulness-based therapy would have produced if its effect size were the same as individual psychotherapy for depression, d = 0.55.[5] As we will see below, this is very likely an overestimate of the true effect of mindfulness-based therapy, and therefore the method used understates reporting bias in mindfulness studies.

Of the 124 RCTs included in Coronado-Montoya et al’s (2016) study, 108 (87%) were classified as positive and 16 (13%) as negative. If the true effect size of mindfulness-based therapy was d = 0.55, then we would expect 68 of 124 studies to be positive, rather than 108, meaning that the ratio of observed to expected positive studies was 1.6.[6] This is clear evidence of reporting bias.

Moreover, Coronado-Montoya et al (2016) also looked at 21 trials that were pre-registered. Of these, none specified which variable would be used to determine success, and 13 (62%) were still unpublished 30 months after the trial was completed.[7]

A recent Nature paper found that in psychology, due to selective reporting, meta-analyses produce significantly different effect sizes to large-scale pre-registered replications in 12 out of 15 cases. Where there was a difference, on average, the effect size in the meta-analysis was 3 times larger than the replications.[8] This shows that reporting bias is usually not adequately corrected for in meta-analyses.

3.2. Effect size of meditation compared to other interventions

Goyal et al conducted a meta-analysis of the effect of mindfulness-based therapy for well-being.[9] They key facts are:

● Effect size

○ Cohen’s d ranging from 0.22 to 0.38 for anxiety symptoms.[10]

○ 0.23 to 0.30 for depressive symptoms.[11]

○ These were each usually compared to a nonspecific active control.

○ However, neither of these estimates correct for reporting bias.[12] I think it is plausible that this biases the estimate of the effect size upwards by a factor of 2 to 3.

○ Comparison to alternative treatments

■ In the 20 RCTs examining comparative effectiveness, mindfulness and mantra programs did not show significant effects when the comparator was a known treatment or therapy.[13]

■ Sample sizes in the comparative effectiveness trials were small (mean size of 37 per group), and none was adequately powered to assess noninferiority or equivalence.[14]

■ Antidepressants

● According to a recent meta-analysis, antidepressants have an effect size of 0.3 for depression vs placebo.

■ CBT

● According to one meta-analysis, compared to wait-list controls, CBT has a Cohen’s d = 0.88 on depression

● Compared to care as usual or non-specific controls, it has a Cohen’s d of 0.38.[15]

● Goyal et al assessment of strength of evidence

○ Only 10 of the 47 included studies had a study quality rating of ‘good’, with the remainder having a rating of only ‘fair’ or ‘poor’.[16]

○ Goyal et al state that “none of our conclusions yielded a high strength of evidence grade for a positive or null effect.”[17]

This suggests that the strength of the evidence on meditation is weak, but that there is some evidence of small to moderate positive effect on anxiety and depression. However, the evidence seems to be much weaker than the evidence for CBT and antidepressants, and CBT and antidepressants seem to have a greater effect on depression.

We should beware the man of one study, but also beware the man of a meta-analysis that doesn’t correct for reporting bias or other sources of bias. Indeed, as argued in the section on reporting bias, there is good reason to think that a pre-registered high-powered replication would cut the estimated meta-analytic effect size for meditation by a factor of 2 to 3.

3.3. Large variation among mindfulness-based interventions

Most mindfulness-based therapy has been based on the idea of mindfulness-based stress reduction. However, there are large variations among studied mindfulness-based interventions.

● Time commitment[18]

○ The practice hours of the intervention included in the Goyal et al meta-analysis range from 7.5 hours to 78 hours.

○ The homework hours are: often not specified, exceed 30 hours in many studies and even reach up to 1,310 hours in one study.

● Methods for teaching and practicing mindful states.[19]

Van Dam et al (2017) contend that there is far greater heterogeneity among mindfulness interventions than among other intervention types such as CBT.[20] This heterogeneity across intervention types means that we should be cautious about broad claims about the efficacy of mindfulness for depression and anxiety.

It is especially important to consider this heterogeneity given that most people practicing meditation practice for 10-20 minutes per day using an app, making their experience very different to a full mindfulness-based stress reduction course.

3.4. Shaky fMRI evidence

Many studies assess the impact of meditation on brains states using fMRI imaging. These methods are highly suspect. There are numerous potential confounds in fMRI studies, such as head movement, pace of breathing, and heart rate.[21] These factors can confound a posited relationship between meditation and change in activity in the amygdala. Moreover, calculating valid estimates of effect sizes across groups in neuroimaging data is very difficult. Consequently, the practical import of such studies remains unclear.

Nonetheless, according to Van Dam et al (2017), meta-analyses of neuroimaging data suggest modest changes in brain structure as a result of practicing mindfulness. Some concomitant modest changes have also been observed in neural function. However, similar changes have been observed following other forms of mental and physical skill acquisition, such as learning to play musical instruments and learning to reason, suggesting that they may not be unique to mindfulness.[22]

It would be interesting to compare the effects of meditation on the brain with the effects of other activities such as reading, exercise, sport, or having a conversation with friends. I suspect that the effects on fMRI scans would be quite similar for many mundane activities, though I have not looked into this.

4. Overall judgement on effectiveness

In light of the above discussion, my best guess for downward adjustments of the effect size estimated in the Goyal et al meta-analysis (which found an effect size of 0.3 on depression) is:

Reporting bias biases the estimate upwards by a factor of 2.
Time commitment. Meditating with an app for 140 minutes per week vs around 390 minutes in MBSR, a 0.35:1 ratio, meriting a discount by a factor of ~3.

I estimate that true effect size on depression of daily mindfulness meditation for 20 minutes with an app is around 0.05. This is very small: if an intervention increased a man’s height with an effect size of 0.05, this would increase their height by around half a centimetre. Mindfulness is not the game-changer it is often painted to be.

5. Useful resources and reading

● Coronado-Montoya et al, ‘Reporting of Positive Results in Randomized Controlled Trials of Mindfulness-Based Mental Health Interventions’, Plos One (2016)

○ A high-quality analysis of reporting bias in mindfulness research.

● Goyal et al, ‘Meditation Programs for Psychological Stress and Well-being: A Systematic Review and Meta-analysis’ JAMA Internal Medicine 2014.

○ A review commissioned by the U.S. Agency for Healthcare Research and Quality. Provides a good overview of the quality of the evidence and estimates of effect size but crucially does not correct for reporting bias.

● Van Dam et al ‘Mind the Hype: A Critical Evaluation and Prescriptive Agenda for Research on Mindfulness and Meditation’, Perspectives on Psychological Science (2017)

○ A very critical review of the evidence on mindfulness, which raises several problems with the evidence. However, I think the tone is overall too critical given the evidence presented.

● Sam Harris – Waking Up book

○ An in my opinion overly rosy review of the evidence on meditation, especially in chapter 4.

● The Waking Up meditation app.

○ The best meditation app I have tried.

References

[1] See for example this piece by Rob Mather, and this by Louis Dixon

[2] Nicholas T. Van Dam et al., ‘Mind the Hype: A Critical Evaluation and Prescriptive Agenda for Research on Mindfulness and Meditation’, Perspectives on Psychological Science 13, no. 1 (2018): 36–37.

[3] Stephanie Coronado-Montoya et al., ‘Reporting of Positive Results in Randomized Controlled Trials of Mindfulness-Based Mental Health Interventions’, PLOS ONE 11, no. 4 (8 April 2016): 1, https://doi.org/10.1371/journal.pone.0153220.

[4] For example, a review of RCT evidence on mindfulness by Creswell opens with a quote from John Kabat-Zinn, a leading figure in mindfulness studies: “There are few people I know on the planet who couldn’t benefit more from a greater dose of awareness”. [creswell ref]

[5] Coronado-Montoya et al., ‘Reporting of Positive Results in Randomized Controlled Trials of Mindfulness-Based Mental Health Interventions’, 5.

[6] Coronado-Montoya et al., 9.

[7] Coronado-Montoya et al., 10.

[8] Amanda Kvarven, Eirik Strømland, and Magnus Johannesson, ‘Comparing Meta-Analyses and Preregistered Multiple-Laboratory Replication Projects’, Nature Human Behaviour, 23 December 2019, 1–12, https://doi.org/10.1038/s41562-019-0787-z.

[9] Madhav Goyal et al., ‘Meditation Programs for Psychological Stress and Well-Being: A Systematic Review and Meta-Analysis’, JAMA Internal Medicine 174, no. 3 (1 March 2014): 357–68, https://doi.org/10.1001/jamainternmed.2013.13018.

[10] Goyal et al., 364.

[11] Goyal et al., 364.

[12] Goyal et al., 361.

[13] Goyal et al., 364.

[14] Goyal et al., 365.

[15] Ellen Driessen and Steven D. Hollon, ‘Cognitive Behavioral Therapy for Mood Disorders: Efficacy, Moderators and Mediators’, Psychiatric Clinics 33, no. 3 (1 September 2010): 2, https://doi.org/10.1016/j.psc.2010.04.005.

[16] Goyal et al., ‘Meditation Programs for Psychological Stress and Well-Being’, Table 2.

[17] Goyal et al., 365.

[18] Goyal et al., Table 2.

[19] Van Dam et al., ‘Mind the Hype’, 40.

[20] Van Dam et al., 46.

[21] Van Dam et al., 49–51.

[22] Van Dam et al., 50–51.

The case for delaying solar geoengineering research

Tl;dr:

Argument:

1. Solar geoengineering is not feasible for the next few decades.

a. Solar geoengineering poses major governance challenges.

b. These governance challenges are only likely to be overcome in at least 50 years’ time.

2. Solar geoengineering research is a moral hazard, and research might uncover dangerous weather manipulation methods.

3. Given this risk and given that we can delay research without obvious costs, there is a good case for delaying solar geoengineering research at least for a few decades.

Epistemic status: Seems correct to me, but some expert disagree (though I don’t think they have been exposed to these arguments).

Solar geoengineering is a form of climate intervention that reduces global temperature by reflecting sunlight back to space. The best studied form – stratospheric aerosol injection – involves the injection of aerosols, such as sulphur, into the stratosphere (the higher atmosphere). This mimics the effects of volcanoes, which can have globally significant effects via the same mechanism. For example, the Pinatubo eruption in 1991 cooled large parts of the Earth by about half a degree. Computer modelling studies have suggested that, if done in a certain way and in certain climatic conditions, solar geoengineering could eliminate many of the costs of global warming without having serious side-effects.[1] These models are of course limited and crude, but they do suggest that solar geoengineering could be useful tool, if it could be deployed and governed safely.

Consequently, interest in the technology is increasing, as discussed in this Economist article. The Open Philanthropy Project has in the past funded solar geoengineering governance research and computer modelling efforts.

Here, I will argue that we should delay solar geoengineering research for a few decades.

1. Solar geoengineering is not feasible for the next few decades

a. Solar geoengineering poses severe governance challenges

In my view, solar geoengineering is only likely to be used once warming is quite extreme, roughly exceeding around 4 degrees. The reason for this is that solar geoengineering would likely be extremely difficult to govern. I outline some of the governance challenges in section 3.4 of my paper on solar geoengineering.

Solar geoengineering, if done using the stratospheric aerosol injection method, would affect the weather in most or all regions.[2] Solar geoengineering would therefore politicise the weather in all regions, and would have diverse regional effects. Adverse weather events would likely be blamed on solar geoengineering by affected countries, even if they were not in fact caused by solar geoengineering. Public anger at such weather events would likely be severe if they thought a massive international weather alteration scheme were at fault. Computer models could at best offer highly imperfect attribution of weather events to climatic causes.

This suggests that for solar geoengineering to be feasible, all major global powers would have to agree on the weather, a highly chaotic system. Securing such an agreement would be extremely difficult in the first instance and also extremely difficult to sustain in the longer-term. States would also foresee the problems of sustained agreement, disincentivising successful agreement in the first place.

b. These governance challenges are only likely to be overcome in at least 50 years’ time.

In light of this, solar geoengineering is only likely to be used once climate change is very bad for all regions. Judging when this point will occur is difficult, but my best guess having looked at the climate impacts literature in some depth is that this would only likely happen after about 3-4 degrees of warming.

We have had about 1 degree of warming thus far and, according to an IMF report, a further 1 degree of warming would be economiclly positive for many regions, especially Canada, Russia and Eastern Europe, and even potentially China (IMF report page 15).

(Note that even this modest climate change is bad overall for the world.)

Russia is a crucial factor here: global warming seems likely to bring numerous economic benefits for Russia, freeing up the Russian Arctic for exploration and thawing potential farmland. It is very unlikely that they would agree to a global scheme that would likely damage their economic prospects. Without agreement from Russia, I find it difficult to see how solar geoengineering could ever be implemented.

Thus, it seems implausible that solar geoengineering would be practicable at 2 degrees of warming, and 4 degrees is a more plausible threshold, in my view.

However, 4 degrees of warming will take many decades to occur. On the highest emissions scenario considered by the IPCC, 4 degrees of warming would take at least 50 years to occur (IPCC synthesis, p59).

This means that solar geoengineering is only likely to get used by around 2070, giving us 50 years from now to find a solution.

One potential counter-argument would point to runaway feedback loops that cause rapid warming, such as release of massive amounts of methane from clathrates. I have looked at the evidence for this and the evidence overall seems slim and the median view in the literature is that this is a negligible risk for the next century at least. See section 4 of my write-up on climate and ex risk for more on feedback loops.

2. Solar geoengineering research is a moral hazard and research might uncover dangerous weather manipulation methods

Research into solar geoengineering itself carries two main risks.

A persistent worry about solar geoengineering research concerns moral hazard: the worry that attention to plan B will reduce commitment to plan A. Having solar geoengineering as a backup will decrease commitment to reducing carbon emissions, which almost all researchers agree to be the top priority. The best discussion of this is in Morrow’s paper,[3] and I discuss the considerations on moral hazard risk at length in sections 4-6 of my paper. Overall, I think this is a genuine risk with solar geoengineering research and a reason not to carry out research.

Another risk of solar geoengineering research is that it will uncover new technologies that could destabilise global civilisation. I discuss weaponisation risks in section 3.2 of my paper. For example, climate researcher David Keith has discussed the possibility that a certain type of nanoparticle could be much longer lasting than ordinary solar geoengineering and so could potentially precipitate an ice age if deployed for long enough. I don’t think this particular technology could actually be a feasible doomsday weapon, but there is a concern that further research could uncover dangerous unknown new geoengineering technologies.

In a nutshell, for those persuaded by the Vulnerable World Hypothesis, research into technologies that could dramatically alter the weather seems like the kind of thing we should avoid if we can.

3. We should delay solar geoengineering research

Solar geoengineering research has clear risks and, given that we cannot deploy it at least for the next 50 years, there is no need to incur these costs now. Instead, the more prudent course seems to be to wait and see how well standard mitigation efforts go and then, if these continue to fail, start researching solar geoengineering in earnest around the middle of the 21st century. This would give us at least 20 years to cover the technical details and a governance framework. This seems to me like enough time, given that:

Solar geoengineering is probably technically feasible with adaptations to various different current technologies.
Extant insights from the governance of other public goods, free rider problems, and free driver problems, could in large part be applied to solar geoengineering, adapted to account for the features of that technology.

I at least don’t think that we need 50 years of forward planning to figure this technology out if we need to use it. Committing research hours when we know it may actually be used makes more sense when research risks undermining fragile commitment to mitigation, and risks discovering dangerous new technologies.

Note that my view has changed on this and that in my paper on solar geoengineering, I made a tentative case for primarily governance-focused research.

[1] For layman’s discussion of a recent paper, see this Vox piece.

[2] The reason for this is that the particles would be distributed globally by stratospheric winds.

[3] David R. Morrow, “Ethical Aspects of the Mitigation Obstruction Argument against Climate Engineering Research,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 372, no. 2031 (December 28, 2014): 20140062, https://doi.org/10.1098/rsta.2014.0062.

The ITN framework, cost-effectiveness, and cause prioritisation

From reading EA material, one might get the impression that the Importance, Tractability and Neglectedness (ITN) framework is the (1) only, or (2) best way to prioritise causes. For example, in EA concepts’ two entries on cause prioritisation, the ITN framework is put forward as the only or leading way to prioritise causes. Will MacAskill’s recent TedTalk leaned heavily on the ITN framework as the way to make cause prioritisation decisions. Open Philanthropy Project explicitly prioritises causes using an informal version of the ITN framework.

In this post, I argue that:

Extant versions of the ITN framework are subject to conceptual problems.
A new version of the ITN framework, developed here, is preferable to extant versions.
Non-ITN cost-effectiveness analysis is, when workable, superior to ITN analysis for the purposes of cause prioritisation.
This is because:
1. Marginal cost-effectiveness is what we ultimately care about.
2. If we can estimate the marginal cost-effectiveness of work on a cause without estimating the total scale of a problem or its neglectedness, then we should do that, in order to save time.
3. Marginal cost-effectiveness analysis does not require the assumption of diminishing marginal returns, which may not characterise all problems.
ITN analysis may be useful when it is difficult to produce intuitions about the marginal cost-effectiveness of work on a problem. In that case, we can make progress by zooming out and carrying out an ITN analysis.
In difficult high stakes cause prioritisation decisions, we have to get into the weeds and consider in-depth the arguments for and against different problems being cost-effective to work on. We cannot bypass this process through simple mechanistic scoring and aggregation of the three ITN factors.
For this reason, the EA movement has thus far significantly over-relied on the ITN framework as a way to prioritise causes. For high stakes cause prioritisation decisions, we should move towards in-depth analysis of marginal cost-effectiveness.

[update – my footnotes didn’t transfer from the googledoc, so I am adding them now]

1. Outlining the ITN framework

Importance, tractability and neglectedness are three factors which are widely held to be correlated with cost-effectiveness; if one cause is more important, tractable and neglected than another, then it is likely to be more cost-effective to work on, on the margin. ITN analyses are meant to be useful when it is difficult to estimate directly the cost-effectiveness of work on different causes.

Informal and formal versions of the ITN framework tend to define importance and neglectedness in the same way. As we will see below, they differ on how to define tractability.

Importance or scale = the overall badness of a problem, or correspondingly, how good it would be to solve it. So for example, the importance of malaria is given by the total health burden it imposes, which you could measure in terms of a health or welfare metric like DALYs.

Neglectedness = the total amount of resources or attention a problem currently receives. So for example, a good proxy for the neglectedness of malaria is the total amount of money that currently goes towards dealing with the disease.[^1]

Extant informal definitions of tractability

Tractability is harder to define and harder to quantify than importance and neglectedness. In informal versions of the framework, tractability is sometimes defined in terms of cost-effectiveness. However, this does not make that much sense because, as mentioned, the ITN framework is meant to be most useful when it is difficult to estimate the marginal cost-effectiveness of work on a particular cause. There would be no reason to calculate neglectedness if we already knew tractability, thus defined.

Other informal versions of the ITN framework often use intuitive definitions such as “tractable causes are those in which it is easy to make progress”. This definition seems to suggest that tractability is defined as how much of a problem you can solve with a given amount of funding. However, if you knew this, there would be no point in calculating neglectedness, since with importance and tractability alone, you could calculate the marginal cost-effectiveness of work on a problem, which is ultimately what we care about. This definition renders the Neglectedness part of the analysis unnecessary, or at least suggests that we would only calculate neglectedness as one factor that bears on tractability, rather than as three distinct quantities that can be aggregated and scored.

Thus, extant informal versions of the ITN framework have some conceptual difficulties.

Extant formal definitions of tractability

80,000 Hours develops a more formal version of the ITN framework which advances a different definition of tractability:

“% of problem solved / % increase in resources”[^2]

The terms in the 80k ITN definitions cancel out as follows:

Importance = good done / ~~% of problem solved~~
Tractability = ~~% of problem solved / % increase in resources~~
Neglectedness = ~~% increase in resources~~ / extra $

Thus, once we have information on importance, tractability and neglectedness (thus defined), then we can produce an estimate of marginal cost-effectiveness.

The problem with this is: if we can do this, then why would we calculate these three terms separately in the first place? The ITN is supposed to be useful as a heuristic when we lack information on cost-effectiveness, but on these definitions, we must already have information on cost-effectiveness. On these definitions, there is no reason to calculate neglectedness.

To be as clear as possible, on the 80k framework, if we know the ITN estimates, then we know the difference that an additional $1m (say) will make on solving a problem. So, we do not necessarily have to calculate the neglectedness of a problem in order to prioritise causes.

It is important to bear in mind, but easy to forget, that cause prioritisation in terms of the ITN criteria thus defined involve judgements about cost-effectiveness. For example, all of 80,000 Hours’ cause prioritisation rests on judgements about all ITN factors thus defined, and so we must be able to deduce from them marginal cost-effectiveness estimates for work on AI, biorisk, nuclear security and climate change, and so on.

An alternative ITN framework

Existing versions of the ITN framework seem to have some conceptual problems. Nevertheless, the ITN framework in some form often seems a useful heuristic. The question therefore is: how should we define tractability in a conceptually coherent way such that the ITN framework remains useful?

The research team at Founders Pledge has developed a framework which attempts to meet these criteria. We define tractability in the following way:

Tractability = % of problem solved per marginal resource.

On this definition, neglectedness is just one among many determinants of tractability. Importance and neglectedness can be quantified quite easily, but the other factors, aside from neglectedness, that bear on tractability are harder to quantify. Assuming diminishing returns, the conceptual relationship between the factors can be represented as follows:

[I can’t get the textboxes to show here – they are labels of ‘good done’ for the y-axis, and ‘resources’ for the x-axis.]

The scale importance of a problem is the maximal point that the curve meets on the y-axis – the higher up the y-axis you can go, the better it is. Neglectedness tells you where you are on the x-axis at present. The other factors that bear on tractability tell you the overall shape of the curve. Intractable problems will have flatter curves, such that moving along the x-axis (putting more resources in) doesn’t take you far up the y-axis (solve much of the problem). Correspondingly, easily solvable problems will have steep curves.

When we are initially evaluating a problem, it is often difficult to know the shape of the returns to resources curve, but easy to calculate how big a problem is and how neglected it is. This is why ITN analysis comes into its own when it is difficult to gather information about cost-effectiveness. Thus, when we are carrying out ITN analysis in this new format, the process would be:

We quantify importance to neglectedness ratios for different problems.
We evaluate the other factors (aside from neglectedness) that bear on the tractability of a problem.
We make a judgement about whether the differences in tractability could be sufficient to overcome the initial importance/neglectedness ranking.

For step 1, problems with higher importance/neglectedness ratios should be a higher priority, other things equal. That is, we should prefer to work on huge but neglected problems than small crowded ones, other things equal.

For step 2, we would have to find a way to abstract from the current neglectedness of different problems.[^3] One way to do this would be to try to evaluate the average tractability of two different problems. Another way would be to evaluate the two problems imagining that they were at the same level of neglectedness. When we are assessing tractability, controlling for neglectedness, we would consider factors such as:

The level of opposition to working on a problem
The strength of the political or economic incentives to solve a problem
The coordination required to solve a problem

For step 3, once we have the information of the other factors (aside from neglectedness) bearing on tractability, we then have to decide how these affect our initial step 1 ranking. One option would be to give problems different very rough scores on tractability perhaps using a checklist of the factors above. Some problems will dominate others in terms of the three ITN criteria, and prioritisation will then be straightforward. In more difficult cases, some problems will be highly neglected but much less tractable than others (eg in climate change, nuclear power is much more neglected than renewables but also arguably more unpopular at all levels of neglectedness), or the tractability of work of a problem will be very unclear. In these cases, we have to make judgement calls about whether any of the differences in the other factors bearing on tractability are sufficient to change our initial step 1 ranking. That is, we have to make rough assumptions claims about the shape of the returns curve for different problems.

On this version of the framework, it is not possible to mechanistically aggregate ITN scores between problems to produce an overall cause ranking. This version of the ITN framework produces rankings between problems that are quite low resolution: it will often be difficult to know the overall ranking of different causes, analysed in this way. This is what we should expect from the ITN framework. The ITN framework is useful precisely when it is difficult to have intuitions about cost-effectiveness.

The advantage of this version of the framework is that it is more conceptually coherent than extant versions of the framework.

The disadvantages of this version of the framework are:

It relies on the assumption of diminishing returns, which may not characterise all problems.
ITN analysis is in some cases inferior to cost-effectiveness analysis as a cause prioritisation tool.

To these two points, I now turn.

2. Cost-effectiveness analysis without ITN analysis

We have seen that on some versions of the framework, ITN analyses necessarily give us the information for a marginal cost-effectiveness estimate. However, it is possible to calculate the marginal cost-effectiveness of work on a cause without carrying out an ITN analysis. There are two main ways in which cost-effectiveness analysis could differ from an ITN analysis:

Calculating the size of the whole problem

ITN analysis involves estimating the size of a whole problem. For example, when estimating the importance of malaria, one would quantify the total scale of the problem of malaria in DALYs. But if you are doing cost-effectiveness analysis, it would not always be necessary to quantify the total scale of the whole problem. Rather, you could estimate directly how good it is to solve part of a problem with a given amount of resources.

Calculating neglectedness

Cost-effectiveness analyses do not necessarily have to calculate the neglectedness of a problem. It is sometimes possible to directly calculate how much good an extra x resources will do, which does not necessarily require you to assess how many resources a problem currently receives in total. This is because neglectedness is just one determinant of tractability (understood as % of problem solved/$) among others, and it may be possible to estimate how tractable a problem is without estimating any one determinant of tractability, whether that be neglectedness, level of political opposition, degree of coordination required, or whatever.

Non-ITN cost-effectiveness estimates have two main advantages over ITN analyses.

Marginal cost-effectiveness is what we ultimately care about. If we can produce an estimate of that without having to go through the extra steps of quantifying the whole problem or calculating neglectedness, then we should do that, purely to save time.
Avoiding theoretical reliance on calculating neglectedness avoids reliance on the assumption of diminishing marginal returns, which may not characterise every problem.[^4]

To illustrate the possibility, and advantages, of non-ITN cost-effectiveness analysis, examples follow.

Giving What We Can on global health

Giving What We Can argued that donating to the best global health charities is better than donating domestically.

“The UK’s National Health Service considers it cost-effective to spend up to £20,000 (about $25,000) for a single year of healthy life added.

By contrast, because of their poverty many developing countries are still plagued by diseases which would cost the developed world comparatively tiny sums to control. For example, GiveWell estimates that the cost per child life saved through an LLIN distribution funded by the Against Malaria Foundation is about $7,500. The NHS would spend this amount to add about four months of healthy life to a patient.”

This argument uses a cost-effectiveness estimate to argue for focusing on the best global health interventions rather than donating in a high-income country.

It is true that Giving What We Can appeals here to neglectedness as a way to explain why the cost-effectiveness of health spending differs between the UK and the best global poverty charities. But the argument from the direct cost-effectiveness estimate alone is sufficient to get to the conclusion: if the cost-effectiveness of health spending is actually higher in the UK than poor countries, then the point about neglectedness would be moot. This illustrates the relation neglectedness has to cost-effectiveness analysis, and how neglectedness analysis is not always necessary for cause comparisons.

Lant Pritchett on economic growth

In his paper ‘Alleviating Global Poverty: Labor Mobility, Direct Assistance, and Economic Growth’, Lant Pritchett argues that research on, and advocacy for, economic growth is a better bet than direct ‘evidence-based development’ (eg, distributing bednets, cash transfers and deworming pills).

Here he lays out the potential benefits of one form of evidence-based development, the ‘Graduation approach’:

“Suppose the impact of the Graduation program in Ethiopia was what it was on average for the five countries and generated $1,720 in NPV for each $1000 invested.” (p25)

Thus, one gets a 1.7x return from the Graduation approach. Here he lays out the benefits of research and advocacy for growth:

“The membership of the American Economics Association is about 20,000 and suppose the global total number of economists is twice that and the inclusive cost to someone of an economist per year is $150,000 on average. Then the cost of all economists in the world is about 6 billion dollars. Suppose this was constant for 50 years and hence cost 300 billion to sustain the economics profession from 1960 to 2010. Suppose the only impact of all economists in all these 50 years was to be even a modest part of the many factors that persuaded the Chinese leadership to switch economic strategy and produce 14 trillion dollars in cumulative additional output.” (p24)

Even if the total impact of all economists in the world for 50 years was only to increase by 4% (in absolute terms) the probability of the change in course in Chinese policy, it would still have greater expected value than directly funding the graduation approach. Since development economists likely did much more than this, research and advocacy for growth-friendly policies is better than evidence-based development. Pritchett continues:

“For instance, the World Bank’s internal expenditures (BB budget) on all of Development Economics (of which research is just a portion) in FY2016 was about 50 million dollars. The gains in NPV of GDP from just the Indian 2002 growth acceleration of 2.5 trillion are 50,000 times larger. The losses in NPV from Brazil’s 1980 growth deceleration are 150,000 times larger. So even if by doing decades of research on what accelerates growth (or avoids losses) and even if that only as a small chance of success in changing policies this still could have just enormous returns—because the policy or other changes that create growth induces country-wide gains in A (which are, economically, free) and induces voluntary investments that have no direct fiscal cost (or conversely causes those to disappear).” (p25)

This, again, is a way of placing a lower bound on a cost-effectiveness estimate of research and advocacy for growth, as against direct interventions.

Trying to bend the reasoning here into an ITN analysis would add unnecessary complexity to Pritchett’s argument. This illustrates the advantage of non-ITN cost-effectiveness analysis:

Calculating the scale of the benefits of economic growth

What is the total scale of the problem that economic growth is trying to solve? Economic growth can arguably produce arbitrarily large benefits, so should we use a discount rate of some sort? Which one should we use? Etc. We can avoid these questions by focusing on the limited benefits of particular growth episodes a la Lant.

Calculating the neglectedness of economics research

At no point does Pritchett appeal to the neglectedness of research and advocacy for growth relative to evidence-based development. Doing so is unnecessary to get to his conclusion.

Bostrom on existential risk

In his paper, ‘Existential Risk Prevention as Global Priority’, Nick Bostrom defends the view that reducing existential risk should be a top priority for our civilisation, and argues:

“Even if we give this allegedly lower bound on the cumulative output potential of a technologically mature civilization a mere 1% chance of being correct, we find that the expected value of reducing existential risk by a mere one billionth of one billionth of one percentage point is worth a hundred billion times as much as a billion human lives.

One might consequently argue that even the tiniest reduction of existential risk has an expected value greater than that of the definite provision of any “ordinary” good, such as the direct benefit of saving 1 billion lives.”

This is an argument in favour of focusing on existential risk that provides a lower bound cost-effectiveness estimate for the expected value of existential risk reduction. The argument is: plausible work on existential risk is likely to make even very small reductions in ex risk, which will have greater expected value than any action one could plausibly take to improve (eg) global poverty or health. Estimating the total neglectedness of the problem of existential risk is unnecessary here. You just have to know the lower bound of how big an effect a state could expect to have on existential risk. That is, you have to know the lower bound of tractability, understood as ‘% of problem solved /$’. Neglectedness is one determinant of tractability (thus defined), and it is not necessary to evaluate all determinants of tractability when evaluating tractability.

Matheny on existential risk and asteroid protection

In his ‘Reducing the Risk of Human Extinction’, Jason Matheny argues that existential risk reduction has very high expected value, and uses the example of asteroid protection to illustrate this. He concludes that an asteroid detect and deflect system costing $20 billion would produce benefits equivalent to saving a life for $2.50.[^5]

Since this is much lower than what you can get from almost all global poverty and health interventions, asteroid protection is better than global poverty and health. One might add that since that most experts think that work on other problems such as AI, biosecurity and nuclear security is much more cost-effective than asteroid protection (from a long-termist point of view), working on these problems must (per epistemic modesty) a fortiori be even better than global poverty and health. If there were reliable conversions between the best animal welfare interventions and global poverty and health interventions, then this would also enable us to choose between the causes of existential risk, global poverty and animal welfare.

In this case, one does not need to estimate neglectedness because we can already quantify what a particular amount of resources will achieve in reducing the problem of existential risk.

3. Conclusions

In this post, I have argued that:

Extant versions of the ITN framework are subject to conceptual problems.
A new version of the ITN framework, developed here, is preferable to extant versions.
Non-ITN cost-effectiveness analysis is, when workable, superior to ITN analysis for the purposes of cause prioritisation.
This is because:
1. Marginal cost-effectiveness is what we ultimately care about.
2. If we can estimate the marginal cost-effectiveness of work on a cause without estimating the total scale of a problem or its neglectedness, then we should do that, in order to save time.
3. Marginal cost-effectiveness analysis does not require the assumption of diminishing marginal returns, which may not characterise all problems.

The ITN framework may be preferable to cost-effectiveness analysis when:

At current levels of information, it is difficult to produce intuitions about the effect that a marginal amount of resources will have on a problem. In that case, it may be easier to zoom out and get some lower resolution information on the total scale of a problem and on its neglectedness, and then to try to weigh up the other factors (aside from neglectedness) bearing on tractability. This can be a good way to economise time on cause prioritisation decisions.

Often, as we have seen, this will sometimes leave us uncertain about which cause is best. This is what we should expect from ITN analysis. We should not expect ITN analysis to resolve all of our difficult cause selection decisions. We can resolve this uncertainty by gathering more information about the factors bearing on the cost-effectiveness of working on a problem. This is difficult work that must go far beyond a simple mechanistic process of quantifying and aggregating three scores.

For example, suppose we are deciding how to prioritise global poverty and climate change. This is a high stakes decision for the EA community, as it could affect the allocation of tens of millions of dollars. While it may be relatively easy to bound the importance and neglectedness of these problems, that still leaves out a lot of information on the other multitudinous factors that bear on how cost-effective these causes are to work on. To really be confident in our choice between these causes, we would have to consider factors including:

The best way to make progress on global poverty. Should we focus on health or on growth? How do we increase growth? Etc etc
What are the indirect effects of growth? Does it make people more tolerant and liberal? To what extent does it increase the discovery of civilisation-threatening technologies? Etc etc.
Plausible estimates of the social cost of carbon. How should we quantify the potential mass migration that could result from extreme climate change? How should we discount future benefits? How rich will people be in 100 years? Etc etc.
Ways to convert global poverty reduction benefits into climate change reduction benefits.
What are the best levers to pull on in climate change. How good would a carbon tax be and how tractable is it? Can renewables take over the electricity supply? Is innovation the way forward? What are the prospects of success for nuclear innovation and enhanced geothermal? Do any non-profits stand a chance of affecting innovation? Does increasing nuclear power lead to weapons proliferation? etc etc.

These are all difficult questions, and we need to answer them in order to make reasonable judgements about cause prioritisation. We should not expect a simple three factor aggregation process to solve difficult cause prioritisation decisions such as these. The more we look in detail at particular causes, the further we get from low resolution ITN analysis, and the closer we get to producing a direct marginal cost-effectiveness estimate of work on these problems.

To have confidence in our high stakes cause prioritisation decisions, the EA community should move away from ITN analysis, and move towards in-depth marginal cost-effectiveness analysis.

Thanks to Stefan Schubert and Martijn Kaag for helpful comments and suggestions.

[^1] Plausibly, we should actually consider the total all-time resources that will go to a problem over time, but that is the subject for another post.

[^2] For mathematical ease, we can make the denominator 1 here and so calculate the good produced by a doubling of resources from the current level

[^3] Rob Wiblin ‘The Important/Neglected/Tractable framework needs to be applied with care’ (2016)

[^4] On this, see Arepo (Sasha Cooper), ‘Against neglectedness’, EA Forum (Nov 2017); sbehmer, ‘Is Neglectedness a Strong Predictor of Marginal Impact?’, EA Forum (Nov 2018). See also Owen Cotton-Barrat ‘The law of diminishing returns’, FHI (2014).

How modest should you be?

In this post, I discuss the extent to which we may appeal to the object-level reasons when forming beliefs. I argue that the object-level reasons should sometimes play a role in determining our credences, even if that is only via our choice of epistemic peers and superiors. Thanks to Gregory Lewis, Stefan Schubert, Aidan Goth, Stephen Clare and Johannes Ackva for comments and discussion.

1. Introduction

There was a lively discussion in EA recently about the case for and against epistemic modesty. On epistemic modesty, one’s all-things-considered credences (the probability one puts on different propositions) should be determined, not by consideration of the object-level arguments, but rather by the credences of one’s epistemic peers and superiors.

Suppose I am trying to form a judgement about the costs and benefits of free trade. On epistemic modesty, my credence should be determined, not by my own impression of the merits of the object-level arguments for and against free trade (regarding issues such as comparative advantage and the distributional effects of tariffs), but rather by what the median expert (e.g. the median trade economist) believes about free trade. Even if my impressions of the arguments lean Trumpian, epistemic modesty requires me to defer to the experts. There is thus a difference between (1) my own personal impressions and (2) the all-things-considered credences I ought to have.

In this way, epistemic modesty places restrictions on the extent to which agents may permissibly appeal to object-level reasons when forming credences. When we are deciding what to believe about an issue, meta-level (non-object level considerations) about the epistemic merits of the expert group are very important.[1] These meta-level considerations include:

Time: These putative experts have put significant time into thinking about the question
Ability: They are selected for high cognitive ability
Scrutiny: Their work has been subject to substantial scrutiny from their peers
Numbers: They are numerous.

On epistemic modesty, a strong argument for the view that we should defer to trade economists on trade is that they are smart people, who put a lot of time into thinking about the topic, have been subject to significant external scrutiny, and they are numerous, which produces a wisdom of crowds effect. These factors are major advantages that trade economists have over me with respect to this topic, which suggests that the aggregate of economists are >10x more likely to be correct than me. So, on this topic, deference seems like a reasonable strategy.

1.1. Do the object-level arguments matter at all?

However, on epistemic modesty, the object-level arguments are not completely irrelevant to the all-things-considered credences I ought to have. Before we get into why, it is useful to distinguish two types of object-level arguments:

People’s object-level arguments on propositions relevant to people’s epistemic virtue on p, not including object-level arguments about p.
People’s object-level arguments and credences about p.

Suppose that our aim is to assess some proposition, such as the efficacy of sleeping pills. When we are choosing peers and superiors on this question, people’s object-level reasoning on medicine in general is relevant to their epistemic virtue. If we learn that someone has good epistemics on medical questions and knows the literature well (picture a Bayesian health economist), then that would be a reason to upgrade their epistemic virtue on the question of whether sleeping pills work. If we learn that someone has in the past appealed to homeopathic arguments about medicine, then that would be a reason to discount their epistemic virtue on the efficacy of sleeping pills. Thus, 1 is relevant to people’s epistemic virtue, and so is relevant to our all-things-considered credences on p.

This point is worth emphasising: even on epistemic modesty, consideration of the object-level arguments can be important for your all-things-considered credence, although the effect is indirect and comes via peer selection.

However, on some strong versions of epistemic modesty, 2 is not relevant to our assessment of people’s epistemic virtue on p. We must select people’s epistemic virtue on a proposition p in advance of considering their object-level reasons and verdict on p. Having selected trade economists as my epistemic superior on the benefits of trade, I cannot demote them merely because their object-level arguments on trade seem implausible. Call this the Object-level Reasons Restriction

Object-level Reasons Restriction = Your own impressions of the object-level reasons regarding p cannot be used to determine people’s epistemic virtue on p.[2]

2. Problems with the Object-level Reasons Restriction

In this section, I will argue against the Object-level Reasons Restriction using an argument from symmetry: if type 1 reasons are admissible when evaluating epistemic virtue, then type 2 reasons are as well.

Suppose I am forming beliefs about the efficacy of sleeping pills. When I am choosing peers and superiors on this question, on epistemic modesty I am allowed to take people’s object-level reasoning on medicine in general: I can justifiably exclude homeopaths from my peer group, for example. But if I choose Geoff as my peer on the efficacy of sleeping pills and he starts making homeopathic arguments about them, then I am not allowed to demote him from my peer group. I see no reason for this asymmetry, so I see no reason to accept the Object-level Reasons Restriction. If the object-level arguments about medicine in general are relevant to one’s epistemic virtue on the efficacy of sleeping pills, then the object-level reasons about the efficacy of sleeping pills are also relevant.

Suppose I have selected Geoff and Hilda as equally good truth-trackers on the efficacy of sleeping pills, and I then get the information that Geoff appeals to homeopathic arguments. From a Bayesian point of view, this is clearly an update away from him being as good a truth-tracker on the efficacy of sleeping pills as Hilda.

I envision two main responses to this, one from epistemic egoism, and one from rule epistemology.

2.1. Epistemic egoism

Firstly, it might be argued that by downgrading Geoff’s epistemic virtue, I in effect put extra weight on my own beliefs merely because they are mine. Since we are both peers, I should treat each of us equally.

I don’t think this argument works. My argument for demoting Geoff is not “my belief that his reasoning is bad” but rather “his reasoning is bad, which I believe”.[3] These two putative reasons are different. One can see this by using the following counterfactual test. Imagine a hypothetical world where I believe that Geoff’s reasoning is bad, but I am wrong. Do I, in the actual world, believe that I should still demote Geoff in the imagined world? No I do not. If I am wrong, I should not demote. So, my reason for downgrading is not my belief that the reasoning is bad, but is rather the proposition that the reasoning is bad, which I believe. Therefore, there is no objectionable epistemic egoism here.

In the sleeping pills case, when I demote someone from my peer group because they appeal to homeopathy, my reason to demote them is not my belief that homeopathy is unscientific, it is the fact that homeopathy is unscientific.

2.2. Rule epistemology

Another possible response is to say that I do better by following the rule of not downgrading according to the object-level reasons, so I should follow that rule rather than trying to find the exceptions. My response to this is that I do even better by following that rule except when it comes to Geoff. There is a difference between trying (and maybe failing) to justifiably downgrade experts and justifiably downgrading experts. If it is asked – ‘why think you should downgrade in this case?’, then a good answer is just to refer to the object-level reasons. This is a good answer regarding sleeping pills, just as it is, as everyone agrees, regarding medical expertise in general.

It is true that the rule of demoting people on the basis of the object-level reasons is liable to abuse. A mercantilist could use this kind of reasoning to demote all trade economists to be his epistemic inferiors. However, the problem here is bad object-level reasoning and epistemic virtue assessment, not the plausibility of the Object-level Reasons Restriction. As we have seen, proponents of epistemic modesty agree that object-level reasons are sometimes admissible when we are assessing people’s epistemic virtue. For example, the Object-level Reasons Restriction would not rule out the mercantilist using his views on trade as a reason to demote economists on other questions in economics. This is also an error, but one that merely stems from mundane bad reasoning, which we have non-modest reasons to care about.

What is usually going on when mercantilists dismiss the expert consensus on trade is (1) they simply don’t understand the arguments for and against free trade; and (2) that many of them also simply do not know that there is such a strong expert consensus on trade. This is simply inept reasoning, not an argument for never considering the object-level arguments when picking one’s peers.

3. Where does this leave us?

I have argued that it is sometimes appropriate to appeal to the object-level arguments on p when deciding on people’s epistemic virtue on p. I illustrated this with (I hope) a relatively clear case involving a rogue homeopath.

The arguments here are potentially practically important. Any sophisticated assessment of some topic will involve appealing to the object-level reasons to give more weight to the views of some putative experts who appear to do equally well on the meta-level considerations.

This is the approach that GiveWell takes when assessing the evidence on interventions in global health. Their approach is not to just take the median view in the literature on some question, but rather to filter out certain parts of the literature on the basis of general methodological quality. In their post on Common Problems with Formal Evaluations, they say they are hesitant to use non-RCT evidence because of selection effects and publication bias.[4] In this way, object-level arguments play a role in filtering the body of evidence that GiveWell responds to. Nonetheless, their approach is still modest in the sense that the prevailing view found among studies of sufficiently high quality plays a major role in determining their all-things-considered view.
In his review of the effect of saving lives on fertility, David Roodman rules out certain types of evidence, such as cross-country regressions and some of the instrumental variables studies, and puts more weight on other quasi-experimental studies. In this way, his final credence is determined by the expert consensus as weighted by the object-level reasons, rather than the crude median of the aggregate of putative experts.
Continental philosophers – Foucauldians, Hegelians, Marxists, postmodernists – do pretty well on the meta-level considerations – time, general cognitive ability, scrutiny and numbers. But they do poorly at object-level reasoning. This is a good reason to give them less weight than analytic philosophers when forming credences about philosophy.[5]

The examples above count against an ‘anything goes’ approach to assessments of the object-level reasons; the object-level reasoning still has to be done well. David Roodman and GiveWell put incredible amounts of intellectual heft into filtering the evidence. Epistemic peerhood and superiority are relative, and Roodman and GiveWell set a high bar. The point here is just that the object-level reasons are sometimes admissible.

The object-level reasons and the meta-level considerations should each play a role in assessments of epistemic virtue. This can justify positions that seem immodest. Some examples that are top of mind for me:

We should put less weight on estimates of climate sensitivity that update from a uniform prior, as I argued here.
We should almost all ignore nutritional epidemiology and just follow an enlightened common sense prior on nutrition.
We should often not defer to experts who use non-Bayesian epistemology, where this might make a difference relative to the prevailing (eg frequentist or scientistic) epistemology. E.g. this arguably played a role in early mainstream expert scepticism towards face masks as a tool to prevent COVID transmission.

[1] I avoid calling these ‘outside view’ reasons because the object-level reasons might also be outside view/base rate-type reasons. It might be that the people who do well on the meta-level reasons ignore the ‘outside view’ reasons. Witness, for example, the performance of many political scientists in Tetlock’s experiments.

[2] This is how I interpret Greg Lewis’ account of epistemic modesty: “One rough heuristic for strong modesty is this: for any question, find the plausible expert class to answer that question (e.g. if P is whether to raise the minimum wage, talk to economists). If this class converges on a particular answer, believe that answer too. If they do not agree, have little confidence in any answer. Do this no matter whether one’s impression of the object level considerations that recommend (by your lights) a particular answer.”

[3] David Enoch, “Not Just a Truthometer: Taking Oneself Seriously (but Not Too Seriously) in Cases of Peer Disagreement,” Mind 119, no. 476 (January 10, 2010): sec. 7, https://doi.org/10.1093/mind/fzq070.

[4] I have some misgivings about the RCT-focus of GiveWell’s methodology, but I think the general approach of filtering good and bad studies on the basis of methodological quality is correct.

[5] I think a reasonable case could be made for giving near zero weight to the views of Continental philosophers on difficult philosophical topics.

Should we buy coal mines?

At Effective Altruism Global, Will MacAskill proposed the idea of buying a coal mine in order to keep coal in the ground, as a potential longtermist megaproject. The idea was covered in a piece by Richard Fisher for the BBC here. There is a large literature in economics on the idea of reducing the supply of fossil fuels in the economics literature, and the idea recently received additional attention thanks to a post by Alex Tabarrok on the Marginal Revolution blog here.

I have spent around 50 hours looking into whether it would be worth buying a coal mine.

I talked to more than 10 experts about this. I have mainly focused on feasibility in the US and Australia because the project seemed more promising in those countries due to likely political and legal barriers in countries like India, China and Indonesia. I have concluded that:

Buying a coal mine is much more expensive than it might at first appear due to costs to reclaim the land after mine closure and foregone royalty payments to the mine owner.
Buying mines is legally impossible for some mines, and for others seems extremely practically difficult, for legal, political and cultural reasons. My impression is that there are few viable opportunities to buy coal mines in the world.
The full costs of buying a mine would be in the hundreds of millions of dollars for a mine a similar size to a top 30 US coal mine. So, the barriers to buying just one mine are very high. For context, total global spending on climate philanthropy is $5-9 billion. So you could spend close to a tenth of global climate philanthropy buying just one coal mine.
It is unclear whether buying a mine and retiring it would make much difference compared to the counterfactual, in the US at least, because coal demand there seems to be in structural decline. The counterfactual benefits seem larger in Australia because the coal industry is more buoyant, but, for that very reason, buying up coal mines would be much more politically difficult there.
It seems likely that there are other better ways to limit coal production or consumption, such as political advocacy, or paying some lawyers to gum up the works. I think EA donors could have more impact through other projects.
Buying up a coal mine might be a good option for donors who would otherwise have lower impact and have the time to identify some workable opportunities.

This is a very quick write-up of my reasoning that I thought I would share in case anyone else had thought about looking into this area. I am optimising for speed and writing up my findings quickly, rather than referencing to back up my claims, doing lots of checking of my numbers, and asking for review before publishing. I am >75% sure that my central claims are correct.

Thanks to Max Daniel, Abie Rohrig and Will MacAskill for comments and discussion.

1. Costs are much higher than they seem

In his recent post on the promise of buying a coal mine, Alex Tabarrok noted that there is a coal mine for sale for $7.8m dollars. If you assume that this is the total cost of buying the mine and shutting it down, then doing so would be highly cost-effective. However, the true cost would be much higher than this.

Reclamation costs

Owners of coal mining leases have to pay ‘reclamation costs’, which is the environmental cleanup of the mine: turning it into forest, a solar farm etc. In the US, the typical cost of reclamation is ~$10,000 per acre (report). For mines with data in the US, the size ranges from 1,000 acres to 40,000 acres. North Antelope Rochelle, the world’s largest coal mine, is ~5,000 acres. So, the reclamation costs would range from $10m to $400m and would be $50m for North Antelope (calculations are here). For the McDowell mine listed by Tabarrok, reclamation costs would be $30 million, which is several times the mining lease.

Lease owners also have to cover reclamation in Australia. In both the US and Australia, this is typically done with a bond. Many coal producers are strategically going bankrupt to avoid paying for reclamation (report).

Foregone royalties

Governments typically charge royalties on the gross value of fossil fuels sold from an extraction lease. The royalty rate for federal US coal is 12% for surface coal and 8% for underground coal. The top 30 coal mines in the US produce more than 4m tons of coal each year (US EIA information). North Antelope produces 66m tons. So, the federal royalty for these mines would range from $20m, up to to $330m per year for North Antelope. States also charge royalties on fossil fuel production typically around 5-20%.

Private reserve owners in the US charge commission on the sale value of the coal extracted. In the mine listed on Alex Tabarrok’s initial post, the commission was 4% per ton.

In Australia, the royalty rate is 5-10%, depending on the province. The length of a typical lease is 20 years.

If an environmentalist did buy the lease, the reserve owner would want the buyer to cover the foregone royalties from fossil fuel extraction. This would add tens to hundreds of millions of dollars to costs, depending on the mine size.

Just transition

There has recently been more attention from philanthropists on ‘just transition’ of communities that were previously reliant on fossil fuel extraction to give these communities an off-ramp losses of jobs and local tax revenue.

An environmentalist aiming to buy a coal mine would have discretion over whether to fund just transition for communities in the places affected. If they did decide to pay for it, this would increase costs. If they decided not to pay for it, the political blowback would be greater.

Total costs

The total cost to buy the 20 year lease for North Antelope Rochelle, including foregone royalty payments and reclamation, would be upwards of $6bn. For a top 30 mine in the US it would be >$500m. The mine listed by Tabarrok would need to be backed by >$100m to cover reclamation and foregone commission (calculations). This is far in excess of the list price of the mine, which was $8m.

On the assumption that all of the coal resources in these mines would be extracted, the cost-effectiveness of buying a single mine would still be reasonable – $0.20 to $14 per tonne of CO₂ avoided, depending on the mine. But the environmentalist buyer would still need upwards of $100 million to make the project work and even then, there is no guarantee that the project would work – it would still likely be embroiled in political and legal challenges.

Note there is a trade-off between cost and additionality, i.e. whether the coal would have been left in the ground anyway. If you bid against no-one for a lease, the cost will be low but the chance of additionality is also low. If you win a competitive bid, the cost will be higher, but you would also have additionality.

2. Legal barriers seem very high

In the US, around 33-40% of coal is on federal Bureau of Land Management land. This is mainly in the western US in the Powder River Basin. In the eastern US in Appalachia, coal tends to be privately owned. I think there is also coal on land owned by US states.

In Australia, all coal is owned by state provinces.

State discretion over coal resources makes buying coal mines difficult. In Germany, the government summarily excluded Greenpeace from the bid for a coal mine without providing an explanation. All in all, I think wherever this is done, there is a real risk of being locked in an interminable legal battle on this. It is worth noting that Citigroup explored the ‘buying mines’ idea but abandoned it, though they were planning to run the mines and then retire them in 2045, so their project was importantly different.

US Federal coal

Coal on federal Bureau of Land Management land is subject to ‘use it or lose it’ laws: coal must be ‘diligently developed’ otherwise the lease owner would lose the lease. So, at present, retiring coal is not possible on federal land.

US state coal

There is some coal on what is known as state trust land, which is land granted to states. Funds from state trust land are used to pay for schools and public amenities.

There are technically no ‘use it or lose it’ laws on state trust land, though states have discretion over who wins energy leases on state trust land. This means that environmentalists are unlikely to win conservation leases without first negotiating with states.

My impression is that there are not (m)any new thermal coal mining leases coming up on state trust land, though I couldn’t get comprehensive info on this. For comparison, there have been around 2 new coal mining leases on sold on federal land each year for the last few years (EIA data). The market for metallurgical coal (for steel production) seems more buoyant.

Private US coal leases

For privately owned coal reserves, there are no ‘use it or lose it’ laws. However, my impression is that there is not much publicly available information openly available on which private leases are up for sale. Many of the deals seem to happen behind the scenes between reserve owners and mining companies. My sense was that it would be hard for an environmentalist to develop the relationships and connections required to identify and execute viable deals.

My impression is that there are very few new leases coming up for thermal coal at the moment.

Australian coal

There are no strict ‘use it or lose it’ laws in Australian provinces. However, provinces have discretion over who wins mining leases. Given the cultural and political importance of coal mining in Australia, it seems to me very unlikely that a province would grant a mining lease to an environmentalist.

3. Political barriers

I would guess that the cultural and political significance of the mining industry is more important than any concerns about taxation foregone.

Politics in the US

For example, upon learning that the federal govt was considering increasing the federal coal royalty to 18%, Wyoming said they would reimburse the industry from state coffers!

Coal mines are partnered with coal power plants in power purchase agreements, so just shutting off a mine would be problematic if it meant losing reliable power. This would increase the risk of a legal challenge.

Politics in Australia

Coal mining and climate is a factor that determines elections in Australia. The current Prime Minister once used a lump of coal as a prop in parliament. It is widely thought that he won the last election because the Labour party was insufficiently pro-coal. I think this is the kind of thing that it would be difficult to fully compensate for with money. It would be too hard to target the money effectively at those affected, and I think many people would view it as something ‘sacred and not to be traded off’.

There is a trade-off between additionality and tractability. The case for additionality is strongest in the jurisdictions where it is clearest that the coal would have been extracted if the environmentalist did not win the bid. But these are also jurisdictions where the political environment is also likely to be pro-coal and it will be hardest for the environmentalist to win.

4. Would the coal be extracted if you didn’t buy the mine?

US

In the US market forces and regulations seem to be taking care of coal. Renewables and batteries are likely to decline in cost in the future. They are already cheaper than many coal plants, but the plants keep running due to long-term contracts with utilities. Regulations on coal are also now quite strict and environmental policies are becoming more strict at the state level.

Coal production in the US has been trending downwards for more than a decade, mainly due to shale gas. I think it is unclear whether the trend in declining gas prices will continue and my best guess is that they will be static.

There are no planned new coal power plants in the US. It seems to be in structural decline.

So, there is a real risk of buying up a lease for coal that would not have been extracted anyway.

Australia

Australian coal is mainly for export (around three quarters compared to 15% in the US). It is high quality and geographically close to areas of high demand in East Asia. So, the prognosis for that industry looks good, unfortunately. Buying up a mine in Australia stands a greater chance of additionality, but also seems very unlikely to succeed, for the reasons mentioned above.

Climate change & Longtermism

This is a link to a long report I did on climate risk. Report.

Football without boundaries

Until recently, the offside rule stated that you are in an offside position if any part of the body you can score with is behind the last defender. You can’t score with your arm or hands, so they can’t make you offside. If any other part of you is behind the last defender, you’re in an offside position.

Harry Kane is offside here because he is behind the last defender when the ball is played forward to him.

Before 2019, we relied on hapless bald linesmen to make offside decisions. These linesmen would have to monitor when the ball was played forward and simultaneously whether the attacking player is in an offside position. This is a tough task because often the ball will be played forward from more than 20 yards away and some footballers (Ronaldo, Agbonlahor etc) are quick. Linesmen’s decisions were analysed to death in slow motion by clueless football pundits who, with the camaraderie of the changing room a distant memory, are typically clinically bored.

In 2019 the Premier League brought in the Video Assistant Referee (VAR). Humans were replaced by infallible robot referees housed in a chrome and granite bunker in west London guarded by armed Premier League drone swarms. The VAR process is utterly interminable: decisions are not referred to the European Court of Human Rights, but it feels like it. The new robot referees really don’t want to get it wrong lest they upset Gary Neville.

All of this means that the precise meaning of the offside rule really starts to matter. A lot of offside decisions are tight. To arrive at an answer in this case for example, the robot draws geometric lines at tangents off the contorted limbs of Tyrone Mings in a gross perversion of Da Vinci.

Is toothy Brazilian Bobby Firmino offside here? He can score with his shoulder, is his shoulder off? Where does his arm start again? I can score with my nose; can my nose be offside?

To Graham Souness, all of this is health and safety gone mad and can’t possibly be right. Plain old bloody common sense means that you can’t be a bloody millimetre offside.

The obvious solution: a thicker line. This will be introduced next season in the Premier League. The line may be up to 10cm thick, so we won’t get any of these nonsense decisions any more.

***

Picture Boris Johnson. Due to the pandemic and continual mortar shots from Dominic Cummings, he starts to develop progressive stress-induced male pattern baldness, shedding a hair every day. How many hairs would Bohnson have to lose to become bald? Is there a precise answer to this question? Intuitively, removing one hair from his head cannot be the difference between him being bald and not bald. But if we apply this principle to each hair on his head, then Bohnson would still not be bald even if he had no hair.

This is the sorites paradox, which exploits the vagueness of the term ‘bald’. Most important words are vague, so vagueness might be important. Notably, it might be vague at which point a foetus becomes a person, so vagueness might matter for abortion law.

The sorites can be stated more formally as follows:

Base step: A person with 100,000 hairs is not bald

Induction step: If a person with n hairs is not bald, then that person is also not bald with n-1 hairs.

Conclusion: Therefore, a person with 0 hairs is not bald.

The Base step is clearly true and the Conclusion is clearly false, so something must have gone wrong. The natural thing to do is to follow the logic where it leads and to say that the induction step is false. This is the approach that epistemicists take: there is a precise number of hairs we could take off Bohnson’s head at which he becomes bald where before he was not. The point is just that we cannot know where that point is. Our concepts are learned from clear cases – this man is fat, this man is bald – not quantitiative definitions – a man of 100kg is fat, a man with less than 10,000 hairs is bald. We are bamboozled by the fact that without us even realising it, these concepts draw sharp boundaries

The epistemicist approach accepts a crucial lesson I learned when studying philosophy: a boundary has no width.

Some theories of vagueness try to get round this by saying that there are clear cases and a zone of borderline cases in the middle for which it is neither true nor false that Bohnson is bald.

The most popular theory that takes this approach is known as supervaluationism (discussed here). There are several problems with this approach. Firstly, what does this say about the sorites paradox? Supervaluationism says that the induction step is false but that for any n you might pick, we will know that the claim isn’t true. So, if I were to say that Bohnson minus 50,000 hairs is not bald but Bohnson minus 50,001 hairs is bald, my claim would be neither true nor false. This would apply to any individual number I might pick. Weirdly, for the same reason, supervaluationism says that it is true that “Bohnson minus 50,001 hairs is either bald or not bald” but that it is not true that “Bohnson minus 50,001 hairs is bad” and not true that “Bohnson minus 50,001 hairs is not bald”. This looks like (and is) a contradiction.

Second, note that on the diagram above, the boundaries to the rectangular zone of borderline cases are sharp. So, on this approach, while there is no sharp transition from baldness to non-baldness, there is a sharp transition from borderline baldness to baldness. If so, where is it? Supervaluationism does posit sharp boundaries, but treats them differently to the sharp boundary between baldness and non-baldness which motivated the theory in the first place. This is known as ‘higher-order vagueness’. Boundary moving is not a good solution to vagueness.

***

Returning to the offside rule, the new thicker line is not really thick, it has just been moved 10cm back from the last defender. In the rules of football, players are offside or they aren’t; there is no purgatory of ‘borderline offside’ where we have a drop ball rather than a free kick. Since a boundary has no width, there just has to be a sharp line – moving it does not solve this problem. Next season, there will still be tight offsides – they will just be measured against a line that has been moved.

If a millimetre can’t be the difference between being offside and not, then, as in the sorites paradox, this implies that someone who is 6 yards offside is not offside. Each season, we will have to move the line. By 2029, the offside rule will have been abolished and goal hanging will be the norm.

According to top referee mandarins, the rationale for the ‘thicker’ line is that it restores “the benefit of the doubt in favour of the attacker”. But that was never the rule. The rule is and always has been crisp and clear – if you’re offside, you’re offside. There’s no mention of doubt. (Indeed, this is why offside decisions are not cases of vagueness).

Moreover, with VAR, doubt has been eradicated by machines. The decisions are not in doubt, but they are close. Pundits and managers object because the offside rule is now being enforced with previously unattainable accuracy. But the rule has always been that there is a sharp line, and it will ever be thus. If there is a sharp line, then players can be offside by a nose.

Creating a thicker line fails to come to terms with the fact that the world is full of sharp boundaries. The sorites paradox and VAR make us pay attention to these sharp boundaries. They offend common sense but they exist.