How VERACI-T is validating marginal emissions models

by Tara Larrue, Data Science Manager, WattTime

Marginal emissions data are playing an increasingly important role in addressing climate change. Companies use it to guide clean energy investments, policymakers rely on it to shape carbon reduction strategies, and researchers apply it to assess the environmental impact of electricity use. As the influence of this data grows, so does the need for independent, empirical validation to ensure it is as accurate and reliable as possible.

VERACI-T is a new working group that was launched to investigate the accuracy of electricity sector marginal emissions datasets in standardized ways through peer-reviewed research. This taskforce of energy experts, researchers, and industry leaders applies rigorous, proven validation techniques to test the accuracy of different marginal emissions factor (MEF) datasets, and makes all results free and open for public use.

By using real-world data and transparent methodologies, VERACI-T is helping to investigate what level of confidence is appropriate in emissions modeling, and to ensure that decisions based on this data are backed by the strongest possible evidence.

VERACI-T’s approach to validation

A key challenge in validating marginal emissions data is that it requires comparing the impact of having taken an action, to the impact of not taking it (counterfactual). Formally in science this is known as “causal inference.” In most fields that rely on causal inference data, testing whether a model is accurate through randomized controlled trials or natural experiments is standard procedure. Clinical trials verify whether a new drug is effective before it’s prescribed to patients. Economists use controlled experiments to understand the effects of policy decisions. Financial analysts stress-test models before putting millions of dollars at risk.

Yet, when it comes to electricity emissions data, that same scientific rigor is often missing. Sometimes groups will simply assert without evidence that a certain dataset is or is not accurate…or even assume that it’s somehow impossible to measure causal inference for emissions data, even though it is accepted practice in other fields.

VERACI-T is changing that. Instead of picking sides in the debate over which dataset is best, the working group is building an open, standardized framework for testing any marginal emissions model. VERACI-T’s techniques draw on established causal inference science widely used in other fields. Using natural experiments and empirical tests, VERACI-T researchers are making it possible for anyone to objectively observe whether any given model holds up based on real-world evidence, not assumptions.

Key findings from VERACI-T’s first studies

VERACI-T has already conducted and published three major studies, each testing different aspects of model accuracy. The first paper has been peer reviewed and accepted into Renewable and Sustainable Energy Reviews, and the other two papers are currently in peer review.

Paper 1: Evaluating marginal emissions models with empirical tests

The key to empirically validating any model is to work out claims it makes about something that is observable in the real world, and gauge how well the real-world evidence matches the claims of that model. VERACI-T’s first study started by working out some of the simplest possible claims that different models were making — explicitly or implicitly — about something observable in the real world. The real-world data on actual power plant behavior were taken from the US Clean Air Markets Program to describe a rubric of tests that can be applied to any marginal operating emission rate model.

The tests are divided into four categories:

Establishing bounds for reasonable Carbon Intensities that models use. For example, one model occasionally predicted that increasing load by 100 megawatts for one hour would increase emissions by 268 tons CO₂. EPA data confirm the grid in question was not physically capable of producing this outcome, which was one strike against the accuracy of the model in question.
Comparing against derived marginal Empirical Annual Averages as a coarse check. In one of these tests, the annual empirical change of emissions with respect to load is expected to be about equal to the annual average MEF.
Assessing hourly Expected Net-Demand Temporal Patterns to study how MEFs consider the ramping constraints of different fuel sources. In one of these tests, MEFs are expected to be higher during peak net-demand times compared to low net-demands times in regions without coal.
Examining how models are able to identify periods of Renewable Curtailment. For example, several models sometimes predicted that increasing load by 100 megawatts for one hour would increase emissions by close to zero — and for some models, these times correlated well with times when the grid operator stated that they were indeed handling load increases purely by spilling less surplus solar energy. This was one notch in favor of the accuracy of the models that predicted it.

The paper describes the tests and then demonstrates each on a variety of MEF models, presenting possible explanations when contraventions occur. While all tested models demonstrated expected behaviors when it came to being temporally correlated with peak net-demand and with the percentage of dispatched coal, some displayed anomalies when compared against the carbon intensities of the dirtiest active power plants. And, only two of the studied MEFs accurately predicted curtailment periods within their models.

The extensible test suite designed in this research provides valuable insight into MEF accuracy, as well as a framework for understanding model behavior.

Paper 2: Using nuclear outages as a natural experiment

The gold standard in causal inference is a randomized controlled trial (RCT). In the ideal RCT in marginal emissions, companies would randomly increase or decrease their net power consumption by large amounts repeatedly, so one could compare the resulting change in emissions. Of course, literally doing so would be a big ask for a power grid. A common technique in causal inference in such situations is to look for cases where a “natural experiment” causes what essentially is an unplanned RCT to happen by accident.

VERACI-T’s second study leveraged this technique using nuclear power plant outages. Unplanned nuclear power plant outages change the power grid’s net load by hundreds to thousands of megawatts while they last. And because these outages occur randomly, their effect is essentially the same as causing random large variations in load. This natural experiment allows the marginal emissions factors to be estimated in a manner that isolates the change in emissions due to a change in load, rather than due to unrelated factors like weather or temperature.

This technique was used to directly measure the actual short-run MEFs in six balancing authorities across the continental US. Anyone wondering if a marginal emissions model is accurate can apply it to this convenient list of nuclear outages to see if that model can correctly predict how real-world total emissions must have changed before, during, and after these outages.

The empirically measured MEFs produced by this study — available now for public use — can therefore serve as a benchmark for testing the accuracy of any marginal emissions model.

Paper 3: Validating locational marginal emissions models with wind generation

In Paper 3, VERACI-T researchers looked at the changes in emissions that occurred in response to changes in wind generation in ERCOT (a US balancing authority). Because changes in wind generation occur randomly throughout the day, and independently of changes in demand for electricity, they act as a different type of natural experiment (similar to RCTs and the nuclear outages described above).

While wind variation is smaller than nuclear outage variation, it’s also much more common, allowing researchers to test the robustness of marginal emissions models in a different way. Any model that makes accurate implicit claims and can accurately predict the effect of nuclear outages and can accurately predict the effects of random fluctuations in wind has therefore demonstrated three different types of accuracy. This method was also used to measure MEFs in small clusters of pricing nodes, a much smaller geographic boundary than had previously been studied using causal methods.

Five different signals were tested for their ability to predict when and where emissions would be the lowest:

a regression-based MEF from WattTime,
a dispatch-based MEF from REsurety,
a heat rate-based MEF (like the one used by the CPUC),
the average emissions rate,
and Locational Marginal Pricing (LMP).

Other than the average emissions rate, which showed no predictive capability, the other signals were all able to effectively identify times and regions where marginal emission rates are lower, and can be used to reduce total real-world emissions.

In addition, this study provided the first known empirical validation that a nodal-level emissions algorithm (REsurety’s) can accurately predict changes in emissions at a very local level. This also may be the first time that statistical regression-based models have been directly compared to economic/engineering dispatch-based models.

The study found that the two different marginal model types, while using wildly different methods, made similar predictions about real-world changes in emissions, and that both matched the actual change in emissions — strongly suggesting that the models are responding to actual causal effects rather than model bias.

What’s next? The future of marginal emissions validation

VERACI-T’s work doesn’t stop here. The working group is currently turning its attention to build margin models, which attempt to predict how both operational (e.g. timing of load) and structural (e.g. new renewable energy) changes to electric grids influence long-term grid emissions via structural grid changes (e.g. building new power plants).

The upcoming build margin research will:

Test whether different build margin models, when applied to the past, can accurately predict historical renewable energy buildout that actually occurred.
Identify key emissions model assumptions that impact model accuracy.
Use data on past corporate renewable energy projects to design natural experiments that measure long-term structural change to grids.

Looking further ahead, VERACI-T plans to explore the use of real randomized controlled trials, leverage non-public grid data to expand validation datasets to different regions, and continue to refine standardized frameworks for evaluating marginal emissions datasets.

Join the effort

Marginal emissions data is having a growing impact on climate action, shaping decisions that drive clean energy investments and emissions reductions. But ensuring that these data are accurate is essential for their effectiveness. VERACI-T invites researchers, energy experts, funders, and organizations with access to relevant data to contribute to this effort.

All research findings are freely available to the public at veraci-t.org. For more information or to get involved, contact the program coordinator at: contact@veraci-t.org

WattTime’s Pierre Christian, Nat Steinsultz, and Sam Koebrich also contributed to this article.