What is it about?

In addition to testing whether forecast model A is better than (or not) model B according to an intensity-based measure, such as RMSE, it is useful to test whether model A is better more often than model B. For example, if two football teams, A and B, play each other many times, you might want to know the average score differential from each game. But, really, it is how often A beats B that matters in the end. This paper also proposes using the power-divergence family of tests, and evaluates its accuracy (in the sense of type-I error size) and power empirically in the face of temporal dependence and contemporaneous correlation (when A and B are correlated with each other). It is found that the power-divergence test is robust to these dependencies, and therefore viable for use in the competing forecast verification domain.

Featured Image

Why is it important?

Often forecast model A is a modification of model B so that they might not differ by much according to measures like RMSE, and thus not be found to be statistically significantly different. However, if A is truly better, then perhaps it would win more "games" than B, even if by a close margin. This type of testing has not been prolific in forecast verification, and this paper attempts to change that, as well as propose an existing statistical test procedure that does not require resampling. The procedure is found through empirical testing to be robust to the types of deviations from the standard assumptions that are common in competing forecast verification.

Perspectives

The analogy to football games is appropriate because it doesn't matter whether your team wins by one point or 50, so long as they win. The point differential in a football game may not have as much bearing on the quality of one team to another, but if one team consistently wins more games than the other, that is another story. The same is true in "competing" forecast verification (testing model A against model B according to some verification data set and summary measure). If team A is the same as team B except for a swap of one player, then results may not be drastically different in terms of the final score, but perhaps the player change adds just enough to account for better performance. This paper describes a path for addressing this type of "better."

Eric Gilleland
National Center for Atmospheric Research

Read the Original

This page is a summary of: Competing Forecast Verification: Using the Power-Divergence Statistic for Testing the Frequency of “Better”, Weather and Forecasting, June 2023, American Meteorological Society,
DOI: 10.1175/waf-d-22-0201.1.
You can read the full text:

Read

Contributors

The following have contributed to this page