Validating Deep-Learning Weather Forecast Models on Recent High-Impact Extreme Events

Olivier C. Pasche, Jonathan Wider, Zhongwei Zhang, Jakob Zscheischler and Sebastian Engelke

Artificial Intelligence for the Earth Systems, 2025

Abstract

The forecast accuracy of machine learning (ML) weather prediction models is improving rapidly, leading many to speak of a “second revolution in weather forecasting”. With numerous methods being developed and limited physical guarantees offered by ML models, there is a critical need for a comprehensive evaluation of these emerging techniques. While this need has been partly fulfilled by benchmark datasets, they provide little information on rare and impactful extreme events or on compound impact metrics, for which model accuracy might degrade due to misrepresented dependencies between variables. To address these issues, we compare ML weather prediction models (GraphCast, PanguWeather, and FourCastNet) and ECMWF’s high-resolution forecast system (HRES) in three case studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the North American winter storm in 2021. We find that ML weather prediction models locally achieve similar accuracy to HRES on the record-shattering Pacific Northwest heatwave but underperform when aggregated over space and time. However, they forecast the compound winter storm substantially better. We also highlight structural differences in how the errors of HRES and the ML models build up to that event. The ML forecasts lack important variables for a detailed assessment of the health risks of the 2023 humid heatwave. Using a possible substitute variable, prediction errors show spatial patterns with the highest danger levels over Bangladesh being underestimated by the ML models. Generally, case-study-driven, impact-centric evaluation can complement existing research, increase public trust, and aid in developing reliable ML weather prediction models.

Significance statement

With the performance of machine-learning-based weather forecasting models improving rapidly, thorough analyses are needed to ensure that their forecasts are accurate and reliable before deploying them in operational settings. Existing evaluations often reduce forecast performance to a few metrics, potentially obscuring rare but systematic errors. This is especially problematic for high-impact extreme events, which, by definition, are rare in the data but often substantially affect society. In a detailed analysis of three extreme events, we observe that, although machine learning (ML) models generally outperform the best physics-based numerical weather prediction (NWP) model on benchmark datasets, they do not consistently do so for the studied extreme events or compound impact metrics and lack some impact-relevant variables.

Published article: https://doi.org/10.1175/AIES-D-24-0033.1 (PDF)
Supplementary materials: Supplementary Material
Data release: https://doi.org/10.5281/zenodo.14358212
Code and reproducibility: https://github.com/jonathanwider/DLWP-eval-extremes (release v1.0)

Preprint (obsolete): https://arxiv.org/abs/2404.17652 (PDF)

Dates

First version: April 2024
Online publication: December 2024
Final issue publication: January 2025

Recommended citation: Pasche, O. C., Wider, J., Zhang, Z., Zscheischler, J. and Engelke, S. (2025). "Validating Deep-Learning Weather Forecast Models on Recent High-Impact Extreme Events." Artificial Intelligence for the Earth Systems 4(1), e240033. https://doi.org/10.1175/AIES-D-24-0033.1
Download Paper