Detailed explanations of the proposed model, as well as baselines, datasets and evaluations, are given here, with references to the Extended Data Figs. and Supplementary Information that add to the results provided in the main text.

### Model details

Table of Contents

We describe NowcastNet with important details of the model architectures, the training methods and the hyperparameter tuning strategies. Ablation study of NowcastNet is available in Supplementary Information section A.

#### Evolution network

The 2D continuity equation modified for precipitation evolution^{31} is

$$\frac{\partial {\bf{x}}}{\partial t}+({\bf{v}}\cdot \nabla ){\bf{x}}={\bf{s}}.$$

(2)

Here **x**, **v** and **s** indicate radar fields of composite reflectivity, motion fields and intensity residual fields, respectively, and ∇ denotes the gradient operator. The tendency term (**v** ⋅ ∇)**x** reveals the mass leaving the system, which is the first-order approximation of the difference before and after the advection operation:

$$\frac{{\bf{x}}({\bf{p}}+\Delta t\cdot \Delta {\bf{v}},t+\Delta t)-{\bf{x}}({\bf{p}},t)}{\Delta t},$$

(3)

with **p** and *t* being the position and time, respectively. The residual field **s** shows the additive evolution mechanisms, such as the growth and decay of precipitation intensities. According to the continuity equation, the temporal evolution of precipitation can be modelled as a composition of advection by motion fields and addition by intensity residuals, which is the evolution operator we design for the evolution network. We use deep neural networks to simultaneously predict all these fields based on past radar observations, which enables nonlinear modelling capability for the complex precipitation evolution.

The evolution network (Fig. 1b) takes as input past radar observations \({{\bf{x}}}_{-{T}_{0}:0}\) and predicts future radar fields \({{\bf{x}}}_{1:T}^{{\prime\prime} }\) at a 20-km scale based on a nonlinear, learnable evolution scheme we propose specifically in this article. The architecture details are described in Extended Data Fig. 1a. The backbone of the evolution network is a two-path U-Net^{30}, which has a shared evolution encoder for learning context representations, a motion decoder for learning motion fields **v**_{1:T} and an intensity decoder for learning intensity residuals **s**_{1:T}. The spectral normalization technique^{32} is applied in every convolution layer. In the skip connections of U-Net, all input and output fields are concatenated on the temporal dimension, that is, the channels in convolutional networks.

The evolution operator (Fig. 1c) is at the core of the evolution network. We use the backward semi-Lagrangian scheme as the advection operator. Because **v**_{1:T} is learnable, we directly set it as the departure offset of the semi-Lagrangian scheme. Also, because **s**_{1:T} is learnable, we directly use it to model the growth or decay of precipitation intensities. We take precipitation rate instead of radar reflectivity as the unit of radar field **x**, as this modification will not influence the physical nature of the evolution process. As applying bilinear interpolation for several steps will blur the precipitation fields, we opt for the nearest interpolation in the backward semi-Lagrangian scheme for computing \({{\bf{x}}}_{t}^{{\prime} }\). Yet, the nearest interpolation is not differentiable at **v**_{1:T}. We resolve this gradient difficulty by using bilinear interpolation (bili) to advect \({\left({{\bf{x}}}_{t}^{{\prime} }\right)}_{{\rm{bili}}}\) from \({{\bf{x}}}_{t-1}^{{\prime\prime} }\), **v**_{1:T}, and use \({\left({{\bf{x}}}_{t}^{{\prime} }\right)}_{{\rm{bili}}}\) to compute the accumulation loss for optimizing the motion fields. Then we use the nearest interpolation to compute \({{\bf{x}}}_{t}^{{\prime} }\) from \({{\bf{x}}}_{t-1}^{{\prime\prime} }\), **v**_{1:T}, and compute the evolved field \({{\bf{x}}}_{t}^{{\prime\prime} }={{\bf{x}}}_{t}^{{\prime} }+{{\bf{s}}}_{t}\). After each round of the evolution operator, we detach the gradient between two consecutive time steps because the overall system is underdetermined. Meanwhile, the successive interpolation operations will make end-to-end optimization unstable, and detaching the gradient (stop gradient in Fig. 1c) will markedly improve the numerical stability^{33}.

The objective function for training the evolution network comprises two parts. The first part is the accumulation loss, which is the sum of the weighted *L*_{1} distances between real observations and predicted fields:

$${J}_{{\rm{accum}}}=\mathop{\sum }\limits_{t=1}^{T}\left({L}_{{\rm{wdis}}}\left({{\bf{x}}}_{t},{\left({{\bf{x}}}_{t}^{{\prime} }\right)}_{{\rm{bili}}}\right)+{L}_{{\rm{wdis}}}\left({{\bf{x}}}_{t},{{\bf{x}}}_{t}^{{\prime\prime} }\right)\right).$$

(4)

In particular, the weighted distance has the following form:

$${L}_{{\rm{wdis}}}\left({{\bf{x}}}_{t},{{\bf{x}}}_{t}^{{\prime} }\right)={\left\Vert \left({{\bf{x}}}_{t}-{{\bf{x}}}_{t}^{{\prime} }\right)\odot {\bf{w}}\left({{\bf{x}}}_{t}\right)\right\Vert }_{1},$$

(5)

in which the pixel-wise weight *w*(*x*) = min(24, 1 + *x*) is taken from DGMR^{4}. Because the rain rate approximately follows a log-normal distribution^{17}, it is necessary to add weight to balance different rainfall levels. Otherwise, neural networks will only fit light-to-medium precipitation taking dominant ratio in the data and heavy precipitation will not be accounted for sufficiently. We follow DGMR^{4} and use a weight proportional to the rain rate and clip it at 24 for robustness to spuriously large values in radar observations.

The second part is the motion-regularization term in the form of gradient norm, which is motivated in part by the continuity equation and in part by the fact that large precipitation patterns tend to be longer lived than small ones^{8}:

$${J}_{{\rm{motion}}}=\mathop{\sum }\limits_{t=1}^{T}\left({\parallel \nabla {{\bf{v}}}_{t}^{1}\odot \sqrt{{\bf{w}}({{\bf{x}}}_{t})}\parallel }_{2}^{2}+{\parallel \nabla {{\bf{v}}}_{t}^{2}\odot \sqrt{{\bf{w}}({{\bf{x}}}_{t})}\parallel }_{2}^{2}\right),$$

(6)

in which \({{\bf{v}}}_{t}^{1}\) and \({{\bf{v}}}_{t}^{2}\) are the two components of the motion fields. The gradient of the motion fields ∇**v** is computed approximately with the Sobel filter^{24}:

$${\partial }_{1}{\bf{v}}\approx \left(\begin{array}{ccc}1 & 0 & -1\\ 2 & 0 & -2\\ 1 & 0 & -1\end{array}\right)\,\ast \,{\bf{v}},\qquad {\partial }_{2}{\bf{v}}\approx \left(\begin{array}{ccc}\,1 & \,2 & \,1\\ \,0 & \,0 & \,0\\ -1 & -2 & -1\end{array}\right)\,\ast \,{\bf{v}},$$

(7)

in which ⁎ denotes the 2D convolution operator in the spatial dimension.

Overall, the objective for training the evolution network (Fig. 1b) is

$${J}_{{\rm{evolution}}}={J}_{{\rm{accum}}}+\lambda {J}_{{\rm{motion}}}\,.$$

(8)

During training, we sample the radar fields with 256 × 256 spatial size as the input. On both the USA and China datasets, we fix input length *T*_{0} = 9 and set output length *T* = 20 for training and take the first 18 predicted fields for evaluation. Note that increasing *T*_{0} does not provide substantial improvements and *T*_{0} ≥ 4 is sufficient. The tradeoff hyperparameter *λ* is set as 1 × 10^{−2}. We use the Adam optimizer^{34} with a batch size of 16 and an initial learning rate of 1 × 10^{−3}, and train the evolution network for 3 × 10^{5} iterations, during which we decay the learning rate to 1 × 10^{−4} at the 2 × 10^{5}th iteration.

#### Generative network

Conditioning on the evolution network predictions \({{\bf{x}}}_{1:T}^{{\prime\prime} }\), the generative network takes as input the past radar observations \({{\bf{x}}}_{-{T}_{0}:0}\) and generates from latent random vectors **z** for the final predicted precipitation fields \({\hat{{\bf{x}}}}_{1:T}\) at a 1–2-km scale. The backbone of the generative network is a U-Net encoder–decoder structure, with architecture details shown in Extended Data Fig. 1b. The nowcast encoder has the identical structure as the evolution encoder (Extended Data Fig. 1a), which takes as input the concatenation of \({{\bf{x}}}_{-{T}_{0}:0}\) and \({{\bf{x}}}_{1:T}^{{\prime\prime} }\). The nowcast decoder is a different convolutional network, which takes as input the contextual representations from the nowcast encoder, along with the transformation of the latent Gaussian vector **z**. The designs of D Block, S Block and Spatial Norm heavily used in the generative network are elaborated in Extended Data Fig. 1e.

The noise projector transforms the latent Gaussian vector **z** to the same spatial size as the contextual representations from the nowcast encoder, as elaborated in Extended Data Fig. 1d. For each forward pass, each element of **z** is independently sampled from the standard Gaussian \({\mathcal{N}}(0,1)\). Then **z** is transformed by the noise projector into a tensor with one-eighth the height and width of input radar observations.

The physics-conditioning mechanism to fuse the generative network and the evolution network is implemented by applying the spatially adaptive normalization^{20} to each convolutional layer of the nowcast decoder (Extended Data Fig. 1b,e). First, each channel of the nowcast decoder is normalized by a parameter-free instance-normalization module^{35}. Then the evolution network predictions \({{\bf{x}}}_{1:T}^{{\prime\prime} }\) are resized to a compatible spatial size and then concatenated to the nowcast decoder at the corresponding layer through average pooling. Finally, a two-layer convolutional network transforms the resized predictions into new mean and variance for each channel of the nowcast decoder, ensuring not to distort the spatial-coherent features from the evolution network predictions \({{\bf{x}}}_{1:T}^{{\prime\prime} }\). Through the physics-conditioning mechanism, the generative network is adaptively informed by the physical knowledge learned with the evolution network, while resolving the inherent conflict between physical-evolution and statistical-learning regimes.

Conditioning on the evolution network predictions at a 20-km scale, the generative network is needed to further generate convective details at a 1–2-km scale through training on a temporal discriminator *D* (Extended Data Fig. 1c). The temporal discriminator takes as input real radar observations \({{\bf{x}}}_{1:T}\) and final predicted fields \({\hat{{\bf{x}}}}_{1:T}\) and outputs scores of how likely they are being real or fake. At its first layer, the inputs are processed by 3D convolution layers with several kernel sizes at the temporal dimension from 4 to the full horizon. Then the multiscale features are concatenated and feedforwarded to subsequent convolutional layers with spectral normalization^{32} applied in each layer. The objective for training the temporal discriminator is

$${J}_{{\rm{d}}{\rm{i}}{\rm{s}}{\rm{c}}}={L}_{{\rm{c}}{\rm{e}}}(D({{\bf{x}}}_{1:T}),1)+{L}_{{\rm{c}}{\rm{e}}}(D({\hat{{\bf{x}}}}_{1:T}),0),$$

(9)

with *L*_{ce} being the cross-entropy loss. Within a two-player minimax game, the nowcast decoder of the generative network is trained to confuse the temporal discriminator by minimizing the adversarial loss modified by^{21}

$${J}_{{\rm{a}}{\rm{d}}{\rm{v}}}={L}_{{\rm{c}}{\rm{e}}}(D({\hat{{\bf{x}}}}_{1:T}),1).$$

(10)

The gradients backpropagate through \({\hat{{\bf{x}}}}_{1:T}\), first to the nowcast decoder and then to the nowcast encoder of the generative network, leading it to predict realistic multiscale fields with convective-scale details.

We take the idea of generative ensemble forecasting from DGMR^{4} and predict a group of precipitation fields \({\hat{{\bf{x}}}}_{1:T}^{{{\bf{z}}}_{i}}\) from several latent inputs **z**_{1:k}, with *k* being the number of ensemble members. Then we aggregate the *k* predictions \({\hat{{\bf{x}}}}_{1:T}^{{{\bf{z}}}_{i}}\) and real fields **x**_{1:T} respectively by a max-pooling layer *Q* in the spatial dimension, with kernel size and stride set as 5 and 2, correspondingly. On the basis of ensemble forecasts, the pool regularization is defined as the weighted distance between spatial-pooled observations and the mean of *k* spatial-pooled predictions

$${J}_{{\rm{p}}{\rm{o}}{\rm{o}}{\rm{l}}}={L}_{{\rm{w}}{\rm{d}}{\rm{i}}{\rm{s}}}\left(Q({{\bf{x}}}_{1:T}),\frac{1}{k}\mathop{\sum }\limits_{i=1}^{k}Q({\hat{{\bf{x}}}}_{1:T}^{{{\bf{z}}}_{i}})\right).$$

(11)

Overall, the objective for training the generative network (Fig. 1a) is

$${J}_{{\rm{generative}}}=\beta {J}_{{\rm{adv}}}+\gamma {J}_{{\rm{pool}}}\,.$$

(12)

We set the number of ensemble members as *k* = 4, adversarial loss weight *β* = 6 and pool-regularization weight *γ* = 20. Similar to the evolution network, we set input length *T*_{0} = 9 and output length *T* = 20. We use the Adam optimizer^{34} with a batch size of 16 and an initial learning rate of 3 × 10^{−5} for the nowcast encoder, the nowcast decoder and the temporal discriminator and train the generative network for 5 × 10^{5} iterations.

#### Transfer learning

NowcastNet is a foundational model for skilful precipitation nowcasting. A large-scale dataset will help NowcastNet be more apt at learning physical evolution and chaotic dynamics of the precipitation processes. Therefore, in countries or regions with intricate atmosphere processes but without sufficient radar observations, we use the transfer learning strategy^{27}, a de facto way to reusing knowledge from pre-trained foundational models. Given a pre-trained NowcastNet model, we use the objectives *J*_{evolution} and *J*_{generative} to fine-tune its evolution network and generative network through decoupled backpropagation, which detaches the gradients between *J*_{evolution} and *J*_{generative}. As the physical knowledge behind the precipitation is universal and transferable across the world, we decrease the learning rate of the evolution network as one-tenth that for the generative network to avoid forgetting^{36} of physical knowledge. We pre-train a NowcastNet model on a large-scale dataset and fine-tune it to a small-scale dataset with the Adam optimizer^{34}, but only for 2 × 10^{5} iterations.

#### Hyperparameter tuning

We use the mean of CSI neighbourhood (CSIN) over all prediction time steps at the rain levels of 16 mm h^{−1}, 32 mm h^{−1} and 64 mm h^{−1} when tuning the hyperparameters of the evolution network. We compute the criterion for hyperparameter tuning as the average of the quantities, \(\frac{{{\rm{CSIN}}}_{16}+{{\rm{CSIN}}}_{32}+{{\rm{CSIN}}}_{64}}{3}\). When tuning the hyperparameters of the generative network, we use the two main evaluation metrics, CSI neighbourhood and PSD. For each model with different hyperparameters, we first ensure that the PSD of the model is no worse than that of pySTEPS. Then we use the average CSI neighbourhood criterion \(\frac{{{\rm{CSIN}}}_{16}+{{\rm{CSIN}}}_{32}+{{\rm{CSIN}}}_{64}}{3}\) to determine the final hyperparameters.

### Baselines

We describe the four baselines used in the comparative study. There is a rich literature of relevant work and we discuss them as further background in Supplementary Information section E.

#### DGMR

DGMR is a state-of-the-art method for precipitation nowcasting, recognized by expert meteorologists. We genuinely reproduce it taking exactly the same architecture and training settings described in ref. ^{4} and the released model files available at https://github.com/deepmind/deepmind-research/tree/master/nowcasting, with the quantitative and qualitative results to match those reported in the original paper. We set the number *k* of ensemble members as 4 during training, which is the same as NowcastNet.

#### PredRNN-V2

We consider PredRNN-V2 (ref. ^{13}), the latest version of PredRNN^{37} with a four-layer convolutional-recurrent network, deployed at the China Meteorological Administration for operational nowcasting. We cut radar fields into 4 × 4 patches and unfold the patches as the channel dimension, which efficiently balances the computation cost and forecasting skill. Reverse scheduled sampling with an exponential increasing strategy is applied in the first 5 × 10^{4} iterations.

#### U-Net

We use the improved version proposed by Ravuri et al.^{4}, which adds a residual structure in each block of the vanilla U-Net^{30}, along with a loss weighted by precipitation intensity, and predicts all fields in a single forward pass.

#### pySTEPS

We use the pySTEPS implementation from ref. ^{9}, following the default settings available at https://github.com/pySTEPS/pysteps.

All deep-learning models, including NowcastNet, DGMR, PredRNN-V2 and U-Net, are trained on the USA dataset (years 2016–2020) by the Adam optimizer with a batch size of 16 for 5 × 10^{5} iterations and transferred to the China dataset by fine-tuning for 2 × 10^{5} iterations. For all models under evaluation, we establish a fair comparison by using the same weighting scheme *w*(*x*) in the weighted distance *L*_{wdis} and the same sampling strategy of training data. Both the weighting scheme and the sampling strategy are taken from DGMR^{4}.

### Datasets

Two large-scale, high-resolution datasets of composite radar observations from the USA and China are used throughout the experiments. The evaluation metrics are described in Supplementary Information section B. More case studies of representative precipitation events and quantitative results of overall performance are available in Extended Data Figs. 2–8 and Supplementary Information sections C and D.

#### USA dataset

The USA dataset consists of radar observations from the MRMS system^{26,38}, collected over the USA. The radar composites cover the area from 20 °N to 55 °N in the south–north direction and 130 °W to 60 °W in the east–west direction. The spatial grid of the composites is 3,500 × 7,000, with a resolution of 0.01° per grid. The missing values on the composites are assigned negative values, which can mask unconcerned positions during evaluation. We use radar observations collected for a 6-year time range from 2016 to 2021, in which the training set covers years 2016–2020 and the test set covers the year 2021. We follow the strategy used in ref. ^{4} such that the radar observations from the first day of each month in the training set are included in the validation set. To trade off computational cost and forecasting skill, we set the temporal resolution as 10 min and downscale the spatial size of radar fields to half of the original width and height, which will keep the most of the convective-scale details. We cap the rain rates at the value of 128 mm h^{−1}.

#### China dataset

The China dataset includes radar observations collected over China by the China Meteorological Administration. The radar composites cover the area from 17° N to 53° N in the south–north direction and 96° E to 132° E in the east–west direction, with a coverage of the middle and east of China. The spatial grid of the composites is 3,584 × 3,584, with a resolution of 0.01° per grid. Similar to the USA dataset, the missing values are replaced by negative values. We use radar observations collected for a nearly 2-year time range from 1 September 2019 to 30 June 2021. Data from 1 September 2019 to 31 March 2021 are taken as the training set, whereas those from 1 April 2021 to 30 June 2021 are taken as the test set. We follow the strategy used in ref. ^{4} such that the radar observations from the first day of each month in the training set are included in the validation set. Notably, the test period covers the flood season when extreme precipitation and rainstorms are frequent in China. We set the temporal resolution, spatial size and rain-rate threshold exactly the same as the USA dataset.

#### Data preparation

We construct the training set and test set for each dataset using an importance-sampling strategy^{4} to increase the ratio of radar series with heavy precipitation. We first crop the full-frame series into smaller spatiotemporal size. For the training set, we cut the series into crops of spatial size 256 × 256 and temporal size 270 min with offsets of 32 in the vertical and horizontal directions. For the test set, we cut the series into crops of spatial size 512 × 512 and temporal size 270 min with offsets of 32 in the vertical and horizontal directions. Then we give each crop an acceptance probability,

$$\Pr ({{\bf{x}}}_{-{T}_{0}:T})=\mathop{\sum }\limits_{t=-{T}_{0}}^{T}{\left\Vert {\bf{g}}({{\bf{x}}}_{t})\right\Vert }_{1}+{\epsilon },$$

(13)

which is the sum of radar fields for all grids and all time steps on this crop, and *ϵ* is a small constant. As done in DGMR^{4}, for the training set, we set g(*x*) = 1 − e^{−x} on each grid with a valid value and g(*x*) = 0 on each grid with a missing value. We use hierarchical sampling during training, by first sampling the full-frame series and then sampling the crop series. To evaluate the forecasting skill of different models on extreme-precipitation events, we define g(*x*) = *x* for the test set. The test set is sampled in advance and kept unchanged throughout evaluation. As our goal is skilful nowcasting of extreme precipitation, this importance-sampling strategy is biased towards weather events with a larger proportion of heavy precipitation.

We also use the uniform-sampling protocol such that all light-to-heavy precipitation can be equally evaluated. In this protocol, the crops in the test set are sampled uniformly from all spatial and temporal ranges. Because the uniformly sampled series usually have scarce precipitation, we enlarge the dataset size to 288,000 for the USA case and 120,000 for the China case, three times larger than the importance-sampled test datasets. The quantitative results under this protocol are available in Supplementary Figs. 10 and 11.

### Evaluation

We perform a meteorologist evaluation as a cognitive assessment task and a quantitative evaluation using operational verification measures.

#### Meteorologist evaluation

To construct the test subsets representative of extreme-precipitation events for expert meteorologist evaluation, we first sample a new test set that contains the crops with spatial size of 512 × 512 using the same strategy detailed in the previous section. After this test set is sampled, we rank the crops by the sum of rain rate on all grids with rate higher than a threshold of 20 mm h^{−1}. This is the threshold of heavy rainfall used in operational practice by the China Meteorological Administration. We take the top 1,200 events as the subset for expert meteorologist evaluation. Because the test events are fewer, we change the strategy to ranking all events by the proportion of grids with a rate higher than 20 mm h^{−1}, which include extreme precipitation with very high probability, while ensuring the temporal diversity. On all crops in this test subset, all models take as input the fields of spatial size 512 × 512, and the central 384 × 384 area of the predicted fields are zoomed in to highlight the convective details.

To enable a professional, transparent and fair meteorologist evaluation, the China Meteorological Administration issued a public announcement to all provincial meteorological observatories, inviting senior meteorologists to participate in the evaluation as volunteers. The announcement states the content, goal and how-to of the expert evaluation, and specifically clarifies that the evaluation results will only be used anonymously for the scientific research but not for the skill test of meteorologists or other purposes. Operationally, we build an anonymous website for the meteorologist evaluation. Each expert logs in to the website using an automatically generated user account with password protection to perform the evaluation anonymously, without being informed of any model information. In the posterior evaluation, we show real radar observations in the past and future horizons and the model predictions anonymously in random order for each event, whereas in the prior evaluation, we only show the real radar observations in the past. Meteorologists can play the video, navigate the progress bar to deliberately observe cloud evolution or arbitrarily stop the video at a certain time step for a meticulous comparison of the forecasting skill and value of all models.

#### Quantitative evaluation

Evaluation with commonly used quantitative metrics involves comparing the difference between ground truths and model predictions on the crops in the test set. Each model outputs 18 future frames of precipitation fields given nine past frames of radar observations, whereas pySTEPS is given four past frames. Similar to the evaluation protocol of DGMR^{4}, the input spatial size is set as 512 × 512 for computing the PSD metric and as 256 × 256 for computing the other metrics. We apply the central-cropping technique, which crops 64 × 64 grid cubes from the central area of the 18 predicted frames, along with the corresponding ground truths. The PSD metric is directly computed on the 512 × 512 precipitation fields, whereas the other metrics are computed between the predicted and ground-truth cubes. The central cropping can eliminate the boundary influence and reduce the computation cost^{4}. For methods with ensemble-forecasting ability, including NowcastNet, DGMR and pySTEPS, we set the number *k* of ensemble members as 4 for computing specific quantitative measures.