### Construction of the training database

Table of Contents

We train PEGSNet on a database of synthetic data augmented with empirical noise. For each example in the database, three-component waveforms at each station location are obtained as follows.

#### Source parameters

The event source location is randomly picked from 1,400 possible locations extracted from the Slab2.0 model of subduction geometry^{38} at two different depths (20 and 30 km). Given latitude, longitude and depth, strike and dip angles are determined by the subduction geometry and rake is randomly extracted from a normal distribution with mean = 90° and standard deviation of 10°. The event final *M*_{w} is drawn from a uniform distribution with min = 5.5 and max = 10.0. We deliberately choose not to use a Gutenberg–Richter distribution for *M*_{w} to avoid a sampling bias during training, by which the model might better estimate certain magnitude values simply because they are more represented in the training database. Finally, from *M*_{w} we compute the scalar moment ({M}_{0}={10}^{1.5{M}_{{rm{w}}}+9.1}).

#### Source time function

Given *M*_{0}, a pseudo-empirical STF is computed using the STF model described in a previous work^{5}, which includes a multiplicative error term and is valid for earthquakes with *M*_{w} > 7.0. In summary,

$${rm{STF}}(t)={M}_{0}frac{f(t)}{int f(t){rm{d}}t},$$

(1)

with:

$$fleft(tright)=t{rm{exp }}left{-0.5{left(lambda tright)}^{2}right}left[1+Nleft(tright)right],$$

(2a)

$$lambda =1{0}^{left(7.24-0.41{rm{log }}left({M}_{0}right)+varepsilon right)},$$

(2b)

$$Nleft(tright)=0.38frac{nleft(tright)}{sigma },$$

(2c)

where *ε* is drawn from a Gaussian distribution with zero mean and standard deviation of 0.15, *n*(*t*) is the time integral of a Gaussian noise time series with zero mean and *σ* is the standard deviation of *n*(*t*). The term *ε* accounts for variability in the STF duration for a given *M*_{0}, whereas *N*(*t*) models the characteristics of noise observed in real STFs^{5}. Examples of final STFs for different magnitude values are shown in Extended Data Fig. 1.

#### Computing synthetic waveforms

With the selected source parameters and STFs, we use the normal-mode approach described in a previous work^{14} to compute three-component synthetic waveforms in a spatial domain of about 20° around the source epicentre. The resulting seismometer responses are convolved with the STF of the corresponding synthetic event and multiplied by the scalar moment to obtain synthetic traces of acceleration sampled at 1 Hz at each station location. Finally, traces are bandpass-filtered between 2.0 mHz (Butterworth, two poles, causal) and 30.0 mHz (Butterworth, six poles, causal). The final seismograms are 700 s long centred at the event origin time.

#### Noise database

The noise database consists of 259 days of three-component waveforms for two non-continuous time intervals: between January 2011 and October 2011 (excluding March 2011) and between January 2014 and April 2014. These intervals have been chosen to sample variable (seasonal) noise conditions. We note that the temporal range spanned by the noise database does not overlap with any of the earthquakes used for real data cases (Extended Data Fig. 10a). We first divide the daily recordings into 1-h-long traces and then apply limited preprocessing, removing the instrument response, the mean and the linear trend, converting to acceleration and decimating the original traces from 20 to 1 Hz. Finally, each trace is filtered using the same bandpass filter applied to the synthetic seismograms (see previous step) and stored. Note that no a priori assumptions on levels and characteristics of the selected noise are made. On the contrary, we include all real noise conditions found in continuous seismic recordings in the specified period range. This is because, in principle, we want the model to be able to generalize well under a broad range of noise conditions.

#### Adding empirical noise to synthetics

From the noise database described in the previous step, a realization of noise (700 s long) is extracted by randomly selecting a starting time point. In this process, we make sure to use different noise data in training, validation and test sets. To preserve spatial coherence of noise across the seismic network, the same time interval is used for all stations for a given event. The selected noise traces are then added to the corresponding acceleration seismograms to produce the final input data for PEGSNet. If noise data are not available for one or more stations in the selected time window for a given event, we discard those stations by setting the corresponding final trace amplitudes (noise and PEGS) to zero in the input data.

#### Preprocessing of input data

Before being fed to PEGSNet, we first sort the input waveform for each example based on station longitude. We found this approach to be effective, but we note that the problem of concatenating station waveforms in a meaningful way in deep learning is an active area of research^{35}. Then, on the basis of the theoretical P-wave arrival time (*T*_{P}) at each station for a given event, we set the amplitude of the seismograms to zero for *t* ≥ *T*_{P}. Note that PEGSNet does not perform P-wave triggering itself. Instead, it relies on theoretical P-wave arrivals. In a real-time scenario, any existing P-wave triggering algorithm (whether based on machine learning or not) can be used to set the switch on the corresponding stations whose data can then be passed to PEGSNet.

To limit the influence of very noisy traces and to suppress high amplitudes (possibly related to the background regional seismicity), we further clipped the resulting trace using a threshold of ±10 nm s^{−2}. This threshold is chosen according to the maximum PEGS amplitude for an *M*_{w} = 10 earthquake at 315 s as found in the calculated database of noise-free synthetics. Amplitudes are finally scaled by 10 nm s^{−2} to facilitate convergence of the optimizer, and at the same time, to preserve information about the relative amplitudes of the PEGS radiation pattern across the seismic network. Finally, to simulate missing data and/or problematic sensors, we randomly mute 5% of the stations for each event by setting to zero the amplitudes of the corresponding traces.

### Deep learning and PEGSNet

#### Previous work

Convolutional neural networks (CNNs) originated in neocognitron^{42} and became practical once it was found that the backpropagation procedure^{43} can be used to compute the gradient of an objective function with respect to the weights of the network. CNNs are a regularized form of neural networks, that is, the function space they represent is simpler and they are more sample-efficient than fully connected neural networks^{44}. Deep CNNs have brought about a revolution in computer vision, and have had a role in almost every state-of-the-art approach for tasks related to recognition and detection in images^{45,46}. In geoscience, machine learning has shown strong potential for data-driven discovery of previously unknown signals and physical processes hidden in large volumes of noisy data^{47,48}.

We note, however, that our choice of a deep learning model over classical machine learning models offers an appealing framework to directly deal with raw seismograms. As a consequence, this choice enables us to explore a larger function space that is not limited by building an a priori set of features, which is a requirement for applying classical machine learning models on seismogram data.

Successful applications of deep learning in seismology have provided new tools for pushing the detection limit of small seismic signals^{31,32} and for the characterization of earthquake source parameters (magnitude and location)^{33,34,35} with EEWS applications^{29,36,37}. We present a deep learning model, PEGSNet, trained to estimate earthquake location and track the time-dependent magnitude, *M*_{w}(*t*), from PEGS data, before P-wave arrivals.

#### Description of PEGSNet architecture

PEGSNet is a deep CNN that combines convolutional layers and fully connected layers in sequence (Extended Data Fig. 2a). The input of the network is a multi-channel image of size (*M*, *N*, *c*) where *M* is 315 (corresponding to 315-s-long traces sampled at 1 Hz), *N* is the number of stations (74) and *c* is the number of seismogram components used (three: east, north and vertical). The outputs of the network are three values corresponding to moment magnitude (*M*_{w}), latitude (*φ*) and longitude (*λ*), where *M*_{w} is time dependent. The training strategy used to learn *M*_{w}(*t*) from the data is described below.

The first part of the model (the CNN) consists of eight convolutional blocks. Each block is made of one convolutional layer with a rectified linear unit (ReLU) activation function followed by a dropout layer. The number of filters in each convolutional layer increases from 32 (blocks 1–5) to 64 (blocks 6–7) to 128 (block 8) to progressively extract more detailed features of the input data. A fixed kernel size of 3 × 3 is used in each convolutional layer. We use spatial dropout with a fixed rate of 4% to reduce overfitting of the training set. Maximum pooling layers are added starting from block 4 to reduce the overall dimension of the input features by a factor of 4. The output of the CNN is then flattened and fed to a sequence of two dense layers of size 512, and 256 with a ReLU activation function and standard dropout with a 4% rate. Fully connected layers perform the high-level reasoning and map the learned features to the desired outputs. The output layer consists of three neurons that perform regression through a hyperbolic tangent activation function (tanh). The labelling strategy for *M*_{w}(*t*), *φ* and *λ* is discussed in detail below. The total number of parameters in the network is 1,479,427.

#### Learning strategy

The purpose of the model is to track the moment released by a given earthquake as it evolves from the origin time. A specific learning strategy has been developed to address this task (Extended Data Fig. 2).

*Labelling*. Labels are *φ*, *λ* and a time-dependent *M*_{w}. *φ*, *λ* simply correspond to the true values for each event. *M*_{w}(*t*) is the time integration of the STF for each event. As detailed in the next section, the model is trained by randomly perturbing the ending time of the input seismograms, so that for a given ending time the input data are associated with the value of *M*_{w}(*t*) at that time. To enforce the role of the tanh activation function in the output layer, we further scale all the labels to fall in the [−1, 1] interval through min/max normalization.

*Learning the time-dependent moment release*. In order for PEGSNet to learn *M*_{w}(*t*), we randomly perturb the starting time of the input data during training (Extended Data Fig. 2c). Every time that an example is extracted from the dataset, a value (*T*_{1}) is drawn at random from a uniform distribution between −315 and 0 (s). *T*_{1} is the time relative to the earthquake origin time (*T*_{0}) corresponding to the starting time of the selected seismograms for that example. In practice, from the 700-s-long seismograms (centred on *T*_{0}) in the database, we extract traces from *T*_{1} to *T*_{2} = *T*_{1} + 315 s: for *T*_{1} = −315 s the extracted traces end at *T*_{0}; for *T*_{1} = 0 s the traces start at *T*_{0} and end 315 s after. Once a value for *T*_{1} is selected, the value of *M*_{w}(*T*_{2}) is assigned as the corresponding label for this example (Extended Data Fig. 2d). This enables the model to learn patterns in the data as the STF evolves with time.

*Training*. The full database (500,000 examples of synthetic earthquakes) is split into training (350,000) validation (100,000) and test (50,000) sets, following a 70/20/10 strategy. The network is trained for 200 epochs (using batches of size 512) on the training set by minimizing the Huber loss between the true values and the predicted earthquake source parameters using the Adam algorithm^{49}, with its default parameters (*β*_{1} = 0.9 and *β*_{2} = 0.999) and a learning rate of 0.001. At the end of each epoch, the model is tested on the validation set to assess the learning performance and avoid overfitting (Extended Data Fig. 2b). After learning, the model that achieved the best performance (lowest loss value) on the validation set is selected as the final model. The final model is then tested against the test set (therefore with data that has never been seen by PEGSNet during training) to assess its final performance.

#### Testing strategy

Once PEGSNet is trained, it can be used to estimate *M*_{w}(*t*) in a real-time scenario. We assess the latency performance of PEGSNet on the test set with the following procedure (Extended Data Fig. 4). For each example in the test set, we slide a 315-s-long window [*T*_{1}, *T*_{2} = *T*_{1} + 315 s] through the data with a time step of 1 s. The starting window ends at the earthquake origin time *T*_{0} (*T*_{2} = *T*_{0} and *T*_{1} = *T*_{0} − 315 s) and the final window starts at the earthquake origin time (*T*_{2} = *T*_{0} + 315 s and *T*_{1} = *T*_{0}). We let PEGSNet predict *M*_{w}(*T*_{2}) at each time step, thus progressively reconstructing the STF. Each *M*_{w}(*t*) estimate made by PEGSNet only makes use of information in the past 315 s. The same procedure is also applied to real data (Fig. 3 and Extended Data Fig. 10), to simulate a playback of the data as if they were fed to PEGSNet in real-time.

#### Test on noise-free synthetics

We investigate the performance of PEGSNet by using the same database described above but without including noise in the input data. Training and testing on noise-free synthetic data provides an upper limit of PEGSNet performance. Although this experiment represents a virtually impossible scenario for real-world applications, the results can reveal inherent limitations of our model or in the input data. Extended Data Fig. 6a shows the accuracy map for the test set. As expected, the model is able to determine the final *M*_{w} of the events with high accuracy and similar performance regardless of the actual *M*_{w} of the event, except at early times. To look at the latency performance more in detail, Extended Data Fig. 6b shows the density plot of the residuals as a function of time for the whole noise-free test set. Errors are mostly confined within ±0.1 but are relatively higher in the first 10–15 s after origin. We relate this observation to a combination of two factors: first, in the first 15 s after origin, very small PEGS amplitudes are expected at very few stations in the seismic network, partially owing to the cancelling effect between the direct and induced terms. This can lead to a situation in which little information is present in the input images and the model ends up predicting the mean value of the labels at these early times. Second, the seismic network geometry may not be optimal for recording PEGS amplitudes in this time window. Finally, we note that similar behaviour is observed for the results obtained on the noisy database (Fig. 2a, b) but with a higher latency (30–40 s). This highlights the role of noise in degrading the optimal performance of PEGSNet.

### Preprocessing of real data

The preprocessing steps for real data closely follow the procedure detailed in previous work^{7}. For each station and each component:

1. Select 1-h-long raw seismograms ending at the theoretical *T*_{P} calculated using the source location from the United States Geological Survey (USGS) catalogue (Extended Data Fig. 10a);

2. Remove the mean;

3. Remove the instrument response and obtain acceleration signals;

4. Lowpass 30.0 mHz (Butterworth, six poles, causal);

5. Decimate to 1 Hz;

6. Highpass 2.0 mHz (Butterworth, two poles, causal);

7. Clip to ±10 nm s^{−2} and scale by the same value;

8. Pad with zeros for *t* ≥ *T*_{P} and select 700-s-long trace centred on *T*_{0}.

This procedure is the same as that used to generate the synthetic database, except that here, traces need to be cut at the P-wave arrival first to avoid contamination of PEGS by the P-wave during instrument response removal.

To speed up our testing procedure (see Methods subsection ‘Testing strategy’), the data are preprocessed once and then sliced into input for PEGSNet at each time step. In an online version of our model, this is unfeasible as all the preprocessing steps need to be applied each time that new packets of data are streamed in. We simulate the conditions of a real-time workflow on the Tohoku-Oki data to assess potential discrepancies with the simplified workflow in the results: at each time step, we apply the preprocessing steps described above, using 1-h-long trace ending at the current time step. We find that the resulting PEGSNet predictions obtained using the two workflows are essentially indistinguishable from each other (Extended Data Fig. 11).

### Predictions on additional real data

We tested PEGSNet on all the subduction earthquakes (dip-slip mechanism within 40 km from the megathrust) with *M*_{w} ≥ 7 that have occurred since January 2003, without considering aftershocks (Extended Data Fig. 10). Among them, the 2003 *M*_{w} = 8.2 Hokkaido earthquake is at the edge of PEGSNet’s lower sensitivity limit of 8.3. For this event, PEGSNet estimates the final *M*_{w} after about two minutes (Extended Data Fig. 10b), in agreement with what was previously observed on the test set for events with similar magnitude (Fig. 2a). However, given the expected lower accuracy and higher errors for this event, we consider these predictions less reliable. For lower-magnitude events, PEGSNet predictions converge toward the noise baseline of 6.5 or never exceed its lower sensitivity limit, confirming that PEGS from *M*_{w} < 8.0 earthquakes are essentially indistinguishable from noise (Extended Data Fig. 10c–f). Deep learning denoising techniques for coherent noise removal^{50} might prove successful in enhancing PEGSNet performance and will be the subject of future work.