### Mice

Table of Contents

Twelve male wild-type mice (C57Bl/6, Charles River) were used for this study, three of which provided enough simultaneously recorded HD cells for continued experimentation. Mice were housed individually at 22 °C under a 12 h–12 h light–dark cycle and 40% humidity with food and water ad libitum. All of the experiments were performed in accordance with McGill University and Douglas Hospital Research Centre Animal Use and Care Committee (protocol 2015-7725) and in accordance with Canadian Institutes of Health Research guidelines.

### Surgeries

During all surgeries, mice were anesthetized by inhalation of a combination of oxygen and 5% isoflurane before being transferred to the stereotaxic frame (David Kopf Instruments), where anaesthesia was maintained by inhalation of oxygen and 0.5–2.5% isoflurane for the duration of the surgery. Body temperature was maintained with a heating pad and eyes were hydrated with gel (Optixcare). Carprofen (10 ml kg^{−1}) and saline (0.5 ml) were administered subcutaneously, respectively, at the beginning and end of each surgery. Preparation for recordings involved three surgeries per mouse. First, at the age of 7–8 weeks, each mouse was injected with 600 nl of the non-diluted viral vector AAV9.syn.GCaMP6f.WPRE.eYFP, sourced from University of Pennsylvania Vector Core. All injections were administered through glass pipettes connected to the Nanoject II (Drummond Scientific) injector at a flow rate of 23 nl s^{−1}. One week after injection, a 0.5-mm-diameter gradient refractive index (GRIN) relay lens (Go!Foton) was implanted above the ADN (AP, 1.8; ML, 0.8; DV, −3). No aspiration was required. In addition to the GRIN lens, three stainless steel screws were threaded into the skull to stabilize the implant. Dental cement (C&B Metabond) was applied to secure the GRIN lens and anchor screws to the skull. A silicone adhesive (Kwik-Sil, World Precision Instruments) was applied to protect the top surface of the GRIN lens until the next surgery. Then, 2 weeks after lens implantation, an aluminium baseplate was affixed by dental cement (C&B Metabond) to the skull of the mouse, which would later secure the miniaturized fluorescent endoscope (miniscope) in place during recording. The miniscope/baseplate was mounted to a stereotaxic arm for lowering above the implanted GRIN lens until the field of view contained visible cell segments, and dental cement was applied to affix the baseplate to the skull. A polyoxymethylene cap was affixed to the baseplate when the mice were not being recorded to protect the baseplate and lens. After surgery, animals were continuously monitored until they recovered. For the initial 3 days after surgery, the mice were provided with a soft diet supplemented with Carprofen for pain management (MediGel CPF). Screening and habituation to recording in the experimental environment began 2–3 days after the baseplate surgery. The first 3–4 weeks of recordings were used to confirm the quality and reliability of the calcium data while the animal was exploring the environment with different screen displays.

### Data acquisition

In vivo calcium videos were recorded with a miniscope (v3; https://miniscope.org) containing a monochrome CMOS imaging sensor (MT9V032C12STM, ON Semiconductor) connected to a custom data acquisition (DAQ) box (https://miniscope.org) with a lightweight, flexible coaxial cable. The cable was attached to a noiseless pulley system with a counterbalance (placed outside the recording environment) to prevent interference with the recorded animal’s movements and to alleviate the weight of the miniscope. The DAQ was connected to a PC with a USB 3.0 SuperSpeed cable and controlled using Miniscope custom acquisition software (https://miniscope.org). The outgoing excitation LED was set to 3–6%, depending on the mouse, to maximize the signal quality with the minimum possible excitation light to mitigate the risk of photobleaching. The gain was adjusted to match the dynamic range of the recorded video to the fluctuations of the calcium signal for each recording to avoid saturation. Behavioural video data were recorded using a webcam mounted above the environment. The DAQ simultaneously acquired behavioural and cellular imaging streams at 30 Hz as uncompressed AVI files and all recorded frames were timestamped for post hoc alignment. Two controllable LEDs (green and red) were added and used for tracking such that, whenever the miniscope was attached to the baseplate, the green LED pointed to the right side of the mouse’s head and the red LED pointed to the left side. All other light sources from the miniscope were covered. All recordings took place inside a 360° LED screen (height: 1 m, diameter: 90 cm; Shenzhen Apexls Optoelectronic), at the centre of which we placed a wall-less circular platform (diameter, 20 cm) raised 50 cm above the ground. Mouse bedding was evenly spread over the platform before each recording session. In all recordings, mice were free to move on top of the raised platform. A half spherical dome was used to cover the environment and prevent external light from entering, while it also held the behavioural camera. The experimental environment was designed to maximize circular symmetry, in the absence of any screen display. During habituation, mice were recorded while exposed to a single vertical stripe or no visual display (darkness). These recordings were also used to confirm the quality of tracking the head direction and the cue location, in different conditions. In all experiments in this study, the visual cue refers to a single white vertical stripe (width, 15 cm; height, 1 m).

### Data preprocessing

Calcium imaging data were preprocessed before analyses through a pipeline of open-source MATLAB (MathWorks; v.R2015a) functions to correct for motion artifacts^{45}, segment cells and extract transients^{32,46}, and infer the likelihood of spiking events by deconvolution of the transient trace through a first-order autoregressive model^{31}. We wrote a MATLAB (MathWorks, v.2015a) program to perform offline tracking of the LEDs and determine, at each frame, the animal’s head direction. Another custom-written program was used to estimate the location of the visual cue. Both scripts were incorporated into the preprocessing pipeline.

### Data analysis

In this study, neural activity refers to the deconvolved calcium traces as described previously^{31} unless specified. The resulting time series (per neuron, per session) correspond to the inferred likelihood of spiking events. A moving average filter of width 3 frames (~100 ms) is then applied on each time series. We refer to the obtained signal as firing activity.

### Identification of HD cells

For every identified cell segment (ROI), we construct an HD tuning curve by measuring the occupancy-normalized firing activity within each angle bin (1° per bin) of the horizontal plane (*x* axis). The tuning curve is circularly smoothed with a moving average filter of width 50°. This enables us to have a better estimate of the angle bin that corresponds to the maximum firing activity of a given neuron’s tuning curve, which we will refer to as the PFD. We next construct a stimulus signal for that specific PFD by convolving the measured HD signal (from the behavioural camera) with a narrow Gaussian kernel (mean = PFD, s.d. = 17°) such that for every neuron *i*:

$${{\rm{stim}}}_{i}\left(t\right)={{\rm{e}}}^{-\frac{{({\rm{angdiff}}({{\rm{PFD}}}_{i},{\theta }_{{\rm{HD}}}(t)))}^{2}}{2{\sigma }^{2}}}$$

Where, *θ*_{HD} is the measured HD time series, *σ* is the s.d. of the Gaussian kernel and, angdiff (*a*, *b*) is a MATLAB function that gives the subtraction of *a* from *b*, wrapped on the [−π,π] interval. We correlate the stimulus signal with a normalized version of the firing activity to obtain the Pearson correlation coefficient *r* of each neuron. To determine the threshold value of *r* above which a cell can be identified as an HD neuron, we used data from ten baseline recordings (3 min) per animal, randomly selected from the reset experiment. We start with a relatively high value *r*_{thresh} and select all neurons such that *r* > *r*_{thresh}. For each neuron, we produce 1,000 shuffles of the firing activity using the MATLAB circshift function (to preserve the temporal correlation of the firing activity signal) at random shifts. We next correlate each shuffled version with the stimulus signal of the corresponding neuron to obtain a distribution of correlation coefficients (3 separate distributions, 1 per mouse). We define \({r}_{m}^{95{\rm{th}}}\) as the value that corresponds to the 95th percentiles of the distribution, for mouse *m*. If \({r}_{{\rm{thresh}}} > {r}_{m}^{95{\rm{th}}}\), we keep iterating the same procedure while decreasing *r*_{thresh} by 0.01 until convergence (that is, \({r}_{{\rm{thresh}}}\simeq {r}_{m}^{95{\rm{th}}}\)), which constitutes the correlation coefficient threshold to identify HD neurons for mouse *m* (see Extended Data Fig. 1d,e for an illustration of the results).

### HD decoding from neural data

We trained a recently developed Bayesian decoder^{37} to infer the HD direction from the deconvolved calcium responses of the imaged neural population. Noise independence across neurons was assumed. Conceptually, this decoder is similar to the Bayesian decoding method for spike trains as commonly used in the literature^{47}, except that we used zero-inflated-gamma distribution to model the stochasticity of the deconvolved calcium responses, instead of Poisson distribution. Our previous results showed that the zero-inflated-gamma model could better capture the noise of the calcium signal and provide better decoding results compared with the Poisson noise model and a few other alternatives. Details of this procedure can be found in section 4 of ref. ^{37}. Here we smoothed the log-likelihood matrix (rows, angle bins; columns, frames) by iteratively summing the likelihoods over 5 frames (~166.7 ms) centred around the corresponding timestep of each iteration, for each angle bin. Note that, owing to the predominance of HD-tuned neurons among detected cell segments and to avoid selection biases, the neural activity from all ADN cells was used as an input to the decoding algorithm.

### Analysis of drift

We define the offset as the angular difference between the measured head direction (*θ*_{measured}) and the decoded head direction (*θ*_{decoded}):

$${\rm{Offset}}(t)={\rm{angdiff}}({\theta }_{{\rm{decoded}}},{\theta }_{{\rm{measured}}})$$

In all analyses involving drift estimation, both measured and decoded HDs were smoothed with a moving average filter of width 20 frames (~667 ms). For the analysis of drift during darkness (except for heat maps), further smoothing was applied to extract the low-frequency component of the signal whereby a moving average filter of width 300 frames (~10 s) was used. In all cases, a simple linear regression was performed on the unwrapped offset signal over a sliding window of 20 frames (~667 ms) to estimate the drift speed at the centre of the regression window (that is, slope of the fitted line).

### Separation of fast and slow resets

Classification of resets within the [70:110]° range was done using the *k*-means clustering function in MATLAB. We used data from the first 1,450 frames after cue display. The algorithm separates between two clusters by generating 50 replicates with different initial cluster centroid positions for each replicate and then calculating the sums of point-to-centroid distances for each cluster using the ‘city block’ distance metric.

### Reconstruction of the bump of activity

At any given time, we can reconstruct the bump of activity from the firing activity of each neuron and their respective tuning curves using a normalized weighted sum of tuning curves^{48}:

$$A\left(\theta ,t\right)=\frac{\sum _{i}{f}_{i}(\theta ){r}_{i}(t)}{\sum _{i}{f}_{i}(\theta )},$$

where, *A* is a 360-by*-T* matrix (each row is a 1° bin of the horizontal plane and each column is a frame within range *T* of the analysis), *f*_{i} is the tuning curve of neuron *i* and *r*_{i} is the firing activity of neuron *i*.

### Calculation of network gain

We assume that, at any given time, the thalamic HD network is subject to a global gain modulation of the firing activity, applied homogeneously on all ADN neurons such that:

$$\begin{array}{cc}{r}_{i,t}={\alpha }_{t}{f}_{i}\left({\theta }_{t}\right)+\varepsilon , & \varepsilon \sim {\mathscr{N}}(0,{\sigma }^{2})\end{array},$$

where *r*_{i,t} is the instantaneous firing activity of ADN neuron *i*; *α*_{t} is the instantaneous gain factor; *f*_{i} is the tuning curve of ADN neuron *i* (calculated on the basis of the response measured in the baseline condition); *θ*_{t} is the decoded head direction from neural activity at time *t*; and *ε* is the additive Gaussian noise.

Our goal is to estimate the value of *α*_{t} at any given time *t* using maximum-likelihood estimation approach.

Given the decoded head direction at time *t*, *θ*_{t} as well as the tuning curves *f*_{i} for all ADN neurons, we obtain the likelihood of observing *r*_{i,t} with parameter *α*_{t}:

$$P\left({r}_{i,t}| \,{f}_{i}\left({\theta }_{t}\right);{\alpha }_{t}\right)={\mathscr{N}}({\alpha }_{t}\,{f}_{i}\left({\theta }_{t}\right),{\sigma }^{2}).$$

We define the vectors:

$$\begin{array}{cc}{R}_{t}=\left[\begin{array}{c}{r}_{1,t}\\ {r}_{2,t}\\ \vdots \\ {r}_{N,t}\end{array}\right], & F({\theta }_{t})=\left[\begin{array}{c}{f}_{1}({\theta }_{t})\\ {f}_{2}({\theta }_{t})\\ \vdots \\ {f}_{N}({\theta }_{t})\end{array}\right]\end{array},$$

where *N* is the number of ADN neurons in the network.

Assuming independent activity between said neurons, we can calculate the likelihood of observing *R*_{t}:

$$\begin{array}{l}P({R}_{t}| F({\theta }_{t})\,;{\alpha }_{t})={\prod }_{i}P({r}_{i,t}| \,{f}_{i}({\theta }_{t});{\alpha }_{t})\\ \,\,\,\,\propto {\prod }_{i}{\rm{\exp }}\left(\frac{{({r}_{i,t}-{\alpha }_{t}{f}_{i}({\theta }_{t}))}^{2}}{-2{\sigma }^{2}}\right).\end{array}$$

We apply the logarithm on both sides:

$$\log \left(P\left({R}_{t}| F\left({\theta }_{t}\right)\,{\rm{;}}{\alpha }_{t}\right)\right)\propto \,-\,\frac{1}{2}{\sum }_{i}\frac{{({r}_{i,t}-{\alpha }_{t}{f}_{i}\left({\theta }_{t}\right))}^{2}}{{\sigma }^{2}}.$$

Our goal is to determine the parameter \({\hat{\alpha }}_{t}\) that maximizes the log-likelihood such that:

$${\hat{\alpha }}_{t}=\mathop{{\rm{a}}{\rm{r}}{\rm{g}}{\rm{m}}{\rm{a}}{\rm{x}}}\limits_{{\alpha }_{t}}\,\log ((P({R}_{t}|F({\theta }_{t})\,;\,{\alpha }_{t}))=\mathop{{\rm{a}}{\rm{r}}{\rm{g}}{\rm{m}}{\rm{i}}{\rm{n}}}\limits_{{\alpha }_{t}}{\sum }_{i}{({r}_{i,t}-{\alpha }_{t}{f}_{i}({\theta }_{t}))}^{2}.$$

To do so, we take the derivative of the objective function with reference to *α*_{t} and set it to zero:

$$\frac{d}{d{\alpha }_{t}}{\sum }_{i}{({r}_{i,t}-{\alpha }_{t}{f}_{i}({\theta }_{t}))}^{2}=0.$$

Thus:

$${\hat{\alpha }}_{t}=\frac{\sum _{i}{r}_{i,t}\,{f}_{i}\left({\theta }_{t}\right)}{\sum _{i}{{f}_{i}\left({\theta }_{t}\right)}^{2}}.$$

Similar to the offset signal, the obtained gain is smoothed with a moving-average filter of width 20 frames (~667 ms), unless otherwise specified. With the exception of Extended Data Fig. 5b–d, the gain was always normalized to the average baseline activity. In Extended Data Fig. 5b–d, the tuning curves represent the average firing rates as a function of the internal HD for the entire recording session.

### Gain heat-map analysis

Gain heat maps are 2D matrices in which each pixel *p*(*x*, *y*) is a 2D bin of width 1.5° per s corresponding to the measured angular head velocity and height 1° corresponding to the decoded HD. Pixel *p*(*x*, *y*) represents the mean network gain—across mice and across sessions—within a 2D average window of width [*x* − 3:*x* + 3]° per s and height [*y* − 15:*y* + 15]°. A 2D Gaussian filter of s.d. = 15 (15° × 22.5° per s) is then applied. The network gain, the decoded HD and the measured HD were all smoothed with a moving-average filter of width 20 frames (~667 ms) while the measured head angular velocity was approximated by a simple linear regression with a regression window of similar width. To evaluate the significance of the difference between gain heat maps (Fig. 4f), we performed a Wilcoxon rank-sum test to compare, at each pixel, the gain distributions within the 2D window of width [*x* − 3:*x* + 3]° per s and height [*y* − 15:*y* + 15]° between darkness epochs of the 20 s experiment and D2 of the 2 min experiment. As we are only interested in the significance of the positive values (indicating the appearance of new bumps), negative values as well as *P* > 0.001 were marked as NaN.

### Drift-speed heat-map analysis

Drift-speed heat maps were generated according to the same approach as for the gain heat maps. However, drift speed was approximated by a simple linear regression with a regression window of width 20 frames (~667 ms). The *P*-value matrix for drift speed difference between the 20 s experiment and D2 of the 2 min experiment (Extended Data Fig. 13e) was calculated as described above. However, only *P* values > 0.001 were marked as NaN.

### Vector field analysis

The purpose of this analysis is to illustrate baseline attractiveness. We define the state space (*y* axis, drift-speed (° per s); *x* axis, drift-angle (°)). We construct a vector field matrix by dividing the *x* axis into 18 bins of width 20° each within the range [−180:180]°, and the *y* axis into 20 bins of width 0.03° per s each, within the range [−3:3]° per s. At each bin (*x*,*y*), we calculate the mean drift speed and mean drift acceleration across mice and across sessions. The two latter quantities represent the velocity components (*u*,*v*) that determine the length and direction of the velocity vector. We assume the vector field has a central symmetry with reference to the baseline point (0,0) owing to the symmetry in the experimental design. We therefore generate an image of the original vector field that is its reflection across the origin. The two versions are then averaged to produce the final 2D vector field. Streamlines are generated using the streamline function in MATLAB. For Figs. 4d and 5e, streamlines were simulated over 1,000 timesteps.

### Analysis of cue-rotation sessions

At the beginning of each continuous cue-rotation epoch, the visual cue was displayed at the same location as in the baseline. After cue removal and depending on its previous rotation speed, the cue would have either reached ±180° or ±90° (cue orientation in clockwise-cue-rotation sessions was reflected across the *x* axis so that the cue ends at ±180° (fast cue-rotation) or −90° (slow cue-rotation)). The offset is therefore expected to start within a close range of these two directions during the second darkness epoch. However, in some cases, drifts during the first darkness epoch were large enough so that the initial anchoring to the rotating cue occurred considerably far from baseline. This caused the drift signal during the second darkness to start further away from the expected location. In Fig. 5d, we limited our analysis to drifts starting within [−180:−145]U[145:180]° for fast cue rotation and [−125:−55]° for slow cue rotation, to study the effects across sessions with similar stability during baseline (total *n* = 44 out of 60).

### Dimensionality reduction

It is generally believed that the main function of the HD system is to provide an estimate of the HD at any given time. As most studies of this network, including ours, are conducted while recording the neural activity in animals placed on horizontal planes, it is fair to assume that most of the variability in the activity of HD neural population can be captured by a single variable representing the angle faced by the animal, at an instant *t*, with reference to a given allocentric reference frame. Indeed, previous studies have shown that, in stable conditions, different dimensionality reduction methods^{17,38} would produce a circular manifold that can be fairly approximated in a unidimensional polar state-space with a fixed radius. Nevertheless, a previous study^{17} observed that the structure becomes more complex during slow-wave sleep. Our guiding hypothesis is that the intrinsic geometric structure of the neural activity in the HD network lies in a multidimensional state space and that latent variables other than the angular component are needed to explain the variability in spiking data, during non-stable conditions such as resets and drift situations. Here we propose the simplest augmentation to the latent structure by adding a radial component that we expect to indicate instantaneous changes in global firing activity of the HD network. Although we believe that the true intrinsic dimensionality of the HD neural data is higher than two, the current paper mainly focuses on the necessity of at least a second dimension of the HD system during instability.

To test our hypothesis, we developed a deep feedforward neural network that maps the high-dimensional input (neural) data onto the 2D polar space (angular dimension *θ* and radial dimension *R*). The network is trained on circular data from the measured head direction. The radial component *R* is a latent variable that can take any non-negative value. Our previous analyses (not included here) have shown that, although methods such as principal component analysis and Isomap can uncover looped latent structures, these unsupervised learning algorithms tend to produce distorted circles, in the presence of noise, when applied on baseline data (that is, stable condition), which makes the definition of a radius less straightforward and motivates our use of a supervised learning method.

We used a feedforward neural network with three parallel branches. Two of these branches have three fully connected hidden layers (referred to as first and second or *B*_{1} and *B*_{2}, respectively), while the third branch has two fully connected hidden layers (referred to as middle or *B*_{m}) (Extended Data Fig. 4d). The input layer receives a N × 1 vector of neural activity from *N* ADN neurons at time *t* (both calcium traces as well as firing activity from deconvolved spikes can be fed to the model). The output layer is composed of two units that are the results of multiplying the output *g*_{t} of the middle branch with the output *z*_{1,t} of the first branch, on one hand, and the output *z*_{2,t} of the second branch, on the other hand, as illustrated in the diagram of Extended Data Fig. 4d.

We trained our model on baseline data. The objective is to find the set of weights *W* that minimize the distance between the network output \(\left(\begin{array}{c}{g}_{t}{z}_{1,t}\\ {g}_{t}{z}_{2,t}\end{array}\right)\) and the vector \(\left(\begin{array}{c}\cos ({\theta }_{t})\\ \sin ({\theta }_{t})\end{array}\right)\), where *θ*_{t} is the measured head direction of the animal at instant *t*. We define the loss function as the mean squared error:

$${\rm{MSE}}=\frac{1}{T}\mathop{\sum }\limits_{t=1}^{T}{\parallel (\begin{array}{c}\cos ({\theta }_{t})\\ \sin ({\theta }_{t})\end{array})-(\begin{array}{c}{g}_{t}{z}_{1,t}\\ {g}_{t}{z}_{2,t}\end{array})\parallel }_{2}^{2},$$

where, *T* is the duration of the training epoch and ‖.‖_{2} is the *L*_{2} norm. If the algorithm converges, we obtain the following approximations:

$$\left\{\begin{array}{c}{z}_{1,t}\approx \frac{\cos ({\theta }_{t})}{{g}_{t}}\\ {z}_{2,t}\approx \frac{\sin ({\theta }_{t})}{{g}_{t}}\end{array}\right..$$

Let \({\hat{R}}_{t}=\frac{1}{{g}_{t}}\), then we can rewrite the output of each branch:

$$\left\{\begin{array}{c}{B}_{1}:{z}_{1,t}\approx {\hat{R}}_{t}\cos \left({\theta }_{t}\right)\\ {B}_{2}:{z}_{2,t}\approx {\hat{R}}_{t}\sin \left({\theta }_{t}\right)\\ {B}_{{\rm{m}}}:{g}_{t}=\frac{1}{{\hat{R}}_{t}}\end{array}\right..$$

In effect, this would allow branches *B*_{1} and *B*_{2} to learn a mapping from the input (neural) space to the Cartesian transformation of the polar coordinates of a given state *s*_{t}, at any time *t* (respectively, *B*_{1} projects the input onto the *x* axis and, *B*_{2} projects the input onto the *y* axis). From these two branches, we can extract the decoded angle \({\hat{\theta }}_{t}=\arctan \left(\frac{{z}_{2,t}}{{z}_{1,t}}\right)\). While branch *B*_{m} would learn a mapping from the input space to the inverse of the approximate radius \({\hat{R}}_{t}\) of said state, in polar space. If we assume that \({\hat{R}}_{t}\) is a certain reflection of global neural activity, as per our hypothesis, then we expect small fluctuations of population activity in the training data (baseline) to be sufficient to allow the network to extrapolate \({\hat{R}}_{t}\) on test data with larger fluctuations.

### Statistics and reproducibility

All statistical tests are noted where the corresponding results are reported throughout the main text and Supplementary Information. All tests were uncorrected two-tailed tests unless otherwise noted. Outliers were identified as data points that fall outside the mean ± (3 s.d.) range.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.