# Champion-level drone racing using deep reinforcement learning – Nature

#### ByAUTHOR

Aug 30, 2023

To enable large-scale training, we use a high-fidelity simulation of the quadrotor dynamics. This section briefly explains the simulation. The dynamics of the vehicle can be written as

$$\dot{{\bf{x}}}=[\begin{array}{c}{\dot{{\bf{p}}}}_{{\mathcal{W}}{\mathcal{B}}}\\ {\dot{{\bf{q}}}}_{{\mathcal{W}}{\mathcal{B}}}\\ {\dot{{\bf{v}}}}_{{\mathcal{W}}}\\ {\dot{{\boldsymbol{\omega }}}}_{{\mathcal{B}}}\\ \dot{{\boldsymbol{\Omega }}}\end{array}]=\left[\begin{array}{c}{{\bf{v}}}_{{\mathcal{W}}}\\ {{\bf{q}}}_{{\mathcal{W}}{\mathcal{B}}}\cdot [\begin{array}{c}0\\ {{\boldsymbol{\omega }}}_{{\mathcal{B}}}/2\end{array}]\\ \frac{1}{m}({{\bf{q}}}_{{\mathcal{W}}{\mathcal{B}}}\odot ({{\bf{f}}}_{{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}}+{{\bf{f}}}_{{\rm{a}}{\rm{e}}{\rm{r}}{\rm{o}}}))+{{\bf{g}}}_{{\mathcal{W}}}\\ {J}^{-1}({{\boldsymbol{\tau }}}_{{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}}+{{\boldsymbol{\tau }}}_{{\rm{m}}{\rm{o}}{\rm{t}}}+{{\boldsymbol{\tau }}}_{{\rm{a}}{\rm{e}}{\rm{r}}{\rm{o}}}+{{\boldsymbol{\tau }}}_{{\rm{i}}{\rm{n}}{\rm{e}}{\rm{r}}})\\ \frac{1}{{k}_{{\rm{m}}{\rm{o}}{\rm{t}}}}({{\boldsymbol{\Omega }}}_{{\rm{s}}{\rm{s}}}-{\boldsymbol{\Omega }})\end{array}\right],$$

(1)

in which represents quaternion rotation, $${{\bf{p}}}_{{\mathcal{W}}{\mathcal{B}}},{{\bf{q}}}_{{\mathcal{W}}{\mathcal{B}}},{{\bf{v}}}_{{\mathcal{W}}}$$ and $${{\boldsymbol{\omega }}}_{{\mathcal{B}}}$$ denote the position, attitude quaternion, inertial velocity and body rates of the quadcopter, respectively. The motor time constant is kmot and the motor speeds Ω and Ωss are the actual and steady-state motor speeds, respectively. The matrix J is the inertia of the quadcopter and $${{\bf{g}}}_{{\mathcal{W}}}$$ denotes the gravity vector. Two forces act on the quadrotor: the lift force fprop generated by the propellers and an aerodynamic force faero that aggregates all other forces, such as aerodynamic drag, dynamic lift and induced drag. The torque is modelled as a sum of four components: the torque generated by the individual propeller thrusts τprop, the yaw torque τmot generated by a change in motor speed, an aerodynamic torque τaero that accounts for various aerodynamic effects such as blade flapping and an inertial term τiner. The individual components are given as

$${{\bf{f}}}_{{\rm{prop}}}=\sum _{i}{{\bf{f}}}_{i},\,{{\boldsymbol{\tau }}}_{{\rm{prop}}}=\sum _{i}{{\boldsymbol{\tau }}}_{i}+{{\bf{r}}}_{{\rm{P}},i}\times {{\bf{f}}}_{i},$$

(2)

$${{\boldsymbol{\tau }}}_{{\rm{mot}}}={J}_{{\rm{m}}+{\rm{p}}}\sum _{i}\,{{\boldsymbol{\zeta }}}_{i}{\dot{\Omega }}_{i},\,{{\boldsymbol{\tau }}}_{{\rm{iner}}}=-{{\boldsymbol{\omega }}}_{{\mathcal{B}}}\times {\bf{J}}{{\boldsymbol{\omega }}}_{{\mathcal{B}}}$$

(3)

in which rP,i is the location of propeller i, expressed in the body frame, and fi and τi are the forces and torques, respectively, generated by the ith propeller. The axis of rotation of the ith motor is denoted by ζi, the combined inertia of the motor and propeller is Jm+p and the derivative of the ith motor speed is $${\dot{\Omega }}_{i}$$. The individual propellers are modelled using a commonly used quadratic model, which assumes that the lift force and drag torque are proportional to the square of the propeller speed Ωi:

$${{\bf{f}}}_{i}({\Omega }_{i})={\left[\begin{array}{ccc}0 & 0 & {c}_{{\rm{l}}}\cdot {\Omega }_{i}^{2}\end{array}\right]}^{\top },\quad {{\boldsymbol{\tau }}}_{i}({\Omega }_{i})={\left[\begin{array}{ccc}0 & 0 & {c}_{{\rm{d}}}\cdot {\Omega }_{i}^{2}\end{array}\right]}^{\top }$$

(4)

in which cl and cd denote the propeller lift and drag coefficients, respectively.

#### Aerodynamic forces and torques

The aerodynamic forces and torques are difficult to model with a first-principles approach. We thus use a data-driven model43. To maintain the low computational complexity required for large-scale RL training, a grey-box polynomial model is used rather than a neural network. The aerodynamic effects are assumed to primarily depend on the velocity $${{\bf{v}}}_{{\mathcal{B}}}$$ (in the body frame) and the average squared motor speed $$\overline{{\Omega }^{2}}$$. The aerodynamic forces fx, fy and fz and torques τx, τy and τz are estimated in the body frame. The quantities vx, vy and vz denote the three axial velocity components (in the body frame) and vxy denotes the speed in the (x, y) plane of the quadrotor. On the basis of insights from the underlying physical processes, linear and quadratic combinations of the individual terms are selected. For readability, the coefficients multiplying each summand have been omitted:

$$\begin{array}{c}{f}_{x}\,\sim \,{v}_{x}+{v}_{x}|{v}_{x}|+\bar{{\Omega }^{2}}+{v}_{x}\,\bar{{\Omega }^{2}}\\ {f}_{y}\,\sim \,{v}_{y}+{v}_{y}|{v}_{y}|+\bar{{\Omega }^{2}}+{v}_{y}\,\bar{{\Omega }^{2}}\\ {f}_{z}\,\sim \,{v}_{z}+{v}_{z}|{v}_{z}|+{v}_{xy}+{v}_{xy}^{2}+{v}_{xy}\,\bar{{\Omega }^{2}}+{v}_{z}\,\bar{{\Omega }^{2}}+{v}_{xy}\,{v}_{z}\,\bar{{\Omega }^{2}}\\ \,{\tau }_{x}\,\sim \,{v}_{y}+{v}_{y}|{v}_{y}|+\bar{{\Omega }^{2}}+{v}_{y}\,\bar{{\Omega }^{2}}+{v}_{y}|{v}_{y}|\,\bar{{\Omega }^{2}}\\ \,{\tau }_{y}\,\sim \,{v}_{x}+{v}_{x}|{v}_{x}|+\bar{{\Omega }^{2}}+{v}_{x}\,\bar{{\Omega }^{2}}+{v}_{x}|{v}_{x}|\,\bar{{\Omega }^{2}}\\ \,{\tau }_{z}\,\sim \,{v}_{x}+{v}_{y}\end{array}$$

The respective coefficients are then identified from real-world flight data, in which motion capture is used to provide ground-truth forces and torque measurements. We use data from the race track, allowing the dynamics model to fit the track. This is akin to the human pilots’ training for days or weeks before the race on the specific track that they will be racing. In our case, the human pilots are given a week of practice on the same track ahead of the competition.

#### Betaflight low-level controller

To control the quadrotor, the neural network outputs collective thrust and body rates. This control signal is known to combine high agility with good robustness to simulation-to-reality transfer44. The predicted collective thrust and body rates are then processed by an onboard low-level controller that computes individual motor commands, which are subsequently translated into analogue voltage signals through an electronic speed controller (ESC) that controls the motors. On the physical vehicle, this low-level proportional–integral–derivative (PID) controller and ESC are implemented using the open-source Betaflight and BLHeli32 firmware45. In simulation, we use an accurate model of both the low-level controller and the motor speed controller.

Because the Betaflight PID controller has been optimized for human-piloted flight, it exhibits some peculiarities, which the simulation correctly captures: the reference for the D-term is constantly zero (pure damping), the I-term gets reset when the throttle is cut and, under motor thrust saturation, the body rate control is assigned priority (proportional downscaling of all motor signals to avoid saturation). The gains of the controller used for simulation have been identified from the detailed logs of the Betaflight controller’s internal states. The simulation can predict the individual motor commands with less than 1% error.

#### Battery model and ESC

The low-level controller converts the individual motor commands into a pulse-width modulation (PWM) signal and sends it to the ESC, which controls the motors. Because the ESC does not perform closed-loop control of the motor speeds, the steady-state motor speed Ωi,ss for a given PWM motor command cmdi is a function of the battery voltage. Our simulation thus models the battery voltage using a grey-box battery model46 that simulates the voltage based on instantaneous power consumption Pmot:

$${P}_{{\rm{mot}}}=\frac{{c}_{{\rm{d}}}{\Omega }^{3}}{\eta }$$

(5)

The battery model46 then simulates the battery voltage based on this power demand. Given the battery voltage Ubat and the individual motor command ucmd,i, we use the mapping (again omitting the coefficients multiplying each summand)

$${\Omega }_{i,{\rm{s}}{\rm{s}}}\sim 1+{U}_{{\rm{b}}{\rm{a}}{\rm{t}}}+\sqrt{{u}_{{\rm{c}}{\rm{m}}{\rm{d}},i}}+{u}_{{\rm{c}}{\rm{m}}{\rm{d}},i}+{U}_{{\rm{b}}{\rm{a}}{\rm{t}}}\sqrt{{u}_{{\rm{c}}{\rm{m}}{\rm{d}},i}}$$

(6)

to calculate the corresponding steady-state motor speed Ωi,ss required for the dynamics simulation in equation (1). The coefficients have been identified from Betaflight logs containing measurements of all involved quantities. Together with the model of the low-level controller, this enables the simulator to correctly translate an action in the form of collective thrust and body rates to desired motor speeds Ωss in equation (1).

### Policy training

We train deep neural control policies that directly map observations ot in the form of platform state and next gate observation to control actions ut in the form of mass-normalized collective thrust and body rates44. The control policies are trained using model-free RL in simulation.

#### Training algorithm

Training is performed using proximal policy optimization31. This actor-critic approach requires jointly optimizing two neural networks during training: the policy network, which maps observations to actions, and the value network, which serves as the ‘critic’ and evaluates actions taken by the policy. After training, only the policy network is deployed on the robot.

#### Observations, actions and rewards

An observation $${{\bf{o}}}_{t}\in {{\mathbb{R}}}^{31}$$ obtained from the environment at time t consists of: (1) an estimate of the current robot state; (2) the relative pose of the next gate to be passed on the track layout; and (3) the action applied in the previous step. Specifically, the estimate of the robot state contains the position of the platform, its velocity and attitude represented by a rotation matrix, resulting in a vector in $${{\mathbb{R}}}^{15}$$. Although the simulation uses quaternions internally, we use a rotation matrix to represent attitude to avoid ambiguities47. The relative pose of the next gate is encoded by providing the relative position of the four gate corners with respect to the vehicle, resulting in a vector in $${{\mathbb{R}}}^{12}$$. All observations are normalized before being passed to the network. Because the value network is only used during training time, it can access privileged information about the environment that is not accessible to the policy48. This privileged information is concatenated with other inputs to the policy network and contains the exact position, orientation and velocity of the robot.

For each observation ot, the policy network produces an action $${{\bf{a}}}_{t}\in {{\mathbb{R}}}^{4}$$ in the form of desired mass-normalized collective thrust and body rates.

We use a dense shaped reward formulation to learn the task of perception-aware autonomous drone racing. The reward rt at time step t is given by

$${r}_{t}={r}_{t}^{{\rm{prog}}}+{r}_{t}^{{\rm{perc}}}+{r}_{t}^{{\rm{cmd}}}-{r}_{t}^{{\rm{crash}}}$$

(7)

in which rprog rewards progress towards the next gate35, rperc encodes perception awareness by adjusting the attitude of the vehicle such that the optical axis of the camera points towards the centre of the next gate, rcmd rewards smooth actions and rcrash is a binary penalty that is only active when colliding with a gate or when the platform leaves a predefined bounding box. If rcrash is triggered, the training episode ends.

Specifically, the reward terms are

$$\begin{array}{r}{r}_{t}^{{\rm{prog}}}\,=\,{\lambda }_{1}\left[{d}_{t-1}^{{\rm{Gate}}}-{d}_{t}^{{\rm{Gate}}}\right]\\ {r}_{t}^{{\rm{perc}}}\,=\,{\lambda }_{2}\exp \left[{\lambda }_{3}\cdot {\delta }_{{\rm{cam}}}^{4}\right]\end{array}$$

(8)

$${r}_{t}^{{\rm{cmd}}}={\lambda }_{4}{{\bf{a}}}_{t}^{\omega }+{\lambda }_{5}\parallel {{\bf{a}}}_{t}-{{\bf{a}}}_{t-1}{\parallel }^{2}$$

(9)

$${r}_{t}^{{\rm{crash}}}=\left\{\begin{array}{l}5.0,\,{\rm{if}}\,{p}_{z} < 0\,{\rm{or\; in\; collision\; with\; gate}}\\ 0,\quad {\rm{otherwise}}\,\end{array}\right.$$

in which $${d}_{t}^{{\rm{Gate}}}$$ denotes the distance from the centre of mass of the vehicle to the centre of the next gate at time step t, δcam represents the angle between the optical axis of the camera and the centre of the next gate and $${{\bf{a}}}_{t}^{\omega }$$ are the commanded body rates. The hyperparameters λ1,…, λ5 balance different terms (Extended Data Table 1a).

#### Training details

Data collection is performed by simulating 100 agents in parallel that interact with the environment in episodes of 1,500 steps. At each environment reset, every agent is initialized at a random gate on the track, with bounded perturbation around a state previously observed when passing this gate. In contrast to previous work44,49,50, we do not perform randomization of the platform dynamics at training time. Instead, we perform fine-tuning based on real-world data. The training environment is implemented using TensorFlow Agents51. The policy network and the value network are both represented by two-layer perceptrons with 128 nodes in each layer and LeakyReLU activations with a negative slope of 0.2. Network parameters are optimized using the Adam optimizer with learning rate 3 × 10−4 for both the policy network and the value network.

Policies are trained for a total of 1 × 108 environment interactions, which takes 50 min on a workstation (i9 12900K, RTX 3090, 32 GB RAM DDR5). Fine-tuning is performed for 2 × 107 environment interactions.

### Residual model identification

We perform fine-tuning of the original policy based on a small amount of data collected in the real world. Specifically, we collect three full rollouts in the real world, corresponding to approximately 50 s of flight time. We fine-tune the policy by identifying residual observations and residual dynamics, which are then used for training in simulation. During this fine-tuning phase, only the weights of the control policy are updated, whereas the weights of the gate-detection network are kept constant.

#### Residual observation model

Navigating at high speeds results in substantial motion blur, which can lead to a loss of tracked visual features and severe drift in linear odometry estimates. We fine-tune policies with an odometry model that is identified from only a handful of trials recorded in the real world. To model the drift in odometry, we use Gaussian processes36, as they allow fitting a posterior distribution of odometry perturbations, from which we can sample temporally consistent realizations.

Specifically, the Gaussian process model fits residual position, velocity and attitude as a function of the ground-truth robot state. The observation residuals are identified by comparing the observed visual–inertial odometry (VIO) estimates during a real-world rollout with the ground-truth platform states, which are obtained from an external motion-tracking system.

We treat each dimension of the observation separately, effectively fitting a set of nine 1D Gaussian processes to the observation residuals. We use a mixture of radial basis function kernels

$$\kappa ({{\bf{z}}}_{i},{{\bf{z}}}_{j})={\sigma }_{{\rm{f}}}^{2}\,\exp \left(-\frac{1}{2}{({{\bf{z}}}_{i}-{{\bf{z}}}_{j})}^{{\rm{\top }}}{L}^{-2}({{\bf{z}}}_{i}-{{\bf{z}}}_{j})\right)+{\sigma }_{{\rm{n}}}^{2}$$

(10)

in which L is the diagonal length scale matrix and σf and σn represent the data and prior noise variance, respectively, and zi and zj represent data features. The kernel hyperparameters are optimized by maximizing the log marginal likelihood. After kernel hyperparameter optimization, we sample new realizations from the posterior distribution that are then used during fine-tuning of the policy. Extended Data Fig. 1 illustrates the residual observations in position, velocity and attitude in real-world rollouts, as well as 100 sampled realizations from the Gaussian process model.

#### Residual dynamics model

We use a residual model to complement the simulated robot dynamics52. Specifically, we identify residual accelerations as a function of the platform state s and the commanded mass-normalized collective thrust c:

$${{\bf{a}}}_{{\rm{res}}}={\rm{KNN}}({\bf{s}},c)$$

(11)

We use k-nearest neighbour regression with k = 5. The size of the dataset used for residual dynamics model identification depends on the track layout and ranges between 800 and 1,000 samples for the track layout used in this work.

### Gate detection

To correct for drift accumulated by the VIO pipeline, the gates are used as distinct landmarks for relative localization. Specifically, gates are detected in the onboard camera view by segmenting gate corners26. The greyscale images provided by the Intel RealSense Tracking Camera T265 are used as input images for the gate detector. The architecture of the segmentation network is a six-level U-Net53 with (8, 16, 16, 16, 16, 16) convolutional filters of size (3, 3, 3, 5, 7, 7) per level and a final extra layer operating on the output of the U-Net containing 12 filters. As the activation function, LeakyReLU with α = 0.01 is used. For deployment on the NVIDIA Jetson TX2, the network is ported to TensorRT. To optimize memory footprint and computation time, inference is performed in half-precision mode (FP16) and images are downsampled to size 384 × 384 before being fed to the network. One forward pass through the network takes 40 ms on the NVIDIA Jetson TX2.

### VIO drift estimation

The odometry estimates from the VIO pipeline54 exhibit substantial drift during high-speed flight. We use gate detection to stabilize the pose estimates produced by VIO. The gate detector outputs the coordinates of the corners of all visible gates. A relative pose is first estimated for all predicted gates using infinitesimal plane-based pose estimation (IPPE)34. Given this relative pose estimate, each gate observation is assigned to the closest gate in the known track layout, thus yielding a pose estimate for the drone.

Owing to the low frequency of the gate detections and the high quality of the VIO orientation estimate, we only refine the translational components of the VIO measurements. We estimate and correct for the drift of the VIO pipeline using a Kalman filter that estimates the translational drift pd (position offset) and its derivative, the drift velocity vd. The drift correction is performed by subtracting the estimated drift states pd and vd from the corresponding VIO estimates. The Kalman filter state x is given by $${\bf{x}}={[{{\bf{p}}}_{{\rm{d}}}^{\top },{{\bf{v}}}_{{\rm{d}}}^{\top }]}^{\top }\in {{\mathbb{R}}}^{6}$$.

The state x and covariance P updates are given by:

$${{\bf{x}}}_{k+1}=F{{\bf{x}}}_{k},\,{P}_{k+1}=F{P}_{k}{F}^{{\rm{\top }}}+Q,$$

(12)

$$F=[\begin{array}{cc}{{\mathbb{I}}}^{3\times 3} & {\rm{d}}t\,{{\mathbb{I}}}^{3\times 3}\\ {0}^{3\times 3} & {{\mathbb{I}}}^{3\times 3}\end{array}],\,Q=[\begin{array}{cc}{\sigma }_{{\rm{p}}{\rm{o}}{\rm{s}}}{{\mathbb{I}}}^{3\times 3} & {0}^{3\times 3}\\ {0}^{3\times 3} & {\sigma }_{{\rm{v}}{\rm{e}}{\rm{l}}}{{\mathbb{I}}}^{3\times 3}\end{array}].$$

(13)

On the basis of measurements, the process noise is set to σpos = 0.05 and σvel = 0.1. The filter state and covariance are initialized to zero. For each measurement zk (pose estimate from a gate detection), the predicted VIO drift $${{\bf{x}}}_{k}^{-}$$ is corrected to the estimate $${{\bf{x}}}_{k}^{+}$$ according to the Kalman filter equations:

$$\begin{array}{c}{K}_{k}\,=\,{P}_{k}^{-}{H}_{k}^{{\rm{\top }}}{({H}_{k}{P}_{k}^{-}{H}_{k}^{{\rm{\top }}}+R)}^{-1}\,,\\ {{\bf{x}}}_{k}^{+}\,=\,{{\bf{x}}}_{k}^{-}+{K}_{k}({{\bf{z}}}_{k}-H({{\bf{x}}}_{k}^{-})),\\ {P}_{k}^{+}\,=\,(I-{K}_{k}{H}_{k}){P}_{k}^{-},\end{array}$$

(14)

in which Kk is the Kalman gain, R is the measurement covariance and Hk is the measurement matrix. If several gates have been detected in a single camera frame, all relative pose estimates are stacked and processed in the same Kalman filter update step. The main source of measurement error is the uncertainty in the gate-corner detection of the network. This error in the image plane results in a pose error when IPPE is applied. We opted for a sampling-based approach to estimate the pose error from the known average gate-corner-detection uncertainty. For each gate, the IPPE algorithm is applied to the nominal gate observation as well as to 20 perturbed gate-corner estimates. The resulting distribution of pose estimates is then used to approximate the measurement covariance R of the gate observation.

### Simulation results

Reaching champion-level performance in autonomous drone racing requires overcoming two challenges: imperfect perception and incomplete models of the system’s dynamics. In controlled experiments in simulation, we assess the robustness of our approach to both of these challenges. To this end, we evaluate performance in a racing task when deployed in four different settings. In setting (1), we simulate a simplistic quadrotor model with access to ground-truth state observations. In setting (2), we replace the ground-truth state observations with noisy observations identified from real-world flights. These noisy observations are generated by sampling one realization from the residual observation model and are independent of the perception awareness of the deployed controller. Settings (3) and (4) share the observation models with the previous two settings, respectively, but replace the simplistic dynamics model with more accurate aerodynamical simulation43. These four settings allow controlled assessment of the sensitivity of the approach to changes in the dynamics and the observation fidelity.

In all four settings, we benchmark our approach against the following baselines: zero-shot, domain randomization and time-optimal. The zero-shot baseline represents a learning-based racing policy35 trained using model-free RL that is deployed zero-shot from the training domain to the test domain. The training domain of the policy is equal to experimental setting (1), that is, idealized dynamics and ground-truth observations. Domain randomization extends the learning strategy from the zero-shot baseline by randomizing observations and dynamics properties to increase robustness. The time-optimal baseline uses a precomputed time-optimal trajectory28 that is tracked using an MPC controller. This approach has shown the best performance in comparison with other model-based methods for time-optimal flight55,56. The dynamics model used by the trajectory generation and the MPC controller matches the simulated dynamics of experimental setting (1).

Performance is assessed by evaluating the fastest lap time, the average and minimum observed gate margin of successfully passed gates and the percentage of track successfully completed. The gate margin metric measures the distance between the drone and the closest point on the gate when crossing the gate plane. A high gate margin indicates that the quadrotor passed close to the centre of the gate. Leaving a smaller gate margin can increase speed but can also increase the risk of collision or missing the gate. Any lap that results in a crash is not considered valid.

The results are summarized in Extended Data Table 1c. All approaches manage to successfully complete the task when deployed in idealized dynamics and ground-truth observations, with the time-optimal baseline yielding the lowest lap time. When deployed in settings that feature domain shift, either in the dynamics or the observations, the performance of all baselines collapses and none of the three baselines are able to complete even a single lap. This performance drop is exhibited by both learning-based and traditional approaches. By contrast, our approach, which features empirical models of dynamics and observation noise, succeeds in all deployment settings, with small increases in lap time.

The key feature that enables our approach to succeed across deployment regimes is the use of an empirical model of dynamics and observation noise, estimated from real-world data. A comparison between an approach that has access to such data and approaches that do not is not entirely fair. For that reason, we also benchmark the performance of all baseline approaches when having access to the same real-world data used by our approach. Specifically, we compare the performance in experimental setting (2), which features the idealized dynamics model but noisy perception. All baseline approaches are provided with the predictions of the same Gaussian process model that we use to characterize observation noise. The results are summarized in Extended Data Table 1b. All baselines benefit from the more realistic observations, yielding higher completion rates. Nevertheless, our approach is the only one that reliably completes the entire track. As well as the predictions of the observation noise model, our approach also takes into account the uncertainty of the model. For an in-depth comparison of the performance of RL versus optimal control in controlled experiments, we refer the reader to ref. 57.

### Fine-tuning for several iterations

We investigate the extent of variations in behaviour across iterations. The findings of our analysis reveal that subsequent fine-tuning operations result in negligible enhancements in performance and alterations in behaviour (Extended Data Fig. 2).

In the following, we provide more details on this investigation. We start by enumerating the fine-tuning steps to provide the necessary notation:

1. 1.

Train policy-0 in simulation.

2. 2.

Deploy policy-0 in the real world. The policy operates on ground-truth data from a motion-capture system.

3. 3.

Identify residuals observed by policy-0 in the real world.

4. 4.

Train policy-1 by fine-tuning policy-0 on the identified residuals.

5. 5.

Deploy policy-1 in the real world. The policy operates only on onboard sensory measurements.

6. 6.

Identify residuals observed by policy-1 in the real world.

7. 7.

Train policy-2 by fine-tuning policy-1 on the identified residuals.

We compare the performance of policy-1 and policy-2 in simulation after fine-tuning on their respective residuals. The results are illustrated in Extended Data Fig. 2. We observe that the difference in distance from gate centres, which is a metric for the safety of the policy, is 0.09 ± 0.08 m. Furthermore, the difference in the time taken to complete a single lap is 0.02 ± 0.02 s. Note that this lap-time difference is substantially smaller than the difference between the single-lap completion times of Swift and the human pilots (0.16 s).

### Drone hardware configuration

The quadrotors used by the human pilots and Swift have the same weight, shape and propulsion. The platform design is based on the Agilicious framework58. Each vehicle has a weight of 870 g and can produce a maximum static thrust of approximately 35 N, which results in a static thrust-to-weight ratio of 4.1. The base of each platform consists of an Armattan Chameleon 6″ main frame that is equipped with T-Motor Velox 2306 motors and 5″, three-bladed propellers. An NVIDIA Jetson TX2 accompanied by a Connect Tech Quasar carrier board provides the main compute resource for the autonomous drones, featuring a six-core CPU running at 2 GHz and a dedicated GPU with 256 CUDA cores running at 1.3 GHz. Although forward passes of the gate-detection network are performed on the GPU, the racing policy is evaluated on the CPU, with one inference pass taking 8 ms. The autonomous drones carry an Intel RealSense Tracking Camera T265 that provides VIO estimates59 at 100 Hz that are fed by USB to the NVIDIA Jetson TX2. The human-piloted drones carry neither a Jetson computer nor a RealSense camera and are instead equipped with a corresponding ballast weight. Control commands in the form of collective thrust and body rates produced by the human pilots or Swift are sent to a commercial flight controller, which runs on an STM32 processor operating at 216 MHz. The flight controller is running Betaflight, an open-source flight-control software45.

### Human pilot impressions

The following quotes convey the impressions of the three human champions who raced against Swift.

Alex Vanover:

• These races will be decided at the split S, it is the most challenging part of the track.

• This was the best race! I was so close to the autonomous drone, I could really feel the turbulence when trying to keep up with it.

Thomas Bitmatta:

• The possibilities are endless, this is the start of something that could change the whole world. On the flip side, I’m a racer, I don’t want anything to be faster than me.

• As you fly faster, you trade off precision for speed.

• It’s inspiring to see the potential of what drones are actually capable of. Soon, the AI drone could even be used as a training tool to understand what would be possible.

Marvin Schaepper:

### Research ethics

The study has been conducted in accordance with the Declaration of Helsinki. The study protocol is exempt from review by an ethics committee according to the rules and regulations of the University of Zurich, because no health-related data has been collected. The participants gave their written informed consent before participating in the study.