Quadrotor simulation
Table of Contents
Quadrotor dynamics
To enable largescale training, we use a highfidelity simulation of the quadrotor dynamics. This section briefly explains the simulation. The dynamics of the vehicle can be written as
$$\dot{{\bf{x}}}=[\begin{array}{c}{\dot{{\bf{p}}}}_{{\mathcal{W}}{\mathcal{B}}}\\ {\dot{{\bf{q}}}}_{{\mathcal{W}}{\mathcal{B}}}\\ {\dot{{\bf{v}}}}_{{\mathcal{W}}}\\ {\dot{{\boldsymbol{\omega }}}}_{{\mathcal{B}}}\\ \dot{{\boldsymbol{\Omega }}}\end{array}]=\left[\begin{array}{c}{{\bf{v}}}_{{\mathcal{W}}}\\ {{\bf{q}}}_{{\mathcal{W}}{\mathcal{B}}}\cdot [\begin{array}{c}0\\ {{\boldsymbol{\omega }}}_{{\mathcal{B}}}/2\end{array}]\\ \frac{1}{m}({{\bf{q}}}_{{\mathcal{W}}{\mathcal{B}}}\odot ({{\bf{f}}}_{{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}}+{{\bf{f}}}_{{\rm{a}}{\rm{e}}{\rm{r}}{\rm{o}}}))+{{\bf{g}}}_{{\mathcal{W}}}\\ {J}^{1}({{\boldsymbol{\tau }}}_{{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}}+{{\boldsymbol{\tau }}}_{{\rm{m}}{\rm{o}}{\rm{t}}}+{{\boldsymbol{\tau }}}_{{\rm{a}}{\rm{e}}{\rm{r}}{\rm{o}}}+{{\boldsymbol{\tau }}}_{{\rm{i}}{\rm{n}}{\rm{e}}{\rm{r}}})\\ \frac{1}{{k}_{{\rm{m}}{\rm{o}}{\rm{t}}}}({{\boldsymbol{\Omega }}}_{{\rm{s}}{\rm{s}}}{\boldsymbol{\Omega }})\end{array}\right],$$
(1)
in which ⊙ represents quaternion rotation, \({{\bf{p}}}_{{\mathcal{W}}{\mathcal{B}}},{{\bf{q}}}_{{\mathcal{W}}{\mathcal{B}}},{{\bf{v}}}_{{\mathcal{W}}}\) and \({{\boldsymbol{\omega }}}_{{\mathcal{B}}}\) denote the position, attitude quaternion, inertial velocity and body rates of the quadcopter, respectively. The motor time constant is k_{mot} and the motor speeds Ω and Ω_{ss} are the actual and steadystate motor speeds, respectively. The matrix J is the inertia of the quadcopter and \({{\bf{g}}}_{{\mathcal{W}}}\) denotes the gravity vector. Two forces act on the quadrotor: the lift force f_{prop} generated by the propellers and an aerodynamic force f_{aero} that aggregates all other forces, such as aerodynamic drag, dynamic lift and induced drag. The torque is modelled as a sum of four components: the torque generated by the individual propeller thrusts τ_{prop}, the yaw torque τ_{mot} generated by a change in motor speed, an aerodynamic torque τ_{aero} that accounts for various aerodynamic effects such as blade flapping and an inertial term τ_{iner}. The individual components are given as
$${{\bf{f}}}_{{\rm{prop}}}=\sum _{i}{{\bf{f}}}_{i},\,{{\boldsymbol{\tau }}}_{{\rm{prop}}}=\sum _{i}{{\boldsymbol{\tau }}}_{i}+{{\bf{r}}}_{{\rm{P}},i}\times {{\bf{f}}}_{i},$$
(2)
$${{\boldsymbol{\tau }}}_{{\rm{mot}}}={J}_{{\rm{m}}+{\rm{p}}}\sum _{i}\,{{\boldsymbol{\zeta }}}_{i}{\dot{\Omega }}_{i},\,{{\boldsymbol{\tau }}}_{{\rm{iner}}}={{\boldsymbol{\omega }}}_{{\mathcal{B}}}\times {\bf{J}}{{\boldsymbol{\omega }}}_{{\mathcal{B}}}$$
(3)
in which r_{P,i} is the location of propeller i, expressed in the body frame, and f_{i} and τ_{i} are the forces and torques, respectively, generated by the ith propeller. The axis of rotation of the ith motor is denoted by ζ_{i}, the combined inertia of the motor and propeller is J_{m+p} and the derivative of the ith motor speed is \({\dot{\Omega }}_{i}\). The individual propellers are modelled using a commonly used quadratic model, which assumes that the lift force and drag torque are proportional to the square of the propeller speed Ω_{i}:
$${{\bf{f}}}_{i}({\Omega }_{i})={\left[\begin{array}{ccc}0 & 0 & {c}_{{\rm{l}}}\cdot {\Omega }_{i}^{2}\end{array}\right]}^{\top },\quad {{\boldsymbol{\tau }}}_{i}({\Omega }_{i})={\left[\begin{array}{ccc}0 & 0 & {c}_{{\rm{d}}}\cdot {\Omega }_{i}^{2}\end{array}\right]}^{\top }$$
(4)
in which c_{l} and c_{d} denote the propeller lift and drag coefficients, respectively.
Aerodynamic forces and torques
The aerodynamic forces and torques are difficult to model with a firstprinciples approach. We thus use a datadriven model^{43}. To maintain the low computational complexity required for largescale RL training, a greybox polynomial model is used rather than a neural network. The aerodynamic effects are assumed to primarily depend on the velocity \({{\bf{v}}}_{{\mathcal{B}}}\) (in the body frame) and the average squared motor speed \(\overline{{\Omega }^{2}}\). The aerodynamic forces f_{x}, f_{y} and f_{z} and torques τ_{x}, τ_{y} and τ_{z} are estimated in the body frame. The quantities v_{x}, v_{y} and v_{z} denote the three axial velocity components (in the body frame) and v_{xy} denotes the speed in the (x, y) plane of the quadrotor. On the basis of insights from the underlying physical processes, linear and quadratic combinations of the individual terms are selected. For readability, the coefficients multiplying each summand have been omitted:
$$\begin{array}{c}{f}_{x}\,\sim \,{v}_{x}+{v}_{x}{v}_{x}+\bar{{\Omega }^{2}}+{v}_{x}\,\bar{{\Omega }^{2}}\\ {f}_{y}\,\sim \,{v}_{y}+{v}_{y}{v}_{y}+\bar{{\Omega }^{2}}+{v}_{y}\,\bar{{\Omega }^{2}}\\ {f}_{z}\,\sim \,{v}_{z}+{v}_{z}{v}_{z}+{v}_{xy}+{v}_{xy}^{2}+{v}_{xy}\,\bar{{\Omega }^{2}}+{v}_{z}\,\bar{{\Omega }^{2}}+{v}_{xy}\,{v}_{z}\,\bar{{\Omega }^{2}}\\ \,{\tau }_{x}\,\sim \,{v}_{y}+{v}_{y}{v}_{y}+\bar{{\Omega }^{2}}+{v}_{y}\,\bar{{\Omega }^{2}}+{v}_{y}{v}_{y}\,\bar{{\Omega }^{2}}\\ \,{\tau }_{y}\,\sim \,{v}_{x}+{v}_{x}{v}_{x}+\bar{{\Omega }^{2}}+{v}_{x}\,\bar{{\Omega }^{2}}+{v}_{x}{v}_{x}\,\bar{{\Omega }^{2}}\\ \,{\tau }_{z}\,\sim \,{v}_{x}+{v}_{y}\end{array}$$
The respective coefficients are then identified from realworld flight data, in which motion capture is used to provide groundtruth forces and torque measurements. We use data from the race track, allowing the dynamics model to fit the track. This is akin to the human pilots’ training for days or weeks before the race on the specific track that they will be racing. In our case, the human pilots are given a week of practice on the same track ahead of the competition.
Betaflight lowlevel controller
To control the quadrotor, the neural network outputs collective thrust and body rates. This control signal is known to combine high agility with good robustness to simulationtoreality transfer^{44}. The predicted collective thrust and body rates are then processed by an onboard lowlevel controller that computes individual motor commands, which are subsequently translated into analogue voltage signals through an electronic speed controller (ESC) that controls the motors. On the physical vehicle, this lowlevel proportional–integral–derivative (PID) controller and ESC are implemented using the opensource Betaflight and BLHeli32 firmware^{45}. In simulation, we use an accurate model of both the lowlevel controller and the motor speed controller.
Because the Betaflight PID controller has been optimized for humanpiloted flight, it exhibits some peculiarities, which the simulation correctly captures: the reference for the Dterm is constantly zero (pure damping), the Iterm gets reset when the throttle is cut and, under motor thrust saturation, the body rate control is assigned priority (proportional downscaling of all motor signals to avoid saturation). The gains of the controller used for simulation have been identified from the detailed logs of the Betaflight controller’s internal states. The simulation can predict the individual motor commands with less than 1% error.
Battery model and ESC
The lowlevel controller converts the individual motor commands into a pulsewidth modulation (PWM) signal and sends it to the ESC, which controls the motors. Because the ESC does not perform closedloop control of the motor speeds, the steadystate motor speed Ω_{i,ss} for a given PWM motor command cmd_{i} is a function of the battery voltage. Our simulation thus models the battery voltage using a greybox battery model^{46} that simulates the voltage based on instantaneous power consumption P_{mot}:
$${P}_{{\rm{mot}}}=\frac{{c}_{{\rm{d}}}{\Omega }^{3}}{\eta }$$
(5)
The battery model^{46} then simulates the battery voltage based on this power demand. Given the battery voltage U_{bat} and the individual motor command u_{cmd,i}, we use the mapping (again omitting the coefficients multiplying each summand)
$${\Omega }_{i,{\rm{s}}{\rm{s}}}\sim 1+{U}_{{\rm{b}}{\rm{a}}{\rm{t}}}+\sqrt{{u}_{{\rm{c}}{\rm{m}}{\rm{d}},i}}+{u}_{{\rm{c}}{\rm{m}}{\rm{d}},i}+{U}_{{\rm{b}}{\rm{a}}{\rm{t}}}\sqrt{{u}_{{\rm{c}}{\rm{m}}{\rm{d}},i}}$$
(6)
to calculate the corresponding steadystate motor speed Ω_{i,ss} required for the dynamics simulation in equation (1). The coefficients have been identified from Betaflight logs containing measurements of all involved quantities. Together with the model of the lowlevel controller, this enables the simulator to correctly translate an action in the form of collective thrust and body rates to desired motor speeds Ω_{ss} in equation (1).
Policy training
We train deep neural control policies that directly map observations o_{t} in the form of platform state and next gate observation to control actions u_{t} in the form of massnormalized collective thrust and body rates^{44}. The control policies are trained using modelfree RL in simulation.
Training algorithm
Training is performed using proximal policy optimization^{31}. This actorcritic approach requires jointly optimizing two neural networks during training: the policy network, which maps observations to actions, and the value network, which serves as the ‘critic’ and evaluates actions taken by the policy. After training, only the policy network is deployed on the robot.
Observations, actions and rewards
An observation \({{\bf{o}}}_{t}\in {{\mathbb{R}}}^{31}\) obtained from the environment at time t consists of: (1) an estimate of the current robot state; (2) the relative pose of the next gate to be passed on the track layout; and (3) the action applied in the previous step. Specifically, the estimate of the robot state contains the position of the platform, its velocity and attitude represented by a rotation matrix, resulting in a vector in \({{\mathbb{R}}}^{15}\). Although the simulation uses quaternions internally, we use a rotation matrix to represent attitude to avoid ambiguities^{47}. The relative pose of the next gate is encoded by providing the relative position of the four gate corners with respect to the vehicle, resulting in a vector in \({{\mathbb{R}}}^{12}\). All observations are normalized before being passed to the network. Because the value network is only used during training time, it can access privileged information about the environment that is not accessible to the policy^{48}. This privileged information is concatenated with other inputs to the policy network and contains the exact position, orientation and velocity of the robot.
For each observation o_{t}, the policy network produces an action \({{\bf{a}}}_{t}\in {{\mathbb{R}}}^{4}\) in the form of desired massnormalized collective thrust and body rates.
We use a dense shaped reward formulation to learn the task of perceptionaware autonomous drone racing. The reward r_{t} at time step t is given by
$${r}_{t}={r}_{t}^{{\rm{prog}}}+{r}_{t}^{{\rm{perc}}}+{r}_{t}^{{\rm{cmd}}}{r}_{t}^{{\rm{crash}}}$$
(7)
in which r^{prog} rewards progress towards the next gate^{35}, r^{perc} encodes perception awareness by adjusting the attitude of the vehicle such that the optical axis of the camera points towards the centre of the next gate, r^{cmd} rewards smooth actions and r^{crash} is a binary penalty that is only active when colliding with a gate or when the platform leaves a predefined bounding box. If r^{crash} is triggered, the training episode ends.
Specifically, the reward terms are
$$\begin{array}{r}{r}_{t}^{{\rm{prog}}}\,=\,{\lambda }_{1}\left[{d}_{t1}^{{\rm{Gate}}}{d}_{t}^{{\rm{Gate}}}\right]\\ {r}_{t}^{{\rm{perc}}}\,=\,{\lambda }_{2}\exp \left[{\lambda }_{3}\cdot {\delta }_{{\rm{cam}}}^{4}\right]\end{array}$$
(8)
$${r}_{t}^{{\rm{cmd}}}={\lambda }_{4}{{\bf{a}}}_{t}^{\omega }+{\lambda }_{5}\parallel {{\bf{a}}}_{t}{{\bf{a}}}_{t1}{\parallel }^{2}$$
(9)
$${r}_{t}^{{\rm{crash}}}=\left\{\begin{array}{l}5.0,\,{\rm{if}}\,{p}_{z} < 0\,{\rm{or\; in\; collision\; with\; gate}}\\ 0,\quad {\rm{otherwise}}\,\end{array}\right.$$
in which \({d}_{t}^{{\rm{Gate}}}\) denotes the distance from the centre of mass of the vehicle to the centre of the next gate at time step t, δ_{cam} represents the angle between the optical axis of the camera and the centre of the next gate and \({{\bf{a}}}_{t}^{\omega }\) are the commanded body rates. The hyperparameters λ_{1},…, λ_{5} balance different terms (Extended Data Table 1a).
Training details
Data collection is performed by simulating 100 agents in parallel that interact with the environment in episodes of 1,500 steps. At each environment reset, every agent is initialized at a random gate on the track, with bounded perturbation around a state previously observed when passing this gate. In contrast to previous work^{44,49,50}, we do not perform randomization of the platform dynamics at training time. Instead, we perform finetuning based on realworld data. The training environment is implemented using TensorFlow Agents^{51}. The policy network and the value network are both represented by twolayer perceptrons with 128 nodes in each layer and LeakyReLU activations with a negative slope of 0.2. Network parameters are optimized using the Adam optimizer with learning rate 3 × 10^{−4} for both the policy network and the value network.
Policies are trained for a total of 1 × 10^{8} environment interactions, which takes 50 min on a workstation (i9 12900K, RTX 3090, 32 GB RAM DDR5). Finetuning is performed for 2 × 10^{7} environment interactions.
Residual model identification
We perform finetuning of the original policy based on a small amount of data collected in the real world. Specifically, we collect three full rollouts in the real world, corresponding to approximately 50 s of flight time. We finetune the policy by identifying residual observations and residual dynamics, which are then used for training in simulation. During this finetuning phase, only the weights of the control policy are updated, whereas the weights of the gatedetection network are kept constant.
Residual observation model
Navigating at high speeds results in substantial motion blur, which can lead to a loss of tracked visual features and severe drift in linear odometry estimates. We finetune policies with an odometry model that is identified from only a handful of trials recorded in the real world. To model the drift in odometry, we use Gaussian processes^{36}, as they allow fitting a posterior distribution of odometry perturbations, from which we can sample temporally consistent realizations.
Specifically, the Gaussian process model fits residual position, velocity and attitude as a function of the groundtruth robot state. The observation residuals are identified by comparing the observed visual–inertial odometry (VIO) estimates during a realworld rollout with the groundtruth platform states, which are obtained from an external motiontracking system.
We treat each dimension of the observation separately, effectively fitting a set of nine 1D Gaussian processes to the observation residuals. We use a mixture of radial basis function kernels
$$\kappa ({{\bf{z}}}_{i},{{\bf{z}}}_{j})={\sigma }_{{\rm{f}}}^{2}\,\exp \left(\frac{1}{2}{({{\bf{z}}}_{i}{{\bf{z}}}_{j})}^{{\rm{\top }}}{L}^{2}({{\bf{z}}}_{i}{{\bf{z}}}_{j})\right)+{\sigma }_{{\rm{n}}}^{2}$$
(10)
in which L is the diagonal length scale matrix and σ_{f} and σ_{n} represent the data and prior noise variance, respectively, and z_{i} and z_{j} represent data features. The kernel hyperparameters are optimized by maximizing the log marginal likelihood. After kernel hyperparameter optimization, we sample new realizations from the posterior distribution that are then used during finetuning of the policy. Extended Data Fig. 1 illustrates the residual observations in position, velocity and attitude in realworld rollouts, as well as 100 sampled realizations from the Gaussian process model.
Residual dynamics model
We use a residual model to complement the simulated robot dynamics^{52}. Specifically, we identify residual accelerations as a function of the platform state s and the commanded massnormalized collective thrust c:
$${{\bf{a}}}_{{\rm{res}}}={\rm{KNN}}({\bf{s}},c)$$
(11)
We use knearest neighbour regression with k = 5. The size of the dataset used for residual dynamics model identification depends on the track layout and ranges between 800 and 1,000 samples for the track layout used in this work.
Gate detection
To correct for drift accumulated by the VIO pipeline, the gates are used as distinct landmarks for relative localization. Specifically, gates are detected in the onboard camera view by segmenting gate corners^{26}. The greyscale images provided by the Intel RealSense Tracking Camera T265 are used as input images for the gate detector. The architecture of the segmentation network is a sixlevel UNet^{53} with (8, 16, 16, 16, 16, 16) convolutional filters of size (3, 3, 3, 5, 7, 7) per level and a final extra layer operating on the output of the UNet containing 12 filters. As the activation function, LeakyReLU with α = 0.01 is used. For deployment on the NVIDIA Jetson TX2, the network is ported to TensorRT. To optimize memory footprint and computation time, inference is performed in halfprecision mode (FP16) and images are downsampled to size 384 × 384 before being fed to the network. One forward pass through the network takes 40 ms on the NVIDIA Jetson TX2.
VIO drift estimation
The odometry estimates from the VIO pipeline^{54} exhibit substantial drift during highspeed flight. We use gate detection to stabilize the pose estimates produced by VIO. The gate detector outputs the coordinates of the corners of all visible gates. A relative pose is first estimated for all predicted gates using infinitesimal planebased pose estimation (IPPE)^{34}. Given this relative pose estimate, each gate observation is assigned to the closest gate in the known track layout, thus yielding a pose estimate for the drone.
Owing to the low frequency of the gate detections and the high quality of the VIO orientation estimate, we only refine the translational components of the VIO measurements. We estimate and correct for the drift of the VIO pipeline using a Kalman filter that estimates the translational drift p_{d} (position offset) and its derivative, the drift velocity v_{d}. The drift correction is performed by subtracting the estimated drift states p_{d} and v_{d} from the corresponding VIO estimates. The Kalman filter state x is given by \({\bf{x}}={[{{\bf{p}}}_{{\rm{d}}}^{\top },{{\bf{v}}}_{{\rm{d}}}^{\top }]}^{\top }\in {{\mathbb{R}}}^{6}\).
The state x and covariance P updates are given by:
$${{\bf{x}}}_{k+1}=F{{\bf{x}}}_{k},\,{P}_{k+1}=F{P}_{k}{F}^{{\rm{\top }}}+Q,$$
(12)
$$F=[\begin{array}{cc}{{\mathbb{I}}}^{3\times 3} & {\rm{d}}t\,{{\mathbb{I}}}^{3\times 3}\\ {0}^{3\times 3} & {{\mathbb{I}}}^{3\times 3}\end{array}],\,Q=[\begin{array}{cc}{\sigma }_{{\rm{p}}{\rm{o}}{\rm{s}}}{{\mathbb{I}}}^{3\times 3} & {0}^{3\times 3}\\ {0}^{3\times 3} & {\sigma }_{{\rm{v}}{\rm{e}}{\rm{l}}}{{\mathbb{I}}}^{3\times 3}\end{array}].$$
(13)
On the basis of measurements, the process noise is set to σ_{pos} = 0.05 and σ_{vel} = 0.1. The filter state and covariance are initialized to zero. For each measurement z_{k} (pose estimate from a gate detection), the predicted VIO drift \({{\bf{x}}}_{k}^{}\) is corrected to the estimate \({{\bf{x}}}_{k}^{+}\) according to the Kalman filter equations:
$$\begin{array}{c}{K}_{k}\,=\,{P}_{k}^{}{H}_{k}^{{\rm{\top }}}{({H}_{k}{P}_{k}^{}{H}_{k}^{{\rm{\top }}}+R)}^{1}\,,\\ {{\bf{x}}}_{k}^{+}\,=\,{{\bf{x}}}_{k}^{}+{K}_{k}({{\bf{z}}}_{k}H({{\bf{x}}}_{k}^{})),\\ {P}_{k}^{+}\,=\,(I{K}_{k}{H}_{k}){P}_{k}^{},\end{array}$$
(14)
in which K^{k} is the Kalman gain, R is the measurement covariance and H_{k} is the measurement matrix. If several gates have been detected in a single camera frame, all relative pose estimates are stacked and processed in the same Kalman filter update step. The main source of measurement error is the uncertainty in the gatecorner detection of the network. This error in the image plane results in a pose error when IPPE is applied. We opted for a samplingbased approach to estimate the pose error from the known average gatecornerdetection uncertainty. For each gate, the IPPE algorithm is applied to the nominal gate observation as well as to 20 perturbed gatecorner estimates. The resulting distribution of pose estimates is then used to approximate the measurement covariance R of the gate observation.
Simulation results
Reaching championlevel performance in autonomous drone racing requires overcoming two challenges: imperfect perception and incomplete models of the system’s dynamics. In controlled experiments in simulation, we assess the robustness of our approach to both of these challenges. To this end, we evaluate performance in a racing task when deployed in four different settings. In setting (1), we simulate a simplistic quadrotor model with access to groundtruth state observations. In setting (2), we replace the groundtruth state observations with noisy observations identified from realworld flights. These noisy observations are generated by sampling one realization from the residual observation model and are independent of the perception awareness of the deployed controller. Settings (3) and (4) share the observation models with the previous two settings, respectively, but replace the simplistic dynamics model with more accurate aerodynamical simulation^{43}. These four settings allow controlled assessment of the sensitivity of the approach to changes in the dynamics and the observation fidelity.
In all four settings, we benchmark our approach against the following baselines: zeroshot, domain randomization and timeoptimal. The zeroshot baseline represents a learningbased racing policy^{35} trained using modelfree RL that is deployed zeroshot from the training domain to the test domain. The training domain of the policy is equal to experimental setting (1), that is, idealized dynamics and groundtruth observations. Domain randomization extends the learning strategy from the zeroshot baseline by randomizing observations and dynamics properties to increase robustness. The timeoptimal baseline uses a precomputed timeoptimal trajectory^{28} that is tracked using an MPC controller. This approach has shown the best performance in comparison with other modelbased methods for timeoptimal flight^{55,56}. The dynamics model used by the trajectory generation and the MPC controller matches the simulated dynamics of experimental setting (1).
Performance is assessed by evaluating the fastest lap time, the average and minimum observed gate margin of successfully passed gates and the percentage of track successfully completed. The gate margin metric measures the distance between the drone and the closest point on the gate when crossing the gate plane. A high gate margin indicates that the quadrotor passed close to the centre of the gate. Leaving a smaller gate margin can increase speed but can also increase the risk of collision or missing the gate. Any lap that results in a crash is not considered valid.
The results are summarized in Extended Data Table 1c. All approaches manage to successfully complete the task when deployed in idealized dynamics and groundtruth observations, with the timeoptimal baseline yielding the lowest lap time. When deployed in settings that feature domain shift, either in the dynamics or the observations, the performance of all baselines collapses and none of the three baselines are able to complete even a single lap. This performance drop is exhibited by both learningbased and traditional approaches. By contrast, our approach, which features empirical models of dynamics and observation noise, succeeds in all deployment settings, with small increases in lap time.
The key feature that enables our approach to succeed across deployment regimes is the use of an empirical model of dynamics and observation noise, estimated from realworld data. A comparison between an approach that has access to such data and approaches that do not is not entirely fair. For that reason, we also benchmark the performance of all baseline approaches when having access to the same realworld data used by our approach. Specifically, we compare the performance in experimental setting (2), which features the idealized dynamics model but noisy perception. All baseline approaches are provided with the predictions of the same Gaussian process model that we use to characterize observation noise. The results are summarized in Extended Data Table 1b. All baselines benefit from the more realistic observations, yielding higher completion rates. Nevertheless, our approach is the only one that reliably completes the entire track. As well as the predictions of the observation noise model, our approach also takes into account the uncertainty of the model. For an indepth comparison of the performance of RL versus optimal control in controlled experiments, we refer the reader to ref. ^{57}.
Finetuning for several iterations
We investigate the extent of variations in behaviour across iterations. The findings of our analysis reveal that subsequent finetuning operations result in negligible enhancements in performance and alterations in behaviour (Extended Data Fig. 2).
In the following, we provide more details on this investigation. We start by enumerating the finetuning steps to provide the necessary notation:

1.
Train policy0 in simulation.

2.
Deploy policy0 in the real world. The policy operates on groundtruth data from a motioncapture system.

3.
Identify residuals observed by policy0 in the real world.

4.
Train policy1 by finetuning policy0 on the identified residuals.

5.
Deploy policy1 in the real world. The policy operates only on onboard sensory measurements.

6.
Identify residuals observed by policy1 in the real world.

7.
Train policy2 by finetuning policy1 on the identified residuals.
We compare the performance of policy1 and policy2 in simulation after finetuning on their respective residuals. The results are illustrated in Extended Data Fig. 2. We observe that the difference in distance from gate centres, which is a metric for the safety of the policy, is 0.09 ± 0.08 m. Furthermore, the difference in the time taken to complete a single lap is 0.02 ± 0.02 s. Note that this laptime difference is substantially smaller than the difference between the singlelap completion times of Swift and the human pilots (0.16 s).
Drone hardware configuration
The quadrotors used by the human pilots and Swift have the same weight, shape and propulsion. The platform design is based on the Agilicious framework^{58}. Each vehicle has a weight of 870 g and can produce a maximum static thrust of approximately 35 N, which results in a static thrusttoweight ratio of 4.1. The base of each platform consists of an Armattan Chameleon 6″ main frame that is equipped with TMotor Velox 2306 motors and 5″, threebladed propellers. An NVIDIA Jetson TX2 accompanied by a Connect Tech Quasar carrier board provides the main compute resource for the autonomous drones, featuring a sixcore CPU running at 2 GHz and a dedicated GPU with 256 CUDA cores running at 1.3 GHz. Although forward passes of the gatedetection network are performed on the GPU, the racing policy is evaluated on the CPU, with one inference pass taking 8 ms. The autonomous drones carry an Intel RealSense Tracking Camera T265 that provides VIO estimates^{59} at 100 Hz that are fed by USB to the NVIDIA Jetson TX2. The humanpiloted drones carry neither a Jetson computer nor a RealSense camera and are instead equipped with a corresponding ballast weight. Control commands in the form of collective thrust and body rates produced by the human pilots or Swift are sent to a commercial flight controller, which runs on an STM32 processor operating at 216 MHz. The flight controller is running Betaflight, an opensource flightcontrol software^{45}.
Human pilot impressions
The following quotes convey the impressions of the three human champions who raced against Swift.
Alex Vanover:

These races will be decided at the split S, it is the most challenging part of the track.

This was the best race! I was so close to the autonomous drone, I could really feel the turbulence when trying to keep up with it.
Thomas Bitmatta:

The possibilities are endless, this is the start of something that could change the whole world. On the flip side, I’m a racer, I don’t want anything to be faster than me.

As you fly faster, you trade off precision for speed.

It’s inspiring to see the potential of what drones are actually capable of. Soon, the AI drone could even be used as a training tool to understand what would be possible.
Marvin Schaepper:
Research ethics
The study has been conducted in accordance with the Declaration of Helsinki. The study protocol is exempt from review by an ethics committee according to the rules and regulations of the University of Zurich, because no healthrelated data has been collected. The participants gave their written informed consent before participating in the study.