EV-Pose: Event-Camera-Enhanced High-Frequency Visual Positioning Service for Accurate Drone Landing

IEEE TMC Submissions

Demo video

EV-Pose estimates the drone 6-DoF pose by redesigning drone-oriented VPS with event cameras. Compared to conventional VPS systems, EV-Pose enables rapid and high-frequency drone pose tracking, ensuring precise flight control and landing.

EV-Pose redesigns drone-oriented VPS with event cameras.


Abstract

Real-time 6-DoF drone pose tracking enables precise flight control and accurate drone landing. With the widespread availability of urban 3D maps, the Visual Positioning Service (VPS) has been adapted for drone landing to overcome GPS unreliability, serving as a global localization system that yields absolute global coordinates by continuously aligning visual features with the prior 3D map. However, deploying conventional vision-based VPS on highly dynamic drones faces bottlenecks in both pose estimation accuracy and efficiency. In this work, we pioneer EV-Pose, an event camera-enhanced VPS designed to deliver drift-free, absolute global coordinates at ultra-high frequencies for drones. EV-Pose addresses the 2D-3D modality gap via a novel Spatio-Temporal Feature-instructed Pose Estimation module, which extracts a Temporal Distance Field (TDF) to enable continuous, differentiable matching with a prior 3D point map for pose estimation. To fully exploit this formulation, we propose a Motion-aware Hierarchical Fusion and Optimization scheme. This architecture utilizes onboard IMU motion information for fine-grained early-stage event filtering and seamlessly optimizes pose estimation in a later-stage factor graph. Evaluation shows that EV-Pose achieves a rotation accuracy of 1.76$\degree$ and a translation accuracy of 7.5mm with a latency of 10.08 ms, outperforming baselines by >40% and enabling accurate drone landings.

Event Camera Preliminary

Event cameras are bio-inspired sensors that differ from traditional frame cameras. Specifically, frame cameras capture synchronous images with a global shutter at fixed time intervals, while output of an event camera is event streams with $ms$-level resolution, as shown in Fig. (a). Each pixel in an event camera independently responds to changes in brightness asynchronously. Each event \( e = (\boldsymbol{x}, i, p) \) represents a pixel at location \(\boldsymbol{x} = (u, v)\) has undergone a predefined magnitude change in brightness at a time \( i \), as shown in Fig. (b). \( p \) is polarity of intensity change, which is 'ON' for brighter and 'OFF' for darker.

Image 1

RGB camera-based VPS & event camera-enhanced VPS

(a) Current VPS uses an RGB camera, an IMU, and 3D point clouds for pose estimation.
(b) EV-Pose leverages event cameras for accurate and low-latency 6-DoF drone pose tracking.

Image 1

System Overview

From a top-level perspective, we design EV-Pose, an event-based 6-DoF pose tracking system for drones that redesign current VPS with event camera. EV-Pose leverages prior 3D point maps and temporal consistency between event camera and IMU to achieve accurate, low-latency drone 6-DoF pose tracking.

(i) Feature-instructed Pose Estimation (STPE) module (§ 4). This module first introduces the concept of a separated polarity time-surface, a novel spatio-temporal representation for event streams (§4.1). Subsequently, it leverages the temporal relationships among events encoded in the time-surface to generate a distance field, which is then used as a feature representation for the event stream (§4.2). Finally, the 2D event-3D point map matching module models the drone pose estimation problem, which aligns the event stream’s distance field feature with the 3D point map, thus facilitating absolute pose estimation of the drone (§4.3).

(ii) Motion-aware Hierarchical Fusion and Optimization (MHFO) scheme (§5). This scheme first introduces motion-optical flow-instructed event filtering (§5.1), which combines drone motion information with structural data from the 3D point map to predict event polarity and perform fine-grained event filtering. This approach fuses event camera with IMU at the early stage of raw data processing, improving the efficiency of matching-based pose estimation. This scheme then introduces a graph-informed joint fusion and optimization module (§5.2). This module first infers the drone's relative motion through proprioceptive tracking and then uses a carefully designed factor graph to fuse these measurements with exteroceptive data from the STPF module. This fusion, performed at the later stage of pose estimation, further improves the accuracy of matching-based pose estimation.

Relationship between STPE and MHFO. STPE extracts a temporal distance field feature from the event stream and aligns it with a prior 3D point map to facilitate matching-based drone pose estimation. To further enhance efficiency and accuracy of estimation, EV-Pose incorporates MHFO, which leverages drone motion information for early-stage event filtering, which reduces the number of events involved in matching, and later-stage pose optimization, which recovers scale and produces a 6-DoF trajectory with minimal drift.

Image 1

Implementation and Experiments Setup

As illustrated in figure, EV-Pose is implemented using a 450 mm-wide drone equipped with
(i) a Prophesee EVK4 HD event camera with 1280 x 720 resolution;
(ii) a D435i Depth camera for frame image capture;
(iii) a Pixhawk 4 flight controller for drone control and IMU measurements.
EV-Pose runs on an Intel NUC with a Core i7 CPU, 16GB RAM, and Ubuntu 20.04. Indoor and outdoor environment mapping is completed in advance using the Livos MID-360 LiDAR and FAST-LIO2 algorithm. All algorithms of EV-Pose are implemented in C++ and ROS.
Image 1


We conduct field studies of EV-Pose both indoors and outdoors to evaluate its pose tracking performance, as shown in Fig 10. The drone flies along a square spiral trajectory within the test field. The environment is pre-mapped, and the drone operates in the 3D point-mapped environment for pose estimation. In indoor settings, the drone's pose ground truth is obtained using a CHINGMU Motion Capture system with 16 MC1300 infrared cameras operating at 210 FPS. For outdoor settings, we deployed a private RTK station using a Hi-Target D8 to provide accurate ground truth. We conduct 20+ hours of extensive experiments, collecting 200+GB of raw data.
Image 1