[IEEE TMC] Enabling High-Frequency Cross-Modality Visual Positioning Service for Accurate Drone Landing

Abstract

After years of growth, the drone-driven low-altitude economy is transforming logistics. At its core, real-time 6-DoF drone pose tracking enables precise flight control and accurate drone landing. With the widespread availability of urban 3D maps, the Visual Positioning Service (VPS), a mobile pose estimation system, has been adapted to enhance drone pose tracking during the landing phase, as conventional systems like GPS are unreliable in urban environments due to signal attenuation and multi-path propagation. However, deploying current VPS on drones faces limitations in both estimation accuracy and efficiency. In this work, we redesign drone-oriented VPS with event camera and introduce EV-Pose to enable accurate, low-latency 6-DoF drone pose tracking. EV-Pose introduces a spatio-temporal feature-instructed pose estimation module to extract temporal distance field for pose estimation with 3D point map matching; and a motion-aware hierarchical fusion and optimization scheme to enhance above estimation in accuracy and efficiency, by utilizing drone motion in early stage of event filtering and later stage of pose optimization. Evaluation shows that EV-Pose achieves a rotation accuracy of 1.34° and a translation accuracy of 6.9$mm$ with a tracking latency of 10.08$ms$, outperforming baselines by >50%, thus enabling accurate drone landings.

Event Camera Preliminary

Event cameras are bio-inspired sensors that differ from traditional frame cameras. Specifically, frame cameras capture synchronous images with a global shutter at fixed time intervals, while output of an event camera is event streams with $ms$-level resolution, as shown in Fig. (a). Each pixel in an event camera independently responds to changes in brightness asynchronously. Each event $ e = (\boldsymbol{x}, i, p) $ represents a pixel at location $\boldsymbol{x} = (u, v)$ has undergone a predefined magnitude change in brightness at a time $ i $, as shown in Fig. (b). $ p $ is polarity of intensity change, which is 'ON' for brighter and 'OFF' for darker.

RGB camera-based VPS & event camera-enhanced VPS

(a) Current VPS uses an RGB camera, an IMU, and 3D point clouds for pose estimation.
(b) EV-Pose leverages event cameras for accurate and low-latency 6-DoF drone pose tracking.

System Overview

From a top-level perspective, we design EV-Pose, an event-based 6-DoF pose tracking system for drones that redesign current VPS with event camera. EV-Pose leverages prior 3D point maps and temporal consistency between event camera and IMU to achieve accurate, low-latency drone 6-DoF pose tracking.

(i) Feature-instructed Pose Estimation (STPE) module (§ 4). This module first introduces the concept of a separated polarity time-surface, a novel spatio-temporal representation for event streams (§4.1). Subsequently, it leverages the temporal relationships among events encoded in the time-surface to generate a distance field, which is then used as a feature representation for the event stream (§4.2). Finally, the 2D event-3D point map matching module models the drone pose estimation problem, which aligns the event stream’s distance field feature with the 3D point map, thus facilitating absolute pose estimation of the drone (§4.3).

(ii) Motion-aware Hierarchical Fusion and Optimization (MHFO) scheme (§5). This scheme first introduces motion-optical flow-instructed event filtering (§5.1), which combines drone motion information with structural data from the 3D point map to predict event polarity and perform fine-grained event filtering. This approach fuses event camera with IMU at the early stage of raw data processing, improving the efficiency of matching-based pose estimation. This scheme then introduces a graph-informed joint fusion and optimization module (§5.2). This module first infers the drone's relative motion through proprioceptive tracking and then uses a carefully designed factor graph to fuse these measurements with exteroceptive data from the STPF module. This fusion, performed at the later stage of pose estimation, further improves the accuracy of matching-based pose estimation.

Relationship between STPE and MHFO. STPE extracts a temporal distance field feature from the event stream and aligns it with a prior 3D point map to facilitate matching-based drone pose estimation. To further enhance efficiency and accuracy of estimation, EV-Pose incorporates MHFO, which leverages drone motion information for early-stage event filtering, which reduces the number of events involved in matching, and later-stage pose optimization, which recovers scale and produces a 6-DoF trajectory with minimal drift.

Implementation and Experiments Setup

As illustrated in figure, EV-Pose is implemented using a 450 mm-wide drone equipped with
(i) a Prophesee EVK4 HD event camera with 1280 x 720 resolution;
(ii) a D435i Depth camera for frame image capture;
(iii) a Pixhawk 4 flight controller for drone control and IMU measurements.
EV-Pose runs on an Intel NUC with a Core i7 CPU, 16GB RAM, and Ubuntu 20.04. Indoor and outdoor environment mapping is completed in advance using the Livos MID-360 LiDAR and FAST-LIO2 algorithm. All algorithms of EV-Pose are implemented in C++ and ROS.

We conduct field studies of EV-Pose both indoors and outdoors to evaluate its pose tracking performance, as shown in Fig 10. The drone flies along a square spiral trajectory within the test field. The environment is pre-mapped, and the drone operates in the 3D point-mapped environment for pose estimation. In indoor settings, the drone's pose ground truth is obtained using a CHINGMU Motion Capture system with 16 MC1300 infrared cameras operating at 210 FPS. For outdoor settings, we deployed a private RTK station using a Hi-Target D8 to provide accurate ground truth. We conduct 20+ hours of extensive experiments, collecting 200+GB of raw data.