History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

1College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics
2Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, China
3The State Key Lab of Brain-Machine Intelligence, Zhejiang University
AAAI 2026

*Indicates Equal Contribution
Corresponding Authors
First research result visualization

In this paper, we introduce a History-Enhanced Two-Stage Transformer (HETT) for AVLN, which integrates coarse-grained and fine-grained multimodal information to bridge the gap between global planning and local perception.

Abstract

Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene awareness. Additionally, the CityNav dataset annotations are manually refined to enhance data quality. Experiments on the refined CityNav dataset show that HETT delivers significant performance gains, while extensive ablation studies further verify the effectiveness of each component.

Dataset Refinement

First research result visualization


Statistics of Dataset Refinement.
"Missing" indicates missing landmark references. "Minor" refers to spelling mistakes or other typos. "Major" denotes critical landmark extraction errors that misalign instructions with their intended targets. "Deletion" corresponds to instructions removed from the dataset due to lacking valid landmark references.


Method

First research result visualization


Overview of HETT Framework.
In the Coarse-Grained Target Prediction stage, our agent leverages the target prediction result to guide navigation. In the Fine-Grained Action Refinement stage, the agent uses the local action estimation to adjust immediate movements until the predicted progress reaches threshold.

Experiments

First research result visualization


Quantitative Results of HETT.


First research result visualization


Qualitative Results of HETT.

Paper

BibTeX

@article{HETT2026,
        title={History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation},
        volume={40},
        url={https://ojs.aaai.org/index.php/AAAI/article/view/38885},
        DOI={10.1609/aaai.v40i22.38885},
        number={22},
        journal={Proceedings of the AAAI Conference on Artificial Intelligence},
        author={Ding, Xichen and Gao, Jianzhe and Pan, Cong and Wang, wenguan and Qin, Jie},
        year={2026},
        month={Mar.},
        pages={18225-18233}
}