A groundbreaking advancement in autonomous vehicle technology has emerged from a collaborative international research effort led by Tongji University, heralding a new era in self-driving car safety and efficiency. The team has introduced KEPT — Knowledge-Enhanced Prediction of Trajectories — an innovative AI-driven system that enhances short-term trajectory prediction by enabling vehicles to recall and learn from a vast repository of previously encountered driving scenarios. This breakthrough leverages cutting-edge vision-language models combined with a sophisticated memory retrieval mechanism, marking a pivotal shift from conventional end-to-end planning toward a more transparent and data-augmented approach.
At the core of KEPT’s innovation lies a novel video encoding technique designed to capture both spatial and temporal nuances of driving environments. This module, termed the temporal frequency–spatial fusion (TFSF) encoder, integrates a fast-Fourier-transform-based frequency attention mechanism with a multi-scale Swin Transformer and a lightweight temporal transformer analyzing sequences sampled at 2 Hz. This complex architecture enables the system to discern minute motion variations and the intricate spatial arrangements crucial for near-term motion planning. The encoder is self-supervised, trained without manual annotations by employing a contrastive loss framework that dynamically reinforces embeddings of similar clips while distancing dissimilar instances. This innovative training paradigm fosters robust, semantically meaningful representations that empower accurate retrieval.
The retrieval mechanism is pivotal to KEPT’s performance. By embedding an extensive corpus of historical driving video clips into a vector database, the system can, in real time, embed the current driving sequence and efficiently query for the most contextually similar prior scenes. Utilizing a two-tier matching strategy — initial cluster routing via k-means and fine-grained neighbor identification through hierarchical navigable small-world (HNSW) indexing — KEPT retrieves multiple relevant exemplars along with their ground-truth trajectories. These historical trajectories do not serve as passive data points; instead, they actively inform the model’s reasoning process by being incorporated into carefully designed chain-of-thought prompts. These prompts guide the vision-language model to draw nuanced comparisons between the current scene and past examples, critically evaluating similarities and divergences to generate a viable, safe, and smooth 3-second ego trajectory.
Addressing a significant challenge in autonomous driving, KEPT tackles the short-horizon trajectory prediction problem, which is notorious for its demand for rapid decision-making amidst dynamic and complex scenes. Many existing autonomous driving systems falter in such scenarios due to limitations in extrapolating future states from limited current inputs. KEPT’s strategic use of a large, diverse memory of past events allows it to effectively “remember” and apply lessons from analogous situations, thereby reducing errors and mitigating collision risks during these critical moments.
The researchers augmented the vision-language backbone architecture through an innovative triple-stage fine-tuning regimen tailored to enhance the model’s environmental understanding and predictive fidelity. Initially, the model is fine-tuned on visual question-answering datasets that emphasize spatial reasoning related to object categories, dimensions, and distances. In the subsequent phase, it learns direct regression of future trajectories from multi-view imagery coupled with fundamental kinematic parameters, while being penalized for unsafe maneuvers such as excessive curvature or abrupt accelerations. Finally, the model specializes further by learning to predict trajectories based solely on front-view consecutive frames, aligning its linguistic reasoning capabilities with short-term temporal dynamics. Importantly, this adaptation utilizes lightweight Low-Rank Adaptation (LoRA) modules, which maintain computational efficiency without compromising performance.
KEPT’s evaluation on the widely respected nuScenes dataset showcases its superior performance compared to not only traditional trajectory prediction baselines but also recent vision-language-driven planners. Demonstrating consistent reductions in positional prediction errors and keeping collision probabilities at or below rival methods, KEPT sets a new standard in safety-aware autonomous navigation. Comprehensive ablation studies reinforce the significance of every architectural element — from the self-supervised TFSF encoding and the expertly structured retrieval pipeline to the tripartite fine-tuning and the inclusion of multiple retrieved exemplars — in driving the overall effectiveness and robustness of the system.
Behind the engineering lies a profound philosophy articulated by Prof. Bingzhao Gao, the project’s corresponding author. Recognizing that vision-language models, while powerful, are prone to hallucinations and lapses in incorporating physical constraints, the team has innovatively grounded the AI’s reasoning in concrete, real-world trajectories. By embedding physical feasibility and collision risk considerations explicitly into the training objectives, KEPT transforms a powerful but often opaque reasoning engine into a practical, engineerable module ready for real-world deployment.
This study’s implications extend beyond immediate performance metrics and open-loop simulation results. It introduces an inspiring paradigm shift in the design of AI systems for autonomous vehicles: combining large-scale pre-trained models with retrieval-augmented cognition and structured, physics-informed prompting. Such design fosters transparency, reduces reliance on excessive data annotation, and instills a proactive safety mindset into the core of decision-making models. While the current research focuses primarily on short-term prediction using monocular front-camera footage, it sets an essential foundation for future expansions, including closed-loop testing, integration of richer sensor suites, and broader geographic and environmental generalization.
The potential applications of KEPT transcend fully autonomous vehicles, hinting at transformative advances in advanced driver-assistance systems (ADAS) that do more than simply support driving—they explain their recommendations in natural language, fostering trust and comprehension among human drivers. By harmonizing retrieval capabilities, visual perception, and language reasoning, KEPT embodies a concrete step toward autonomous systems that are not only competent drivers but also articulate and interpretable partners in mobility.
As autonomous vehicle technology accelerates toward widespread adoption, KEPT exemplifies the convergence of AI innovation, rigorous engineering discipline, and practical safety considerations. This research stands as a beacon of progress, illustrating how thoughtful system design can leverage the best of modern machine learning—large transformer models, self-supervised learning, efficient retrieval architectures—while embedding domain-specific constraints to safeguard human life and foster trust in intelligent transportation systems.
Subject of Research: Autonomous Driving, AI-based Trajectory Prediction, Vision-Language Models, Self-Supervised Learning, Retrieval-Augmented AI
Article Title: KEPT: Knowledge‑Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models
News Publication Date: 31-Mar-2026
Web References: https://doi.org/10.26599/COMMTR.2026.9640012
References: Communications in Transportation Research
Image Credits: Communications in Transportation Research
Keywords
Autonomous Vehicles, Trajectory Prediction, Vision-Language Models, Self-Supervised Learning, Temporal Frequency-Spatial Fusion Encoder, Retrieval-Augmented AI, Chain-of-Thought Prompting, NuScenes Benchmark, Advanced Driver-Assistance Systems, Motion Planning, Transformer Models, Safety-Aware AI.

