Saturday, February 7, 2026
Science
No Result
View All Result
  • Login
  • HOME
  • SCIENCE NEWS
  • CONTACT US
  • HOME
  • SCIENCE NEWS
  • CONTACT US
No Result
View All Result
Scienmag
No Result
View All Result
Home Science News Technology and Engineering

Voice at the wheel: Commands navigates, wisdom travels from COMMTR2024

April 29, 2024
in Technology and Engineering
Reading Time: 4 mins read
0
Schematic Overview of the CAVG Model Architecture
67
SHARES
610
VIEWS
Share on FacebookShare on Twitter
ADVERTISEMENT

Recently, the team led by Professor Xu Chengzhong and Assistant Professor Li Zhenning from the University of Macau’s State Key Laboratory of Internet of Things for Smart City unveiled the Context-Aware Visual Grounding Model (CAVG). This model stands as the first Visual Grounding autonomous driving model to integrate natural language processing with large language models. They published their study in Communications in Transportation Research.

Schematic Overview of the CAVG Model Architecture

Credit: Communications in Transportation Research, Tsinghua University Press

Recently, the team led by Professor Xu Chengzhong and Assistant Professor Li Zhenning from the University of Macau’s State Key Laboratory of Internet of Things for Smart City unveiled the Context-Aware Visual Grounding Model (CAVG). This model stands as the first Visual Grounding autonomous driving model to integrate natural language processing with large language models. They published their study in Communications in Transportation Research.

 

Amidst the burgeoning interest in autonomous driving technology, industry leaders in both the automotive and tech sectors have demonstrated to the public the capabilities of driverless vehicles that can navigate safely around obstacles and handle emergent situations. Yet, there is a cautious attitude among the public towards entrusting full control to AI systems. This underscores the importance of developing a system that enables passengers to issue voice commands to control the vehicle. Such an endeavor intersects two critical domains: computer vision and natural language processing (NLP). A pivotal research challenge lies in employing cross-modal algorithms to forge a robust link between intricate verbal instructions and real-world contexts, thereby empowering the driving system to grasp passengers’ intents and intelligently select among diverse goals. In response to this challenge, Thierry Deruyttere and colleagues inaugurated the Talk2Car challenge in 2019. This competition tasks researchers with pinpointing the most semantically accurate regions in front-view images from real-world traffic scenarios, based on provided textual descriptions.

 

Owing to the swift advancement of Large Language Models (LLMs), the possibility of linguistic interaction with autonomous vehicles has become a reality. The article initially frames the challenge of aligning textual instructions with visual scenes as a mapping task, necessitating the conversion of textual descriptions into vectors that accurately correspond to the most suitable subregions among potential candidates. To address this, it introduces the CAVG model, underpinned by a cross-modal attention mechanism. Drawing on the Two-Stage Methods framework, CAVG employs the CenterNet model for delineating numerous candidate areas within images, subsequently extracting regional feature vectors for each. The model is structured around an Encoder-Decoder framework, comprising encoders for Text, Emotion, Vision, and Context, alongside a Cross-Modal encoder and a Multimodal decoder. To adeptly navigate the complexity of contextual semantics and human emotional nuances, the article leverages GPT-4V, integrating a novel multi-head cross-modal attention mechanism and a Region-Specific Dynamics (RSD) layer. This layer is instrumental in modulating attention and interpreting cross-modal inputs, thereby facilitating the identification of the region that most closely aligns with the given instructions from among all candidates.

 

Furthermore, in pursuit of evaluating the model’s generalizability, the study devised specific testing environments that pose additional complexities: low-visibility nighttime settings, urban scenarios characterized by dense traffic and intricate object interactions, environments with ambiguous instructions, and scenarios featuring significantly reduced visibility. These conditions were designed to intensify the challenge of accurate predictions. According to the findings, the proposed model establishes new benchmarks on the Talk2Car dataset, demonstrating remarkable efficiency by achieving impressive outcomes with only half of the data for both CAVG (50%) and CAVG (75%) configurations, and showing superior performance across various specialized challenge datasets.

 

Future endeavors in research are poised to delve into advancing the precision of integrating textual commands with visual data in autonomous navigation, while also harnessing the potential of large language models to act as sophisticated aides in autonomous driving technologies. The discourse will venture into incorporating an expanded array of data modalities, including Bird’s Eye View (BEV) imagery and trajectory data among others. This approach aims to forge comprehensive deep learning strategies capable of synthesizing and leveraging multifaceted modal information, thereby significantly elevating the efficacy and performance of the models in question.

 


About Communications in Transportation Research

Communications in Transportation Research was launched in 2021, with academic support provided by Tsinghua University and China Intelligent Transportation Systems Association. The Editors-in-Chief are Professor Xiaobo Qu, a member of the Academia Europaea from Tsinghua University and Professor Shuai’an Wang from Hong Kong Polytechnic University. The journal mainly publishes high-quality, original research and review articles that are of significant importance to emerging transportation systems, aiming to become an international platform and window for showcasing and exchanging innovative achievements in transportation and related fields, to promote the exchange and development of transportation research between China and the international academic community. It has been indexed in ESCI, Ei Compendex, Scopus, DOAJ, TRID and other databases. In 2022, it was selected as a high-starting-point new journal project of the “China Science and Technology Journal Excellence Action Plan”.



Journal

Communications in Transportation Research

DOI

10.1016/j.commtr.2023.100116

Article Title

GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Article Publication Date

21-Feb-2024

Share27Tweet17
Previous Post

Gamma radiation vortex burst in the nonlinear Thomson scattering with refocusing spiral plasma mirror

Next Post

Study reveals cancer vulnerabilities in popular dog breeds 

Related Posts

blank
Technology and Engineering

Comprehensive Global Analysis: Merging Finance, Technology, and Governance Essential for Just Climate Action

February 7, 2026
blank
Technology and Engineering

Revolutionary Genetic Technology Emerges to Combat Antibiotic Resistance

February 6, 2026
blank
Technology and Engineering

Nanophotonic Two-Color Solitons Enable Two-Cycle Pulses

February 6, 2026
blank
Technology and Engineering

Insilico Medicine Welcomes Dr. Halle Zhang as New Vice President of Clinical Development for Oncology

February 6, 2026
blank
Technology and Engineering

Novel Gene Editing Technique Targets Tumors Overloaded with Oncogenes

February 6, 2026
blank
Technology and Engineering

New Study Uncovers Microscopic Sources of Surface Noise Affecting Diamond Quantum Sensors

February 6, 2026
Next Post
Worried dog

Study reveals cancer vulnerabilities in popular dog breeds 

  • Mothers who receive childcare support from maternal grandparents show more parental warmth, finds NTU Singapore study

    Mothers who receive childcare support from maternal grandparents show more parental warmth, finds NTU Singapore study

    27610 shares
    Share 11040 Tweet 6900
  • University of Seville Breaks 120-Year-Old Mystery, Revises a Key Einstein Concept

    1017 shares
    Share 407 Tweet 254
  • Bee body mass, pathogens and local climate influence heat tolerance

    662 shares
    Share 265 Tweet 166
  • Researchers record first-ever images and data of a shark experiencing a boat strike

    529 shares
    Share 212 Tweet 132
  • Groundbreaking Clinical Trial Reveals Lubiprostone Enhances Kidney Function

    515 shares
    Share 206 Tweet 129
Science

Embark on a thrilling journey of discovery with Scienmag.com—your ultimate source for cutting-edge breakthroughs. Immerse yourself in a world where curiosity knows no limits and tomorrow’s possibilities become today’s reality!

RECENT NEWS

  • Inflammasome Protein ASC Drives Pancreatic Cancer Metabolism
  • Personalized Guide to Understanding and Reducing Chemicals
  • Phage-Antibiotic Combo Beats Resistant Peritoneal Infection
  • Barriers and Boosters of Seniors’ Physical Activity in Karachi

Categories

  • Agriculture
  • Anthropology
  • Archaeology
  • Athmospheric
  • Biology
  • Biotechnology
  • Blog
  • Bussines
  • Cancer
  • Chemistry
  • Climate
  • Earth Science
  • Editorial Policy
  • Marine
  • Mathematics
  • Medicine
  • Pediatry
  • Policy
  • Psychology & Psychiatry
  • Science Education
  • Social Science
  • Space
  • Technology and Engineering

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 5,190 other subscribers

© 2025 Scienmag - Science Magazine

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • HOME
  • SCIENCE NEWS
  • CONTACT US

© 2025 Scienmag - Science Magazine

Discover more from Science

Subscribe now to keep reading and get access to the full archive.

Continue reading