Image Segmentation-Based Scene Understanding using a Text-to-Image Model for Autonomous Robot Navigation
Stephane Claude Lee
University of West Florida Libraries
Master of Science (MS), University of West Florida
2025
Metrics
2 File views/ downloads
26 Record Views
Abstract
Robots require an effective system to determine whether their environment is safe to navigatethrough or not. This includes identifying objects, analyzing terrain, and accounting for the physical form and mode of locomotion of the robot, whether wheeled, legged, or flying, and finding a path to safely reach a destination. Without accurate perception and classification of their surroundings, autonomous robots are prone to misinterpret obstacles, misjudge terrain types, or become non-functional in dynamic or unstructured environments.
The most common form of identifying visual data from a robot’s camera is through a trained segmentation model. These models are typically trained on thousands of annotated images to recognize and label objects in new, unseen data. However, this approach presents multiple limitations. First, trained segmentation models are computationally expensive and time-consuming to develop, requiring substantial hardware resources and manual labeling. Second, their effectiveness is often constrained to the domain they were trained on; once deployed in a new environment, they may fail to generalize without additional training or adaptation. Lastly, many of these models require fine-tuning when transitioning across sensor types, scene variations, or robot platforms.
This research explores an alternative to traditional segmentation by proposing a generative AI-based approach for class identification, terrain analysis, and visual perception. specifically, the study investigates whether a text-to-image model, DALLE-3, can be prompted to generate segmentation masks without any prior exposure to labeled training data. By eliminating the dependency on pre-trained segmentation networks, the goal is to create a universal segmentation pipeline that is adaptable across multiple robotic platforms.
A complete segmentation pipeline is developed that includes prompt formatting, segmentation generation, region recognition, and evaluation against ground truth data from the RELLIS-3D dataset. This RELLIS-3D dataset is the benchmark comprising dynamic outdoor scenes with 20 defined semantic classes. The segmentation performance of DALLE-3 is evaluated using the mean Intersection over Union (mIoU) metric and compared to the outputs of state-of-the-art trained models, including HRNet+OCR and GSCNN.
The findings of this research point toward a future where generative models may supplement orreplace traditional segmentation pipelines in scenarios where training data is scarce or unavailable. This thesis lays the groundwork for further research in combining generative models with robotic perception systems, enabling greater flexibility, rapid deployment, and task adaptation across a wide range of autonomous platforms.
Files and links (1)
pdf
Image Segmentation-Based Scene Understanding using a Text-to-Image Model for Autonomous Robot Navigation24.69 MBDownloadView
Preprint Thesis pdf Open Access
Details
Title
Image Segmentation-Based Scene Understanding using a Text-to-Image Model for Autonomous Robot Navigation
Resource Type
Thesis
Contributors
Hakki Erhan Sevil (Committee Chair)
Yazan Alqudah (Committee Member)
Jiaming Fu (Committee Member)
Publisher
University of West Florida Libraries
Format
pdf
Number of pages
86
Copyright
Permission granted to the University of West Florida Libraries by the author to digitize and/or display this information for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires the permission of the copyright holder.
Identifiers
99381469215906600
Academic Unit
Dr. Muhammad Harunur Rashid Department of Electrical and Computer Engineering
Language
English
Awarding Institution
University of West Florida; Master of Science (MS)
Theses and Dissertations
Master of Science (MS), University of West Florida
Image Segmentation-Based Scene Understanding using a Text-to-Image Model for Autonomous Robot Navigation