Object Detection with Visual Question Answering Using a Team of Multi-Agent UAVs

Nathan Wesley Rayon

This study proposes an autonomous multi-agent UAV system to perform Visual Question Answering (VQA) for object detection using aerial footage. This autonomous system utilizes an entropy-based distributed behavior model to move each UAV to a specified waypoint. This distributed behavior model calculates the entropy of the system to generate each UAV's next move. The overall goal of this behavior model is to minimize entropy. Therefore, when entropy is high, the group of UAVs will move closer together. In addition, a visual question answering model will take aerial footage provided by these UAVs to answer questions regarding the scenario in natural language. Visual question answering is a machine learning task that aims to answer questions regarding an image in natural language with two inputs: an image and a question. The VQA model used in this study utilizes a multi-modal architecture with three components, a computer vision based object detection model, a natural language processing model, and finally a merging layer. The training results and test case analyses will be presented for the object detection model as well as the final VQA system. This thesis will discuss any applications such as post-disaster response, intelligence gathering, and surveillance as well as any relevant studies related to distributed behavior modeling, Convolutional Neural Network (CNN) based object detection, and Visual Question Answering. The main goal of this thesis is to evaluate how VQA performance and overall surface area coverage changes with different distributed behavior model configurations. These configurations include varying the number of UAVs, UAV formations, altitude, and separation distance. In addition to these experiments, separate optimal parameter testing will be performed on the distributed behavior model to minimize the average distance traveled by each UAV. Further testing will be performed to overcome a time delay challenge introduced by performing VQA with multiple UAVs. These tests include comparing two input strategies, a sequentially fed input stream where images are processed one at a time by the VQA model. The second strategy focuses on merging each UAV's image into a grid pattern as input to the VQA model. After analyzing each test case, a final optimized configuration for maximizing surface area coverage and VQA model performance will be listed at the end of this paper. According to results, a sequential input stream was more efficient with a smaller number of questions asked to the VQA model. However, when the number of questions increases, the grid approach may outperform the sequential input stream. This is because only one grid image needs to be processed for each question, whereas the sequential approach needs to process each UAV's image separately for each question. In addition, an optimal configuration was chosen by testing multiple configurations. This optimal configuration includes a wider formation and a higher altitude to maximize surface area coverage, and a smaller number of UAVs was chosen to reduce the time delay that caused a decrease in UAV accuracy. The results from this study can be potentially used for multiple applications such as rapid disaster response, since multiple UAVs can survey a much larger area faster than using a single autonomous UAV. Another application of this system could be used for the military to conduct intelligence, surveillance, and reconnaissance (ISR) since the proposed system can answer a wide variety of questions regarding people and vehicles in an image.

Object Detection with Visual Question Answering Using a Team of Multi-Agent UAVs

Metrics

Abstract

Files and links (1)

Details

Object Detection with Visual Question Answering Using a Team of Multi-Agent UAVs

Metrics

Abstract

Files and links (1)

Details

University of West Florida Social media