Visual recognition of text in images for question answering using deep learning

Vlachos Konstantinos

Full record

URI:

http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26

Year

2024

Type of Item

Diploma Work

License

Details

Bibliographic Citation

Konstantinos Vlachos, "Visual recognition of text in images for question answering using deep learning", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2024 https://doi.org/10.26233/heallink.tuc.100599

Appears in Collections

Diploma Works in Community School of Electrical and Computer Engineering

Diploma Works in Community Intelligent Systems Laboratory

Summary

Visual Question Answering (VQA) is a complex challenge that combines the domains of Computer Vision and Natural Language Processing. The key concept behind VQA is to be able to automatically answer questions, provided in the form of natural language text, about the content of a digital color image, provided also as part of the input. The answer is to be delivered also in the same form of natural language text. This diploma thesis explores the development of a VQA model, utilizing existing systems, trained on millions of data using deep machine learning techniques. More specifically, the two systems utilized are: EfficientNetB0 as the image feature extractor and BERT for question embedding. The feature maps generated by these two components are concatenated and are subsequently passed through a convolutional Neural Network architecture with two dense layers, which is responsible for making predictions. The goal of this model’s architecture is to correctly classify inputs, consisting of a question and an image, to answers selected from a predefined set of 500 possible responses. Training the model involved leveraging Colab’s Pro GPUs, experimenting with various configurations to optimize performance, and employing a range of callbacks for enhanced training stability. The resulting model demonstrated good performance in many cases, accurately recognizing objects, understanding scenes, and performing spatial reasoning to answer questions related to the input image. These results are illustrated through a series of correct and incorrect predicted answers on selected instances. Finally, limitations, future extensions and potential applications of the proposed approach are discussed.

Search

Browse

My Space

Visual recognition of text in images for question answering using deep learning

Vlachos Konstantinos

Summary

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: