Naziv predavanja: Image Captioning
In the last few years, significant progress has been made in creating systems capable of automatically describe an image. Numerous solutions based on deep neural networks have been proposed, in which a Convolutional Neural Network (CNN) is responsible for the visual part, and a Recurrent Neural Network (RNN), or a variant thereof is responsible for language modeling. Often, such models are further extended with an attention mechanism that helps generate more detailed descriptions. Moreover, the majority of modern image captioning systems use Maximum Likelihood Estimation (MLE) as their learning method and in training or evaluating their outputs, they use examples consisting of image-caption pairs from the MS COCO dataset. However, despite all the advantages, such systems are capable of generating only factual descriptions that literally convey the content of an image and are still far from how people would describe a scene. With all the achievements so far, there exist many more challenges that need to be faced in order to achieve the ease with which people express their experience of the world they are surrounded by. In this talk, we discuss several of those challenges and provide some directions for further research.