We introduce ReINVenTA – Research and Innovation Network for Visual and Textual Analysis of Multimodal Objects –, an inter-institutional research network funded by FAPEMIG and CNPq. ReINVenTA focuses on the computational semantic processing of multimodal objects (i.e. composed of communicative modes such as verbal language, gestures, video that interact to produce meaning). To this end, it mobilizes laboratories and research groups with expertise in Model Development for Natural Language Understanding, Artificial Intelligence, Knowledge Discovery, and Assistive Technologies. More specifically, ReINVenTA investigates applications of the FrameNet semantic-computational model to the semantic representation of multimodal objects. The methodology used is the constitution of a semantically annotated dataset and the training of AI models on this dataset. In this sense, this approach is based on the premise that, just as words can evoke frames, so can other semiotic elements. Therefore, the semantic-computational representation of multimodal genres is the result of intricate and complex intersemiotic relations. The dataset will be composed of image-caption pairings, TV shows and movies with audio description features. Both visual and verbal language modalities – including the audio description – will be annotated for semantic frames and the semantic representations for each modality will be contrasted. Moreover, eye-tracking experiments will be conducted to validate the relevance of the image annotation in terms of the viewers’ attention to specific regions of the images. FrameNet Brasil has been developing, in recent years, the analytical criteria and the software tools that allow for the extension of the FrameNet model to multimodal genre analysis. The combination of the expertise of researchers and lab infrastructure located at five Brazilian universities – UFJF, UFMG, UFU, PUC-MG and UFPB – will result in: (i) the expansion of FrameNet coverage for Brazilian Portuguese; (ii) the production of a gold standard dataset of multimodal objects annotated for semantic frames and psycholinguistically validated using eye-tracking; (iii) the development of AI algorithms for automatic semantic role labeling and knowledge discovery in multimodal objects; and (iv) the proposition of best practices for audio description of videos.