AI4TV '20: Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery
AI4TV '20: Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery
SESSION: Keynote & Invited Talks
Session details: Keynote & Invited Talks
- Raphael Troncy
AI in the Media Spotlight
- Alexandre Rouxel
The use of AI technology offers many new opportunities for the media sector; in particular, it leads to an increase in productivity and efficiency to convey relevant information to appropriate viewers quickly and accurately.
In this keynote, I will show how AI is gradually transforming the content production and distribution chain for broadcasters and media in general. We will start with an overview of AI applications in the media field and the underlying technologies. Then we will go through some projects led or developed by the EBU for the Public Media Services, PSM. To conclude, I will sketch the trend, the limitations and potential evolutions of the uptake of AI in media.
The range of AI applications in the mediums of written press, cinema, radio, television and advertising is widespread. To start with the content production and post-production AI is used in video creation and editing, in the written press, for automatic or assisted writing, information analysis and verification. Without being exhaustive, in the broad field of audience analytics, AI can identify the optimal audience for a given content, personalise and recommend the content for a targeted audience or specific user depending on the granularity. From the perspective of accessibility and inclusion, AI plays a predominant role in improving access to content through transcription, translation, vocal synthesis and recommendation.
In this context of the raising of AI in the media sphere, the PSM are facing the need to be innovative and transform their value chain to reach the audience better. This can't be performed without keeping the PSM remit which combines a full range of distinctive quality content to fulfil its central mission: inform, educate, entertain.
As such, the EBU is leading projects and developing technologies to leverage AI capabilities for media while meeting PSM remit [1]. Firstly, the EBU is leading a project to benchmark AI tools on the market in the context of PSM. As a first step, we are focusing on Automatic Speech Recognition. Therefore I will describe the objective, the metrics and the evolution of the tool. Among other activities related to machine learning and metadata, the EBU is developing a tool to generate high-level tags on written content. Since NLP recently achieved a breakthrough for many applications, we are working on leveraging this technology to produce high-level explainable tags on written contents. As they are called high level, these tags identify properties correlated with several groups of linguistic features like vocabulary, grammar, semantic or formality. Originally designed to detect fake news they can as well feed recommender systems or classifiers. I will detail the machine learning algorithms behind the tool and pave the way of future works.
And, Action! Towards Leveraging Multimodal Patterns for Storytelling and Content Analysis
- Natalie Parde
Humans perform intelligent tasks by productively leveraging relevant information from numerous sensory and experiential inputs, and recent scientific and hardware advances have made it increasingly possible for machines to attempt this as well. However, improved resource availability does not automatically give rise to humanlike performance in complex tasks [1]. In this talk, I discuss recent work towards three tasks that benefit from an elegant synthesis of linguistic and visual input: visual storytelling, visual question answering (VQA), and affective content analysis. I focus primarily on visual storytelling, a burgeoning task with the goal of generating coherent, sensible narratives for sequences of input images [2]. I analyze recent work in this area, and then introduce a novel visual storytelling approach that employs a hierarchical context-based network, with a co-attention mechanism that jointly attends to patterns in visual (image) and linguistic (description) input.
Following this, I describe ongoing work in VQA, another inherently multimodal task with the goal of producing accurate, sensible answers to questions about images. I explore a formulation in which the VQA model generates unconstrained, free-form text, providing preliminary evidence that harnessing the linguistic patterns latent in language models results in competitive task performance [3].
Finally, I introduce some intriguing new work that investigates the utility of linguistic patterns in a task that is not inherently multimodal: analyzing the affective content of images. I close by suggesting some exciting future directions for each of these tasks as they pertain to multimodal media analysis.
SESSION: Session 1: Video Analytics and Storytelling
Session details: Session 1: Video Analytics and Storytelling
- Vasileios Mezaris
Predicting Your Future Audience's Popular Topics to Optimize TV Content Marketing Success
- Lyndon Nixon
TV broadcasters and other organizations with online media collections which wish to extend the reach of and engagement with their media assets conduct digital marketing activities. The marketing success depends on the relevance of the topics of the media content to the audience, which is made even more difficult when planning future marketing activities as one needs to know the topics that the future audience will be interested in. This paper presents the innovative application of AI based predictive analytics to identify the topics that will be more popular among future audiences and its use in a digital content marketing strategy of media organisations.
Neural Style Transfer Based Voice Mimicking for Personalized Audio Stories
- Syeda Maryam Fatima
- Marina Shehzad
- Syed Sami Murtuza
- Syeda Saleha Raza
This paper demonstrates a CNN based neural style transfer on audio dataset to make storytelling a personalized experience by asking users to record a few sentences that are used to mimic their voice. User audios are converted to spectrograms, the style of which is transferred to the spectrogram of a base voice narrating the story. This neural style transfer is similar to the style transfer on images. This approach stands out as it needs a small dataset and therefore, also takes less time to train the model. This project is intended specifically for children who prefer digital interaction and are also increasingly leaving behind the storytelling culture and for working parents who are not able to spend enough time with their children. By using a parent's initial recording to narrate a given story, it is designed to serve as a conjunction between storytelling and screen-time to incorporate children's interest through the implicit ethical themes of the stories, connecting children to their loved ones simultaneously ensuring an innocuous and meaningful learning experience.
Video Analysis for Interactive Story Creation: The Sandmännchen Showcase
- Miggi Zwicklbauer
- Willy Lamm
- Martin Gordon
- Konstantinos Apostolidis
- Basil Philipp
- Vasileios Mezaris
This paper presents a method to interactively create a new Sandmannchen story. We built an application which is deployed on a smart speaker, interacts with a user, selects appropriate segments from a database of Sandmannchen episodes and combines them to generate a new story that is compatible with the user requests. The underlying video analysis technologies are presented and evaluated. We additionally showcase example results from using the complete application, as a proof of concept.
SESSION: Session 2: Video Annotation and Summarization
Session details: Session 2: Video Annotation and Summarization
- Jorma Laaksonen
Named Entity Recognition for Spoken Finnish
- Dejan Porjazovski
- Juho Leinonen
- Mikko Kurimo
In this paper we present a Bidirectional LSTM neural network with a Conditional Random Field layer on top, which utilizes word, character and morph embeddings in order to perform named entity recognition on various Finnish datasets. To overcome the lack of annotated training corpora that arises when dealing with low-resource languages like Finnish, we tried a knowledge transfer technique to transfer tags from Estonian dataset. On the human annotated in-domain Digitoday dataset, out system achieved F1 score of 84.73. On the out-of-domain Wikipedia set we got F1 score of 67.66. In order to see how well the system performs on speech data, we used two datasets containing automatic speech recognition outputs. Since we do not have true labels for those datasets, we used a rule-based system to annotate them and used those annotations as reference labels. On the first dataset which contains Finnish parliament sessions we obtained F1 score of 42.09 and on the second one which contains talks from Yle Pressiklubi we obtained F1 score of 74.54.
Avoid Crowding in the Battlefield: Semantic Placement of Social Messages in Entertainment Programs
- Yashaswi Rauthan
- Vatsala Singh
- Rishabh Agrawal
- Satej Kadlay
- Niranjan Pedanekar
- Shirish Karande
- Manasi Malik
- Iaphi Tariang
Crisis situations often require authorities to convey important messages to a large population of varying demographics. An example of such a message is maintain a distance of 6 ft from others in times of the present COVID-19 crisis. In this paper, we propose a method to programmatically place such messages in existing entertainment media as overlays at semantically relevant locations. For this purpose, we use generic semantic annotations on the media and subsequent spatio-temporal querying on these annotations to find candidate locations for message placement. We then propose choosing the final locations optimally using parameters such as spacing of messages, length of the messages and confidence of query results. We present preliminary results for optimal placement of messages in popular entertainment media.
Realistic Video Summarization through VISIOCITY: A New Benchmark and Evaluation Framework
- Vishal Kaushal
- Suraj Kothawade
- Rishabh Iyer
- Ganesh Ramakrishnan
Automatic video summarization is still an unsolved problem due to several challenges. We take steps towards making it more realistic by addressing the following challenges. Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type. We introduce a new benchmarking dataset called VISIOCITY which comprises of longer videos across six different categories with dense concept annotations capable of supporting different flavors of video summarization and other vision problems. Secondly, for long videos, human reference summaries, necessary for supervised video summarization techniques, are difficult to obtain. We present a novel recipe based on pareto optimality to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY. We show that these summaries are at par with human summaries. Thirdly, we demonstrate that in the presence of multiple ground truth summaries (due to the highly subjective nature of the task), learning from a single combined ground truth summary using a single loss function is not a good idea. We propose a simple recipe VISIOCITY-SUM to enhance an existing model using a combination of losses and demonstrate that it beats the current state of the art techniques. We also present a study of different desired characteristics of a good summary and demonstrate that a single measure (say F1) to evaluate a summary, as is the current typical practice, falls short in some ways. We propose an evaluation framework for better quantitative assessment of summary quality which is closer to human judgment than a single measure. We report the performance of a few representative techniques of video summarization on VISIOCITY assessed using various measures and bring out the limitation of the techniques and/or the assessment mechanism in modeling human judgment and demonstrate the effectiveness of our evaluation framework in doing so.