FAT/MM '19- Proceedings of the 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia


FAT/MM '19- Proceedings of the 1st International Workshop on Fairness, Accountability, and Transparency in MultiMedia

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Fairness in Algorithmic and Crowd-Generated Descriptions of People Images

  •      Jahna Otterbacher

Image analysis algorithms have become indispensable in the modern information ecosystem. Beyond their early use in restricted domains (e.g., military, medical), they are now widely used in consumer applications and social media. With the rise of the "Algorithm Economy," image analysis algorithms are increasingly being commercialized as Cognitive Services. This practice is proving to be a boon to the development of applications where user modeling, personalization and adaptation are required. From e-stores, where image recognition is used to curate a "personal style" for a given shopper based on previously viewed items, to dating apps, which can now act as "visual matchmakers," the technology has gained increasing influence in our digital interactions and experiences. However, proprietary image tagging services are black boxes and there are numerous social and ethical issues surrounding their use in contexts where people can be harmed. In this talk, I will discuss recent work in analyzing proprietary image tagging services (e.g., Clarifai, Google Vision, Amazon Rekognition) for their gender and racial biases when tagging images depicting people. I will present our techniques for discrimination discovery in this domain, as well as our work on understanding user perceptions of fairness. Finally, I will explore the sources of such biases, by comparing human versus machine descriptions of the same people images.

Deep Learning for Video Retrieval by Natural Language

  •      Xirong Li

Videos are everywhere. Video retrieval, i.e., finding videos that meet the information need of a specific user, is important for a wide range of applications including communication, education, entertainment, business, security etc. Among multiple ways of expressing the information need, a natural-language text is the most intuitive to start a retrieval process. For instance, to find video shots showing "a person in front of a blackboard talking or writing in a classroom". Such a query can be submitted easily, by typing or speech recognition, to a video retrieval system. Given a video as a sequence of frames and a query as a sequence of words, a fundamental problem in video retrieval by natural language is how to properly associate visual and linguistic information presented in sequential order. We attempt to address the fundamental problem by decomposing our quest along the following three dimensions: (1) Query representation, (2) Video representation, (3) Common space. The three dimensions also account for major designs in the state-of-the-art systems. We introduce a set of deep learning methods recently developed by our joint team of RUC, ZJGU, UvA and CAS. We evaluate the deep models on the TRECVID Ad-hoc Video Search (AVS) benchmark over the last three years (2016-2018). Much room exists for future research. Compared to video retrieval with semantic representations, deep learning approaches lack an intuitive explanation of the results obtained, in particular when the results are unsatisfactory. As the retrieval performance continues to improve, the accountability of a video retrieval model requires more research attention. While a well-performed deep model can be largely expected given adequate training data, novel algorithms that enable learning a video retrieval model from limited training resource are much in demand. Consider, for instance, visual annotation and retrieval for a target language other than English. Data and code used for this research are available at http://github.com/li-xirong/video-retrieval.

SESSION: Full Papers

Social Multimedia, Diversity, and Global South Cities: A Double Blind Side

  •      Daniel Gatica-Perez
  • Darshan Santani
  • Joan Isaac-Biel
  • Thanh-Trung Phan

Social media provides opportunities to examine urban phenomena at scale, and we believe that studying cities in the Global South through citizen-contributed data and AI-driven analytics should be a priority of multimedia research. However, little work has been done in our community, and we argue that this contributes to a double blind side problem. We exemplify this situation by studying Ma3Route, a mobile social media channel to crowdsource and broadcast transit reports in Nairobi, Kenya. Using multimedia data from its Twitter stream, we first conduct a descriptive analysis that shows an active community generating rich traffic-related reports, and then discover latent topics that identify both regular and ephemeral thematic clusters of reports involving accidents, traffic conditions, and attitudes of citizens towards authorities. In the second place, we conduct a deep learning-based analysis of Ma3Route images to understand the kind of visual content shared in the platform, and that shows limitations of using deep neural network models trained with data largely coming from the US and Europe, which do not fully match the reality and diversity of other world regions. We conclude by presenting a multidisciplinary research agenda for future work in this domain.

A Software Defined Network Based Research on Fairness in Multimedia

  •      Ahmed Osama Basil
  • Mu Mu
  • Ali Al-Sherbaz

The demand for online distribution of high quality and high throughput content has led to non-cooperative competition of network resources between a growing number of media applications. This causes a significant impact on network efficiency, the quality of user experience (QoE) as well as a discrepancy of QoE across user devices. Within a multi-user multi-device environment, measuring and maintaining perceivable fairness becomes as critical as achieving the QoE on individual user applications. This paper discusses application- and human-level fairness over networked multimedia applications and how such fairness can be managed through novel network designs using programmable networks such as software-defined networks (SDN).

Toward Fairness in Face Matching Algorithms

  •      Jamal Alasadi
  • Ahmed Al Hilli
  • Vivek K. Singh

Automated face matching algorithms are used in a wide variety of societal applications ranging from access authentication, to criminal identification, to application customization. Hence, it is important for such algorithms to be equitable in their performance for different demographic groups. If the algorithms work well only for certain racial or gender identities, they would adversely affect others. Recent efforts in algorithmic fairness literature (typically not focused on multimedia or computer vision tasks such as face matching) have argued for designing algorithms and architectures to tackle such bias via trade-offs between accuracy and fairness. Here, we show that adopting an adversarial deep learning-based approach allows for the model to maintain the accuracy at face matching while also reducing demographic disparities compared to a baseline (non-adversarial deep learning) approach at face matching. The results motivate and pave way for more accurate and fair face matching algorithms.

Learning Facial Recognition Biases through VAE Latent Representations

  •      Diego Celis
  • Meghana Rao

The use of facial recognition technology in fields such as law enforcement is creating increased concern about algorithm-embedded biases. MIT research recently demonstrated that commercial facial recognition technology performs poorly on people of color, specifically women. The objective of this work is to implement a variational autoencoder (VAE) that generates low-dimensional latent representations of faces and then to analyze these low-dimensional latent representations to interpret potentially learned biases. While the field of interpreting facial recognition biases is still emerging, previous work has also relied on VAEs to better understand the relationship between images and their latent representations. We implement a 10-layer VAE and analyze its image encoding to a single latent feature and to ten latent features. For the single latent feature encoding, through observing the images corresponding to the highest and lowest activation values, we hypothesized that the model focuses on the overall brightness and darkness of an image. For example, faces of darker tones are assigned the lowest values and faces with lighter tones are assigned the highest values. To test this hypothesis, we manually input values for the single latent feature into the decoder at equally spaced increments and observed the reconstructed images. The reconstructed images support the hypothesis---as the latent feature value increases, the image appears to become brighter with whiter face attributes. This shows that protected features such as race (e.g. skin color) play a role in latent representations. This lays the groundwork for interpreting the encodings of individual latent features to address algorithmic biases. Future work could entail finding a set of latent features that more accurately represents said protected characteristics.

QoE-fair Resource Allocation for DASH Video Delivery Systems

  •      Luca De Cicco
  • Gioacchino Manfredi
  • Saverio Mascolo
  • Vittorio Palmisano

Services delivering videos to massive audiences are required to provide the users with a satisfactory Quality of Experience (QoE) to keep high engagement and avoid service abandonment. Adaptive BitRate algorithms (ABR) running in video players are designed to dynamically change the video bitrate to provide the best possible QoE given the user device features and the end-to-end network available bandwidth. Well-designed ABR algorithms strive to improve the individual QoE obtained by each user resulting, in the optimal case, in the maximization of the sum of QoE individually perceived by users. However, when resources are scarce, maximizing the sum of the QoE might result in favoring some clients at the expense of others which instead obtain poor QoEs with the possible consequence of service abandonment. Thus, we argue that video service providers should directly address fairness issues when designing their delivery networks so to gracefully degrade the QoE for all users when resources are scarce. This paper addresses this open issue and shows that the Multi-Commodity Flow Problem (MCFP) optimization framework is a proper methodology to achieve a QoE-fair distribution of the resources. The proposed solution is based on the bandwidth reservation approach that slices network resources and assigns similar video requests to the same network slice according to a proposed similarity metric dependent on video quality. Obtained results show that the proposed approach is able to achieve its goal and provide a fair level of QoE to heterogeneous clients.