PhD thesis abstracts

June 2011

Dinesh Babu Jayagopi

Computational Modeling of Face-to-Face Social Interaction

The computational modeling of face-to-face interactions using nonverbal behavioral cues is an emerging and relevant problem in social computing. Studying face-to-face interactions in small groups helps in understanding the basic processes of individual and group behavior; and improving team productivity and satisfaction in the modern workplace. Apart from the verbal channel, nonverbal behavioral cues form a rich communication channel through which people infer - often automatically and unconsciously - emotions, relationships, and traits of fellow members.

There exists a solid body of knowledge about small groups and the multimodal nature of the nonverbal phenomenon in social psychology and nonverbal communication. However, the problem has only recently begun to be studied in the multimodal processing community. A recent trend is to analyze these interactions in the context of face-to-face group conversations, using multiple sensors and make inferences automatically without the need of a human expert. These problems can be formulated in a machine learning framework involving the extraction of relevant audio, video features and the design of supervised or unsupervised learning models.

While attempting to bridge social psychology, perception, and machine learning, certain factors have to be considered. Firstly, various group conversation patterns emerge at different time-scales. For example, turn-taking patterns evolve over shorter time scales, whereas dominance or group-interest trends get established over larger time scales. Secondly, a set of audio and visual cues that are not only relevant but also robustly computable need to be chosen. Thirdly, unlike typical machine learning problems where ground truth is well defined, interaction modeling involves data annotation that needs to factor in interannotator variability. Finally, principled ways of integrating the multimodal cues have to be investigated.

In the thesis, we have investigated individual social constructs in small groups like dominance and status (two facets of the so-called vertical dimension of social relations). In the first part of this work, we have investigated how dominance perceived by external observers can be estimated by different nonverbal audio and video cues, and affected by annotator variability, the estimation method, and the exact task involved. In the second part,we jointly study perceived dominance and role-based status to understand whether dominant people are the ones with high status and whether dominance and status in small group conversations be automatically explained by the same nonverbal cues. We employ speaking activity, visual activity, and visual attention cues for both the works.

In the second part of the thesis, we have investigated group social constructs using both supervised and unsupervised approaches. We first propose a novel framework to characterize groups. The two-layer framework consists of a individual layer and the group layer. At the individual layer, the floor-occupation patterns of the individuals are captured. At the group layer, the identity information of the individuals is not used. We define group cues by aggregating individual cues over time and person, and use them to classify group conversational contexts - cooperative vs competitive and brainstorming vs decision-making. We then propose a framework to discover group interaction patterns using probabilistic topic models. An objective evaluation of our methodology involving human judgment and multiple annotators, showed that the learned topics indeed are meaningful, and also that the discovered patterns resemble prototypical leadership styles - autocratic, participative, and free-rein - proposed in social psychology.

Advisor(s): Thesis supervisor: Daniel Gatica-Perez

SIG MM member(s): Daniel Gatica-Perez


Social Computing group

Social computing is an emerging research domain focused on the automatic sensing, analysis, and interpretation of human and social behavior from sensor data. Through microphones and cameras in multi-sensor spaces, mobile phones, and the web, sensor data depicting human behavior can increasingly be obtained at large-scale - longitudinally and population-wise. The research group integrates models and methods from multimedia signal processing and information systems, statistical machine learning, ubiquitous computing, and applying knowledge from social sciences to address questions related to the discovery, recognition, and prediction of short-term and long-term behavior of individuals, groups, and communities in real life. This can range from people at work having meetings, users of social media sites, or people with mobile phones in urban environments. The group's research methods are aimed at creating ethical, personally and socially meaningful applications that support social interaction and communication, in the contexts of work, leisure, healthcare, and creative expression.

Katayoun Farrahi

A Probabilistic Approach to Socio-Geographic Reality Mining

As we live our daily lives, our surroundings know about it. Our surroundings consist of people, but also our electronic devices. Our mobile phones, for example, continuously sense our movements and interactions. This socio-geographic data could be continuously captured by hundreds of millions of people around the world and promises to reveal important behavioral clues about humans in a manner never before possible. Mining patterns of human behavior from large-scale mobile phone data has deep potential impact on society. For example, by understanding a community’s movements and interactions, appropriate measures may be put in place to prevent the threat of an epidemic. The study of such human-centric massive datasets requires advanced mathematical models and tools. In this thesis, we investigate probabilistic topic models as unsupervised machine learning tools for large-scale socio-geographic activity mining.

We first investigate two types of probabilistic topic models for large-scale location-driven phone data mining. We propose a methodology based on Latent Dirichlet Allocation, followed by the Author Topic Model, for the discovery of dominant location routines mined from the MIT Reality Mining data set containing the activities of 97 individuals over the course of a 16 month period. We investigate the many possibilities of our proposed approach in terms of activity modeling, including differentiating users with high and low varying lifestyles and determining when a user’s activities fluctuate from the norm over time.

We then consider both location and interaction features from cell tower connections and Bluetooth, in single and multimodal forms for routine discovery, where the daily routines discovered contain information about the interactions of the day in addition to the locations visited. We also propose a method for the prediction of missing multimodal data based on Latent Dirichlet Allocation. We further consider a supervised approach for day type and student type classification using similar socio-geographic features.

We then propose two new probabilistic approaches to alleviate some of the limitations of Latent Dirichlet Allocation for activity modeling. Large duration activities and varying time duration activities can not be modeled with the initially proposed methods due to problems with input and model parameter size explosion. We first propose a Multi-Level Topic Model as a method to incorporate multiple time duration sequences into a probabilistic generative topic model. We then propose the Pairwise-Distance Topic Model as an approach to address the problem of modeling long duration activities with topics.

Finally, we consider an application of our work to the study of influencing factors in human opinion change with mobile sensor data. We consider the Social Evolution Project Reality Mining dataset, and investigate other mobile phone sensor features including communication logs. We consider the difference in behaviors of individuals who change political opinion and those who do not. We combine several types of data to form multimodal exposure features, which express the exposure of individuals to others’ political opinions. We use the previously defined methodology based on Latent Dirichlet Allocation to define each group’s behaviors in terms of their exposure to opinions, and determine statistically significant features which differentiate those who change opinions and those who do not. We also consider the difference in exposure features of individuals that increases their interest in politics versus those who do not.

Overall, this thesis addresses several important issues in the recent body of work called Computational Social Science. Investigations principled on mathematical models and multiple types of mobile phone sensor data are performed to mine real life human activities in largescale scenarios.

Advisor(s): Daniel Gatica-Perez, thesis supervisor

SIG MM member(s): Daniel Gatica-Perez

ISBN number: DOI: 10.5075/epfl-thesis-5018

URL: ">

Kimiaki Shirahama

Intelligent Video Processing Using Data Mining Techniques

Due to the rapidly increasing video data on the Web, much research effort has been devoted to develop video retrieval methods which can efficiently retrieve videos of interest. Considering the limited man-power, it is much expected to develop retrieval methods which use features automatically extracted from videos. However, since features only represent physical contents (e.g. color, edge, motion, etc.), retrieval methods require knowledge of how to use/integrate features for retrieving videos relevant to queries. To obtain such knowledge, this thesis concentrates on `video data mining' where videos are analyzed using data mining techniques which extract previously unknown, interesting patterns in underlying data. Thereby, patterns for retrieving relevant shots to queries are extracted as explicit knowledge.

Queries can be classified into three types. For the first type of queries, a user can find keywords suitable for retrieving relevant videos. For the second type of queries, the user cannot find such keywords due to the lexical ambiguity, but can provide some example videos. For the final type of queries, the user has neither keywords nor example videos. Thus, this thesis develops a video retrieval system with `multi-modal' interfaces by implementing three video data mining methods to support each of the above three query types. For the first query type, the system provides a `Query-By-Keyword' (QBK) interface where patterns which characterize videos relevant to certain keywords are extracted. For the second query type, a `Query-By-Example' (QBE) interface is provided where relevant videos are retrieved based on their similarities to example videos provided by the user. So, patterns for defining meaningful shot similarities are extracted using example videos. For the final qu

ery type, a `Query-By-Browsing' (QBB) interface is devised where abnormal video editing patterns are detected to characterize impressive segments in videos, so that the user can browse these videos to find keywords or example videos. Finally, to improve retrieve performance, the integration of QBK and QBE is explored where informations from text and image/video modalities are interchanged using knowledge base which represents relations among semantic contents.

The developed video data mining methods and the integration method are summarized as follows.

The method for the QBK interface focuses that a certain semantic content is presented by concatenating several shots taken by different cameras. Thus, this method extracts `sequential patterns' which relate adjacent shots relevant to certain keyword queries. Such patterns are extracted by connecting characteristic features in adjacent shots. However, the extraction of sequential patterns requires an expensive computation cost because a huge number of sequences of features have to be examined as candidates of patterns. Hence, time constraints are adopted to eliminate semantically irrelevant sequences of features.

The method for the QBE interface focuses on a large variation of relevant shots. This means that even for the same query, relevant shots contain significantly different features due to varied camera techniques and settings. Thus, `rough set theory' is used to extract multiple patterns which characterize different subsets of example shots. Although this pattern extraction requires counter-example shots which are compared to example shots, they are not provided. Hence, `partially supervised learning' is used to collect counter-example shots from a large set of shots left behind in the database. Particularly, to characterize the boundary between relevant and irrelevant shots, the method collects counter-example shots which are as similar to example shots as possible.

The method for the QBB interface assumes that impressive actions of a character are presented by abnormal video editing patterns. For example, thrilling actions of the character are presented by shots with very short durations while his/her romantic actions are presented by shots with very long durations. Based on this, the method detects `bursts' as patterns consisting of abnormally short or long durations of the character's appearance. The method firstly performs a probabilistic time-series segmentation to divide a video into segments characterized by distinct patterns of the character's appearance. It then examines whether each segment contains a burst or not.

The integration of QBK and QBE is achieved by constructing a `video ontology' where concepts such as Person, Car and Building are organized into a hierarchical structure. Specifically, this is constructed by considering the generalization/specialization relation among concepts and their co-occurrences in the same shots. Based on the video ontology, concepts related to a keyword query are selected by tracing its hierarchical structure. Shots where few of selected concepts are detected are filtered, and then QBE is performed on the remaining shots.

Experimental results validate the effectiveness of all the developed methods. In the future, the multi-modal video retrieval system will be extended by adding a `Query-By-Gesture' (QBG) interface based on virtual reality techniques. This enables a user to create example shots for any arbitrary queries by synthesizing his/her gesture, 3DCG and background images.

Advisor(s): Prof. Dr. Kuniaki Uehara (supervisor)

SIG MM member(s): Kimiaki Shirahama


CS 24 Uehara Laboratory at Graduate School of System Informatics, Kobe University

Our research group aims at developing fundamental and practical technologies to utilize knowledge extracted from multimedia data. To this end, we are conducting research in broad areas of artificial intelligence, more specifically, machine learning, video data mining, time-series data analysis, information retrieval, trend analysis, knowledge discovery, etc. with typically a large amount of data.

As a part of the research efforts, we are developing a multi-modal video retrieval system where different media, such as text, image, video, and audio, are analyzed using machine learning and data mining techniques. We formulate video retrieval as a classification problem to discriminate between relevant and irrelevant shots to a query. Various techniques, such as rough set theory, partially supervised learning, multi-task learning, and Hidden Markov Model (HMM), are applied to the classification. Recently, we began to develop a gesture-based video retrieval system where information from various sensors, including RGB cameras, depth sensors, and magnetic sensors, are fused using virtual reality and computer vision techniques. In addition, transfer learning and collaborative filtering are utilized to refine the video annotation.

Another pillar of our research group is concerned with more deeper analysis of natural language text. Our primary focus is to distill both explicit and implicit information contained therein. The former is generally seen as the problems of information extraction, question answering, passage retrieval, and annotation, and the latter as hypothesis discovery and text mining. Explicit information is directly described in text but not readily accessible by computers as it is embedded in complex human language. On the other hand, implicit information cannot be found in a single document and is only understood by synthesizing knowledge fragments scattered across a large number of documents. We take statistical natural language processing (NLP)- and machine learning-based approaches, such as kernel-based online learning and transductive transfer learning, to tackling these problems.

As described above, the common foundation underlying our research methodologies is machine learning, which requires more and more computing power reflecting increasingly available large-scale data and more complex algorithms. To deal with it, we are also engaged in developing parallel machine learning frameworks using MapReduce, MPI, Cell, and GPGPU. These works are ongoing and will be shared with the research community soon. More details of our research group can be found on our web site at

Pinaki Sinha

Automatic Summarization of Personal Photo Collections

Photo taking and sharing devices (e.g., smart phones, digital cameras, etc) have become extremely popular in recent times. Photo enthusiasts today capture moments of their personal lives using these devices. This has resulted in huge collections of photos stored in various personal archives. The exponential growth of online social networks and web based photo sharing platforms have added fuel to this fire. According to recent estimates [46], three billion photos are uploaded on the social network Facebook per month. This photo data overload has created some major challenges. One of the them is automatic generation of representative overviews from large photo collections. Manual browsing of photo corpora is not only tedious, but also time inefficient. Hence, development of an automatic photo summarization system is not only a research but also a practical challenge. In this dissertation, we present a principled approach for generation of size constrained overview summaries from large personal photo collections.

We define a photo summary as an extractive subset, which is a good representative of the larger photo set. We propose three properties that an effective summary should satisfy: Quality, Diversity and Coverage. Modern digital photos come with heterogeneous content and context data. We propose models which can combine this multimodal data to compute the summary properties. Our summarization objective is modeled as an optimization of these properties. Further, the summarization framework can integrate user preferences in form of inputs. Thus, different summaries may be generated from the same corpus to accommodate preference variations among the users.

A traditional way of intrinsic evaluation in information retrieval is comparing the retrieved result set with a manually generated ground truth. However, given the variability of human behavior in selection of appealing photos, it may be difficult and non-intuitive to generate a unique ground truth summary of a larger data corpus. Due to the personal nature of the dataset, only the contributor of a particular photo corpus can possibly summarize it (since personal photos typically come with lots of background personal knowledge). While considerable efforts have been directed towards evaluation of annotation and ranking in multimedia, relatively few experiments have been done to evaluate photo summaries.

We conducted extensive user studies on summarization of photos from single life events. The experiments showed certain uniformity and some diversity of user preferences in generating and evaluating photo summaries. We also posit that photo summaries should serve the twin objectives of information discovery and reuse. Based on this assumption, we propose novel objective metrics which enables us to evaluate summaries from large personal photo corpora without user generated ground truths. We also create dataset of personal photos along with hosts of contextual data which can be helpful in future research. Our experiments show that the summarization properties and framework proposed can indeed be used to generate effective summaries. This framework can be extended to include other types information (e.g., social ties among multiple users present in a dataset) and to create personalized photo summaries.

Advisor(s): Professor Ramesh Jain (supervisor), Professor Sharad Mehrotra (committee member), Professor Padhraic Smyth (committee member), Professor Deva Ramanan (committee member)

SIG MM member(s): Ramesh Jain


Radu Andrei Negoescu

Modeling and understanding communities in online social

The amount of multimedia content is on a constant increase, and people interact with each other and with content on a daily basis through social media systems.

The goal of this thesis was to model and understand emerging online communities that revolve around multimedia content, more specifically photos, by using large-scale data and probabilistic models in a quantitative approach.

The disertation has four contributions. First, using data from two online photo management systems, this thesis examined different aspects of the behavior of users of these systems pertaining to the uploading and sharing of photos with other users and online groups.

Second, probabilistic topic models were used to model online entities, such as users and groups of users, and the new proposed representations were shown to be useful for further understanding such entities, as well as to have practical applications in search and recommendation scenarios. Third, by jointly modeling users from two different social photo systems, it was shown that differences at the level of vocabulary exist, and different sharing behaviors can be observed.

Finally, by modeling online user groups as entities in a topic-based model, hyper-communities were discovered in an automatic fashion based on various topic-based representations.

These hyper-communities were shown, both through an objective and a subjective evaluation with a number of users, to be generally homogeneous, and therefore likely to constitute a viable exploration technique for online communities.

Advisor(s): Daniel Gatica-Perez, supervisor

SIG MM member(s): Daniel Gatica-Perez

ISBN number: DOI: 10.5075/epfl-thesis-5059

Social computing group

Our recent work has investigated methods to analyze small groups at work in multisensor spaces, populations of mobile phones users in urban environments, and on-line communities in social media.

Previous Section Table of Contents Next Section