![]() |
![]() |
PhD thesis abstracts
June 2011
Dinesh Babu JayagopiComputational Modeling of Face-to-Face Social Interaction
There exists a solid body of knowledge about small groups and the multimodal nature of the nonverbal phenomenon in social psychology and nonverbal communication. However, the problem has only recently begun to be studied in the multimodal processing community. A recent trend is to analyze these interactions in the context of face-to-face group conversations, using multiple sensors and make inferences automatically without the need of a human expert. These problems can be formulated in a machine learning framework involving the extraction of relevant audio, video features and the design of supervised or unsupervised learning models. While attempting to bridge social psychology, perception, and machine learning, certain factors have to be considered. Firstly, various group conversation patterns emerge at different time-scales. For example, turn-taking patterns evolve over shorter time scales, whereas dominance or group-interest trends get established over larger time scales. Secondly, a set of audio and visual cues that are not only relevant but also robustly computable need to be chosen. Thirdly, unlike typical machine learning problems where ground truth is well defined, interaction modeling involves data annotation that needs to factor in interannotator variability. Finally, principled ways of integrating the multimodal cues have to be investigated. In the thesis, we have investigated individual social constructs in small groups like dominance and status (two facets of the so-called vertical dimension of social relations). In the first part of this work, we have investigated how dominance perceived by external observers can be estimated by different nonverbal audio and video cues, and affected by annotator variability, the estimation method, and the exact task involved. In the second part,we jointly study perceived dominance and role-based status to understand whether dominant people are the ones with high status and whether dominance and status in small group conversations be automatically explained by the same nonverbal cues. We employ speaking activity, visual activity, and visual attention cues for both the works. In the second part of the thesis, we have investigated group social constructs using both supervised and unsupervised approaches. We first propose a novel framework to characterize groups. The two-layer framework consists of a individual layer and the group layer. At the individual layer, the floor-occupation patterns of the individuals are captured. At the group layer, the identity information of the individuals is not used. We define group cues by aggregating individual cues over time and person, and use them to classify group conversational contexts - cooperative vs competitive and brainstorming vs decision-making. We then propose a framework to discover group interaction patterns using probabilistic topic models. An objective evaluation of our methodology involving human judgment and multiple annotators, showed that the learned topics indeed are meaningful, and also that the discovered patterns resemble prototypical leadership styles - autocratic, participative, and free-rein - proposed in social psychology.
Katayoun FarrahiA Probabilistic Approach to Socio-Geographic Reality Mining
We first investigate two types of probabilistic topic models for large-scale location-driven phone data mining. We propose a methodology based on Latent Dirichlet Allocation, followed by the Author Topic Model, for the discovery of dominant location routines mined from the MIT Reality Mining data set containing the activities of 97 individuals over the course of a 16 month period. We investigate the many possibilities of our proposed approach in terms of activity modeling, including differentiating users with high and low varying lifestyles and determining when a user’s activities fluctuate from the norm over time. We then consider both location and interaction features from cell tower connections and Bluetooth, in single and multimodal forms for routine discovery, where the daily routines discovered contain information about the interactions of the day in addition to the locations visited. We also propose a method for the prediction of missing multimodal data based on Latent Dirichlet Allocation. We further consider a supervised approach for day type and student type classification using similar socio-geographic features. We then propose two new probabilistic approaches to alleviate some of the limitations of Latent Dirichlet Allocation for activity modeling. Large duration activities and varying time duration activities can not be modeled with the initially proposed methods due to problems with input and model parameter size explosion. We first propose a Multi-Level Topic Model as a method to incorporate multiple time duration sequences into a probabilistic generative topic model. We then propose the Pairwise-Distance Topic Model as an approach to address the problem of modeling long duration activities with topics. Finally, we consider an application of our work to the study of influencing factors in human opinion change with mobile sensor data. We consider the Social Evolution Project Reality Mining dataset, and investigate other mobile phone sensor features including communication logs. We consider the difference in behaviors of individuals who change political opinion and those who do not. We combine several types of data to form multimodal exposure features, which express the exposure of individuals to others’ political opinions. We use the previously defined methodology based on Latent Dirichlet Allocation to define each group’s behaviors in terms of their exposure to opinions, and determine statistically significant features which differentiate those who change opinions and those who do not. We also consider the difference in exposure features of individuals that increases their interest in politics versus those who do not. Overall, this thesis addresses several important issues in the recent body of work called Computational Social Science. Investigations principled on mathematical models and multiple types of mobile phone sensor data are performed to mine real life human activities in largescale scenarios.
Kimiaki ShirahamaIntelligent Video Processing Using Data Mining Techniques
Queries can be classified into three types. For the first type of queries, a user can find keywords suitable for retrieving relevant videos. For the second type of queries, the user cannot find such keywords due to the lexical ambiguity, but can provide some example videos. For the final type of queries, the user has neither keywords nor example videos. Thus, this thesis develops a video retrieval system with `multi-modal' interfaces by implementing three video data mining methods to support each of the above three query types. For the first query type, the system provides a `Query-By-Keyword' (QBK) interface where patterns which characterize videos relevant to certain keywords are extracted. For the second query type, a `Query-By-Example' (QBE) interface is provided where relevant videos are retrieved based on their similarities to example videos provided by the user. So, patterns for defining meaningful shot similarities are extracted using example videos. For the final qu ery type, a `Query-By-Browsing' (QBB) interface is devised where abnormal video editing patterns are detected to characterize impressive segments in videos, so that the user can browse these videos to find keywords or example videos. Finally, to improve retrieve performance, the integration of QBK and QBE is explored where informations from text and image/video modalities are interchanged using knowledge base which represents relations among semantic contents. The developed video data mining methods and the integration method are summarized as follows. The method for the QBK interface focuses that a certain semantic content is presented by concatenating several shots taken by different cameras. Thus, this method extracts `sequential patterns' which relate adjacent shots relevant to certain keyword queries. Such patterns are extracted by connecting characteristic features in adjacent shots. However, the extraction of sequential patterns requires an expensive computation cost because a huge number of sequences of features have to be examined as candidates of patterns. Hence, time constraints are adopted to eliminate semantically irrelevant sequences of features. The method for the QBE interface focuses on a large variation of relevant shots. This means that even for the same query, relevant shots contain significantly different features due to varied camera techniques and settings. Thus, `rough set theory' is used to extract multiple patterns which characterize different subsets of example shots. Although this pattern extraction requires counter-example shots which are compared to example shots, they are not provided. Hence, `partially supervised learning' is used to collect counter-example shots from a large set of shots left behind in the database. Particularly, to characterize the boundary between relevant and irrelevant shots, the method collects counter-example shots which are as similar to example shots as possible. The method for the QBB interface assumes that impressive actions of a character are presented by abnormal video editing patterns. For example, thrilling actions of the character are presented by shots with very short durations while his/her romantic actions are presented by shots with very long durations. Based on this, the method detects `bursts' as patterns consisting of abnormally short or long durations of the character's appearance. The method firstly performs a probabilistic time-series segmentation to divide a video into segments characterized by distinct patterns of the character's appearance. It then examines whether each segment contains a burst or not. The integration of QBK and QBE is achieved by constructing a `video ontology' where concepts such as Person, Car and Building are organized into a hierarchical structure. Specifically, this is constructed by considering the generalization/specialization relation among concepts and their co-occurrences in the same shots. Based on the video ontology, concepts related to a keyword query are selected by tracing its hierarchical structure. Shots where few of selected concepts are detected are filtered, and then QBE is performed on the remaining shots. Experimental results validate the effectiveness of all the developed methods. In the future, the multi-modal video retrieval system will be extended by adding a `Query-By-Gesture' (QBG) interface based on virtual reality techniques. This enables a user to create example shots for any arbitrary queries by synthesizing his/her gesture, 3DCG and background images.
Pinaki SinhaAutomatic Summarization of Personal Photo Collections
We define a photo summary as an extractive subset, which is a good representative of the larger photo set. We propose three properties that an effective summary should satisfy: Quality, Diversity and Coverage. Modern digital photos come with heterogeneous content and context data. We propose models which can combine this multimodal data to compute the summary properties. Our summarization objective is modeled as an optimization of these properties. Further, the summarization framework can integrate user preferences in form of inputs. Thus, different summaries may be generated from the same corpus to accommodate preference variations among the users. A traditional way of intrinsic evaluation in information retrieval is comparing the retrieved result set with a manually generated ground truth. However, given the variability of human behavior in selection of appealing photos, it may be difficult and non-intuitive to generate a unique ground truth summary of a larger data corpus. Due to the personal nature of the dataset, only the contributor of a particular photo corpus can possibly summarize it (since personal photos typically come with lots of background personal knowledge). While considerable efforts have been directed towards evaluation of annotation and ranking in multimedia, relatively few experiments have been done to evaluate photo summaries. We conducted extensive user studies on summarization of photos from single life events. The experiments showed certain uniformity and some diversity of user preferences in generating and evaluating photo summaries. We also posit that photo summaries should serve the twin objectives of information discovery and reuse. Based on this assumption, we propose novel objective metrics which enables us to evaluate summaries from large personal photo corpora without user generated ground truths. We also create dataset of personal photos along with hosts of contextual data which can be helpful in future research. Our experiments show that the summarization properties and framework proposed can indeed be used to generate effective summaries. This framework can be extended to include other types information (e.g., social ties among multiple users present in a dataset) and to create personalized photo summaries.
Radu Andrei NegoescuModeling and understanding communities in online social
The goal of this thesis was to model and understand emerging online communities that revolve around multimedia content, more specifically photos, by using large-scale data and probabilistic models in a quantitative approach. The disertation has four contributions. First, using data from two online photo management systems, this thesis examined different aspects of the behavior of users of these systems pertaining to the uploading and sharing of photos with other users and online groups. Second, probabilistic topic models were used to model online entities, such as users and groups of users, and the new proposed representations were shown to be useful for further understanding such entities, as well as to have practical applications in search and recommendation scenarios. Third, by jointly modeling users from two different social photo systems, it was shown that differences at the level of vocabulary exist, and different sharing behaviors can be observed. Finally, by modeling online user groups as entities in a topic-based model, hyper-communities were discovered in an automatic fashion based on various topic-based representations. These hyper-communities were shown, both through an objective and a subjective evaluation with a number of users, to be generally homogeneous, and therefore likely to constitute a viable exploration technique for online communities.
|
||||||||
|
||||||||
|
||||||||
|