PhD thesis abstracts
Volume 4, Issue 1, March 2012 (ISSN 1947-4598)
|
|||||||||||
Anita SobeSelf-Organizing Multimedia DeliveryIn this thesis the non-sequential delivery of media in dynamic networks is investigated. Consider a scenario where people participate at a social event. With the increased popularity of smart phones and tablet computers people produce more and more multimedia content. They share their content and consume it on popular web platforms. The production and the consumption of such media are, however, different from the typical sequential movie pattern: we call this non-sequential media access. If the infrastructure is not available, visitors cannot share their content with other visitors during the event. The idea is to connect the devices directly, which is further robust even if people move during the event (dynamic networks). Non-sequential media access in combination with dynamic networks brings new challenges for the whole multimedia life cycle. A formalism called Video Notation helps to define the single parts of the life cycle with a simple and short notation. New measures for transport are needed as well. A caching technique is introduced that allows for evaluating the goodness of content for being cached based on its popularity in different user groups. However, this cache does not cope with the dynamic network requirement, because such a delivery has to be robust, adaptive and scalable. Therefore, we concentrate on self-organizing algorithms that provide these characteristics. In this thesis the implemented algorithm is inspired by the endocrine system of higher mammals. A client can express its demands by creating hormones that will be released to the network. The corresponding resources are attracted by this hormone and travel towards a higher hormone concentration. This leads to a placement of content near to the users. Furthermore, the robustness and service quality is increased by placing replicas of the traveling content along the transport path. Unused replicas are automatically removed from the nodes, to ensure storage balancing. Finally, we show with a use case that a middleware based on the hormone-based delivery including well-defined interfaces to the user and to the network can be used for content delivery. For such a general application recommendations on possible configurations are made.
Dominik KasparMultipath Aggregation of Heterogeneous Access NetworksThe explosive deployment of wired and wireless communication infrastructure has recently enabled many novel applications and sparked new research problems. One of the unsolved issues in today's Internet - the main topic of this thesis - is the goal of increasing data transfer speeds of end hosts by aggregating and simultaneously using multiple network interfaces. This objective is most interesting when the Internet is accessible through several, relatively slow and variable (typically wireless) networks, which are unable to single-handedly provide the required data rate for resource-intensive applications, such as bulk file transfers and high-definition multimedia streaming. Communication devices equipped with multiple network interfaces are now commonplace. Smartphones and laptops are often shipped with built-in network adapters of different wireless technologies, typically enabling them to connect to wireless local area networks and cellular data networks (such as WLAN and HSPA). At the same time, wireless network coverage has become so widespread that mobile devices are often located in overlapping coverage areas of independent access networks. However, even if multiple interfaces are successfully connected to the Internet, operating systems typically use only a single default interface for data transmission, leaving secondary interfaces idle. This technical restriction is based on the fact that the majority of current Internet traffic is conveyed by transport protocols (TCP and UDP) that do not support multiple IP addresses per endpoint. Another crucial factor is that path heterogeneity introduces packet reordering, which can negatively affect the performance of transport protocols and strain the buffer requirements of applications. This thesis explores the problem of multipath aggregation with the attempt to find solutions that achieve increased data throughput by concurrently utilizing heterogeneous and dynamic access networks. We strive for approaches that support existing applications without requiring significant modifications to the current infrastructure. The exploration starts at the IP layer with a study of the IP packet reordering that is caused by the use of heterogeneous paths. Our practical experiments confirm that multipath reordering exceeds the typical packet reordering in the Internet by an extent that renders the usual reordering metric useless. The outcome of our network-layer study is a proxy-based, adaptive multipath scheduler that is able to mitigate packet reordering while transparently forwarding a single transport connection over multiple paths. After introducing a novel metric for quantifying the benefit of path aggregation, our analysis continues on the transport layer, where we investigate TCP's resilience to multipath-inflicted packet reordering. While IP packet reordering is known for its destructive effect on TCP's performance, our practical experiments indicate that a modern implementation is significantly more robust to packet reordering than standard TCP and achieves a substantial aggregation benefit even in certain cases of extreme multipath heterogeneity. In addition, we run a large set of emulation-based multipath experiments and identify several TCP parameters that lead to improved multipath performance when correctly tuned. Finally, we present an application-layer solution that builds upon the idea of logical file segmentation for streaming a single video to multihomed clients. The novelty of this approach lies in diverting standard protocol features (i.e., HTTP pipelining and range retrieval requests) from their intended purpose and using them for scheduling video segments over different paths. Interoperable with existing server infrastructure, our proposed solution can be deployed in a lightweight and purely client-based manner. We validate the proposed algorithms by implementing them into an existing video streaming platform.
Sree Hari Krishnan ParthasarathiPrivacy-Sensitive Audio Features for Conversational Speech ProcessingThe work described in this thesis takes place in the context of capturing real-life audio for the analysis of spontaneous social interactions. Towards this goal, we wish to capture conversational and ambient sounds using portable audio recorders. Analysis of conversations can then proceed by modeling the speaker turns and durations produced by speaker diarization. However, a key factor against the ubiquitous capture of real-life audio is privacy. Particularly, recording and storing raw audio would breach the privacy of people whose consent has not been explicitly obtained. In this thesis, we study audio features instead -- for recording and storage -- that can respect privacy by minimizing the amount of linguistic information, while achieving state-of-the-art performance in conversational speech processing tasks. Indeed, the main contributions of this thesis are the achievement of state-of-the-art performances in speech/nonspeech detection and speaker diarization tasks using such features, which we refer to, as privacy-sensitive. Besides this, we provide a comprehensive analysis of these features for the two tasks in a variety of conditions, such as indoor (predominantly) and outdoor audio. To objectively evaluate the notion of privacy, we propose the use of human and automatic speech recognition tests, with higher accuracy in either being interpreted as yielding lower privacy. For the speech/nonspeech detection (SND) task, this thesis investigates three different approaches to privacy-sensitive features. These approaches are based on simple, instantaneous, feature extraction methods, excitation source information based methods, and feature obfuscation methods. These approaches are benchmarked against Perceptual Linear Prediction (PLP) features under many conditions on a large meeting dataset of nearly 450 hours. Additionally, automatic speech (phoneme) recognition studies on TIMIT showed that the proposed features yield low phoneme recognition accuracies, implying higher privacy. For the speaker diarization task, we interpret the extraction of privacy-sensitive features as an objective that maximizes the mutual information (MI) with speakers while minimizing the MI with phonemes. The source-filter model arises naturally out of this formulation. We then investigate two different approaches for extracting excitation source based features, namely Linear Prediction (LP) residual and deep neural networks. Diarization experiments on the single and multiple distant microphone scenarios from the NIST rich text evaluation datasets show that these features yield a performance close to the Mel Frequency Cepstral coefficients (MFCC) features. Furthermore, listening tests support the proposed approaches in terms of yielding low intelligibility in comparison with MFCC features. The last part of the thesis studies the application of our methods to SND and diarization in outdoor settings. While our diarization study was more preliminary in nature, our study on SND brings about the conclusion that privacy-sensitive features trained on outdoor audio yield performance comparable to that of PLP features trained on outdoor audio. Lastly, we explored the suitability of using SND models trained on indoor conditions for the outdoor audio. Such an acoustic mismatch caused a large drop in performance, which could not be compensated even by combining indoor models.
Mohammad Kazemi VarnamkhastiMultiple Description Video Coding Based on Base and Enhancement Layers of SVC and Channel Adaptive OptimizationMultiple distortion coding (MDC) is a promising solution for video transmission over lossy channels. In MDC, multiple descriptions of a source are generated which are dependently decodable and mutually refinable. When all descriptions are available, the corresponding quality is called central quality; otherwise it is called side quality. Generally, there exists a trade-off between side and central quality in all MDC schemes. MDC methods which provide better central-side quality trade-off are of more interest to designers. In this thesis a new MDC scheme is introduced which has better trade-off between side and central quality compared to existing schemes. In other words, for the same central quality, it provides higher side quality; or equivalently for the same side quality, it has higher central quality. This method is based on the mixing of the base and enhancement layers of Coarse-Grain Scalable (CGS) coding and hence is called Mixed Layer MDC (MLMDC). At the central decoder the layers are separated and we have two-layer quality, such as in the CGS decoder. At the side decoder, some descriptions are not available and hence we cannot separate the layers, directly. We propose to use estimation for this purpose. MLMDC for two-description coding and four-description coding is implemented in JM16.0, H.264/AVC reference software. The experimental results show that for videos which have dynamic enough content (texture and motion activity), MLMDC in comparison to the conventional MDCs provides higher side quality for the same central quality. In addition, for video transmission over channels with packet loss (such as the Internet), MLMDC provides higher average video quality, in particular for four-description coding. In error prone environments, we need to have higher side quality while in less noisy conditions higher central quality is more important. Therefore, in order to have the best quality in different channel conditions, optimization is needed. For this purpose, a new model for end-to-end distortion is introduced which takes into account both quantization and transmission distortions for predicting the quality at the receiver side. The transmission distortion is the result of error propagation which in turn originated from the mismatch between side and central decoder outputs. The derived model is applicable for all DCT-domain MDCs. With experimental results, the model is verified first and then used for MDCs optimization. The results show the performance of the optimizer and also MLMDC higher video quality compared to that of conventional MDCs when they are designed optimally.
While the thesis is in Farsi, its first part is available in English in the following paper:
Mona OmidyeganehParametric Analysis and Modeling of Video SignalsVideo modeling and analysis have been of great interest in the video research community, due to their essential contribution to systematic improvements concerned in a wide range of video processing techniques. Parametric modeling and analysis of video provides appropriate means for processing the signal and the necessary mining of information for efficient representation of the signal. Video comparison, human action recognition, video retrieval, video abstraction, video transmission, and video clustering are some applications that can benefit from video modeling and analysis. In this thesis, the parametric analysis and modeling of the video signal is studied through two schemes. In the first scheme, spatial parameters are first extracted from video frames and temporal evolution of these spatial parameters is investigated. Spatial parameters are selected based on the statistics of the 2D wavelet transform of the video frames, where wavelet transform provides a sparse representation of the signals and structurally conforms to the frequency sensitivity distribution of the human visual system. To analyze the temporal relations and progress of these spatial parameters, three methods are considered: inter-frame distance measurement, temporal decomposition, and Autoregressive (AR) modeling. In the first method, employing the Kullback–Leibler (KL) distance between spatial parameters as the similarity measure, the temporal evolution of the spatial features is studied. This analysis is used to determine shot boundaries, segment shots into clusters and select keyframes properly based on both similarity and dissimilarity criteria, within and outside the corresponding cluster, respectively. In the second method, the video signal is assumed to be a sequence of overlapping independent visual components called events, which typically are temporally overlapping compact functions that describe temporal evolution of a given set of the spatial parameters of the video signal. This event-based temporal decomposition technique is used for video abstraction, where no shot boundary detection or clustering is required. In the third method, the video signal is assumed to be a combination of spatial feature time series that are temporally approximated by the AR model. The AR model describes each spatial feature vector as a linear combination of the previous vectors within a reasonable time interval. Shot boundaries are well detected based on the AR prediction errors, and then at least one keyframe is extracted from each shot. To evaluate these models, subjective and objective tests, on TRECVID and Hollywood2 datasets, are conducted and simulation results indicate high accuracy and effectiveness of these techniques. In the second scheme, video spatio-temporal parameters are extracted from 3D wavelet transform of the natural video signal based on the statistical characteristics analysis of this transform. Joint and marginal statistics are studied and the extracted parameters are utilized for human action recognition and video activity level detection. Subjective and objective test results, on the popular Hollywood2 and KTH datasets, confirm high efficiency of this analysis method, as compared to the current techniques. While the thesis is written in Farsi, the following English papers encompass some of the main technical aspects from the thesis:
Xirong LiContent-Based Visual Search Learned from Social MediaIn a world with increasing amounts of digital pictures, content-based visual search is an important and scientifically challenging problem in ICT research. This thesis tackles the problem by learning from social media. The fundamental question addressed in this thesis is: what is the value of socially tagged images for visual search? To that end, we propose the neighbor voting algorithm (Chapter 2) and its multi-feature variant (Chapter 3) to verify whether what people spontaneously say about an image is factually in the pictorial content. The two algorithms are used to find high-quality positive examples for learning automated image taggers. To obtain negative training examples without manual verification, we go beyond the classical random sampling approach by introducing informative negative bootstrapping (Chapter 4). For answering complex visual searches, we introduce the notion of bi-concepts as a retrieval method for unlabeled images in which two concepts are co-occurring (Chapter 5). Finally, as users have their own associations with image semantics, we propose personalized image tagging by jointly exploiting personal tagging history and content-based analysis, optimized through Monte Carlo sampling (Chapter 6). On the basis of the reported theories, algorithms, and experiments, this thesis has revealed the value of socially tagged images for content-based visual search, providing a basis for uncovering universal knowledge on images and semantics. With the methodologies established, this thesis opens up promising avenues for image search engines which provide access to the semantics of the visual content, but without the need of manual labeling.
|
|||||||||||
|
|||||||||||
|
|||||||||||
|