Feb 25 (Fri) @ 3:00pm: "Data-driven Methods for Evaluating the Quality and Authenticity of Visual Media," Ekta Prashnani, ECE PhD Defense
Data-driven approaches, especially those that leverage deep learning (DL), have led to significant progress for many important problems in computer vision and image/video processing over the last decade -- fueled by the availability of large-scale training datasets. Typically, for supervised DL tasks that assess the unambiguous aspects of visual media – such as classifying an object in an image, recognizing an activity in a video – large-scale datasets can be reliably captured with human-provided labels specifying the expected right answer. In contrast, an important class of perceptual tasks deserves special attention: assessing the different aspects of the quality of visual media. DL for these tasks can enable widespread downstream applications. However, the subjective nature of these tasks makes it difficult to capture unambiguous and consistent large-scale human-annotated training data. This poses an interesting challenge in terms of designing DL frameworks for such perceptual tasks with noisy/limited training data – which is the focus of this research.
We first explore DL for perceptually-consistent image error assessment, where we want to predict the perceived error between a reference and a distorted image. We begin by addressing the limitations of existing training datasets: we deploy a novel, noise-robust scheme to label our proposed large-scale dataset which is based on pairwise visual preference to reliably capture the human perception of visual error. We then design a learning framework to leverage this dataset and obtain state-of-the-art results in perceptual image-error prediction.
Perceptual metrics have been vital to the advancement of deep generative models for images and videos -- which, although promising, also poses a looming societal threat (e.g., in the form of malicious deepfakes). Therefore, we also explore a complementary question: given a high-quality video without any human-perceivable artifacts, can we predict whether it is authentic? Within this context, we specifically focus on robust deepfake detection using domain-invariant, generalizable, input features.
Lastly, we find that for certain perceptual tasks, such as modeling the visual saliency of a stimulus, the only way to overcome the ambiguity/noise in the training data is to query more humans, e.g., using a gaze tracker. This tends to be onerous - especially for video-based stimuli. Hence, most existing datasets are limited in their accuracy. Considering that noise-robust dataset capture in this case can often be impossible, we design a noise-aware training paradigm for video and image saliency prediction that prevents overfitting to the noise in the training data and shows consistent improvement compared to traditional training schemes. Further, since the existing video-saliency datasets do not capture video-specific aspects such as temporally evolving content, we design a novel videogame-based saliency dataset with temporally-evolving semantics and multiple attractors of human attention. Overall, through this dissertation, we make critical strides towards robust DL for visual perceptual tasks related to quality assessment.
Ekta Prashnani is a PhD candidate in the ECE department and is advised by Professor B. S. Manjunath. Her main research interests include robust deep learning from limited or noisy labels, visual perceptual metrics for images and videos, and visual media forensics. Specifically she has worked on improving the robustness and accuracy of deep learning methods for evaluating perceptual quality aspects of visual media.
Hosted by: Professor B. S. Manjunath
Submitted by: Ekta Prashnani <firstname.lastname@example.org>