Dec 2 (Tue) @ 9:00am: "Advancing Dynamic Scene Understanding: Improving generalization, temporal consistency, and interactive grounding," Raphael Ruschel dos Santos, ECE PhD Defense
Location: Engineering Science Bldg (ESB), Room 2001
Research Area: Communications & Signal Processing
Research Keywords: Activity Detection, Machine Learning, Tracking
Abstract
Visual scene understanding is a cornerstone of computer vision, supporting applications ranging from autonomous navigation to human-computer interaction. While traditional Scene Graph Generation (SGG) methods have made progress in parsing static images, they face significant limitations when applied to dynamic, real-world video data. Existing approaches often struggle to generalize to unseen object–relationship compositions, fail to maintain temporal consistency across frames, and rely on coarse bounding box representations that limit user interactivity. This dissertation systematically addresses these challenges through three contributions that advance the state of the art in Dynamic Scene Graph Generation (DSG).
First, to overcome dataset bias and poor generalization, we introduce the Decoupled Dynamic Scene Graph (DDS) network. By separating the feature-learning processes for objects and relationships, DDS forces the model to learn disentangled representations. This design substantially improves compositional zero-shot learning, achieving a 24-fold improvement in detecting unseen interaction triplets on the Action Genome dataset compared to baseline methods.
Second, to address tracklet fragmentation in video analysis, we present Temporally Consistent Dynamic Scene Graphs (TCDSG). This end-to-end framework integrates a Sequence-Level Bipartite Matching objective and Temporally Conditioned Queries to enforce long-term identity persistence during training. TCDSG removes the need for fragile post-processing heuristics and doubles temporal recall (tR@50) on standard benchmarks. To support efficient training on variable-length video sequences, we also propose BLoad, a specialized distributed data-loading strategy.
Finally, to bridge the gap between automated perception and human intent, we introduce Click-to-Graph, the first interactive framework that unifies promptable foundation models with structured scene understanding. Built on the Segment Anything Model 2, this system allows users to specify subjects through simple visual prompts and automatically generates temporally consistent, pixel-level scene graphs. By combining a Dynamic Interaction Discovery Module with semantic classification heads, Click-to-Graph moves beyond coarse bounding boxes to deliver high-fidelity, user-guided panoptic understanding of complex interactions.
Together, these contributions establish a robust framework for generating more generalizable, temporally coherent, and interactive dynamic scene graphs, paving the way for the next generation of intelligent video analytics systems.
Bio
Raphael Ruschel is a Ph.D. candidate in Electrical and Computer Engineering at the University of California, Santa Barbara, advised by Professor B. S. Manjunath. He joined the Vision Research Lab in 2019, where his research focuses on activity detection, with an emphasis on improving generalization, consistency, and interpretability. His academic work includes a 2023 internship at Synaptics, where he developed deep-learning–based algorithms for moisture detection on touchscreens.
As a teaching assistant, Raphael received the Outstanding Teaching Assistant Award for three years. After completing his Ph.D., he will join Zoox to develop perception algorithms for autonomous vehicles.
He holds a B.S. in Computer Engineering from the State University of Rio Grande do Sul (UERGS) and an M.S. in Electrical Engineering from the Federal University of Rio Grande do Sul (UFRGS), both in Brazil.
Hosted By: ECE Professor B.S. Manjunath
Submitted By: Raphael Ruschel dos Santos <raphael251@ucsb.edu>