TUBITAK 1002 Project

Unsupervised Video Summarization Using Diffusion Models

Istanbul Technical University  ·  Machine Learning & AI Research Group

← Back to Research
🎬
Focus Area
Video Summarization
🧠
Core Architecture
DDPM + GAN + VAE
🔬
Learning Paradigm
Unsupervised
📋
Funder
TUBITAK 1002

Motivation

In recent years, the rapid growth of online video-sharing platforms, along with applications such as surveillance and media archiving, has made the efficient analysis of large-scale video data increasingly important. Because watching full-length videos is often time-consuming and resource-intensive, video summarization has become an essential technology.

In this study, we focus on the unsupervised video summarization problem and propose a summarization framework that preserves both the temporal structure and semantic content of video data.

Background & Motivation

Recently, unsupervised approaches built on Generative Adversarial Networks (GANs) have demonstrated strong potential by casting video summarization as a reconstruction problem. Modern techniques further enhance this process by replacing Long Short-Term Memory (LSTM) architectures with self-attention mechanisms in the frame selection stage, enabling more efficient capture of long-range temporal relationships among video frames.

However, importance scores generated by self-attention may exhibit high variance and temporal inconsistency, which can negatively affect the quality of frame selection.

Proposed Method

In this project, we propose a novel model that integrates a diffusion-based score refinement mechanism into the adversarial training loop. To this end, a Denoising Diffusion Probabilistic Model (DDPM) is employed to act as a regularizer on the outputs of the self-attention module.

During training, the diffusion process refines noisy attention scores to produce more stable and representative importance estimates. These refined scores guide the Variational Autoencoder (VAE) and GAN components to reconstruct the video using only the most informative segments.

Experimental results demonstrate that the proposed method outperforms conventional GAN-based video summarization approaches in terms of capturing important video segments and preserving summary coherence.

Model Architecture

Forward and backward pass of the proposed diffusion-based video summarization model
Figure 1. Overview of the proposed framework. Left (Forward Pass): Self-attention importance scores are extracted from video frames and corrupted with noise across time steps t=0…T. Right ( Backward Pass): A 1D convolutional DDPM denoises the noisy scores conditioned on a timestep embedding, yielding refined importance scores that guide the GAN-VAE reconstruction.

Publications and Source Codes

The paper related to this project is currently under review. Once the review process is complete, the source code will be made available.