TALC:Time-Aligned Captions for
Multi-Scene Text-to-Video Generation

1 University of California Los Angeles,   2 Google Research  

Abstract

Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce \textbf{T}ime-\textbf{Al}igned \textbf{C}aptions (\name) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the \name framework. We show that the \name-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is \href{https://talc-mst2v.github.io/}{https://talc-mst2v.github.io/}.

Main Figure

Figure 1: (a) Generating video on the merged descriptions, leads to a poor text-video alignment.
(b) Generating videos for the individual text descriptions and concatenate them temporally, leads to a lack of background consistency.
(c) Our approach (TALC) enhances the scene-level text-video alignment and maintains background consistency.

TALC: Time-Aligned Captions for Multi-Scene T2V Generation

Teaser
Figure 2: The architecture of Time-Aligned Captions (TALC). During the generation process of the video, the initial half of the video frames are conditioned on the embeddings of the description of scene 1, while the subsequent video frames are conditioned on the embeddings of the description of scene 2 (ry2 ).

Existing T2V generative models such as Modelscope and Lumiere are trained with large-scale datasets where each instance consists of a video and a human-written video description. These videos either lack the depiction of multiple events, or the video descriptions only focus on the main event in the video. As a result, the pretrained T2V generative models only synthesize single video scenes depicting individual events.

TALC is a novel and effective framework to generate multi-scene videos from diffusion T2V generative models based on the scene descriptions. TALC leverages the widely adopted text conditioning mechanism and modifies it to be aware of the alignment between text descriptions and individual video scenes.

We illustrate the framework in Figure 2. TALC aims to equip the generative model with the ability to depict all the events in the multi-scene descriptions, the visual consistency is ensured by the temporal modules (attentions and convolution blocks) in the denoiser network. TACL can be applied to the pretrained T2V such as Modelscope and Lumiere during inference.



Multi-Scene Video-Text Data Generation

Teaser
Figure 3: The architecture of Time-Aligned Captions (TALC). During the generation process of the video, the initial half of the video frames are conditioned on the embeddings of the description of scene 1, while the subsequent video frames are conditioned on the embeddings of the description of scene 2.

Our TALC approach improves multi-scene video generation, but struggles with accurate text adherence due to limited training on relevant data. Multi-scene video-text datasets are scarce and challenging to compile because creating detailed captions requires considerable time and resources. Previous projects like ActivityNet have generated captions for specific video scenes, but these scenes often overlap or are spaced far apart, which is problematic for producing smooth transitions in multi-scene videos. Consequently, there is a lack of continuous, high-quality captions needed for effective training of T2V models.

We develop a comprehensive multi-scene video-text dataset that will enhance the training of our existing T2V models. Using the multimodal model Gemini-Pro-Vision, we aim to create synthetic, high-quality video-text data. We start with a basic video-text dataset and use the PySceneDetect library to identify different scenes within a video. We then select a representative frame from each scene to capture its essence. These frames, along with the overall video caption, are fed into a large multimodal model to generate coherent, scene-specific captions. Using this method, illustrated in Figure 3, we generated 20k video scene-captions. The generated dataset consists of 73% multi-scene videos.

Experimental Results

Teaser
Figure 4: Automatic evaluation results for (a) ModelScope and (b) Lumiere. In (a), we observe that TALC-finetuned ModelScope model achieves the highest overall score, that is the average of the visual consistency and text adherence scores. In (b), we find that TALC framework with the Lumiere base model outperforms merging captions and merging videos on the overall scores. We report the average performance across the diverse multi-scene prompts and the number of generated scenes.

We show the efficacy of TALC in multi-scene video generation by utilizing two models: ModelScope and Lumiere. We show two settings, one where TALC is applied to the base model during inference and one where we finetune Modelscope on the curate multi-scene video data with TALC framework.

TALC outperforms the baselines without any finetuning Figure 4(a) and 4(b) show that the base ModelScope and Lumiere with TALC achieves the highest overall scores in multi-scene video generation, outperforming other methods such as merging captions and merging videos. TALC and merging captions excel in visual consistency, whereas merging videos struggle with maintaining consistent visual elements across scenes. TALC also leads in text adherence, demonstrating its effectiveness in aligning closely with scene-specific descriptions, unlike merging videos which, while high in text adherence, fail to integrate descriptions across multiple scenes effectively. This indicates that TALC is particularly effective in producing coherent and textually consistent multi-scene videos.

Finetuning with TALC achieves the best performance. The ModelScope T2V model was finetuned using the TALC framework and merging captions method to improve performance on multi-scene video-text data. The finetuning led to TALC achieving the highest overall score, maintaining strong visual consistency while significantly enhancing text adherence. Conversely, finetuning with merging captions resulted in a notable decrease in visual consistency. This suggests that finetuning on multi-scene data predominantly benefits text adherence, particularly when using the TALC approach.

Qualitative Examples

time_aligned_example

Generated Video Examples

We show videos generated using merging captions baseline and TALC for various multi-scene descriptions. Overall, we find that the merging captions baseline generates poor quality videos. This highlights that finetuning a T2V model with multi-scene video-text data by naively merging the scene-specific descriptions in the raw text space leads to undesirable artifacts in the generated video.

Scene 1: Superman is surfing on the waves.
Scene 2: The Superman falls into the water.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: Spiderman is surfing on the waves.
Scene 2: Darth Vader is surfing on the same waves.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: A stuffed toy is lying on the road.
Scene 2: A person enters and picks the stuffed toy.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: Red panda is moving in the forest.
Scene 2: The red panda spots a treasure chest.
Scene 3: The red panda finds a map inside the treasure chest.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: A labrador moves towards the camera.
Scene 2: The labrador moves away from the camera.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: A koala climbs a tree.
Scene 2: The koala eats the eucalyptus leaves.
Scene 3: The koala takes a nap.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: Superman is flying in the air.
Scene 2: Spiderman is flying in the air.

Merging Captions

Teaser

TALC (Ours)

Teaser

Scene 1: A cook pours the batter into a pan.
Scene 2: The cook stirs the batter in the pan.
Scene 3: The cook puts the cake on the table.
Scene 4: The cook cuts the cake on the table.

Merging Captions

Teaser

TALC (Ours)

Teaser