Paper Detail

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun

huggingface Score 9.0

Published 2026-03-31 · First seen 2026-04-01

General AI

Abstract

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{zhao2026cutclaw,
  title = {CutClaw: Agentic Hours-Long Video Editing via Music Synchronization},
  author = {Shifang Zhao and Yihan Hu and Ying Shan and Yunchao Wei and Xiaodong Cun},
  year = {2026},
  abstract = {Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models\textasciitilde{}(MLLMs) as an agent system. },
  url = {https://huggingface.co/papers/2603.29664},
  keywords = {multimodal language models, multi-agent framework, video editing, audio alignment, narrative consistency, visual content selection, aesthetic criteria, semantic criteria, code available, huggingface daily},
  eprint = {2603.29664},
  archiveprefix = {arXiv},
}

Metadata

{}