Claude Code's /watch plugin: Frame extraction, Whisper, and hook microscopes

Anthropic’s Claude models process text, images, and audio, but video understanding remains absent from the official API as of 2025. Developers have built workarounds that combine frame extraction, OpenAI’s Whisper transcription, and Claude Code hooks to give Claude functional video analysis capabilities.

TL;DR: Claude lacks native video input, but developers bridge this gap by extracting keyframes with FFmpeg, generating transcripts with Whisper, and routing both through Claude Code hooks. This pipeline lets Claude answer questions about video content, though it consumes significant tokens and loses temporal nuance between sampled frames.

Why Can’t Claude Watch Videos Natively?

Claude’s multimodal architecture handles images and PDF documents, but video files are not supported through the standard API. The model processes static visual frames rather than continuous temporal streams of data. This is a fundamental architectural constraint.

Video files typically arrive as compressed streams using codecs like H.264 or VP9, which encode spatial and temporal redundancy across hundreds of frames per minute. Claude’s vision encoder expects individual rasterized images, not inter-frame delta compression or motion vectors. Feeding a raw video file directly would fail at the input parsing stage.

Anthropic’s official documentation for Claude’s vision capabilities confirms support for JPEG, PNG, GIF, and WebP formats. The API accepts base64-encoded image data or image URLs, with each image counting toward token limits based on resolution. A 1080p frame typically consumes roughly 1,600 tokens after automatic resizing.

The absence of native video support forces developers to decompose video into discrete modalities: visual frames and audio tracks. Each modality gets processed separately, then the outputs merge into a unified context window for Claude to reason over. This decomposition is lossy but functional.

How Do You Extract Keyframes from Video Files?

FFmpeg serves as the standard tool for frame extraction, available on macOS via Homebrew (brew install ffmpeg) and on Linux through most package managers. The process involves sampling frames at regular intervals, typically one frame every 1 to 5 seconds depending on content density.

A basic extraction command looks like this:

ffmpeg -i input.mp4 -vf "fps=1" frame_%04d.jpg

This command produces one JPEG per second of video. For a 10-minute clip, that yields 600 images, which far exceeds Claude’s practical context limits. Developers must tune the fps parameter based on video length and the density of visual changes.

A more conservative approach samples one frame every 10 seconds:

ffmpeg -i input.mp4 -vf "fps=1/10" frame_%04d.jpg

This produces 60 frames for a 10-minute video, a manageable payload. For talking-head videos where visual changes are minimal, even sparser sampling at one frame per 30 seconds may suffice. Action-heavy content requires denser sampling to capture motion sequences.

Scene detection offers a smarter alternative to fixed-interval sampling. FFmpeg’s select filter with scene change thresholds identifies moments where the visual content shifts significantly:

ffmpeg -i input.mp4 -vf "select='gt(scene,0.3)',showinfo" -vsync vfr frame_%04d.jpg

The scene threshold ranges from 0 to 1, where 0.3 captures moderate changes and 0.5 captures only major cuts. This approach reduces redundant frames from static scenes while preserving transitions, slides, and visual events. For lecture recordings or presentation videos, scene detection typically yields 80% fewer frames than fixed-interval sampling at equivalent coverage.

Extracted frames should be organized with consistent naming conventions to preserve temporal ordering. A JSON manifest mapping frame filenames to timestamps enables Claude to reference specific moments:

[
 {"frame": "frame_0001.jpg", "timestamp": "00:00:00"},
 {"frame": "frame_0002.jpg", "timestamp": "00:00:10"}
]

This manifest becomes part of the prompt context, allowing Claude to cite specific times when describing events. Without timestamp metadata, Claude loses the ability to sequence events chronologically across frames.

What Role Does Whisper Play in Video Understanding?

OpenAI’s Whisper model transcribes the audio track from video files into structured text with timestamps. Since many videos convey most of their informational content through speech, the transcript often matters more than the visual frames. Whisper bridges this gap.

Whisper is available in multiple model sizes, from tiny (39M parameters) to large-v3 (1.5B parameters). The base model offers a reasonable balance of speed and accuracy for English content, while medium or large models improve performance on accented speech or non-English languages. Running locally via the whisper Python package avoids API costs:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio_extracted.mp3")

for segment in result["segments"]:
 print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

Audio extraction from video uses FFmpeg again:

ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

Whisper outputs segments with start and end timestamps, word-level confidence scores, and language detection. Feeding the full transcript into Claude’s context window provides the narrative backbone of the video. Claude can then cross-reference transcript timestamps with extracted frame timestamps to correlate spoken content with visual changes.

Whisper’s word-level timestamp mode (word_timestamps=True) enables fine-grained alignment between audio and frames. This precision matters for videos where visual changes happen mid-sentence, such as coding tutorials where the speaker references on-screen code. Without word-level alignment, Claude might associate a frame with the wrong segment of speech.

The transcript also captures non-speech audio cues. Whisper detects music, silence, and ambient sounds, labeling them in the output. These labels give Claude additional context about the video’s structure, such as intro music, pause segments, or background noise that might indicate a change in location or activity.

How Does Whisper Transcription Compare to Claude’s Native Audio Processing?

OpenAI’s Whisper model achieves a 7.6% word error rate on the LibriSpeech benchmark, making it one of the most accurate open-source speech-to-text systems available. Claude does not natively process audio tracks from video files, which means a dedicated transcription layer is required for any project involving spoken content. The Claude API accepts transcribed text as standard input, allowing developers to pass Whisper output directly into a prompt context.

Whisper handles 96 languages with automatic detection, and its large-v3 model processes roughly 30 seconds of audio per second on an NVIDIA A100 GPU. This speed matters. A 10-minute video transcribes in about 20 seconds, after which the text feeds into Claude for summarization, analysis, or question answering. The transcription pipeline adds one external dependency but solves a fundamental limitation.

Anthropic introduced a Voice Mode on Claude mobile apps that supports multilingual input, as reported by TestingCatalog on X. This feature handles real-time conversation but does not replace file-based transcription workflows. For pre-recorded video analysis, Whisper remains the standard bridge between audio and Claude’s language capabilities.

What Are the Technical Requirements for Building a Claude Video Pipeline?

A functional Claude Watch pipeline requires three components: a frame extraction tool (typically FFmpeg), a transcription service (Whisper API or local model), and the Claude API for final synthesis. FFmpeg extracts frames at configurable intervals — one frame every 2 seconds produces 30 images from a 60-second clip. Claude’s vision API accepts up to 100 images per request, so longer videos need batched processing.

The system must manage token budgets carefully. Each image consumes approximately 1,600 tokens in Claude’s vision format, meaning a 100-image batch uses 160,000 input tokens before any text analysis occurs. Claude’s context window accommodates this, but costs add up quickly at current API pricing.

For transcription, OpenAI’s Whisper API charges $0.006 per minute of audio. A 60-minute video costs $0.36 to transcribe. Local Whisper deployments eliminate per-minute costs but require GPU resources. The Claude API call itself depends on token volume and model selection.

Here is a breakdown of typical pipeline costs for a 10-minute video at standard settings:

Component	Resource Used	Cost (USD)
Frame extraction (FFmpeg)	300 frames at 0.5s intervals	$0 (local processing)
Image processing (Claude Vision)	100 selected frames × 1,600 tokens	~$0.48 with Claude Opus
Audio transcription (Whisper API)	10 minutes of audio	$0.06
Text analysis (Claude API)	~5,000 tokens transcript + prompt	~$0.23 with Claude Opus
Total estimated cost	Full pipeline	~$0.77

Developers building production systems should consider caching strategies. Repeated analysis of the same video should reuse extracted frames and transcripts rather than regenerating them.

Which Claude Model Should You Use for Video Analysis?

Model selection depends on task complexity and latency requirements. Claude Opus provides the deepest reasoning for complex video analysis tasks involving multiple visual elements and contextual understanding. Claude Sonnet offers a balance of speed and capability that suits most frame-by-frame description workflows.

According to the Claude Code CLI guide by Blake Crosley, Claude Code v2.1.173 uses Opus 4.8 as the default model, with a fallback chain configuration option. The guide describes a --safe-mode flag and Auto Mode in Bedrock that affects model routing decisions. These configuration options matter for developers building automated video pipelines.

Claude Fable 5 and Claude Mythos 5 represent newer tiers in the model lineup, as documented in the Claude API Docs. However, access to Fable has faced regulatory hurdles — the White House ordered its shutdown shortly after launch, and Anthropic is currently negotiating to restore availability, according to reporting from Next.Gazeta.pl. For production pipelines today, Opus and Sonnet remain the accessible choices.

How Do You Handle Long Videos That Exceed Token Limits?

Long-form content requires chunking strategies that split video analysis into manageable segments. A 60-minute video at one frame every 5 seconds generates 720 frames — far exceeding Claude’s 100-image-per-request limit. The solution involves temporal segmentation: divide the video into chapters, process each independently, then synthesize results.

FFmpeg supports precise timestamp-based extraction, allowing developers to pull frames from specific time ranges. A practical approach extracts one frame every 10 seconds for overview analysis, then denser sampling (1 frame per second) for segments requiring detailed examination. This hierarchical strategy balances coverage with token economy.

For transcription, Whisper processes audio in 30-second windows natively, producing timestamped segments that align with frame extraction intervals. Developers can correlate visual frames with transcript segments using timestamps, giving Claude synchronized multimodal context for each chunk.

The synthesis step combines chapter summaries into a final output. Claude excels at this consolidation task, taking 10-15 chunk summaries and producing a coherent full-video analysis. This approach scales to arbitrary video length without hitting architectural limits.

Frequently Asked Questions

Can Claude directly watch and analyze video files?

No, Claude cannot process video files natively. The Claude API accepts images (up to 100 per request) and text, but has no video decoding capability. Developers must extract frames using tools like FFmpeg and transcribe audio using Whisper before sending content to Claude. Anthropic’s Voice Mode on mobile apps handles real-time speech but does not extend to file-based video processing.

How much does it cost to analyze a 30-minute video with Claude?

A 30-minute video typically costs between $1.50 and $3.00 depending on model choice and frame sampling rate. Whisper API transcription costs $0.18 for 30 minutes of audio at $0.006 per minute. Claude Vision processing of 180 frames (one every 10 seconds) consumes approximately 288,000 tokens, costing roughly $1.38 with Opus pricing. Additional text analysis adds $0.30-$0.50.

Does Claude’s Cowork feature support video analysis?

Claude Cowork provides file, browser, and app access on your computer, as documented in the Claude Help Center safety guidelines. Cowork can open video files and interact with media players, but it still requires the same frame extraction and transcription pipeline — it does not add native video understanding. The Cowork safety documentation emphasizes understanding risks before granting file system access.

Is Claude Fable 5 available for video analysis pipelines?

Claude Fable 5 is documented in the Claude API Docs as part of the model lineup alongside Mythos 5, but its availability is currently restricted. The White House ordered Fable’s shutdown after launch, and Anthropic is negotiating restoration, per Next.Gazeta.pl reporting. Production video pipelines should target Claude Opus or Sonnet, which remain fully accessible.

Summary

Building a Claude Watch pipeline requires bridging three technologies: FFmpeg for frame extraction, Whisper for audio transcription, and the Claude API for multimodal synthesis. Here are the key takeaways:

Claude cannot process video directly — frame extraction and transcription are mandatory preprocessing steps that convert video into formats Claude understands (images and text).
Token budgeting is critical — each image consumes roughly 1,600 tokens, so a 100-image batch uses 160,000 tokens before analysis begins. Sampling strategy directly impacts both cost and coverage quality.
Whisper delivers production-grade transcription at $0.006 per minute, with a 7.6% word error rate on LibriSpeech. It supports 96 languages and integrates cleanly with Claude’s text input.
Model selection matters — Opus for complex reasoning, Sonnet for speed-sensitive tasks. Fable 5 remains restricted by White House order, so target Opus or Sonnet for production systems.
Long videos need chunking — temporal segmentation with timestamp correlation between frames and transcript segments allows Claude to process content of arbitrary length through hierarchical summarization.

For developers ready to build, start with a simple prototype: extract one frame every 5 seconds from a short clip, transcribe with Whisper, and send both to Claude with a structured prompt. The architecture scales from there. The Claude API Docs and Help Center provide the technical specifications needed for production deployment.