CVPR 2026 Workshop

3rd CV4Smalls Workshop

Computer Vision with Small Data: Beyond Scale
Toward Data-Efficient Dynamically-Aware Video Intelligence

Small Data as a Design Principle

While large-scale pretraining has revolutionized image understanding, video remains bound by high costs and complex dynamics. At this years 3rd annual CV4Smalls workshop, we seek to treat these constraints as opportunities—using small data as a design driver for physics-aware, motion-centric intelligence. Join us to define the future of generalizable video AI, featuring a new challenge co-hosted by Voxel51 and Twelve Labs.

Call for Papers & Participation

Algorithmic Track

We invite full papers (up to 8 pages, to be included in the official CVPR Workshop Proceedings upon acceptance) that advance research in small-data, motion-centric, and long-form video understanding, including but not limited to:

  • Physics-aware and dynamics-driven spatiotemporal representation learning
  • Long-form video reasoning, temporal segmentation, and event discovery
  • Data-efficient models for action recognition, captioning, and video question answering
  • Self-, weakly-, and semi-supervised learning for video perception under limited data
  • Domain adaptation, generalization, and cross-modal transfer for video ML
  • Synthetic, simulated, and generative data for small-data video training
  • Causality-aware video understanding and temporal grounding
  • Multimodal fusion across video, audio, language, and embodied sensor streams

Emerging Applications

We also welcome short papers (up to 4 pages, non-archival) that showcase deployment-focused innovations and case studies to be presented as posters, including:

  • Healthcare and surgical assistance with limited labeled data
  • Assistive and autonomous robotics in dynamic, real-world environments
  • Security, surveillance, and public safety using privacy-preserving models
  • Environmental and wildlife monitoring with sparse, long-duration video data
  • Synthetic-to-real adaptation for physics-based and simulation-driven platforms

Challenge Track

Our challenge at CVPR aims to advance data-centric video understanding through a hands-on exploration of real-world safety scenarios.

This edition will feature a collaboration between Voxel51 and Twelve Labs, combining Twelve Labs’ multimodal video understanding platform with FiftyOne’s powerful visualization and curation capabilities.

Participants will engage with a worker safety dataset, focusing on analyzing and classifying safe versus unsafe behaviors in manufacturing environments. Through this challenge, attendees will have the opportunity to explore multiple latent spaces, experiment with diverse models and algorithms, and visualize their results using integrated workflows.

Keynote Speakers

Juan Carlos Niebles Prof. Juan Carlos Niebles
Research Director, Salesforce AI Research
Co-Director, Stanford Vision and Learning Lab
Talk Title Title to be announced
Vasudev Lal Dr. Vasudev Lal
Director, Applied Science, Multimodal GenAI
Oracle
Talk Title “Learning More from Less Multimodal Data: Lessons from Counterfactual Augmentation, Spatial Re-Captioning, and Cross-Lingual Transfer”

Motivation & Scope

Despite remarkable advances in image- and text-based machine learning, video understanding remains fundamentally constrained by its data inefficiency. Large-scale pretraining has revolutionized vision and language domains, yet video models continue to face prohibitive annotation costs, complex spatiotemporal dependencies, and high domain variability that severely hinder generalization.

In practical applications, videos rarely arrive as neatly segmented clips of interest. Real-world streams are continuous, ambiguous, and context-dependent, requiring models to reason dynamically over extended time horizons and under uncertainty. These challenges become even more pronounced in long-form video understanding, where systems must track entities, events, and evolving contexts across minutes or even hours.

Yet, most existing video-language models discard long-term and fine-grained motion cues early in their pipelines, relying on frame-level encoders that fail to capture causal and long-term temporal dependencies. As a result, current systems miss the true spatiotemporal dynamics that define complex interactions and behaviors in the real world.

The Paradigm Shift

We argue that small-data constraints should not be treated as limitations, but as core design drivers for algorithms, benchmarks, and deployment pipelines. By reframing small data as an opportunity rather than a barrier, this workshop will chart the path toward data-efficient, dynamics-aware, and deployable video perception systems that thrive under real-world conditions.

Important Dates

  • Dec 2025

    Call for Papers

    Official release.

  • March 4, 2026

    Submission Deadline

    Full & short papers due.

  • March 18, 2026

    Reviews Done & Decision Notifications Sent

  • April 8, 2026

    Camera-Ready

    Camera-ready versions due

Organized By

Northeastern University University of Central Florida University of Michigan Voxel51 Twelve Labs Amazon Prime Video Microsoft Research