Tuesday, April 7, 2026

Video-MME-v2 raises the bar for video understanding evaluation; Adam's Law reveals textual frequency scaling in LLMs; Gemma 4 family dominates model releases with MoE and any-to-any variants

video-understanding-benchmarksempirical-scaling-lawsagent-trajectory-optimizationtool-use-efficiencygemma-4-ecosystemvirtual-try-on-and-video-synthesis

Executive Summary

April 7th is defined by evaluation rigor and empirical laws. The day's top paper, Video-MME-v2, arrives with 68 upvotes and establishes a comprehensive new benchmark for video understanding that exposes capability gaps even in frontier multimodal models. Paired with Adam's Law (48 upvotes), which uncovers a textual frequency scaling law governing how LLMs process token distributions, the day's research message is clear: understanding what models actually learn — and how to measure it — matters more than scaling further.

Agent trajectory retrieval and tool-use efficiency form the day's secondary theme. Two 18-upvote papers attack complementary problems: one teaches agents to retrieve useful information from execution histories, while the other catalogs inefficiency patterns when LLMs use tools for reasoning. Together, they point toward a maturing agent ecosystem that is moving past "can agents work" to "how do we make agents work well."

On the model front, Google's Gemma 4 family dominates HuggingFace trending with six variants spanning 2B to 31B parameters, including MoE (26B-A4B) and any-to-any modality models. The abliterated Gemma 4 variant from DealignAI and Jackrong's Claude Opus reasoning distillation into Qwen3.5 signal an active community immediately stress-testing and remixing new releases. GitHub trending is led by NousResearch's hermes-agent at 3,009 stars/day, confirming that agent infrastructure remains the dominant category in open-source AI.

Researcher Notes

Video-MME-v2 is the most important benchmark paper of the week, and its 68-upvote engagement reflects genuine community demand. The original Video-MME was already a go-to evaluation for video LLMs, but v2 substantially expands coverage with longer videos, more diverse question types, and harder temporal reasoning tasks. What makes this significant is timing: video understanding is the next frontier after image understanding has been largely commoditized by GPT-4o and Gemini. The benchmark's release alongside the Gemma 4 any-to-any models creates an immediate test surface — expect leaderboard results within days. The key question is whether any model achieves parity with human performance on the temporal reasoning subset, which has historically been the weakest capability.

Adam's Law is a sleeper hit that could reshape how we think about LLM training data. The paper identifies a power-law relationship between textual token frequency and LLM behavior — essentially showing that models develop predictable biases tied to how often they encounter specific patterns during training. This is not merely an academic curiosity: it has direct implications for data curation, deduplication strategies, and understanding failure modes. If frequency-dependent behavior is as lawful as the paper claims, it becomes possible to predict (and correct) model weaknesses from training data statistics alone. The 48 upvotes suggest practitioners see the practical value immediately.

The agent efficiency papers reveal that the field is entering its "optimization phase." The paper on learning to retrieve from agent trajectories (Zhou et al.) and the analysis of inefficiency patterns in tool-integrated reasoning (Su et al.) both scored 18 upvotes — modest by engagement standards but disproportionately important. They represent the shift from "proof of concept" agent work to engineering discipline. The inefficiency paper is particularly telling: it catalogs specific failure patterns where LLMs waste tool calls, loop unproductively, or choose suboptimal tool sequences. This kind of systematic failure analysis is exactly what precedes rapid improvement in any engineering field. Combined with the broader agent evaluation work trending this week, we are seeing the agent research community mature in real time.

The Gemma 4 model tsunami deserves attention for its structural implications, not just its performance. Six Gemma 4 variants trending simultaneously — including the first any-to-any modality models from Google at the 2B and 4B scale — signals that Google is pursuing a "flood the zone" strategy for open-weight models. The MoE variant (26B total, 4B active) is particularly interesting as it brings mixture-of-experts to a scale accessible to consumer hardware. Meanwhile, the community response is immediate: DealignAI's abliterated version (705 likes) removes safety guardrails, and Jackrong's Claude Opus distillation (2,454 likes) tries to inject stronger reasoning. The speed of community remixing is itself a signal — Gemma 4's architecture is modular enough to support diverse modifications.

Virtual try-on and compositional video synthesis represent quiet but steady progress in applied generative AI. Vanast and ONE-SHOT both address practical video generation challenges — human image animation for try-on and compositional human-environment synthesis respectively. Neither will make headlines, but they represent the kind of application-ready research that drives commercial adoption. The fact that they appear alongside the more academic Video-MME-v2 benchmark highlights the dual nature of today's video AI landscape: fundamental evaluation is improving at the same time as commercial applications ship.

Themes & Trends

Video Understanding Benchmarks

rising

Video-MME-v2 leads a push toward more rigorous evaluation of video understanding capabilities, exposing persistent temporal reasoning gaps in frontier multimodal models.

Empirical Scaling Laws

rising

Adam's Law reveals predictable frequency-dependent behaviors in LLMs, extending the tradition of scaling laws beyond loss curves into behavioral characterization of trained models.

Agent Trajectory Optimization

rising

Multiple papers address how agents learn from, retrieve, and optimize their execution histories, marking a shift from agent capability to agent efficiency.

Tool-Use Efficiency

rising

Analysis of how LLMs interact with external tools reveals systematic inefficiency patterns, suggesting that tool orchestration is becoming a first-class optimization target.

Gemma 4 Ecosystem Explosion

rising

Google's simultaneous release of dense, MoE, and any-to-any Gemma 4 variants triggers an ecosystem wave including abliterations, quantizations, and reasoning distillations within hours.

Reasoning and Self-Refinement

stable

Joint optimization of reasoning and self-correction capabilities in LLMs, rather than treating them as separate add-on capabilities, shows measurable gains.

Trending Papers (13)

Video-MME-v2: A Comprehensive Video Understanding Benchmark

High Relevance

Chaoyou Fu, Yuhan Dai, Yondong Luo, et al. University of Science and Technology of China, Shanghai AI Laboratory

A substantially expanded video understanding benchmark that tests multimodal models across longer videos, diverse question types, and harder temporal reasoning tasks, establishing a new standard for video LLM evaluation.

Key Findings

  • Comprehensive evaluation suite covering temporal reasoning, long-form video comprehension, and multi-turn QA

  • Frontier models including GPT-4o and Gemini still exhibit significant capability gaps on temporal reasoning subsets

  • Benchmark design enables fine-grained diagnosis of video understanding failures across duration and complexity axes

benchmarkvideo-understandingmultimodalevaluationtemporal-reasoning
68 upvotes

Adam's Law: Textual Frequency Law on LLMs

High Relevance

Hongyuan Adam Lu, et al. University of Waterloo

Uncovers a power-law relationship between textual token frequency in training data and LLM behavioral patterns, demonstrating that model biases and failure modes are predictable from data statistics alone.

Key Findings

  • LLM behavior follows a lawful power-law relationship with token frequency in training corpora

  • Frequency-dependent biases can be predicted from training data statistics without running the model

  • The law holds across multiple model families and scales, suggesting a universal phenomenon

scaling-lawsLLM-behaviortraining-datafrequency-analysisempirical-laws
48 upvotes

Learning to Retrieve from Agent Trajectories

High Relevance

Yuqi Zhou, et al. University of Michigan

Proposes a learned retrieval method for extracting useful information from agent execution histories, enabling more efficient reuse of past experience for decision-making in LLM agent systems.

Key Findings

  • Learned retrieval over agent trajectories significantly outperforms heuristic selection methods

  • Past execution traces contain reusable knowledge that transfers across similar tasks

  • The approach reduces redundant exploration and improves task completion rates in multi-step agent tasks

agent-trajectoriesretrievalLLM-agentsexperience-reuseefficiency
18 upvotes

Beyond Accuracy: Inefficiency Patterns in Tool-Integrated Reasoning

High Relevance

Qisheng Su, et al. Tsinghua University

Systematically catalogs inefficiency patterns in how LLMs use external tools for reasoning, identifying wasteful tool calls, unproductive loops, and suboptimal tool selection sequences.

Key Findings

  • LLMs exhibit systematic inefficiency patterns including redundant tool calls and unproductive retry loops

  • Tool selection order significantly impacts reasoning efficiency even when final accuracy is similar

  • Efficiency metrics reveal capability gaps invisible to accuracy-only evaluation

tool-useLLM-reasoningefficiencyfailure-analysisagent-evaluation
18 upvotes

Vanast: Virtual Try-On with Human Image Animation

Hyunsoo Cha, et al. KAIST

Combines virtual try-on with human image animation to produce realistic clothing visualization on moving subjects, bridging the gap between static garment transfer and dynamic video generation.

Key Findings

  • Joint modeling of garment transfer and body animation produces more coherent try-on videos than sequential approaches

  • Temporal consistency in generated animations significantly improves perceived realism

  • The approach generalizes across diverse body types and clothing categories

virtual-try-onvideo-generationhuman-animationgenerative-AIe-commerce
10 upvotes

ONE-SHOT: Compositional Human-Environment Video Synthesis

Fengyuan Yang, et al. University of California, San Diego

Enables compositional video synthesis combining human subjects with diverse environments in a single-shot framework, addressing the challenge of generating coherent human-scene interactions.

Key Findings

  • Single-shot composition produces realistic human-environment interactions without multi-stage pipelines

  • Environment-aware human motion generation improves physical plausibility of synthesized videos

  • Approach handles diverse environments including indoor, outdoor, and complex scene layouts

video-synthesiscompositional-generationhuman-scene-interactiondiffusion-modelsgenerative-AI
8 upvotes

ThinkTwice: Jointly Optimizing LLMs for Reasoning and Self-Refinement

Difan Jiao, et al. University of Illinois Urbana-Champaign

Proposes joint optimization of reasoning and self-refinement capabilities in LLMs, showing that training both abilities together yields superior performance compared to sequential or separate training.

Key Findings

  • Joint optimization of reasoning and self-refinement outperforms training each capability independently

  • Self-refinement benefits from being co-trained with the reasoning objective rather than added post-hoc

  • The approach improves performance on both mathematical reasoning and code generation benchmarks

reasoningself-refinementLLM-trainingjoint-optimizationchain-of-thought
5 upvotes

Synthetic Sandbox for Training MLE Agents

Yuhang Zhou, et al. National University of Singapore

Constructs synthetic machine learning engineering environments for training and evaluating autonomous ML agents, providing controlled sandboxes that test end-to-end ML workflow capabilities.

Key Findings

  • Synthetic ML engineering tasks provide a controllable evaluation environment for MLE agents

  • Agent performance varies dramatically across ML workflow stages from data preprocessing to deployment

  • Sandbox environments enable safe iteration on agent capabilities without real infrastructure costs

MLE-agentssynthetic-environmentsagent-trainingmachine-learning-engineeringsandbox
5 upvotes

Mimic Intent, Not Just Trajectories

Renming Huang, et al. Chinese University of Hong Kong

Argues that imitation learning for agents should focus on replicating the underlying intent behind demonstrations rather than surface-level trajectory matching, leading to more robust and generalizable policies.

Key Findings

  • Intent-level imitation produces policies that generalize better to novel situations than trajectory-level cloning

  • Disentangling intent from execution details reduces compounding errors in sequential decision-making

  • The approach is complementary to existing behavioral cloning methods and can be integrated as an auxiliary objective

imitation-learningintent-modelingLLM-agentsbehavioral-cloninggeneralization
5 upvotes

ACES: Leave-One-Out AUC Consistency for Code Generation

Hui Sun, et al. Microsoft Research

Introduces a novel code generation evaluation metric based on leave-one-out AUC consistency, providing a more robust signal for model selection than pass@k metrics alone.

Key Findings

  • Leave-one-out AUC consistency captures code generation reliability that pass@k misses

  • The metric is more stable across random seeds and problem subsets than existing evaluation approaches

  • ACES enables better model ranking decisions for deployment in code generation pipelines

code-generationevaluation-metricsAUCreliabilitymodel-selection
4 upvotes

The Geometric Alignment Tax

Prashant C. Raju Independent Researcher

Formalizes the cost of aligning LLMs in geometric terms, showing that alignment procedures distort the model's representation geometry in ways that reduce downstream capabilities on non-aligned tasks.

Key Findings

  • Alignment procedures create measurable geometric distortions in model representation spaces

  • The distortion magnitude correlates with capability degradation on tasks outside the alignment distribution

  • The geometric framework provides a principled way to quantify the alignment tax across model families

alignmentrepresentation-geometryalignment-taxLLM-capabilitiestheoretical
3 upvotes

Paper Espresso: From Paper Overload to Research Insight

Mingzhe Du, et al. National University of Singapore

An automated research paper summarization and insight extraction tool addressing information overload in fast-moving AI/ML research, with multi-level summarization that preserves technical details.

Key Findings

  • Automated pipeline reduces time-to-insight for literature review by an order of magnitude

  • Multi-level summarization preserves key technical details that simple abstractive summaries lose

  • Open-source tool designed for integration into existing research workflows

research-toolssummarizationliterature-reviewproductivityopen-source
2 upvotes

BidirLM: From Text to Omnimodal Bidirectional Encoders

Nicolas Boizard, et al. Meta AI

Extends bidirectional language modeling to omnimodal inputs, converting text-only bidirectional encoders into models that process text, images, and audio within a unified bidirectional framework.

Key Findings

  • Bidirectional encoding over multiple modalities improves cross-modal retrieval compared to causal architectures

  • Text-pretrained bidirectional encoders can be efficiently adapted to process visual and audio inputs

  • The approach maintains the embedding quality advantages of bidirectional models while adding multimodal capability

multimodalbidirectional-encodingomnimodalembeddingsrepresentation-learning
2 upvotes

Trending Models (11)

Gemma 4 31B-IT

Google · image-text-to-text · 31B

View on HF

Google's flagship Gemma 4 instruction-tuned model at 31B parameters, supporting image-text-to-text tasks. The largest dense variant in the Gemma 4 family, trending as the community benchmarks it against GPT-4o and Claude.

multimodalinstruction-tunedgemma-4
884.0K downloads1.4K likes

Community-created reasoning model distilling Claude Opus 4.6 reasoning capabilities into a Qwen3.5-27B base, achieving strong performance on reasoning benchmarks through knowledge distillation from a frontier model.

reasoningdistillationcommunity
552.0K downloads2.5K likes
Gemma-4-31B-JANG_4M-CRACK

DealignAI · text-generation · 31B

View on HF

Abliterated (uncensored) variant of Gemma 4 31B with safety guardrails removed, trending rapidly as the community explores the model's full unfiltered capabilities.

abliterateduncensoredgemma-4
29.0K downloads705 likes
Void Model

Netflix · video-inpainting · undisclosed

View on HF

Netflix's video inpainting and object removal model, designed for seamless removal of unwanted objects from video footage. Notable as Netflix's first open model release.

video-inpaintingobject-removalproduction
0 downloads574 likes
Gemma 4 26B-A4B-IT

Google · text-generation · 26B (4B active)

View on HF

Mixture-of-experts Gemma 4 variant with 26B total parameters but only 4B active per token, bringing MoE efficiency to consumer-accessible hardware.

MoEefficient-inferencegemma-4
660.0K downloads508 likes
Gemma 4 E4B-IT

Google · any-to-any · 4B

View on HF

Any-to-any modality Gemma 4 model at 4B parameters, capable of processing and generating across text, image, and audio modalities in a single unified architecture.

any-to-anymultimodalgemma-4
474.0K downloads476 likes
Bonsai-8B-GGUF

Prism ML · text-generation · 8B (1-bit)

View on HF

1-bit quantized 8B parameter language model pushing the limits of extreme quantization, demonstrating that binary weight models can achieve surprisingly coherent text generation.

1-bitquantizationefficient-inference
53.0K downloads506 likes
GLM-5.1

Zhipu AI · text-generation · undisclosed MoE

View on HF

Latest generation of the GLM series as a mixture-of-experts text generation model, representing Zhipu AI's continued push to compete with Western frontier labs on open-weight models.

MoEtext-generationChinese-AI
389 downloads450 likes
Qianfan-OCR

Baidu · image-text-to-text · undisclosed

View on HF

Baidu's dedicated OCR and vision-language model optimized for document understanding and text extraction, achieving strong results on multilingual document benchmarks.

OCRdocument-understandingvision-language
40.0K downloads1.1K likes
OmniVoice

k2-fsa · text-to-speech · undisclosed

View on HF

Voice cloning and text-to-speech model with high-fidelity voice replication, trending for its ability to clone voices from short audio samples.

TTSvoice-cloningspeech-synthesis
105.0K downloads360 likes
Holo3-35B-A3B

Hcompany · multimodal · 35B (3B active)

View on HF

Multimodal mixture-of-experts model with 35B total parameters and 3B active, designed for efficient multimodal reasoning across text and vision tasks.

MoEmultimodalefficient-inference
1.8K downloads246 likes

Trending GitHub Repos (11)

A modular, extensible agent framework that grows with the user, featuring plugin-based tool integration and persistent memory. Exploding in popularity with 3,009 stars in a single day.

LLM-agentsagent-frameworktool-integration
Python32.9K+3.0K today4.2K

Client-side knowledge graph for codebases that enables semantic search and visualization of code relationships, running entirely in the browser.

knowledge-graphcode-searchdeveloper-tools
TypeScript24.7K+1.2K today2.8K

NVIDIA's framework for generating persona-diverse synthetic data, enabling creation of realistic and varied training datasets for LLM fine-tuning and evaluation.

synthetic-datapersona-generationdata-augmentation
Python8.0K+662 today1.2K

Google's C++ runtime for efficient on-device LLM inference, optimized for mobile and embedded deployment with minimal memory footprint.

on-device-inferenceLLM-runtimeedge-deployment
C++2.6K+528 today253

NVIDIA's tool for designing and generating high-quality synthetic training data pipelines, part of the NeMo ecosystem for large-scale model training.

synthetic-datadata-generationNeMo
Python1.5K+244 today132

SEO content generation tool powered by Claude, automating keyword research, content planning, and article generation for search engine optimization.

SEOcontent-generationClaude
Python4.0K+215 today668

Agent-native learning assistant from HKU that uses LLM agents to provide personalized tutoring, adaptive explanations, and interactive problem-solving.

educationLLM-agentstutoring
Python12.4K+168 today1.7K

Agent memory system that learns from past interactions, enabling LLM agents to build persistent, improving knowledge bases from their execution histories.

agent-memorylearningknowledge-base
Python7.8K+160 today486
High RelevanceGitHub

Zero-code LLM agent framework from HKU that enables building autonomous agents through natural language specifications without programming.

agent-frameworkno-codeLLM-agents
Python9.0K+76 today1.3K

Sources Checked

07:10 AM UTC
07:15 AM UTC
07:20 AM UTC