Friday, May 29, 2026

LaRA detects data contamination in RL post-training via layer-wise representation analysis; NAVA from Baidu achieves native audio-visual alignment for joint generation; agent skills ecosystem continues explosive growth with Understand-Anything gaining 3,776 stars/day

rl-training-integritynative-multimodal-generationagent-skills-ecosystemcinematic-video-generationcompact-edge-modelsai-output-quality-alignment

Executive Summary

Thursday's HuggingFace Daily Papers features five submissions spanning RL training integrity, multimodal generation, cinematic video control, agent benchmarking, and physics-based animal simulation. The top paper is LaRA from Yonsei University (5 upvotes), which introduces a layer-wise representation analysis framework for detecting data contamination in RL post-trained LLMs, a critical problem as reinforcement learning reshapes model behavior through trajectory-level rewards rather than token likelihoods. NAVA from Baidu proposes a native audio-visual alignment framework for joint audio-video generation with a novel Align-then-Fuse MMDiT architecture. SmartDirector and MoZoo from Orange Team tackle keyframe-conditioned cinematic generation and physics-based animal simulation respectively. AsyncTool benchmarks the under-explored problem of asynchronous function calling in multi-task agent scenarios.

The model landscape sees DeepSeek-V4-Pro maintaining dominance with 5.3M downloads and 4,405 likes. HauhauCS's Qwen3.6-35B uncensored surpasses 1.95M downloads. Notable new entries include LiquidAI's LFM2.5-8B-A1B, a mixture-of-experts model with 8B total / 1B active parameters just launched yesterday with 116 likes. ByteDance Lance reaches 956 likes for multimodal any-to-any generation, while Qwen/Qwen3.6-27B leads official releases at 4.8M downloads and 1,510 likes. OpenBMB MiniCPM5-1B grows to 498 likes, confirming sustained demand for compact edge models.

GitHub trending is dominated by the agent skills and tooling ecosystem reaching unprecedented scale. Understand-Anything leads AI repos with 3,776 stars/day (42.9K total), converting code into interactive knowledge graphs. MoneyPrinterTurbo surges to 66.5K stars with 4,698 stars/day for AI video generation. NousResearch/hermes-agent (171.7K stars, 1,411/day) and ECC (197.4K stars, 1,385/day) represent the mature agent platform layer, while taste-skill (26.6K stars, 2,234/day) and stop-slop (6.5K stars, 761/day) signal growing focus on AI output quality alignment.

Researcher Notes

LaRA's contribution is methodologically significant because it attacks a blind spot in the RL post-training evaluation pipeline. Existing contamination detection relies on output-level signals like likelihood or entropy, but RL fundamentally changes the relationship between these signals and model behavior — RL shapes behavior through trajectory-level rewards, not token-level probabilities. LaRA instead examines perturbation sensitivity, directional collapse, and local representation rigidity across layers, finding that contamination produces progressive geometric deviations. This is exactly the kind of geometric analysis that should generalize across different RL algorithms and reward structures, because the underlying mechanism (memorization of specific trajectories) produces consistent representational signatures regardless of how the reward was structured. The practical implication is clear: as RL post-training becomes standard practice (RLHF, GRPO, etc.), we need contamination detection methods that work at the representation level, not the output level.

NAVA from Baidu represents a principled middle ground in the multimodal generation architecture debate. The current landscape has two extremes: dual-tower designs that generate audio and video separately then align them post-hoc, and fully unified tri-modal architectures that mix text, audio, and video in one shared space. The former loses fine-grained co-evolution, the latter couples semantic conditioning with low-level synchronization. NAVA's Align-then-Fuse MMDiT first establishes audio-video correspondence in a dedicated interaction space, then uses external context to condition joint denoising. The Timbre-in-Context Conditioning is a particularly clever addition — associating reference timbre cues with corresponding speech spans for controllable speech timbre. At 6.3B parameters, NAVA achieves competitive quality while remaining significantly smaller than many unified approaches.

AsyncTool highlights a critical gap in how we evaluate agent capabilities. Most tool-calling benchmarks assume synchronous execution — the agent calls a tool, gets a response, then decides what to do next. But real-world agent deployments involve concurrent tasks with different latencies, and the ability to productively use idle time while waiting for slow tool responses is essential for practical efficiency. The finding that delayed tool feedback causes clear performance degradation across current models is important but unsurprising — models are trained primarily on synchronous interaction patterns. The more interesting question is whether this capability can be improved through training or whether it requires architectural changes to how models handle temporal context.

The GitHub trending data reveals an inflection point in the AI agent ecosystem. The sheer scale is remarkable: obra/superpowers (211K stars), affaan-m/ECC (197K stars), NousResearch/hermes-agent (172K stars), DigitalPlatDev/FreeDomain (171K stars), and anthropics/skills (143K stars). But today's most notable signal is the emergence of AI output quality tools as a distinct category. taste-skill (2,234 stars/day) and stop-slop (761 stars/day) both focus on making AI-generated content less generic and more authentic. This is a demand-side signal that complements the supply-side scaling of agent platforms — users don't just want agents that work, they want agents that produce output indistinguishable from skilled human work. Microsoft's RAMPART (304 stars, new) adds safety testing for agentic AI applications, addressing the governance side of the same maturation trend.

The compact model trend continues its strong trajectory. MiniCPM5-1B (498 likes, growing), LiquidAI LFM2.5-8B-A1B (116 likes, just launched), and NemoStation Marlin-2B (430 likes) all demonstrate that the community is actively seeking models that balance capability with deployability. LiquidAI's MoE approach (8B total, 1B active) is particularly interesting as it applies the mixture-of-experts scaling strategy to the edge deployment use case — getting the representational capacity of a larger model while keeping inference cost at the 1B-parameter level. The continued success of unsloth's quantized variants (806K downloads for Qwen3.6-27B-MTP-GGUF) confirms that efficient local inference remains a primary community priority.

Themes & Trends

↑

RL Training Integrity and Evaluation

rising

LaRA addresses the critical blind spot of data contamination detection in RL post-training, while AsyncTool exposes temporal reasoning gaps in tool-calling agents, together highlighting fundamental evaluation challenges as AI systems become more complex.

↑

Native Multimodal Generation Architectures

rising

NAVA's Align-then-Fuse approach and SmartDirector's keyframe conditioning represent the ongoing shift from modular pipelines to natively integrated multimodal architectures for audio, video, and visual content generation.

↑

Agent Skills and Quality Alignment Ecosystem

rising

The massive growth of taste-skill, stop-slop, and cybersecurity skills alongside established platforms like Superpowers and ECC signals a maturation from 'can agents work' to 'how do agents produce quality, safe output.'

→

Compact and Edge-Deployable Models

stable

MiniCPM5-1B, LFM2.5-8B-A1B, Marlin-2B, and HRM-Text-1B demonstrate sustained demand for models that balance capability with deployability, with MoE architectures extending this to larger representational capacity at small inference cost.

↑

Domain-Specific AI Foundation Models

rising

Kronos for financial markets, Anthropic's financial-services toolkit, and MetaTrader MCP server reflect growing investment in vertical AI applications beyond general-purpose language and vision models.

Trending Papers (5)

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

High Relevance

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim — Yonsei University, Georgia Institute of Technology

Introduces LaRA, a layer-wise representation analysis framework for detecting data contamination in RL post-trained LLMs. Uses three complementary metrics — perturbation sensitivity, directional collapse, and local representation rigidity — to identify progressive geometric deviations caused by contamination across model layers.

Key Findings

•
Existing output-level contamination detection methods (likelihood, entropy) become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards
•
Contamination produces progressive geometric deviations across layers including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity
•
The aggregated representation-level detection protocol outperforms existing output-level baselines for contamination detection in RL-trained reasoning models

reinforcement-learningdata-contaminationrepresentation-analysisllm-evaluationpost-training

5 upvotes

arXiv HF PDF

Native Audio-Visual Alignment for Generation

High Relevance

Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He — Baidu, ERNIE Research

Proposes NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation that first establishes audio-video correspondence in a dedicated interaction space, then conditions joint denoising with external context. Achieves superior video quality and precise audio-visual synchronization at 6.3B parameters.

Key Findings

•
Align-then-Fuse MMDiT architecture transitions from modality-aware audio-video alignment to modality-shared joint denoising, avoiding the weaknesses of both dual-tower and unified tri-modal designs
•
Timbre-in-Context Conditioning associates reference timbre cues with speech spans for controllable speech timbre generation
•
Achieves superior video quality and precise audio-visual synchronization compared to existing methods at only 6.3B parameters

audio-visual-generationmultimodaldiffusion-transformerspeech-synthesisvideo-generation

1 upvotes

arXiv HF PDF

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li — Orange Team

Proposes SmartDirector, a framework that enhances narrative capacity of video generation through multiple keyframes. Operates in two stages: Director-Gen generates low-resolution video conditioned on keyframes, and Director-SR refines using high-resolution keyframes as semantic anchors.

Key Findings

•
Two-stage architecture separates narrative structure generation from visual quality refinement using keyframes as conditioning signals
•
Supports flexible generation scenarios including single-shot, multi-shot narrative synthesis, and video extension
•
Data pipeline curates single-shot and multi-shot sequences from movies for robust multi-keyframe training

video-generationkeyframe-conditioningnarrative-controlcinematictemporal-pacing

1 upvotes

arXiv HF PDF

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

High Relevance

Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie, Zhen Fang, Qiuchen Wang, Lin Chen, Huaian Chen, Zehui Chen, Feng Zhao — University of Science and Technology of China

Introduces AsyncTool, a benchmark for evaluating LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. Demonstrates that delayed tool responses pose substantial challenges to current agents, identifying key failure modes in task coordination and temporal reasoning.

Key Findings

•
Delayed tool feedback causes clear performance degradation across current LLM-based agents, exposing weaknesses in temporal reasoning
•
Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance
•
Efficiency-oriented metrics at step, sub-task, and task levels provide more granular evaluation of agent coordination capabilities

agent-evaluationtool-callingbenchmarkasynchronousmulti-task

1 upvotes

arXiv HF PDF

MoZoo: Unleashing Video Diffusion Power in Animal Fur and Muscle Simulation

Dongxia Liu, Jie Ma, Xiaochen Yang, Jiancheng Zhang, Bin Xia, Zhehan Kan, Nisha Huang, Jun Liang, Wenming Yang, Jin Li — Orange Team, Tsinghua University

Presents MoZoo, a generative dynamics solver that synthesizes high-fidelity animal videos from coarse meshes under multimodal guidance. Introduces Role-Aware RoPE for motion alignment and Asymmetric Decoupled Attention for computational efficiency, along with MoZoo-Data synthetic-to-real pipeline and MoZooBench benchmark.

Key Findings

•
Role-Aware RoPE employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets
•
Asymmetric Decoupled Attention partitions latent sequences to enforce unidirectional information flow, preventing feature interference
•
MoZoo-Data synthetic-to-real pipeline addresses training data scarcity with paired sequences from rendering engines and inverse mapping

video-diffusionanimal-simulationphysics-basedpositional-encodingbenchmark

1 upvotes

arXiv HF PDF

Trending Models (15)

DeepSeek-V4-Pro

DeepSeek AI · text-generation · unknown

LaRA detects data contamination in RL post-training via layer-wise representation analysis; NAVA from Baidu achieves native audio-visual alignment for joint generation; agent skills ecosystem continues explosive growth with Understand-Anything gaining 3,776 stars/day

Executive Summary

Researcher Notes

Themes & Trends

RL Training Integrity and Evaluation

Native Multimodal Generation Architectures

Agent Skills and Quality Alignment Ecosystem

Compact and Edge-Deployable Models

Domain-Specific AI Foundation Models

Trending Papers (5)

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Native Audio-Visual Alignment for Generation

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

MoZoo: Unleashing Video Diffusion Power in Animal Fur and Muscle Simulation

Trending Models (15)

Trending GitHub Repos (15)

Sources Checked