Friday, May 29, 2026
LaRA detects data contamination in RL post-training via layer-wise representation analysis; NAVA from Baidu achieves native audio-visual alignment for joint generation; agent skills ecosystem continues explosive growth with Understand-Anything gaining 3,776 stars/day
Executive Summary
Thursday's HuggingFace Daily Papers features five submissions spanning RL training integrity, multimodal generation, cinematic video control, agent benchmarking, and physics-based animal simulation. The top paper is LaRA from Yonsei University (5 upvotes), which introduces a layer-wise representation analysis framework for detecting data contamination in RL post-trained LLMs, a critical problem as reinforcement learning reshapes model behavior through trajectory-level rewards rather than token likelihoods. NAVA from Baidu proposes a native audio-visual alignment framework for joint audio-video generation with a novel Align-then-Fuse MMDiT architecture. SmartDirector and MoZoo from Orange Team tackle keyframe-conditioned cinematic generation and physics-based animal simulation respectively. AsyncTool benchmarks the under-explored problem of asynchronous function calling in multi-task agent scenarios.
The model landscape sees DeepSeek-V4-Pro maintaining dominance with 5.3M downloads and 4,405 likes. HauhauCS's Qwen3.6-35B uncensored surpasses 1.95M downloads. Notable new entries include LiquidAI's LFM2.5-8B-A1B, a mixture-of-experts model with 8B total / 1B active parameters just launched yesterday with 116 likes. ByteDance Lance reaches 956 likes for multimodal any-to-any generation, while Qwen/Qwen3.6-27B leads official releases at 4.8M downloads and 1,510 likes. OpenBMB MiniCPM5-1B grows to 498 likes, confirming sustained demand for compact edge models.
GitHub trending is dominated by the agent skills and tooling ecosystem reaching unprecedented scale. Understand-Anything leads AI repos with 3,776 stars/day (42.9K total), converting code into interactive knowledge graphs. MoneyPrinterTurbo surges to 66.5K stars with 4,698 stars/day for AI video generation. NousResearch/hermes-agent (171.7K stars, 1,411/day) and ECC (197.4K stars, 1,385/day) represent the mature agent platform layer, while taste-skill (26.6K stars, 2,234/day) and stop-slop (6.5K stars, 761/day) signal growing focus on AI output quality alignment.
Researcher Notes
LaRA's contribution is methodologically significant because it attacks a blind spot in the RL post-training evaluation pipeline. Existing contamination detection relies on output-level signals like likelihood or entropy, but RL fundamentally changes the relationship between these signals and model behavior — RL shapes behavior through trajectory-level rewards, not token-level probabilities. LaRA instead examines perturbation sensitivity, directional collapse, and local representation rigidity across layers, finding that contamination produces progressive geometric deviations. This is exactly the kind of geometric analysis that should generalize across different RL algorithms and reward structures, because the underlying mechanism (memorization of specific trajectories) produces consistent representational signatures regardless of how the reward was structured. The practical implication is clear: as RL post-training becomes standard practice (RLHF, GRPO, etc.), we need contamination detection methods that work at the representation level, not the output level.
NAVA from Baidu represents a principled middle ground in the multimodal generation architecture debate. The current landscape has two extremes: dual-tower designs that generate audio and video separately then align them post-hoc, and fully unified tri-modal architectures that mix text, audio, and video in one shared space. The former loses fine-grained co-evolution, the latter couples semantic conditioning with low-level synchronization. NAVA's Align-then-Fuse MMDiT first establishes audio-video correspondence in a dedicated interaction space, then uses external context to condition joint denoising. The Timbre-in-Context Conditioning is a particularly clever addition — associating reference timbre cues with corresponding speech spans for controllable speech timbre. At 6.3B parameters, NAVA achieves competitive quality while remaining significantly smaller than many unified approaches.
AsyncTool highlights a critical gap in how we evaluate agent capabilities. Most tool-calling benchmarks assume synchronous execution — the agent calls a tool, gets a response, then decides what to do next. But real-world agent deployments involve concurrent tasks with different latencies, and the ability to productively use idle time while waiting for slow tool responses is essential for practical efficiency. The finding that delayed tool feedback causes clear performance degradation across current models is important but unsurprising — models are trained primarily on synchronous interaction patterns. The more interesting question is whether this capability can be improved through training or whether it requires architectural changes to how models handle temporal context.
The GitHub trending data reveals an inflection point in the AI agent ecosystem. The sheer scale is remarkable: obra/superpowers (211K stars), affaan-m/ECC (197K stars), NousResearch/hermes-agent (172K stars), DigitalPlatDev/FreeDomain (171K stars), and anthropics/skills (143K stars). But today's most notable signal is the emergence of AI output quality tools as a distinct category. taste-skill (2,234 stars/day) and stop-slop (761 stars/day) both focus on making AI-generated content less generic and more authentic. This is a demand-side signal that complements the supply-side scaling of agent platforms — users don't just want agents that work, they want agents that produce output indistinguishable from skilled human work. Microsoft's RAMPART (304 stars, new) adds safety testing for agentic AI applications, addressing the governance side of the same maturation trend.
The compact model trend continues its strong trajectory. MiniCPM5-1B (498 likes, growing), LiquidAI LFM2.5-8B-A1B (116 likes, just launched), and NemoStation Marlin-2B (430 likes) all demonstrate that the community is actively seeking models that balance capability with deployability. LiquidAI's MoE approach (8B total, 1B active) is particularly interesting as it applies the mixture-of-experts scaling strategy to the edge deployment use case — getting the representational capacity of a larger model while keeping inference cost at the 1B-parameter level. The continued success of unsloth's quantized variants (806K downloads for Qwen3.6-27B-MTP-GGUF) confirms that efficient local inference remains a primary community priority.
Themes & Trends
RL Training Integrity and Evaluation
risingLaRA addresses the critical blind spot of data contamination detection in RL post-training, while AsyncTool exposes temporal reasoning gaps in tool-calling agents, together highlighting fundamental evaluation challenges as AI systems become more complex.
Native Multimodal Generation Architectures
risingNAVA's Align-then-Fuse approach and SmartDirector's keyframe conditioning represent the ongoing shift from modular pipelines to natively integrated multimodal architectures for audio, video, and visual content generation.
Agent Skills and Quality Alignment Ecosystem
risingThe massive growth of taste-skill, stop-slop, and cybersecurity skills alongside established platforms like Superpowers and ECC signals a maturation from 'can agents work' to 'how do agents produce quality, safe output.'
Compact and Edge-Deployable Models
stableMiniCPM5-1B, LFM2.5-8B-A1B, Marlin-2B, and HRM-Text-1B demonstrate sustained demand for models that balance capability with deployability, with MoE architectures extending this to larger representational capacity at small inference cost.
Domain-Specific AI Foundation Models
risingKronos for financial markets, Anthropic's financial-services toolkit, and MetaTrader MCP server reflect growing investment in vertical AI applications beyond general-purpose language and vision models.
Trending Papers (5)
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
High RelevanceMinju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim — Yonsei University, Georgia Institute of Technology
Introduces LaRA, a layer-wise representation analysis framework for detecting data contamination in RL post-trained LLMs. Uses three complementary metrics — perturbation sensitivity, directional collapse, and local representation rigidity — to identify progressive geometric deviations caused by contamination across model layers.
Key Findings
- •
Existing output-level contamination detection methods (likelihood, entropy) become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards
- •
Contamination produces progressive geometric deviations across layers including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity
- •
The aggregated representation-level detection protocol outperforms existing output-level baselines for contamination detection in RL-trained reasoning models
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control
Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li — Orange Team
Proposes SmartDirector, a framework that enhances narrative capacity of video generation through multiple keyframes. Operates in two stages: Director-Gen generates low-resolution video conditioned on keyframes, and Director-SR refines using high-resolution keyframes as semantic anchors.
Key Findings
- •
Two-stage architecture separates narrative structure generation from visual quality refinement using keyframes as conditioning signals
- •
Supports flexible generation scenarios including single-shot, multi-shot narrative synthesis, and video extension
- •
Data pipeline curates single-shot and multi-shot sequences from movies for robust multi-keyframe training
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios
High RelevanceKou Shi, Ziao Zhang, Shiting Huang, Avery Nie, Zhen Fang, Qiuchen Wang, Lin Chen, Huaian Chen, Zehui Chen, Feng Zhao — University of Science and Technology of China
Introduces AsyncTool, a benchmark for evaluating LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. Demonstrates that delayed tool responses pose substantial challenges to current agents, identifying key failure modes in task coordination and temporal reasoning.
Key Findings
- •
Delayed tool feedback causes clear performance degradation across current LLM-based agents, exposing weaknesses in temporal reasoning
- •
Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance
- •
Efficiency-oriented metrics at step, sub-task, and task levels provide more granular evaluation of agent coordination capabilities
MoZoo: Unleashing Video Diffusion Power in Animal Fur and Muscle Simulation
Dongxia Liu, Jie Ma, Xiaochen Yang, Jiancheng Zhang, Bin Xia, Zhehan Kan, Nisha Huang, Jun Liang, Wenming Yang, Jin Li — Orange Team, Tsinghua University
Presents MoZoo, a generative dynamics solver that synthesizes high-fidelity animal videos from coarse meshes under multimodal guidance. Introduces Role-Aware RoPE for motion alignment and Asymmetric Decoupled Attention for computational efficiency, along with MoZoo-Data synthetic-to-real pipeline and MoZooBench benchmark.
Key Findings
- •
Role-Aware RoPE employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets
- •
Asymmetric Decoupled Attention partitions latent sequences to enforce unidirectional information flow, preventing feature interference
- •
MoZoo-Data synthetic-to-real pipeline addresses training data scarcity with paired sequences from rendering engines and inverse mapping
Trending Models (15)
DeepSeek AI · text-generation · unknown
The dominant open-weight large language model, maintaining its position with over 5.3 million downloads and 4,405 likes, the most-liked model on HuggingFace trending.
Qwen (Alibaba) · image-text-to-text · 27B
Qwen's flagship 27B-parameter multimodal model leading official releases with 4.8M downloads and 1,510 likes, supporting image-text-to-text tasks with conversation capabilities.
SulphurAI · text-to-video · unknown
A leading open text-to-video generation model with 1.47M downloads and 1,421 likes, available in both diffusers and GGUF formats for video generation workloads.
Tencent · translation · 1.8B
A compact 1.8B-parameter translation model from Tencent's Hunyuan family supporting 35+ languages, with 1,079 likes reflecting strong demand for dedicated multilingual translation models.
HauhauCS · image-text-to-text · 35B-A3B (MoE)
A community-produced uncensored variant of Qwen3.6-35B using mixture-of-experts architecture (3B active parameters), distributed in GGUF format with vision capabilities. Leads community models in downloads at 1.95M.
ByteDance Research · any-to-any · unknown
ByteDance's multimodal any-to-any generation model supporting image generation, video generation, and image editing, with rising engagement at 956 likes.
Supertone · text-to-speech · unknown
A multilingual text-to-speech model supporting 30+ languages using ONNX format, with 727 likes and 52K downloads reflecting strong demand for high-quality open TTS solutions.
Unsloth · image-text-to-text · 27B
Unsloth's GGUF quantization of Qwen3.6-27B with Multi-Token Prediction support, enabling efficient local inference with 807K downloads, confirming local inference as a primary community priority.
OpenBMB · text-generation · 1B
A compact 1B-parameter model in the MiniCPM series designed for on-device and edge AI deployment, with 498 likes and growing, featuring tool-calling and long-context support.
NemoStation · video-text-to-text · 2B
A 2B-parameter video-text-to-text model for video captioning and temporal grounding, based on Qwen3.5-2B with 430 likes, targeting efficient video understanding.
Sapient Inc · text-generation · 1B
A 1B-parameter text generation model with hierarchical reasoning and prefix-LM architecture, maintaining high download volume of 122K for lightweight text generation.
Meituan · audio-text-to-video · unknown
An audio-driven video avatar generation model supporting audio-text-to-video and audio-image-text-to-video pipelines, with 368 likes and growing interest in avatar generation.
NVIDIA · image-text-to-text · 3B
NVIDIA's 3B-parameter grounding and object detection model based on the Eagle architecture, supporting visual grounding tasks with 199 likes.
NuMind · image-to-text · 4B
A vision-language model for structured document extraction, OCR, and document-to-markdown conversion based on Qwen3.5-4B, with 186 likes and 45K downloads.
Liquid AI · text-generation · 8B-A1B (MoE)
A newly launched mixture-of-experts model with 8B total parameters and 1B active, designed for edge deployment. Represents Liquid AI's latest entry with 116 likes and zero downloads (just launched).
Trending GitHub Repos (15)
AI-powered one-click short video generation using LLMs, surging with 4,698 stars/day to 66.5K total. The highest daily velocity among Python repos.
Turns code into interactive knowledge graphs for exploration and Q&A. Leading AI-related repos in star velocity at 3,776 stars/day, up to 42.9K total stars.
A skill for giving AI agents better taste and preventing generic output, leading the AI quality alignment movement at 2,234 stars/day to 26.6K total.
An agentic skills framework and software development methodology, now the largest agent framework on GitHub at 211K stars with sustained momentum of 1,730 stars/day.
NousResearch's self-growing agent framework at 171.7K stars, gaining 1,411 stars/day as a leading open-source agentic platform.
Microsoft's Python tool for converting files and office documents to Markdown, maintaining massive scale at 128K stars with 1,410 stars/day.
The agent harness performance optimization system with skills, instincts, memory, and security, continuing massive growth to 197K stars with 1,385 stars/day.
A skill file for removing AI tells from prose, gaining 761 stars/day to 6.5K total. Part of the emerging AI output quality alignment trend.
754 structured cybersecurity skills for AI agents mapped to MITRE ATT&CK, NIST CSF 2.0, and other frameworks, gaining 737 stars/day to 11.6K total.
Anthropic's official agent skills repository for Claude with 143K stars and 718 stars/day, the canonical skills ecosystem for the Claude platform.
Anthropic's financial services toolkit at 28.5K stars gaining 385 stars/day, providing AI integration patterns for the financial sector.
A foundation model for the language of financial markets, gaining 357 stars/day to 27K total, representing growing interest in domain-specific AI for finance.
Anthropic's agentic coding tool for the terminal, with 127K stars and 319 stars/day, serving as the primary interface for Claude-based development workflows.
Open-source LLM-friendly web crawler and scraper at 67K stars, a foundational tool in the AI data pipeline ecosystem.
A pytest-native safety and security testing framework for agentic AI applications from Microsoft, newly trending at 304 stars.