Saturday, May 23, 2026
RLVR token-credit assignment (DelTA) advances fine-grained LLM training signals; full-attention sparsification shows LLMs are intrinsically sparse; agent governance and tooling ecosystems explode on GitHub
Executive Summary
The most engaged papers today center on reinforcement learning improvements for LLMs. DelTA (124 upvotes) introduces a discriminator-theoretic view of RLVR updates, revealing that policy-gradient steps implicitly act as linear discriminators over token-gradient vectors — a finding that could reshape how the community thinks about credit assignment in post-training pipelines. Meanwhile, Full Attention Strikes Back demonstrates that standard full-attention LLMs are already intrinsically sparse, enabling sparse conversion in under 100 training steps, which has immediate implications for inference efficiency without sacrificing expressivity.
On the evaluation and dataset front, TransitLM (162 upvotes) released the largest publicly known transit-planning corpus (13M+ records, 120K stations), setting a new scale benchmark for geographic reasoning without map APIs. Perception or Prejudice formalizes Grounded Personality Reasoning for MLLMs, surfacing a key reliability gap: current multimodal models often pattern-match rather than genuinely perceive behavioral signals. pi-Bench and Spreadsheet-RL extend the agentic evaluation frontier into long-horizon proactive workflows and real-world spreadsheet automation respectively.
GitHub trends signal a rapidly maturing agent-infrastructure layer: Microsoft's agent-governance-toolkit (covering OWASP Agentic Top 10), plastic-labs/honcho (stateful agent memory), and Anthropic's claude-plugins-official directory all gained significant traction. The proliferation of Claude Code plugin and skill registries, combined with pre-indexed code knowledge graphs (codegraph, Understand-Anything), suggests the developer tooling stack around AI coding agents is consolidating fast.
Researcher Notes
The DelTA-sparsification connection is non-obvious but important. DelTA shows that RLVR updates implicitly discriminate over token-gradient vectors, while Full Attention Strikes Back shows that trained LLMs are already sparse in attention patterns. Together, these suggest a future where sparse-attention models trained with token-discriminative RL rewards might be both cheaper to run and more precisely shaped by fine-grained feedback — a compounding efficiency gain worth tracking.
Unsupervised PRMs are a sleeper hit. With only 17 upvotes, the unsupervised process reward model paper may be underappreciated relative to its potential impact. If PRMs can be trained without expert step-level annotations, the scaling bottleneck for verifier-guided search collapses significantly. This pairs well with DelTA's credit-assignment framing: both papers are attacking the labeling cost of process-level supervision from different angles.
The KV-cache stack is fragmenting into specialized solutions. WorldKV (video diffusion), KVServe (disaggregated LLM serving), and the RLVR sparsification work all touch KV-cache efficiency but from entirely different angles — video generation consistency, network bandwidth under SLO constraints, and attention sparsity respectively. The absence of a unified framework is notable, and the first system to integrate these perspectives might define the next generation of inference engines.
Agent governance is crossing the chasm. Microsoft's agent-governance-toolkit explicitly maps to OWASP Agentic Top 10, AWS's aidlc-workflows provides adaptive steering rules, and Tracer-Cloud's opensre targets AI SRE use cases. The simultaneous emergence of governance tooling from a hyperscaler (Microsoft), cloud provider (AWS), and startup (Tracer-Cloud) within the same trending window suggests enterprise AI agent deployment is moving from pilot to production at scale, creating urgent demand for policy enforcement and zero-trust identity primitives.
pi-Bench and Spreadsheet-RL reveal a maturation inflection in agentic benchmarks. Early agent benchmarks (WebArena, SWE-bench) tested reactive task completion. The new generation — pi-Bench with hidden-intent proactive workflows, Spreadsheet-RL with multi-step real-world spreadsheet operations — tests whether agents can sustain intentional, long-horizon behavior. This shift mirrors what happened in NLP evaluation when GLUE gave way to BIG-Bench, and suggests the next 12-18 months will see capability thresholds redefined by proactive and sustained-action metrics.
Themes & Trends
RLVR Credit Assignment and Process Supervision
risingMultiple papers tackle the granularity and efficiency of reward signals in LLM training — from token-level discriminative assignment (DelTA) to eliminating annotation bottlenecks in process reward models (Unsupervised PRMs). Together they signal a push toward more principled and scalable RL feedback.
Attention Sparsification and Inference Efficiency
risingFull Attention Strikes Back demonstrates that LLMs are intrinsically sparse, while Gated DeltaNet-2 improves linear attention's memory operations. Both reduce inference cost without sacrificing model quality, pointing toward a convergence on efficient attention mechanisms.
Agentic Evaluation and Long-Horizon Benchmarks
risingA new generation of benchmarks — pi-Bench for proactive workflows, Spreadsheet-RL for realistic office tasks, and CUSP for scientific forecasting — moves beyond reactive task completion to evaluate sustained, intentional, and proactive agent behavior.
Agent Infrastructure and Governance
risingThe GitHub trending data shows rapid maturation of agent infrastructure: governance toolkits, stateful memory libraries, and MCP-based browser tooling all trended simultaneously, indicating enterprise-ready agent deployment is imminent.
Multimodal Robustness and Grounded Reasoning
stablePerception or Prejudice, LatentOmni, and SpaceDG all probe whether multimodal models genuinely reason or merely pattern-match. The theme spans personality grounding, audio-visual temporal reasoning, and spatial understanding under degraded inputs.
KV-Cache Innovation Across Domains
stableKV-cache optimization is fragmenting into domain-specific solutions: WorldKV for video diffusion consistency, KVServe for network-efficient disaggregated LLM serving, and attention sparsification for standard LLM inference. Unification remains an open challenge.
Trending Papers (14)
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
High RelevanceHanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu — Peking University, ByteDance
Releases a corpus of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines. The dataset is designed as a continual pre-training corpus and benchmark for LLM-based transit planning that does not rely on map infrastructure, enabling geographic reasoning in resource-constrained settings.
Key Findings
- •
First large-scale open dataset for map-free transit route planning with 13M+ records
- •
Covers 4 Chinese cities, 120,845 stations, and 13,666 lines at unprecedented scale
- •
Enables continual pre-training of LLMs for transit reasoning without external map APIs
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
High RelevanceCaixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang — Tsinghua University, Renmin University of China
Formalizes the task of Grounded Personality Reasoning (GPR) to evaluate whether multimodal LLMs genuinely perceive personality through behavioral understanding or merely prejudge via superficial pattern matching. The work reveals a systematic reliability gap in current MLLMs.
Key Findings
- •
Current MLLMs predominantly pattern-match superficial cues rather than reasoning from behavior
- •
Introduces GPR as a formal evaluation framework distinguishing genuine perception from prejudice
- •
Identifies a category of failure modes specific to first-impression bias in multimodal models
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
High RelevanceKaiyi Zhang, Wei Wu, Yankai Lin — Renmin University of China, Tsinghua University
Introduces a discriminator view of RLVR updates, demonstrating that policy-gradient steps implicitly act as linear discriminators over token-gradient vectors, determining which token probabilities increase or decrease. This theoretical reframing enables more principled credit assignment at the token level.
Key Findings
- •
Policy-gradient RLVR updates are equivalent to linear discriminators over token-gradient vectors
- •
Provides theoretical grounding for fine-grained token-level credit assignment in RL fine-tuning
- •
Discriminative framing opens new design space for reward shaping and token selection
π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
High RelevanceHaoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang — Peking University, Alibaba Group
Introduces a benchmark evaluating whether agents can identify and act on hidden user intents before they are explicitly stated, across sustained long-horizon workflows. Addresses the proactive assistance challenge that reactive benchmarks cannot capture.
Key Findings
- •
First benchmark specifically targeting proactive intent identification in long-horizon workflows
- •
Reveals that current agents fail to act on implicit user goals without explicit instruction
- •
Long-horizon workflow structure exposes compounding failures invisible in short-task evaluations
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
High RelevanceYanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu — Peking University, Microsoft Research
Demonstrates that full-attention LLMs are already intrinsically sparse in their attention patterns, and can be converted into highly sparse models with minimal adaptation — under 100 training steps. Identifies three key observations about inherent sparsity patterns.
Key Findings
- •
Full-attention LLMs exhibit intrinsic sparsity that can be exploited without retraining from scratch
- •
Sparse conversion requires fewer than 100 training steps, making it practical for deployed models
- •
Three distinct sparsity pattern types identified across model families
ACC: Compiling Agent Trajectories for Long-Context Training
Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao — Shanghai AI Laboratory, Fudan University
Proposes using agent trajectories — which naturally contain evidence scattered across many turns — as long-context training data for LLMs. Addresses the challenge that agentic problem-solving requires integrating distant context, making trajectories ideal for training long-context integration.
Key Findings
- •
Agent trajectories naturally encode long-context dependencies, making them ideal training data
- •
Compilation approach aggregates multi-turn agentic outputs into coherent long-context examples
- •
Training on compiled trajectories improves long-context reasoning on downstream benchmarks
PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
Ziang Cao, Yinghao Liu, Haitian Li, Runmao Yao, Fangzhou Hong — Nanyang Technological University, NVIDIA
Presents a unified framework for generating simulation-ready 3D assets across rigid, deformable, and articulated object types through a novel geometry pipeline. Addresses the fragmentation of prior work that handled each asset class separately.
Key Findings
- •
Unified pipeline handles rigid, deformable, and articulated objects in a single framework
- •
Generated assets are immediately simulation-ready without manual post-processing
- •
Novel geometry pipeline enables physically accurate asset generation at scale
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu — Zhejiang University, Alibaba Group
Proposes a unified latent space for audio-visual reasoning instead of text-based chain-of-thought, addressing the core problem that explicit CoT compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and cross-modal alignment.
Key Findings
- •
Text-based CoT for audio-visual tasks loses temporal grounding by discretizing continuous signals
- •
Unified latent reasoning space preserves audio-visual continuity across modalities
- •
Outperforms text-CoT baselines on temporal grounding and cross-modal reasoning tasks
Spreadsheet-RL: Advancing LLM Agents on Realistic Spreadsheet Tasks via RL
Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang — National University of Singapore, Microsoft
Applies reinforcement learning to train LLM agents on realistic spreadsheet automation tasks, addressing the limitations of specialized prompting on complex multi-step operations. Extends agentic RL into the practical enterprise office automation domain.
Key Findings
- •
Specialized prompting fails on complex multi-step spreadsheet operations
- •
RL training significantly improves agent performance on realistic spreadsheet benchmarks
- •
Office automation tasks require sustained multi-step planning that RL is well-suited for
Unsupervised Process Reward Models
High RelevanceArtyom Gadetsky, Maxim Kodryan, Siba Smarak Panigrahi, Hang Guo, Maria Brbic — EPFL, ETH Zurich
Introduces a method for training process reward models without human supervision, eliminating the need for step-by-step annotations or ground-truth verification. Directly attacks the expert annotation bottleneck that has limited PRM scaling.
Key Findings
- •
PRMs can be trained without any step-level human annotations
- •
Unsupervised approach matches or approaches supervised PRMs on reasoning benchmarks
- •
Removes the primary scaling bottleneck for verifier-guided search in LLM reasoning
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Ali Hatamizadeh, Yejin Choi, Jan Kautz — NVIDIA
Decouples the erase and write operations in linear attention's compressed memory representation, arguing that a single scalar gate causes interference between the two operations. Separate gating mechanisms enable independent control over memory retention and updates.
Key Findings
- •
Single scalar gate in linear attention causes interference between erase and write operations
- •
Separate gating for erase and write improves memory control in linear attention models
- •
Gated DeltaNet-2 achieves better language modeling perplexity and downstream task performance
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu — Tsinghua University, Peking University
Applies RL to orchestrate multiple LLMs and modular skills, exploiting complementary strengths across domains instead of relying on a single monolithic model. Hierarchical ensemble enables dynamic routing based on task requirements.
Key Findings
- •
RL-based orchestration outperforms static routing and monolithic models across diverse domains
- •
Complementary specialization across models can be exploited by a learned orchestrator
- •
Hierarchical skill ensembles reduce inference cost while improving accuracy on mixed workloads
Forecasting Scientific Progress with AI
Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada — Allen Institute for AI, UCLA
Introduces CUSP, a benchmark for evaluating AI systems on scientific forecasting under controlled knowledge constraints, enabling multi-disciplinary event-level evaluation of how well models can predict future scientific developments.
Key Findings
- •
CUSP provides controlled knowledge cutoffs enabling fair comparison of forecasting capabilities
- •
Multi-disciplinary coverage reveals domain-specific forecasting strengths and weaknesses
- •
Current frontier models show significant gaps in scientific event prediction accuracy
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng — Waymo, Google
Converts in-the-wild dashcam video to match proprietary AV sensor configurations, enabling the use of massive diverse dashcam datasets for training AV perception systems that require structured fleet sensor data.
Key Findings
- •
Dashcam-to-AV sensor conversion bridges the data gap between consumer and fleet sensor configurations
- •
Cross-embodiment approach enables leveraging billions of dashcam frames for AV training
- •
Sensor conversion quality is sufficient to improve downstream AV perception task performance
Trending Models (11)
DeepSeek AI · text-generation · Unknown (MoE)
Latest flagship text generation model from DeepSeek with 4.2M+ downloads and 4,152 likes, representing the leading open-weight frontier model. Continues the V4 series with architectural improvements.
Qwen / Alibaba · image-text-to-text · 27B
Qwen3.6 27B dense multimodal model with 4M downloads and 1,390 likes, supporting image-text-to-text tasks. Part of the Qwen3 series advancing open multimodal frontier models.
Circlestone Labs · image-generation · Unknown
Highly-liked ComfyUI diffusion model with 1,499 likes and 602K downloads, designed for high-quality image/video generation workflows in the ComfyUI ecosystem.
SulphurAI · text-to-video · Unknown
Text-to-video diffusion model available in GGUF and diffusers formats with 1.25M downloads, indicating strong community adoption for local video generation workflows.
OpenBMB / Tsinghua · image-text-to-text · ~4.6B
Efficient multimodal model with 221K downloads and 904 likes, part of the MiniCPM-V series known for strong performance relative to its compact size in image-text-to-text tasks.
ByteDance Research · image-generation · Unknown
ByteDance Research multimodal model supporting both image and video generation with 649 likes, representing ByteDance's entry into unified visual generation.
Supertone · text-to-speech · Unknown
Advanced text-to-speech and speech synthesis model in ONNX format with 37K downloads and 582 likes, offering high-quality voice synthesis capabilities.
Unsloth · text-generation · 27B
Unsloth-optimized GGUF quantization of Qwen3.6 27B MTP variant with 532K downloads and 413 likes, enabling efficient local deployment of the Qwen3.6 series.
Tencent · translation · 1.8B
Tencent's compact 1.8B translation-capable text generation model based on HunyuanV1 dense architecture, designed for efficient multilingual translation tasks.
ResembleAI · text-to-speech · Unknown
High-quality TTS and voice cloning model with 1,354 downloads and 230 likes, specialized for dramatic and expressive speech synthesis with voice cloning capabilities.
TencentARC · image-to-3d · Unknown
Image-to-3D model from TencentARC with 192 likes enabling single-image 3D reconstruction, contributing to the growing ecosystem of 3D generation tools.
Trending GitHub Repos (12)
Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode that reduces token usage and tool calls while running 100% locally. Highest stars today in the trending list (3,684) indicates strong resonance with AI coding workflows.
Official Anthropic-managed directory of high-quality Claude Code plugins, serving as the authoritative registry for the growing Claude Code plugin ecosystem. Explosive growth with 2,549 stars today signals rapid ecosystem adoption.
General-purpose agent framework from NousResearch with 163K stars and 1,743 new stars today, positioning as a composable and growing agent platform built on the Hermes model family.
Converts code repositories into interactive knowledge graphs compatible with Claude Code, Codex, Cursor, Copilot, and Gemini CLI. Gained 1,393 stars today, reflecting demand for code-comprehension tooling across AI coding agent stacks.
Comprehensive AI engineering curriculum covering building and shipping AI applications from scratch, with 988 stars today showing strong community interest in practical AI engineering education.
Converts WiFi signals into real-time spatial intelligence, vital sign monitoring, and presence detection without requiring video cameras. Gained 978 stars today, representing a novel privacy-preserving sensing approach.
Chrome DevTools as an MCP server for AI coding agents, enabling programmatic browser inspection and debugging within agent workflows. 501 stars today reflects growing adoption of browser tooling in agentic developer stacks.
Agentic video generation system combining Director, Screenwriter, Producer, and Video Generator roles into a single all-in-one system, demonstrating multi-agent collaboration for complex creative tasks.
Memory library for building stateful agents, providing persistent memory infrastructure for long-running agent workflows. Gained 133 stars today as stateful agent memory becomes a critical infrastructure component.
Microsoft's AI agent governance toolkit covering policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering, explicitly addressing all 10 OWASP Agentic Top 10 risks.
Fully autonomous and self-evolving research system that operates from idea generation to paper writing, representing a significant step toward automated scientific research pipelines.
Meta's Segment Anything Model 3 for inference and fine-tuning, extending the highly influential SAM series with improved capabilities for segmentation tasks.