Saturday, May 23, 2026

RLVR token-credit assignment (DelTA) advances fine-grained LLM training signals; full-attention sparsification shows LLMs are intrinsically sparse; agent governance and tooling ecosystems explode on GitHub

rlvr-and-credit-assignmentattention-sparsificationagentic-evaluation-benchmarksagent-infrastructure-and-governancemultimodal-robustnesskv-cache-and-inference-efficiency

Executive Summary

The most engaged papers today center on reinforcement learning improvements for LLMs. DelTA (124 upvotes) introduces a discriminator-theoretic view of RLVR updates, revealing that policy-gradient steps implicitly act as linear discriminators over token-gradient vectors — a finding that could reshape how the community thinks about credit assignment in post-training pipelines. Meanwhile, Full Attention Strikes Back demonstrates that standard full-attention LLMs are already intrinsically sparse, enabling sparse conversion in under 100 training steps, which has immediate implications for inference efficiency without sacrificing expressivity.

On the evaluation and dataset front, TransitLM (162 upvotes) released the largest publicly known transit-planning corpus (13M+ records, 120K stations), setting a new scale benchmark for geographic reasoning without map APIs. Perception or Prejudice formalizes Grounded Personality Reasoning for MLLMs, surfacing a key reliability gap: current multimodal models often pattern-match rather than genuinely perceive behavioral signals. pi-Bench and Spreadsheet-RL extend the agentic evaluation frontier into long-horizon proactive workflows and real-world spreadsheet automation respectively.

GitHub trends signal a rapidly maturing agent-infrastructure layer: Microsoft's agent-governance-toolkit (covering OWASP Agentic Top 10), plastic-labs/honcho (stateful agent memory), and Anthropic's claude-plugins-official directory all gained significant traction. The proliferation of Claude Code plugin and skill registries, combined with pre-indexed code knowledge graphs (codegraph, Understand-Anything), suggests the developer tooling stack around AI coding agents is consolidating fast.

Researcher Notes

The DelTA-sparsification connection is non-obvious but important. DelTA shows that RLVR updates implicitly discriminate over token-gradient vectors, while Full Attention Strikes Back shows that trained LLMs are already sparse in attention patterns. Together, these suggest a future where sparse-attention models trained with token-discriminative RL rewards might be both cheaper to run and more precisely shaped by fine-grained feedback — a compounding efficiency gain worth tracking.

Unsupervised PRMs are a sleeper hit. With only 17 upvotes, the unsupervised process reward model paper may be underappreciated relative to its potential impact. If PRMs can be trained without expert step-level annotations, the scaling bottleneck for verifier-guided search collapses significantly. This pairs well with DelTA's credit-assignment framing: both papers are attacking the labeling cost of process-level supervision from different angles.

The KV-cache stack is fragmenting into specialized solutions. WorldKV (video diffusion), KVServe (disaggregated LLM serving), and the RLVR sparsification work all touch KV-cache efficiency but from entirely different angles — video generation consistency, network bandwidth under SLO constraints, and attention sparsity respectively. The absence of a unified framework is notable, and the first system to integrate these perspectives might define the next generation of inference engines.

Agent governance is crossing the chasm. Microsoft's agent-governance-toolkit explicitly maps to OWASP Agentic Top 10, AWS's aidlc-workflows provides adaptive steering rules, and Tracer-Cloud's opensre targets AI SRE use cases. The simultaneous emergence of governance tooling from a hyperscaler (Microsoft), cloud provider (AWS), and startup (Tracer-Cloud) within the same trending window suggests enterprise AI agent deployment is moving from pilot to production at scale, creating urgent demand for policy enforcement and zero-trust identity primitives.

pi-Bench and Spreadsheet-RL reveal a maturation inflection in agentic benchmarks. Early agent benchmarks (WebArena, SWE-bench) tested reactive task completion. The new generation — pi-Bench with hidden-intent proactive workflows, Spreadsheet-RL with multi-step real-world spreadsheet operations — tests whether agents can sustain intentional, long-horizon behavior. This shift mirrors what happened in NLP evaluation when GLUE gave way to BIG-Bench, and suggests the next 12-18 months will see capability thresholds redefined by proactive and sustained-action metrics.

Themes & Trends

RLVR Credit Assignment and Process Supervision

rising

Multiple papers tackle the granularity and efficiency of reward signals in LLM training — from token-level discriminative assignment (DelTA) to eliminating annotation bottlenecks in process reward models (Unsupervised PRMs). Together they signal a push toward more principled and scalable RL feedback.

Attention Sparsification and Inference Efficiency

rising

Full Attention Strikes Back demonstrates that LLMs are intrinsically sparse, while Gated DeltaNet-2 improves linear attention's memory operations. Both reduce inference cost without sacrificing model quality, pointing toward a convergence on efficient attention mechanisms.

Agentic Evaluation and Long-Horizon Benchmarks

rising

A new generation of benchmarks — pi-Bench for proactive workflows, Spreadsheet-RL for realistic office tasks, and CUSP for scientific forecasting — moves beyond reactive task completion to evaluate sustained, intentional, and proactive agent behavior.

Agent Infrastructure and Governance

rising

The GitHub trending data shows rapid maturation of agent infrastructure: governance toolkits, stateful memory libraries, and MCP-based browser tooling all trended simultaneously, indicating enterprise-ready agent deployment is imminent.

Multimodal Robustness and Grounded Reasoning

stable

Perception or Prejudice, LatentOmni, and SpaceDG all probe whether multimodal models genuinely reason or merely pattern-match. The theme spans personality grounding, audio-visual temporal reasoning, and spatial understanding under degraded inputs.

KV-Cache Innovation Across Domains

stable

KV-cache optimization is fragmenting into domain-specific solutions: WorldKV for video diffusion consistency, KVServe for network-efficient disaggregated LLM serving, and attention sparsification for standard LLM inference. Unification remains an open challenge.

Trending Papers (14)

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

High Relevance

Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu Peking University, ByteDance

Releases a corpus of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines. The dataset is designed as a continual pre-training corpus and benchmark for LLM-based transit planning that does not rely on map infrastructure, enabling geographic reasoning in resource-constrained settings.

Key Findings

  • First large-scale open dataset for map-free transit route planning with 13M+ records

  • Covers 4 Chinese cities, 120,845 stations, and 13,666 lines at unprecedented scale

  • Enables continual pre-training of LLMs for transit reasoning without external map APIs

datasettransit-planninggeographic-reasoningllmbenchmark
162 upvotes

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

High Relevance

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang Tsinghua University, Renmin University of China

Formalizes the task of Grounded Personality Reasoning (GPR) to evaluate whether multimodal LLMs genuinely perceive personality through behavioral understanding or merely prejudge via superficial pattern matching. The work reveals a systematic reliability gap in current MLLMs.

Key Findings

  • Current MLLMs predominantly pattern-match superficial cues rather than reasoning from behavior

  • Introduces GPR as a formal evaluation framework distinguishing genuine perception from prejudice

  • Identifies a category of failure modes specific to first-impression bias in multimodal models

multimodalevaluationpersonality-reasoningmllmbenchmark
152 upvotes

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

High Relevance

Kaiyi Zhang, Wei Wu, Yankai Lin Renmin University of China, Tsinghua University

Introduces a discriminator view of RLVR updates, demonstrating that policy-gradient steps implicitly act as linear discriminators over token-gradient vectors, determining which token probabilities increase or decrease. This theoretical reframing enables more principled credit assignment at the token level.

Key Findings

  • Policy-gradient RLVR updates are equivalent to linear discriminators over token-gradient vectors

  • Provides theoretical grounding for fine-grained token-level credit assignment in RL fine-tuning

  • Discriminative framing opens new design space for reward shaping and token selection

rlvrreinforcement-learningcredit-assignmentllm-trainingtheory
124 upvotes

π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

High Relevance

Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang Peking University, Alibaba Group

Introduces a benchmark evaluating whether agents can identify and act on hidden user intents before they are explicitly stated, across sustained long-horizon workflows. Addresses the proactive assistance challenge that reactive benchmarks cannot capture.

Key Findings

  • First benchmark specifically targeting proactive intent identification in long-horizon workflows

  • Reveals that current agents fail to act on implicit user goals without explicit instruction

  • Long-horizon workflow structure exposes compounding failures invisible in short-task evaluations

agent-evaluationproactive-assistancelong-horizonbenchmarkllm-agents
81 upvotes

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

High Relevance

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu Peking University, Microsoft Research

Demonstrates that full-attention LLMs are already intrinsically sparse in their attention patterns, and can be converted into highly sparse models with minimal adaptation — under 100 training steps. Identifies three key observations about inherent sparsity patterns.

Key Findings

  • Full-attention LLMs exhibit intrinsic sparsity that can be exploited without retraining from scratch

  • Sparse conversion requires fewer than 100 training steps, making it practical for deployed models

  • Three distinct sparsity pattern types identified across model families

attention-sparsificationefficiencyinferencellmsparse-attention
75 upvotes

ACC: Compiling Agent Trajectories for Long-Context Training

Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao Shanghai AI Laboratory, Fudan University

Proposes using agent trajectories — which naturally contain evidence scattered across many turns — as long-context training data for LLMs. Addresses the challenge that agentic problem-solving requires integrating distant context, making trajectories ideal for training long-context integration.

Key Findings

  • Agent trajectories naturally encode long-context dependencies, making them ideal training data

  • Compilation approach aggregates multi-turn agentic outputs into coherent long-context examples

  • Training on compiled trajectories improves long-context reasoning on downstream benchmarks

long-contextagent-trainingdata-curationllmtrajectories
52 upvotes

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

Ziang Cao, Yinghao Liu, Haitian Li, Runmao Yao, Fangzhou Hong Nanyang Technological University, NVIDIA

Presents a unified framework for generating simulation-ready 3D assets across rigid, deformable, and articulated object types through a novel geometry pipeline. Addresses the fragmentation of prior work that handled each asset class separately.

Key Findings

  • Unified pipeline handles rigid, deformable, and articulated objects in a single framework

  • Generated assets are immediately simulation-ready without manual post-processing

  • Novel geometry pipeline enables physically accurate asset generation at scale

3d-generationsimulationphysicsroboticsasset-generation
42 upvotes

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu Zhejiang University, Alibaba Group

Proposes a unified latent space for audio-visual reasoning instead of text-based chain-of-thought, addressing the core problem that explicit CoT compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and cross-modal alignment.

Key Findings

  • Text-based CoT for audio-visual tasks loses temporal grounding by discretizing continuous signals

  • Unified latent reasoning space preserves audio-visual continuity across modalities

  • Outperforms text-CoT baselines on temporal grounding and cross-modal reasoning tasks

multimodalaudio-visuallatent-reasoningtemporal-groundingomni-modal
35 upvotes

Spreadsheet-RL: Advancing LLM Agents on Realistic Spreadsheet Tasks via RL

Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang National University of Singapore, Microsoft

Applies reinforcement learning to train LLM agents on realistic spreadsheet automation tasks, addressing the limitations of specialized prompting on complex multi-step operations. Extends agentic RL into the practical enterprise office automation domain.

Key Findings

  • Specialized prompting fails on complex multi-step spreadsheet operations

  • RL training significantly improves agent performance on realistic spreadsheet benchmarks

  • Office automation tasks require sustained multi-step planning that RL is well-suited for

reinforcement-learningspreadsheetoffice-automationllm-agentsrl
29 upvotes

Unsupervised Process Reward Models

High Relevance

Artyom Gadetsky, Maxim Kodryan, Siba Smarak Panigrahi, Hang Guo, Maria Brbic EPFL, ETH Zurich

Introduces a method for training process reward models without human supervision, eliminating the need for step-by-step annotations or ground-truth verification. Directly attacks the expert annotation bottleneck that has limited PRM scaling.

Key Findings

  • PRMs can be trained without any step-level human annotations

  • Unsupervised approach matches or approaches supervised PRMs on reasoning benchmarks

  • Removes the primary scaling bottleneck for verifier-guided search in LLM reasoning

process-reward-modelsunsupervised-learningrlhfreasoningllm-training
17 upvotes

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Ali Hatamizadeh, Yejin Choi, Jan Kautz NVIDIA

Decouples the erase and write operations in linear attention's compressed memory representation, arguing that a single scalar gate causes interference between the two operations. Separate gating mechanisms enable independent control over memory retention and updates.

Key Findings

  • Single scalar gate in linear attention causes interference between erase and write operations

  • Separate gating for erase and write improves memory control in linear attention models

  • Gated DeltaNet-2 achieves better language modeling perplexity and downstream task performance

linear-attentionefficient-transformersmemoryarchitecturessm
16 upvotes

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu Tsinghua University, Peking University

Applies RL to orchestrate multiple LLMs and modular skills, exploiting complementary strengths across domains instead of relying on a single monolithic model. Hierarchical ensemble enables dynamic routing based on task requirements.

Key Findings

  • RL-based orchestration outperforms static routing and monolithic models across diverse domains

  • Complementary specialization across models can be exploited by a learned orchestrator

  • Hierarchical skill ensembles reduce inference cost while improving accuracy on mixed workloads

model-ensemblereinforcement-learningllm-routingmulti-agentorchestration
17 upvotes

Forecasting Scientific Progress with AI

Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada Allen Institute for AI, UCLA

Introduces CUSP, a benchmark for evaluating AI systems on scientific forecasting under controlled knowledge constraints, enabling multi-disciplinary event-level evaluation of how well models can predict future scientific developments.

Key Findings

  • CUSP provides controlled knowledge cutoffs enabling fair comparison of forecasting capabilities

  • Multi-disciplinary coverage reveals domain-specific forecasting strengths and weaknesses

  • Current frontier models show significant gaps in scientific event prediction accuracy

scientific-aiforecastingbenchmarkknowledge-cutoffmulti-disciplinary
29 upvotes

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng Waymo, Google

Converts in-the-wild dashcam video to match proprietary AV sensor configurations, enabling the use of massive diverse dashcam datasets for training AV perception systems that require structured fleet sensor data.

Key Findings

  • Dashcam-to-AV sensor conversion bridges the data gap between consumer and fleet sensor configurations

  • Cross-embodiment approach enables leveraging billions of dashcam frames for AV training

  • Sensor conversion quality is sufficient to improve downstream AV perception task performance

autonomous-drivingsensor-fusiondata-augmentationcross-embodimentperception
22 upvotes

Trending Models (11)

DeepSeek-V4-Pro

DeepSeek AI · text-generation · Unknown (MoE)

View on HF

Latest flagship text generation model from DeepSeek with 4.2M+ downloads and 4,152 likes, representing the leading open-weight frontier model. Continues the V4 series with architectural improvements.

text-generationtransformerssafetensorsdeepseek
4.3M downloads4.2K likes
Qwen3.6-27B

Qwen / Alibaba · image-text-to-text · 27B

View on HF

Qwen3.6 27B dense multimodal model with 4M downloads and 1,390 likes, supporting image-text-to-text tasks. Part of the Qwen3 series advancing open multimodal frontier models.

transformerssafetensorsmultimodalqwen3
4.0M downloads1.4K likes
Anima

Circlestone Labs · image-generation · Unknown

View on HF

Highly-liked ComfyUI diffusion model with 1,499 likes and 602K downloads, designed for high-quality image/video generation workflows in the ComfyUI ecosystem.

diffusion-single-filecomfyuiimage-generation
602.5K downloads1.5K likes
Sulphur-2-base

SulphurAI · text-to-video · Unknown

View on HF

Text-to-video diffusion model available in GGUF and diffusers formats with 1.25M downloads, indicating strong community adoption for local video generation workflows.

diffusersgguftext-to-video
1.2M downloads1.3K likes
MiniCPM-V-4.6

OpenBMB / Tsinghua · image-text-to-text · ~4.6B

View on HF

Efficient multimodal model with 221K downloads and 904 likes, part of the MiniCPM-V series known for strong performance relative to its compact size in image-text-to-text tasks.

transformerssafetensorsmultimodalefficient
221.6K downloads904 likes
Lance

ByteDance Research · image-generation · Unknown

View on HF

ByteDance Research multimodal model supporting both image and video generation with 649 likes, representing ByteDance's entry into unified visual generation.

multimodalimage-generationvideo-generationsafetensors
1.0K downloads649 likes
Supertonic-3

Supertone · text-to-speech · Unknown

View on HF

Advanced text-to-speech and speech synthesis model in ONNX format with 37K downloads and 582 likes, offering high-quality voice synthesis capabilities.

supertoniconnxtext-to-speechspeech-synthesistts
37.5K downloads582 likes
Qwen3.6-27B-MTP-GGUF

Unsloth · text-generation · 27B

View on HF

Unsloth-optimized GGUF quantization of Qwen3.6 27B MTP variant with 532K downloads and 413 likes, enabling efficient local deployment of the Qwen3.6 series.

transformersggufunslothqwenquantization
532.3K downloads413 likes
Hy-MT2-1.8B

Tencent · translation · 1.8B

View on HF

Tencent's compact 1.8B translation-capable text generation model based on HunyuanV1 dense architecture, designed for efficient multilingual translation tasks.

transformerssafetensorstext-generationtranslation
564 downloads280 likes
Dramabox

ResembleAI · text-to-speech · Unknown

View on HF

High-quality TTS and voice cloning model with 1,354 downloads and 230 likes, specialized for dramatic and expressive speech synthesis with voice cloning capabilities.

ttsvoice-cloningspeech-synthesisdrama
1.4K downloads230 likes
Pixal3D

TencentARC · image-to-3d · Unknown

View on HF

Image-to-3D model from TencentARC with 192 likes enabling single-image 3D reconstruction, contributing to the growing ecosystem of 3D generation tools.

image-to-3d3d-generationreconstruction
0 downloads192 likes

Trending GitHub Repos (12)

Pre-indexed code knowledge graph for Claude Code, Codex, Cursor, and OpenCode that reduces token usage and tool calls while running 100% locally. Highest stars today in the trending list (3,684) indicates strong resonance with AI coding workflows.

knowledge-graphclaude-codecoding-agentsdeveloper-toolsefficiency
TypeScript16.7K+3.7K today919

Official Anthropic-managed directory of high-quality Claude Code plugins, serving as the authoritative registry for the growing Claude Code plugin ecosystem. Explosive growth with 2,549 stars today signals rapid ecosystem adoption.

claude-codepluginsanthropicdeveloper-toolsai-coding
Python25.0K+2.5K today2.8K

General-purpose agent framework from NousResearch with 163K stars and 1,743 new stars today, positioning as a composable and growing agent platform built on the Hermes model family.

llm-agentshermesopen-sourceagent-framework
Python163.2K+1.7K today26.7K

Converts code repositories into interactive knowledge graphs compatible with Claude Code, Codex, Cursor, Copilot, and Gemini CLI. Gained 1,393 stars today, reflecting demand for code-comprehension tooling across AI coding agent stacks.

knowledge-graphcode-understandingdeveloper-toolsai-coding
TypeScript18.8K+1.4K today1.7K

Comprehensive AI engineering curriculum covering building and shipping AI applications from scratch, with 988 stars today showing strong community interest in practical AI engineering education.

ai-engineeringeducationllmmachine-learningcurriculum
Python12.0K+988 today2.3K

Converts WiFi signals into real-time spatial intelligence, vital sign monitoring, and presence detection without requiring video cameras. Gained 978 stars today, representing a novel privacy-preserving sensing approach.

spatial-intelligencewifi-sensingprivacyrustcomputer-vision
Rust64.1K+978 today8.5K

Chrome DevTools as an MCP server for AI coding agents, enabling programmatic browser inspection and debugging within agent workflows. 501 stars today reflects growing adoption of browser tooling in agentic developer stacks.

mcpchrome-devtoolsbrowser-automationcoding-agentsdeveloper-tools
TypeScript41.0K+501 today2.6K

Agentic video generation system combining Director, Screenwriter, Producer, and Video Generator roles into a single all-in-one system, demonstrating multi-agent collaboration for complex creative tasks.

video-generationmulti-agentcreative-aiagentic
Python6.7K+266 today1.1K

Memory library for building stateful agents, providing persistent memory infrastructure for long-running agent workflows. Gained 133 stars today as stateful agent memory becomes a critical infrastructure component.

agent-memorystateful-agentsllm-agentsmemory-management
Python4.0K+133 today467

Microsoft's AI agent governance toolkit covering policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering, explicitly addressing all 10 OWASP Agentic Top 10 risks.

agent-governancesecurityowasppolicy-enforcemententerprise-ai
Python1.8K+86 today343

Fully autonomous and self-evolving research system that operates from idea generation to paper writing, representing a significant step toward automated scientific research pipelines.

autonomous-researchscientific-aipaper-generationself-evolving
Python12.5K+73 today1.5K

Meta's Segment Anything Model 3 for inference and fine-tuning, extending the highly influential SAM series with improved capabilities for segmentation tasks.

segmentationcomputer-visionmetafoundation-modelfine-tuning
Python10.1K+63 today1.5K

Sources Checked