Friday, May 22, 2026

Agent trajectory compilation (ACC) opens new long-context training paradigm; Gated DeltaNet-2 decouples linear attention memory editing; code knowledge graphs and agentic skills frameworks explode on GitHub

agent-training-from-trajectoriesefficient-attention-mechanismsagent-benchmarks-evaluationagentic-coding-toolscurriculum-reinforcement-learning

Executive Summary

Today's research centers on improving how LLMs learn from agent interactions and how efficient attention mechanisms manage compressed memory. The standout paper is ACC (Agent Trajectory Compilation), which reframes the massive trajectories produced by tool-using agents as a natural source of long-context training data — an elegant insight that sidesteps the cost of synthetic long-document curation. Gated DeltaNet-2 tackles a fundamental tension in linear attention: how to edit a fixed-size recurrent state without corrupting existing associations, introducing decoupled erase-write gates that improve on KDA's channel-wise decay.

The agent and benchmark space continues to mature rapidly. TerminalWorld introduces a scalable data engine that reverse-engineers evaluation tasks from real terminal recordings, yielding 1,530 validated tasks across 18 categories. Spreadsheet-RL applies reinforcement learning to spreadsheet automation, and pi-Bench evaluates proactive personal assistants on hidden-intent detection. Meanwhile, SCRL introduces curriculum RL with verifiable subproblems to solve credit assignment in reasoning.

On the model front, DeepSeek V4 continues its dominance with Pro (4M+ downloads) and Flash (2.4M+), while ByteDance's Lance and OpenBMB's MiniCPM-V-4.6 push multimodal boundaries. GitHub is dominated by the agentic coding revolution: codegraph (4,294 stars today), andrej-karpathy-skills (2,614 stars today), and superpowers (1,576 stars today) reflect massive developer appetite for AI-assisted development tooling.

Researcher Notes

ACC's insight that agent trajectories are natural long-context training data is deceptively simple but potentially transformative. The observation that tool-using agents scatter evidence across many turns — requiring integration of distant context segments — mirrors exactly the kind of long-range reasoning we want LLMs to learn. Rather than expensive synthetic data curation, this approach harvests training signal from the very process of agents doing useful work. Watch for follow-up work applying this to code agents, where trajectories are especially rich.

Gated DeltaNet-2 represents a meaningful step toward practical linear attention. The core problem — that delta-rule models use a single scalar gate for both erasing and writing, causing one operation to inadvertently distort the other — is well-characterized here. Decoupling these operations into separate gating mechanisms is the kind of surgical architectural improvement that compounds across scale. The connection to KDA (Kimi Delta Attention) and its channel-wise decay suggests this line of work is converging on a practical alternative to full softmax attention for long sequences.

The benchmarking wave deserves attention for what it reveals about field maturity. TerminalWorld (1,530 tasks from 80K real recordings), Spreadsheet-RL (realistic spreadsheet automation), and pi-Bench (proactive assistant evaluation) all share a common philosophy: evaluation grounded in real-world usage patterns rather than synthetic tasks. This shift from 'can the model do X on a toy problem' to 'can the model handle the messy reality of X' is a leading indicator of practical deployment readiness.

The GitHub trending data tells the story of 2026: agentic coding has gone mainstream. codegraph's 4,294 daily stars for pre-indexed code knowledge graphs, combined with andrej-karpathy-skills' 2,614 daily stars for a single CLAUDE.md best-practices file, shows that developers are now optimizing their workflow around AI coding agents rather than treating them as novelties. The emergence of hermes-agent at 161K total stars and agency-agents at 103K total stars suggests agent orchestration platforms are becoming core infrastructure.

Sleeper hit: WorldKV for persistent video world generation. While the engagement numbers are modest, the problem of maintaining consistent world state across long video rollouts — where revisiting a viewpoint should yield the same content — is fundamental to real-time interactive applications. The retrieval-and-compression approach to KV-cache management could have implications well beyond video generation.

Themes & Trends

Agent Training from Trajectories

rising

ACC demonstrates that agent trajectories provide natural long-context training data, while SCRL introduces curriculum RL for credit assignment — both advancing how we train models from agent interactions.

Efficient Attention and Memory

rising

Gated DeltaNet-2 and WorldKV both address the fundamental challenge of managing compressed memory in sequence models — decoupling erase/write operations in linear attention and retrieval-compression for video world models.

Agent Benchmarks and Real-World Evaluation

rising

TerminalWorld, Spreadsheet-RL, and pi-Bench all push evaluation toward real-world usage patterns rather than synthetic tasks, reflecting growing demand for ecologically valid agent assessment.

Agentic Coding Tools Ecosystem

rising

GitHub trending is dominated by AI coding tools — codegraph, andrej-karpathy-skills, academic-research-skills, superpowers, and claude-plugins-official collectively gaining 10K+ daily stars, signaling mainstream adoption of AI-assisted development.

Video Generation and World Models

stable

Bernini's semantic planning for video diffusion, WorldKV's persistent world generation, and ViMax's agentic video generation framework show continued strong momentum in controllable video synthesis.

Trending Papers (10)

ACC: Compiling Agent Trajectories for Long-Context Training

High Relevance

Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao Tsinghua University, ByteDance

Proposes using the massive trajectories produced by tool-using agents as a natural source of long-context training data for LLMs. Agent trajectories scatter evidence across many turns of tool invocation and environment observation, requiring integration of distant context segments — exactly the capacity long-context training aims to develop.

Key Findings

  • Agent trajectories provide naturally structured long-context training data without expensive manual curation

  • Evidence scattered across multi-turn tool interactions requires long-range context integration

  • The compilation approach sidesteps heuristic context synthesis methods used in prior work

agentslong-contexttraining-datatrajectories
9 upvotes

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

High Relevance

Ali Hatamizadeh, Yejin Choi, Jan Kautz NVIDIA, University of Washington

Addresses a fundamental limitation in delta-rule linear attention models where a single scalar gate controls both erasing and writing to the compressed recurrent state. By decoupling these operations into separate gating mechanisms, the model avoids the interference where one operation scrambles the other's associations.

Key Findings

  • Single-gate delta-rule models suffer from erase-write interference in compressed memory

  • Decoupled gating mechanisms allow independent control of memory erasure and new value writing

  • Improves upon KDA's channel-wise decay approach for managing the fixed-size recurrent state

linear-attentionefficient-transformersarchitecturememory
1 upvotes

WorldKV: Efficient World Memory with World Retrieval and Compression

High Relevance

Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun NAVER AI Lab, Korea Advanced Institute of Science and Technology

Proposes a retrieval-and-compression approach to KV-cache management for autoregressive video diffusion models, enabling persistent world generation where revisiting previously seen viewpoints yields consistent content without breaking real-time constraints.

Key Findings

  • Full KV-cache attention preserves world consistency but memory and compute grow linearly with rollout length

  • Sliding window inference restores throughput but sacrifices long-term consistency

  • WorldKV combines retrieval and compression to maintain both consistency and real-time performance

video-generationworld-modelsKV-cachereal-time
2 upvotes

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang Zhejiang University, Alibaba Group

Applies reinforcement learning to train LLM agents for realistic spreadsheet automation tasks. Addresses limitations of specialized prompting approaches that struggle with complex multi-step spreadsheet operations beyond simple cell manipulation.

Key Findings

  • Specialized prompting over general-purpose LLMs fails on complex spreadsheet operations

  • RL training enables agents to learn multi-step spreadsheet manipulation strategies

  • Bridges the gap between toy spreadsheet tasks and real-world data-centric workflows

agentsreinforcement-learningspreadsheetsautomation
2 upvotes

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian Indian Institute of Technology Hyderabad, Samsung Research

Introduces a training-free frame selection algorithm inspired by the brain's predictive coding that identifies high-information moments in long-form video by modeling it as a differentiable trajectory in visual latent space and computing velocity-based surprise scores.

Key Findings

  • Most frames in long-form video are redundant; critical information resides in temporal surprises

  • Taylor series-based velocity computation identifies moments where visual features deviate from predicted evolution

  • Training-free approach requires no task-specific fine-tuning for frame selection

video-understandingframe-selectionpredictive-codingefficiency
2 upvotes

Diversed Model Discovery via Structured Table Discovery

Zhengyuan Dong, Renée J. Miller Northeastern University

Argues that model search is inherently comparative and proposes leveraging structured artifacts from model cards — performance tables, configuration data, dataset metadata — to produce diverse, differentiated model recommendations beyond what text-based semantic similarity can achieve.

Key Findings

  • Text-based model search produces homogeneous results due to semantic similarity clustering

  • Structured table artifacts in model cards capture differentiation dimensions text misses

  • Comparative model search requires balancing task alignment with measurable differentiation

model-discoverymodel-cardsinformation-retrievalstructured-data
2 upvotes

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

High Relevance

Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang Shanghai Jiao Tong University, Tsinghua University

Introduces SCRL, a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and uses progressive difficulty scheduling to solve the credit assignment problem in outcome-based RLVR, where correct final-answer rollouts are too rare for efficient learning on hard problems.

Key Findings

  • Outcome-based RLVR is inefficient on hard problems because correct final-answer rollouts are rare

  • Decomposing problems into verifiable subproblems enables partial credit assignment from failed attempts

  • Curriculum scheduling from easy to hard subproblems improves sample efficiency

reinforcement-learningreasoningcurriculum-learningcredit-assignment
0 upvotes

Bernini: Latent Semantic Planning for Video Diffusion

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi ByteDance

Unifies multimodal large language models and diffusion models through a division of labor: MLLMs perform semantic planning while diffusion models render pixels from high-level semantic guidance and low-level visual features, enabling controllable video generation with strong semantic grounding.

Key Findings

  • MLLMs and diffusion models can be unified through semantic planning plus pixel rendering

  • Latent semantic representations bridge the gap between language reasoning and visual generation

  • The division of labor leverages each architecture family's strengths without compromise

video-generationdiffusionmultimodalsemantic-planning
0 upvotes

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

High Relevance

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li Renmin University of China, Ant Group

Introduces a scalable data engine that reverse-engineers evaluation tasks from 80,870 real terminal recordings, producing 1,530 validated tasks spanning 18 categories and 1,280 unique commands, with a curated verified subset of 200 tasks for comprehensive agent evaluation.

Key Findings

  • Automated pipeline converts 80K real terminal recordings into 1,530 validated evaluation tasks

  • Tasks span 18 real-world categories from short operations to 50+ step workflows

  • Coverage of 1,280 unique commands provides breadth unavailable in manually crafted benchmarks

benchmarksagentsterminalevaluation
0 upvotes

pi-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang Peking University, Microsoft Research

Evaluates whether personal assistant agents can identify and act on hidden intents — needs, constraints, and preferences that users leave unstated — in sustained long-horizon workflows, addressing a core challenge in proactive assistance that existing benchmarks overlook.

Key Findings

  • Existing benchmarks rarely evaluate proactive identification of unstated user needs

  • Long-horizon workflows amplify the importance of hidden intent detection

  • Proactive assistance requires different capabilities than reactive task completion

agentspersonal-assistantsbenchmarksproactive-AI
0 upvotes

Trending Models (10)

DeepSeek V4 Pro

DeepSeek · text-generation · unknown

View on HF

Latest flagship model from DeepSeek with 4M+ downloads, continuing the V4 architecture's dominance in the open-source LLM ecosystem for conversational and general text generation tasks.

conversationaltext-generationfrontier
4.0M downloads4.1K likes
DeepSeek V4 Flash

DeepSeek · text-generation · unknown

View on HF

Efficient variant of DeepSeek V4 optimized for faster inference while maintaining strong conversational and text generation capabilities, achieving 2.4M+ downloads.

conversationaltext-generationefficient
2.4M downloads1.2K likes
Anima

Circlestone Labs · image-generation · unknown

View on HF

Diffusion model with 1,468 likes gaining strong traction in the generative image community, compatible with ComfyUI workflows.

diffusionimage-generationcomfyui
591.8K downloads1.5K likes
Sulphur-2-base

SulphurAI · text-to-video · unknown

View on HF

Text-to-video model with over 1.1M downloads, available in both diffusers and GGUF formats, establishing itself as a leading open-source video generation model.

text-to-videodiffusersgguf
1.2M downloads1.2K likes
MiniCPM-V-4.6

OpenBMB · image-text-to-text · unknown

View on HF

Multimodal vision-language model with 196K downloads and 876 likes, continuing the MiniCPM-V series' strong performance in image-text understanding at efficient model sizes.

multimodalvision-languageefficient
196.1K downloads876 likes
Lance

ByteDance Research · multimodal · unknown

View on HF

Any-to-any multimodal model supporting image and video generation from ByteDance, rapidly gaining community attention with 572 likes despite relatively low download count, suggesting strong interest from early adopters.

multimodalimage-generationvideo-generation
739 downloads572 likes
Fara-7B

Microsoft · image-text-to-text · 7B

View on HF

7B parameter multimodal vision-language model from Microsoft built on Qwen2.5-VL architecture, achieving 592 likes and 15K downloads for image-text understanding tasks.

multimodalvision-languageqwen
15.2K downloads592 likes
Supertonic-3

Supertone · text-to-speech · unknown

View on HF

Text-to-speech model with ONNX format support, achieving 535 likes and 34K downloads for high-quality speech synthesis applications.

ttsspeech-synthesisonnx
35.0K downloads535 likes
HiDream-O1-Image

HiDream AI · image-text-to-image · unknown

View on HF

Vision-language model combining image understanding and image generation capabilities in a single architecture based on Qwen3-VL, with 417 likes and 21K downloads.

multimodalimage-generationvision-language
21.6K downloads417 likes
Qwen3.6-27B-MTP-GGUF

Unsloth · text-generation · 27B

View on HF

GGUF quantized version of Qwen3.6-27B with Multi-Token Prediction, enabling efficient local deployment with 478K downloads.

ggufquantizedqwenefficient-inference
478.5K downloads376 likes

Trending GitHub Repos (15)

Pre-indexed code knowledge graph for AI coding agents (Claude Code, Codex, Cursor, OpenCode), reducing token usage and tool calls while keeping everything local. Leading today's GitHub trending with 4,294 daily stars.

code-knowledge-graphai-codingdeveloper-tools
TypeScript13.7K+4.3K today781

A single CLAUDE.md file derived from Andrej Karpathy's observations on LLM coding pitfalls, rapidly adopted as best-practice guidance for Claude Code agents. 143K total stars.

ai-codingbest-practicesclaude-code
143.3K+2.6K today14.7K

Academic research workflow skills for Claude Code covering the full pipeline from research to writing, review, revision, and finalization. 2,579 daily stars.

academic-researchai-writingclaude-code
Python18.2K+2.6K today1.6K

Nous Research's personal AI agent platform with 161K total stars and 2,056 daily stars, positioning itself as the leading open-source personal agent framework.

agentspersonal-aiopen-source
Python161.6K+2.1K today26.3K

Agentic skills framework and software development methodology with 201K total stars, providing structured approaches to AI-assisted development.

agentic-skillsdevelopment-methodologyai-assisted-dev
Shell201.6K+1.6K today18.0K

Comprehensive learning resource for AI engineering covering the full stack from foundations to deployment, gaining 1,333 daily stars.

ai-engineeringeducationfull-stack
Python10.8K+1.3K today2.1K

Cross-platform Electron desktop app for streaming and downloading media content with zero ads, gaining 1,094 daily stars.

electronstreamingdesktop-app
JavaScript4.0K+1.1K today315

Complete AI agency framework with specialized expert agents for different domains — from frontend development to community management — each with defined processes and deliverables. 103K total stars.

ai-agentsspecialized-agentsagency
Shell103.7K+1.0K today17.1K

Curated collection of inspiring lists, manuals, cheatsheets, and developer tools with 222K total stars, a perennial trending resource.

developer-resourcescheatsheetsreference
222.5K+756 today13.3K

Free, open-source, self-hosted WhatsApp API gateway gaining 730 daily stars, enabling programmatic WhatsApp integration.

whatsappapi-gatewaymessaging
TypeScript5.4K+730 today1.1K

Anthropic's official directory of high-quality Claude Code plugins, gaining 682 daily stars as the plugin ecosystem matures.

claude-codepluginsecosystem
Python22.6K+682 today2.6K

Converts code into interactive knowledge graphs for exploration, search, and Q&A — compatible with Claude Code, Codex, Cursor, Copilot, and Gemini CLI. 666 daily stars.

knowledge-graphcode-understandingvisualization
TypeScript16.7K+666 today1.6K

Makes all software agent-native through CLI interfaces, with 39K total stars. Part of HKUDS's agent-native software initiative.

cliagent-nativesoftware-tools
Python39.2K+656 today3.7K
High RelevanceGitHub

Agentic video generation system that combines Director, Screenwriter, Producer, and Video Generator roles in one framework. 537 daily stars.

video-generationagenticcreative-ai
Python6.5K+537 today1.0K

Open-source managed agents platform that turns coding agents into real teammates with task assignment, progress tracking, and skill compounding. 534 daily stars.

managed-agentsagent-platformtask-management
Go30.8K+534 today3.7K

Sources Checked

03:00 PM UTC
03:00 PM UTC
03:00 PM UTC