Friday, May 22, 2026
Agent trajectory compilation (ACC) opens new long-context training paradigm; Gated DeltaNet-2 decouples linear attention memory editing; code knowledge graphs and agentic skills frameworks explode on GitHub
Executive Summary
Today's research centers on improving how LLMs learn from agent interactions and how efficient attention mechanisms manage compressed memory. The standout paper is ACC (Agent Trajectory Compilation), which reframes the massive trajectories produced by tool-using agents as a natural source of long-context training data — an elegant insight that sidesteps the cost of synthetic long-document curation. Gated DeltaNet-2 tackles a fundamental tension in linear attention: how to edit a fixed-size recurrent state without corrupting existing associations, introducing decoupled erase-write gates that improve on KDA's channel-wise decay.
The agent and benchmark space continues to mature rapidly. TerminalWorld introduces a scalable data engine that reverse-engineers evaluation tasks from real terminal recordings, yielding 1,530 validated tasks across 18 categories. Spreadsheet-RL applies reinforcement learning to spreadsheet automation, and pi-Bench evaluates proactive personal assistants on hidden-intent detection. Meanwhile, SCRL introduces curriculum RL with verifiable subproblems to solve credit assignment in reasoning.
On the model front, DeepSeek V4 continues its dominance with Pro (4M+ downloads) and Flash (2.4M+), while ByteDance's Lance and OpenBMB's MiniCPM-V-4.6 push multimodal boundaries. GitHub is dominated by the agentic coding revolution: codegraph (4,294 stars today), andrej-karpathy-skills (2,614 stars today), and superpowers (1,576 stars today) reflect massive developer appetite for AI-assisted development tooling.
Researcher Notes
ACC's insight that agent trajectories are natural long-context training data is deceptively simple but potentially transformative. The observation that tool-using agents scatter evidence across many turns — requiring integration of distant context segments — mirrors exactly the kind of long-range reasoning we want LLMs to learn. Rather than expensive synthetic data curation, this approach harvests training signal from the very process of agents doing useful work. Watch for follow-up work applying this to code agents, where trajectories are especially rich.
Gated DeltaNet-2 represents a meaningful step toward practical linear attention. The core problem — that delta-rule models use a single scalar gate for both erasing and writing, causing one operation to inadvertently distort the other — is well-characterized here. Decoupling these operations into separate gating mechanisms is the kind of surgical architectural improvement that compounds across scale. The connection to KDA (Kimi Delta Attention) and its channel-wise decay suggests this line of work is converging on a practical alternative to full softmax attention for long sequences.
The benchmarking wave deserves attention for what it reveals about field maturity. TerminalWorld (1,530 tasks from 80K real recordings), Spreadsheet-RL (realistic spreadsheet automation), and pi-Bench (proactive assistant evaluation) all share a common philosophy: evaluation grounded in real-world usage patterns rather than synthetic tasks. This shift from 'can the model do X on a toy problem' to 'can the model handle the messy reality of X' is a leading indicator of practical deployment readiness.
The GitHub trending data tells the story of 2026: agentic coding has gone mainstream. codegraph's 4,294 daily stars for pre-indexed code knowledge graphs, combined with andrej-karpathy-skills' 2,614 daily stars for a single CLAUDE.md best-practices file, shows that developers are now optimizing their workflow around AI coding agents rather than treating them as novelties. The emergence of hermes-agent at 161K total stars and agency-agents at 103K total stars suggests agent orchestration platforms are becoming core infrastructure.
Sleeper hit: WorldKV for persistent video world generation. While the engagement numbers are modest, the problem of maintaining consistent world state across long video rollouts — where revisiting a viewpoint should yield the same content — is fundamental to real-time interactive applications. The retrieval-and-compression approach to KV-cache management could have implications well beyond video generation.
Themes & Trends
Agent Training from Trajectories
risingACC demonstrates that agent trajectories provide natural long-context training data, while SCRL introduces curriculum RL for credit assignment — both advancing how we train models from agent interactions.
Efficient Attention and Memory
risingGated DeltaNet-2 and WorldKV both address the fundamental challenge of managing compressed memory in sequence models — decoupling erase/write operations in linear attention and retrieval-compression for video world models.
Agent Benchmarks and Real-World Evaluation
risingTerminalWorld, Spreadsheet-RL, and pi-Bench all push evaluation toward real-world usage patterns rather than synthetic tasks, reflecting growing demand for ecologically valid agent assessment.
Agentic Coding Tools Ecosystem
risingGitHub trending is dominated by AI coding tools — codegraph, andrej-karpathy-skills, academic-research-skills, superpowers, and claude-plugins-official collectively gaining 10K+ daily stars, signaling mainstream adoption of AI-assisted development.
Video Generation and World Models
stableBernini's semantic planning for video diffusion, WorldKV's persistent world generation, and ViMax's agentic video generation framework show continued strong momentum in controllable video synthesis.
Trending Papers (10)
ACC: Compiling Agent Trajectories for Long-Context Training
High RelevanceQisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao — Tsinghua University, ByteDance
Proposes using the massive trajectories produced by tool-using agents as a natural source of long-context training data for LLMs. Agent trajectories scatter evidence across many turns of tool invocation and environment observation, requiring integration of distant context segments — exactly the capacity long-context training aims to develop.
Key Findings
- •
Agent trajectories provide naturally structured long-context training data without expensive manual curation
- •
Evidence scattered across multi-turn tool interactions requires long-range context integration
- •
The compilation approach sidesteps heuristic context synthesis methods used in prior work
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
High RelevanceAli Hatamizadeh, Yejin Choi, Jan Kautz — NVIDIA, University of Washington
Addresses a fundamental limitation in delta-rule linear attention models where a single scalar gate controls both erasing and writing to the compressed recurrent state. By decoupling these operations into separate gating mechanisms, the model avoids the interference where one operation scrambles the other's associations.
Key Findings
- •
Single-gate delta-rule models suffer from erase-write interference in compressed memory
- •
Decoupled gating mechanisms allow independent control of memory erasure and new value writing
- •
Improves upon KDA's channel-wise decay approach for managing the fixed-size recurrent state
WorldKV: Efficient World Memory with World Retrieval and Compression
High RelevanceJung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun — NAVER AI Lab, Korea Advanced Institute of Science and Technology
Proposes a retrieval-and-compression approach to KV-cache management for autoregressive video diffusion models, enabling persistent world generation where revisiting previously seen viewpoints yields consistent content without breaking real-time constraints.
Key Findings
- •
Full KV-cache attention preserves world consistency but memory and compute grow linearly with rollout length
- •
Sliding window inference restores throughput but sacrifices long-term consistency
- •
WorldKV combines retrieval and compression to maintain both consistency and real-time performance
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang — Zhejiang University, Alibaba Group
Applies reinforcement learning to train LLM agents for realistic spreadsheet automation tasks. Addresses limitations of specialized prompting approaches that struggle with complex multi-step spreadsheet operations beyond simple cell manipulation.
Key Findings
- •
Specialized prompting over general-purpose LLMs fails on complex spreadsheet operations
- •
RL training enables agents to learn multi-step spreadsheet manipulation strategies
- •
Bridges the gap between toy spreadsheet tasks and real-world data-centric workflows
Swift Sampling: Selecting Temporal Surprises via Taylor Series
Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian — Indian Institute of Technology Hyderabad, Samsung Research
Introduces a training-free frame selection algorithm inspired by the brain's predictive coding that identifies high-information moments in long-form video by modeling it as a differentiable trajectory in visual latent space and computing velocity-based surprise scores.
Key Findings
- •
Most frames in long-form video are redundant; critical information resides in temporal surprises
- •
Taylor series-based velocity computation identifies moments where visual features deviate from predicted evolution
- •
Training-free approach requires no task-specific fine-tuning for frame selection
Diversed Model Discovery via Structured Table Discovery
Zhengyuan Dong, Renée J. Miller — Northeastern University
Argues that model search is inherently comparative and proposes leveraging structured artifacts from model cards — performance tables, configuration data, dataset metadata — to produce diverse, differentiated model recommendations beyond what text-based semantic similarity can achieve.
Key Findings
- •
Text-based model search produces homogeneous results due to semantic similarity clustering
- •
Structured table artifacts in model cards capture differentiation dimensions text misses
- •
Comparative model search requires balancing task alignment with measurable differentiation
From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
High RelevanceXitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang — Shanghai Jiao Tong University, Tsinghua University
Introduces SCRL, a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and uses progressive difficulty scheduling to solve the credit assignment problem in outcome-based RLVR, where correct final-answer rollouts are too rare for efficient learning on hard problems.
Key Findings
- •
Outcome-based RLVR is inefficient on hard problems because correct final-answer rollouts are rare
- •
Decomposing problems into verifiable subproblems enables partial credit assignment from failed attempts
- •
Curriculum scheduling from easy to hard subproblems improves sample efficiency
Bernini: Latent Semantic Planning for Video Diffusion
Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi — ByteDance
Unifies multimodal large language models and diffusion models through a division of labor: MLLMs perform semantic planning while diffusion models render pixels from high-level semantic guidance and low-level visual features, enabling controllable video generation with strong semantic grounding.
Key Findings
- •
MLLMs and diffusion models can be unified through semantic planning plus pixel rendering
- •
Latent semantic representations bridge the gap between language reasoning and visual generation
- •
The division of labor leverages each architecture family's strengths without compromise
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
High RelevanceZhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li — Renmin University of China, Ant Group
Introduces a scalable data engine that reverse-engineers evaluation tasks from 80,870 real terminal recordings, producing 1,530 validated tasks spanning 18 categories and 1,280 unique commands, with a curated verified subset of 200 tasks for comprehensive agent evaluation.
Key Findings
- •
Automated pipeline converts 80K real terminal recordings into 1,530 validated evaluation tasks
- •
Tasks span 18 real-world categories from short operations to 50+ step workflows
- •
Coverage of 1,280 unique commands provides breadth unavailable in manually crafted benchmarks
pi-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang — Peking University, Microsoft Research
Evaluates whether personal assistant agents can identify and act on hidden intents — needs, constraints, and preferences that users leave unstated — in sustained long-horizon workflows, addressing a core challenge in proactive assistance that existing benchmarks overlook.
Key Findings
- •
Existing benchmarks rarely evaluate proactive identification of unstated user needs
- •
Long-horizon workflows amplify the importance of hidden intent detection
- •
Proactive assistance requires different capabilities than reactive task completion
Trending Models (10)
DeepSeek · text-generation · unknown
Latest flagship model from DeepSeek with 4M+ downloads, continuing the V4 architecture's dominance in the open-source LLM ecosystem for conversational and general text generation tasks.
DeepSeek · text-generation · unknown
Efficient variant of DeepSeek V4 optimized for faster inference while maintaining strong conversational and text generation capabilities, achieving 2.4M+ downloads.
Circlestone Labs · image-generation · unknown
Diffusion model with 1,468 likes gaining strong traction in the generative image community, compatible with ComfyUI workflows.
SulphurAI · text-to-video · unknown
Text-to-video model with over 1.1M downloads, available in both diffusers and GGUF formats, establishing itself as a leading open-source video generation model.
OpenBMB · image-text-to-text · unknown
Multimodal vision-language model with 196K downloads and 876 likes, continuing the MiniCPM-V series' strong performance in image-text understanding at efficient model sizes.
ByteDance Research · multimodal · unknown
Any-to-any multimodal model supporting image and video generation from ByteDance, rapidly gaining community attention with 572 likes despite relatively low download count, suggesting strong interest from early adopters.
Microsoft · image-text-to-text · 7B
7B parameter multimodal vision-language model from Microsoft built on Qwen2.5-VL architecture, achieving 592 likes and 15K downloads for image-text understanding tasks.
Supertone · text-to-speech · unknown
Text-to-speech model with ONNX format support, achieving 535 likes and 34K downloads for high-quality speech synthesis applications.
HiDream AI · image-text-to-image · unknown
Vision-language model combining image understanding and image generation capabilities in a single architecture based on Qwen3-VL, with 417 likes and 21K downloads.
Unsloth · text-generation · 27B
GGUF quantized version of Qwen3.6-27B with Multi-Token Prediction, enabling efficient local deployment with 478K downloads.
Trending GitHub Repos (15)
Pre-indexed code knowledge graph for AI coding agents (Claude Code, Codex, Cursor, OpenCode), reducing token usage and tool calls while keeping everything local. Leading today's GitHub trending with 4,294 daily stars.
A single CLAUDE.md file derived from Andrej Karpathy's observations on LLM coding pitfalls, rapidly adopted as best-practice guidance for Claude Code agents. 143K total stars.
Academic research workflow skills for Claude Code covering the full pipeline from research to writing, review, revision, and finalization. 2,579 daily stars.
Nous Research's personal AI agent platform with 161K total stars and 2,056 daily stars, positioning itself as the leading open-source personal agent framework.
Agentic skills framework and software development methodology with 201K total stars, providing structured approaches to AI-assisted development.
Comprehensive learning resource for AI engineering covering the full stack from foundations to deployment, gaining 1,333 daily stars.
Cross-platform Electron desktop app for streaming and downloading media content with zero ads, gaining 1,094 daily stars.
Complete AI agency framework with specialized expert agents for different domains — from frontend development to community management — each with defined processes and deliverables. 103K total stars.
Curated collection of inspiring lists, manuals, cheatsheets, and developer tools with 222K total stars, a perennial trending resource.
Free, open-source, self-hosted WhatsApp API gateway gaining 730 daily stars, enabling programmatic WhatsApp integration.
Anthropic's official directory of high-quality Claude Code plugins, gaining 682 daily stars as the plugin ecosystem matures.
Converts code into interactive knowledge graphs for exploration, search, and Q&A — compatible with Claude Code, Codex, Cursor, Copilot, and Gemini CLI. 666 daily stars.
Makes all software agent-native through CLI interfaces, with 39K total stars. Part of HKUDS's agent-native software initiative.
Agentic video generation system that combines Director, Screenwriter, Producer, and Video Generator roles in one framework. 537 daily stars.
Open-source managed agents platform that turns coding agents into real teammates with task assignment, progress tracking, and skill compounding. 534 daily stars.