Wednesday, April 8, 2026
In-Place Test-Time Training enables LLMs to adapt during inference; Polynomial Mixer achieves linear-time attention replacement; Gym-Anything turns any software into an agent environment
Executive Summary
April 8th delivers a strong showing in adaptive inference and efficient architectures. The headline paper, In-Place Test-Time Training, breaks the static train-then-deploy paradigm by enabling LLMs to update their parameters during inference, directly addressing the long-context performance ceiling that plagues fixed-weight models. This joins yesterday's test-time scaling work to form a clear two-day trend: the field is converging on inference as a first-class optimization target, not just a cost center.
The Polynomial Mixer (PoM) offers a mathematically rigorous linear-time replacement for attention that provably preserves the universal approximation properties of transformers. Unlike previous linear attention approximations that sacrifice expressivity, PoM satisfies the contextual mapping property — a theoretical guarantee that could finally make sub-quadratic transformers viable for production workloads. Meanwhile, Gym-Anything automates environment creation for computer-use agents, producing 10K+ long-horizon tasks across occupational domains — a critical infrastructure contribution as the agent ecosystem matures.
The model landscape sees NousResearch/hermes-agent explode to 3,009 stars/day on GitHub, dwarfing all other repos. NVIDIA enters the agent space with PersonaPlex and DataDesigner, while Hindsight from Vectorize introduces learning agent memory — signals that agent infrastructure is becoming the dominant category in open-source AI tooling.
Researcher Notes
In-Place Test-Time Training is the most architecturally ambitious paper today. The core idea — allowing LLMs to modify their own parameters at inference time — directly addresses the fundamental limitation that models are frozen after training. While test-time compute scaling (more tokens at inference) has been the dominant paradigm, test-time training (weight updates at inference) is a qualitatively different capability. The connection to yesterday's T^2 scaling laws paper is direct: if inference is now an optimization target, then the boundary between training and inference is dissolving. Watch for rapid follow-up work combining both approaches.
The Polynomial Mixer deserves more attention than it will probably get. PoM's proof that it satisfies the contextual mapping property while maintaining linear complexity is the strongest theoretical result for efficient attention alternatives in recent memory. Previous linear attention schemes (Mamba, RWKV, etc.) traded theoretical guarantees for empirical performance; PoM keeps both. The paper comes from David Picard's group, which has a strong track record in vision architectures. The immediate question: does the theoretical guarantee translate to practical gains at scale, or is there a constant-factor penalty that makes it uncompetitive with FlashAttention?
The agent evaluation crisis is becoming acute. Three papers today — Claw-Eval, ACE-Bench, and Gym-Anything — all address the same problem from different angles: we cannot reliably evaluate autonomous agents. Claw-Eval records full execution trajectories, ACE-Bench provides controllable difficulty scaling, and Gym-Anything generates environments automatically. The fact that three independent teams are building evaluation infrastructure simultaneously signals that the community recognizes agent benchmarking as a critical bottleneck. The contrast with yesterday's SimpleStream result (simple baseline beats 13 complex methods) suggests current agent benchmarks may face the same reckoning.
HaloProbe's Bayesian approach to hallucination detection is the sleeper hit. Rather than treating hallucinations as classification problems, it decomposes description statistics into factorized probabilities — a fundamentally more principled approach. The paper targets vision-language models specifically, but the statistical framework could generalize to text-only hallucination detection. At a time when every VLM vendor claims low hallucination rates, principled detection methods that don't rely on the model's own confidence are increasingly valuable.
The GitHub trending data tells a clear story: agent infrastructure is eating the world. NousResearch/hermes-agent at 3,009 stars/day is the highest single-day gain we've tracked. Vectorize's Hindsight (agent memory that learns), NVIDIA's DataDesigner (synthetic data for agents), and HKUDS's AutoAgent (zero-code agent framework) all reinforce the same trend. The interesting signal is the diversity of agent tooling: memory, evaluation, persona management, data generation, and framework construction are all simultaneously trending. This is infrastructure build-out, not hype — these are the tools builders actually need.
Themes & Trends
Test-Time Adaptation
risingThe boundary between training and inference is dissolving, with papers on in-place test-time training and target policy optimization showing that inference is becoming a first-class optimization target.
Efficient Architecture Alternatives
risingThe Polynomial Mixer provides the strongest theoretical guarantee yet for linear-time attention replacement, joining the ongoing race to make sub-quadratic transformers production-ready.
Agent Evaluation Crisis
risingThree independent papers tackle agent evaluation from different angles — trajectory recording, configurable difficulty, and automated environment generation — signaling community recognition of a critical bottleneck.
LLM Safety and Alignment
stableExclusive unlearning inverts the safety paradigm (keep-only vs delete-specific), while constrained decoding snowballing reveals hidden alignment taxes in structured output generation.
Agent Infrastructure Build-Out
risingGitHub trending is dominated by agent tooling: frameworks (hermes-agent), memory (hindsight), personas (personaplex), data (DataDesigner), and evaluation — the full agent stack is being built simultaneously.
Trending Papers (14)
In-Place Test-Time Training
High RelevanceGuhao Feng, Shengjie Luo, Kai Hua, et al. — Tsinghua University, Microsoft Research
Breaks the static train-then-deploy paradigm by enabling LLMs to update their parameters during inference, directly targeting improved performance on long contexts and distribution shifts without retraining.
Key Findings
- •
LLMs can update parameters in-place during inference for dynamic adaptation
- •
Significant performance improvements on long-context tasks compared to frozen-weight models
- •
Framework addresses distribution shift without requiring access to original training data
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
High RelevanceDavid Picard, Nicolas Dufour, Lucas Degeorge, et al. — ENPC, Valeo.ai
Introduces the Polynomial Mixer, a novel token mixing mechanism with linear complexity that provably satisfies the contextual mapping property, maintaining transformer universality while eliminating quadratic attention cost.
Key Findings
- •
PoM satisfies the contextual mapping property — the first linear-time method with this guarantee
- •
Maintains universal approximation capabilities of full attention transformers
- •
Achieves competitive performance with significantly reduced computational cost
Gym-Anything: Turn any Software into an Agent Environment
High RelevancePranjal Aggarwal, Graham Neubig, Sean Welleck — Carnegie Mellon University
Frames environment creation for computer-use agents as a multi-agent task, automatically producing 10K+ long-horizon tasks across diverse occupational domains from arbitrary software.
Key Findings
- •
Automated environment creation produces 10K+ long-horizon tasks from arbitrary software
- •
Multi-agent task framing enables scalable environment generation without manual annotation
- •
Tasks span diverse occupational domains, providing realistic evaluation for computer-use agents
HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
High RelevanceReihaneh Zohrabi, Hosein Hasani, Akshita Gupta, et al. — University of Alberta, Amii
Presents a Bayesian framework that factorizes description statistics to detect and mitigate object hallucinations in vision-language models, offering a principled alternative to classification-based approaches.
Key Findings
- •
Factorized Bayesian statistics detect hallucination probabilities without relying on model confidence
- •
Framework enables both detection and mitigation of object hallucinations in VLMs
- •
Outperforms existing hallucination detection methods across multiple VLM architectures
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
High RelevanceBowen Ye, Rang Li, Qibin Yang, et al. — Zhejiang University, Alibaba Group
Introduces a comprehensive evaluation suite with 300 tasks recording full execution trajectories — including audit logs and environment snapshots — for trustworthy assessment of autonomous LLM agents.
Key Findings
- •
300 tasks with full trajectory recording across execution traces, audit logs, and snapshots
- •
Reveals significant gaps between task completion rates and execution quality in current agents
- •
Trajectory-level evaluation catches failure modes invisible to outcome-only metrics
Action Images: End-to-End Policy Learning via Multiview Video Generation
High RelevanceHaoyu Zhen, Zixian Gao, Qiao Sun, et al. — Tsinghua University, Shanghai AI Laboratory
Formulates robot policy learning through multiview video generation with pixel-grounded action representations, enabling end-to-end policy learning that bridges perception and control.
Key Findings
- •
Pixel-grounded action representations enable direct policy extraction from generated videos
- •
Multiview generation provides spatial consistency critical for real-world robot deployment
- •
End-to-end approach eliminates the need for separate perception and planning pipelines
Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
High RelevanceKomal Kumar, Aman Chadha, Salman Khan, et al. — MBZUAI, Stanford University, Amazon
Introduces an open-source multi-agent system with discovery and analysis pipelines for academic literature, addressing the challenge of efficient research synthesis at scale.
Key Findings
- •
Multi-agent architecture separates discovery from analysis for efficient research workflows
- •
Open-source framework enables reproducible and extensible research automation
- •
Outperforms single-agent approaches on literature review quality metrics
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
High RelevanceQimin Zhong, Hao Liao, Haiming Qin, et al. — Peking University, ByteDance
Analyzes multi-token prediction gradient bias in world models and proposes anchoring predictions to ground-truth trajectories for improved consistency, contributing to the debate on whether LLMs develop coherent internal world models.
Key Findings
- •
Multi-token prediction introduces gradient bias that degrades world model consistency
- •
Anchoring to ground-truth trajectories corrects drift in sequential predictions
- •
Latent semantic enhancement improves the coherence of learned internal representations
Exclusive Unlearning
High RelevanceMutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, et al. — University of Tokyo, RIKEN
Proposes a novel machine unlearning approach that removes broad categories of harmful content by forgetting everything except desired knowledge domains, inverting the typical targeted-deletion paradigm.
Key Findings
- •
Exclusive unlearning (keep-only) is more effective than inclusive unlearning (delete-specific) for safety
- •
Approach scales better to unknown harmful content categories than enumeration-based methods
- •
Maintains model utility on retained knowledge domains while broadly removing harmful capabilities
Target Policy Optimization
High RelevanceJean Kaddour — Google DeepMind
Separates target distribution construction from parameter updates in RL for language models, demonstrating improved performance on sparse reward tasks by decoupling these traditionally entangled components.
Key Findings
- •
Decoupling target distribution from parameter updates improves sparse reward optimization
- •
Cleaner theoretical framework than PPO/DPO for RLHF by separating what-to-optimize from how-to-optimize
- •
Achieves state-of-the-art on sparse reward benchmarks with simpler training dynamics
Artificial Intelligence and the Structure of Mathematics
High RelevanceMaissam Barkeshli, Michael R. Douglas, Michael H. Freedman — University of Maryland, Harvard University, Microsoft Research
Discusses how AI may reveal the global structure of formal proofs and enable mathematical discovery, authored by Fields Medal-level mathematicians including Michael Freedman.
Key Findings
- •
AI could reveal hidden structural patterns in the space of formal mathematical proofs
- •
Automated proof systems may enable discovery of connections between distant mathematical domains
- •
The paper outlines concrete paths for AI-assisted mathematical research beyond theorem proving
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
High RelevanceHongxu Zhou — Independent Researcher
Reveals that constrained decoding in LLM self-correction triggers 'structure snowballing' rather than improving reflection, exposing a hidden alignment tax in structured output generation.
Key Findings
- •
Constrained decoding triggers structure snowballing that compounds errors rather than correcting them
- •
Self-correction mechanisms fail under constrained output formats due to cascading structural commitments
- •
Identifies a fundamental tension between structured output requirements and genuine model reflection
ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty
Wang Yang, Chaoda Song, Xinpeng Li, et al. — Chinese Academy of Sciences, University of Chinese Academy of Sciences
Proposes a unified grid-based planning framework for agent evaluation with fine-grained control over task horizon and difficulty, addressing the high overhead and limited configurability of existing agent benchmarks.
Key Findings
- •
Grid-based planning tasks enable continuous difficulty scaling for agent evaluation
- •
Controllable horizon length isolates planning capability from task-specific knowledge
- •
Lightweight environments dramatically reduce the cost of large-scale agent benchmarking
Trending Models (10)
NousResearch · agent-framework · Various
NousResearch's agent framework that grows with users has exploded to 3,009 stars/day on GitHub, representing the fastest-growing AI agent project tracked. Model-native agent design from NousResearch's deep open-weight expertise.
Alibaba · text-generation · MoE + linear attention
Alibaba's latest release featuring 1M context window, 65K output tokens, and always-on chain-of-thought reasoning. Beats Claude Opus on Terminal-Bench 2.0 (61.6 vs 59.3) and available as free preview on OpenRouter.
Google · image-text-to-text · 31B
Google's flagship 31B dense Gemma-4 instruction-tuned model continues strong trending with 678k downloads and 1,158 likes. Apache 2.0 license makes it the first Google model with fully permissive enterprise licensing.
Jackrong (Community) · text-generation · 27B
Community-built Qwen3.5-27B distilled from Claude Opus reasoning outputs continues massive traction with 2,403 likes and 548k downloads, representing the pinnacle of closed-to-open capability transfer.
NVIDIA · data-generation · N/A
NVIDIA's synthetic data generation tool for creating high-quality training data from scratch or seed data, trending at 244 stars/day as enterprises seek data-centric AI approaches.
Zhipu AI · text-generation · 744B (40B active)
Zhipu AI's frontier reasoning model with 744B total / 40B active parameters, trained on Huawei silicon under MIT license. Achieves 50.4% on Humanity's Last Exam, demonstrating competitive non-NVIDIA training infrastructure.
Vectorize · agent-memory · N/A
Agent memory system that learns and improves over time, trending at 160 stars/day. Addresses a critical gap in the agent stack: persistent, learning memory beyond simple RAG retrieval.
Google · image-text-to-text · 26B (4B active)
Gemma-4 MoE variant with 26B total / 4B active parameters, offering strong multimodal performance at fraction of dense model inference cost. 476k downloads show strong enterprise adoption.
NVIDIA · persona-generation · N/A
NVIDIA's system for generating and managing AI personas, trending at 662 stars/day. Signals NVIDIA's expanding role beyond hardware into agent personality and character management.
OpenAI · text-generation · 117B (5.1B active)
OpenAI's first Apache 2.0 open-weight model at 117B total / 5.1B active parameters with MXFP4 quantization and 128K context. A landmark shift in OpenAI's open-source strategy.
Trending GitHub Repos (12)
NousResearch's extensible AI agent framework that grows with users. Explosive growth from 28.9k to 32.7k stars, the highest daily gain tracked in this project's history.
Client-side knowledge graph creator running entirely in-browser. Drop in GitHub repos or ZIP files for interactive knowledge graphs with built-in Graph RAG Agent capabilities.
Google's showcase gallery for on-device ML/GenAI use cases. Continued strong growth to 18.9k stars, enabling local model experimentation on mobile devices.
NVIDIA's PersonaPlex system for generating and managing AI personas. Surging from 7.5k to 8k stars as NVIDIA expands into agent personality infrastructure.
Create Reddit Videos with just one command. Resurgent popularity at 636 stars/day, likely driven by content creator demand for automated video pipelines.
Google's lightweight runtime for running language models on edge devices. Complementing AI Edge Gallery with C++ inference infrastructure at 528 stars/day.
NeMo Data Designer: Generate high-quality synthetic data from scratch or seed data. NVIDIA's data-centric AI approach gaining traction at 244 stars/day.
Specialized Claude workspace for creating long-form, SEO-optimized blog content with research, writing, analysis, and optimization features. 215 stars/day.
Agent-native personalized learning assistant from HKU. Steady growth at 168 stars/day as education-focused AI tools gain traction.
Hindsight: Agent Memory That Learns. A novel agent memory system that improves over time, addressing the critical gap between simple context windows and full persistent memory.
Fully-automated and zero-code LLM agent framework from HKU. Enables building agents without programming, at 76 stars/day.