Wednesday, April 8, 2026

In-Place Test-Time Training enables LLMs to adapt during inference; Polynomial Mixer achieves linear-time attention replacement; Gym-Anything turns any software into an agent environment

test-time-adaptationlinear-attention-replacementsagent-environment-infrastructurehallucination-detectionautonomous-agent-evaluationagent-tooling-dominance

Executive Summary

April 8th delivers a strong showing in adaptive inference and efficient architectures. The headline paper, In-Place Test-Time Training, breaks the static train-then-deploy paradigm by enabling LLMs to update their parameters during inference, directly addressing the long-context performance ceiling that plagues fixed-weight models. This joins yesterday's test-time scaling work to form a clear two-day trend: the field is converging on inference as a first-class optimization target, not just a cost center.

The Polynomial Mixer (PoM) offers a mathematically rigorous linear-time replacement for attention that provably preserves the universal approximation properties of transformers. Unlike previous linear attention approximations that sacrifice expressivity, PoM satisfies the contextual mapping property — a theoretical guarantee that could finally make sub-quadratic transformers viable for production workloads. Meanwhile, Gym-Anything automates environment creation for computer-use agents, producing 10K+ long-horizon tasks across occupational domains — a critical infrastructure contribution as the agent ecosystem matures.

The model landscape sees NousResearch/hermes-agent explode to 3,009 stars/day on GitHub, dwarfing all other repos. NVIDIA enters the agent space with PersonaPlex and DataDesigner, while Hindsight from Vectorize introduces learning agent memory — signals that agent infrastructure is becoming the dominant category in open-source AI tooling.

Researcher Notes

In-Place Test-Time Training is the most architecturally ambitious paper today. The core idea — allowing LLMs to modify their own parameters at inference time — directly addresses the fundamental limitation that models are frozen after training. While test-time compute scaling (more tokens at inference) has been the dominant paradigm, test-time training (weight updates at inference) is a qualitatively different capability. The connection to yesterday's T^2 scaling laws paper is direct: if inference is now an optimization target, then the boundary between training and inference is dissolving. Watch for rapid follow-up work combining both approaches.

The Polynomial Mixer deserves more attention than it will probably get. PoM's proof that it satisfies the contextual mapping property while maintaining linear complexity is the strongest theoretical result for efficient attention alternatives in recent memory. Previous linear attention schemes (Mamba, RWKV, etc.) traded theoretical guarantees for empirical performance; PoM keeps both. The paper comes from David Picard's group, which has a strong track record in vision architectures. The immediate question: does the theoretical guarantee translate to practical gains at scale, or is there a constant-factor penalty that makes it uncompetitive with FlashAttention?

The agent evaluation crisis is becoming acute. Three papers today — Claw-Eval, ACE-Bench, and Gym-Anything — all address the same problem from different angles: we cannot reliably evaluate autonomous agents. Claw-Eval records full execution trajectories, ACE-Bench provides controllable difficulty scaling, and Gym-Anything generates environments automatically. The fact that three independent teams are building evaluation infrastructure simultaneously signals that the community recognizes agent benchmarking as a critical bottleneck. The contrast with yesterday's SimpleStream result (simple baseline beats 13 complex methods) suggests current agent benchmarks may face the same reckoning.

HaloProbe's Bayesian approach to hallucination detection is the sleeper hit. Rather than treating hallucinations as classification problems, it decomposes description statistics into factorized probabilities — a fundamentally more principled approach. The paper targets vision-language models specifically, but the statistical framework could generalize to text-only hallucination detection. At a time when every VLM vendor claims low hallucination rates, principled detection methods that don't rely on the model's own confidence are increasingly valuable.

The GitHub trending data tells a clear story: agent infrastructure is eating the world. NousResearch/hermes-agent at 3,009 stars/day is the highest single-day gain we've tracked. Vectorize's Hindsight (agent memory that learns), NVIDIA's DataDesigner (synthetic data for agents), and HKUDS's AutoAgent (zero-code agent framework) all reinforce the same trend. The interesting signal is the diversity of agent tooling: memory, evaluation, persona management, data generation, and framework construction are all simultaneously trending. This is infrastructure build-out, not hype — these are the tools builders actually need.

Themes & Trends

Test-Time Adaptation

rising

The boundary between training and inference is dissolving, with papers on in-place test-time training and target policy optimization showing that inference is becoming a first-class optimization target.

Efficient Architecture Alternatives

rising

The Polynomial Mixer provides the strongest theoretical guarantee yet for linear-time attention replacement, joining the ongoing race to make sub-quadratic transformers production-ready.

Agent Evaluation Crisis

rising

Three independent papers tackle agent evaluation from different angles — trajectory recording, configurable difficulty, and automated environment generation — signaling community recognition of a critical bottleneck.

LLM Safety and Alignment

stable

Exclusive unlearning inverts the safety paradigm (keep-only vs delete-specific), while constrained decoding snowballing reveals hidden alignment taxes in structured output generation.

Agent Infrastructure Build-Out

rising

GitHub trending is dominated by agent tooling: frameworks (hermes-agent), memory (hindsight), personas (personaplex), data (DataDesigner), and evaluation — the full agent stack is being built simultaneously.

Trending Papers (14)

In-Place Test-Time Training

High Relevance

Guhao Feng, Shengjie Luo, Kai Hua, et al. Tsinghua University, Microsoft Research

Breaks the static train-then-deploy paradigm by enabling LLMs to update their parameters during inference, directly targeting improved performance on long contexts and distribution shifts without retraining.

Key Findings

  • LLMs can update parameters in-place during inference for dynamic adaptation

  • Significant performance improvements on long-context tasks compared to frozen-weight models

  • Framework addresses distribution shift without requiring access to original training data

test-time-trainingadaptive-inferencelong-contextLLM-trainingdynamic-adaptation
4 upvotes

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

High Relevance

David Picard, Nicolas Dufour, Lucas Degeorge, et al. ENPC, Valeo.ai

Introduces the Polynomial Mixer, a novel token mixing mechanism with linear complexity that provably satisfies the contextual mapping property, maintaining transformer universality while eliminating quadratic attention cost.

Key Findings

  • PoM satisfies the contextual mapping property — the first linear-time method with this guarantee

  • Maintains universal approximation capabilities of full attention transformers

  • Achieves competitive performance with significantly reduced computational cost

efficient-attentionlinear-complexitytoken-mixingtransformerstheoretical-guarantees
5 upvotes

Gym-Anything: Turn any Software into an Agent Environment

High Relevance

Pranjal Aggarwal, Graham Neubig, Sean Welleck Carnegie Mellon University

Frames environment creation for computer-use agents as a multi-agent task, automatically producing 10K+ long-horizon tasks across diverse occupational domains from arbitrary software.

Key Findings

  • Automated environment creation produces 10K+ long-horizon tasks from arbitrary software

  • Multi-agent task framing enables scalable environment generation without manual annotation

  • Tasks span diverse occupational domains, providing realistic evaluation for computer-use agents

agent-environmentscomputer-usebenchmark-generationautomationLLM-agents
5 upvotes

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

High Relevance

Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, et al. University of Alberta, Amii

Presents a Bayesian framework that factorizes description statistics to detect and mitigate object hallucinations in vision-language models, offering a principled alternative to classification-based approaches.

Key Findings

  • Factorized Bayesian statistics detect hallucination probabilities without relying on model confidence

  • Framework enables both detection and mitigation of object hallucinations in VLMs

  • Outperforms existing hallucination detection methods across multiple VLM architectures

hallucination-detectionVLMBayesianvision-languagereliability
5 upvotes

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

High Relevance

Bowen Ye, Rang Li, Qibin Yang, et al. Zhejiang University, Alibaba Group

Introduces a comprehensive evaluation suite with 300 tasks recording full execution trajectories — including audit logs and environment snapshots — for trustworthy assessment of autonomous LLM agents.

Key Findings

  • 300 tasks with full trajectory recording across execution traces, audit logs, and snapshots

  • Reveals significant gaps between task completion rates and execution quality in current agents

  • Trajectory-level evaluation catches failure modes invisible to outcome-only metrics

agent-evaluationbenchmarkautonomous-agentstrustworthy-AItrajectory-analysis
37 upvotes

Action Images: End-to-End Policy Learning via Multiview Video Generation

High Relevance

Haoyu Zhen, Zixian Gao, Qiao Sun, et al. Tsinghua University, Shanghai AI Laboratory

Formulates robot policy learning through multiview video generation with pixel-grounded action representations, enabling end-to-end policy learning that bridges perception and control.

Key Findings

  • Pixel-grounded action representations enable direct policy extraction from generated videos

  • Multiview generation provides spatial consistency critical for real-world robot deployment

  • End-to-end approach eliminates the need for separate perception and planning pipelines

roboticspolicy-learningvideo-generationmultiviewworld-models
3 upvotes

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

High Relevance

Komal Kumar, Aman Chadha, Salman Khan, et al. MBZUAI, Stanford University, Amazon

Introduces an open-source multi-agent system with discovery and analysis pipelines for academic literature, addressing the challenge of efficient research synthesis at scale.

Key Findings

  • Multi-agent architecture separates discovery from analysis for efficient research workflows

  • Open-source framework enables reproducible and extensible research automation

  • Outperforms single-agent approaches on literature review quality metrics

research-automationmulti-agentliterature-reviewopen-sourcescientific-discovery
10 upvotes

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

High Relevance

Qimin Zhong, Hao Liao, Haiming Qin, et al. Peking University, ByteDance

Analyzes multi-token prediction gradient bias in world models and proposes anchoring predictions to ground-truth trajectories for improved consistency, contributing to the debate on whether LLMs develop coherent internal world models.

Key Findings

  • Multi-token prediction introduces gradient bias that degrades world model consistency

  • Anchoring to ground-truth trajectories corrects drift in sequential predictions

  • Latent semantic enhancement improves the coherence of learned internal representations

world-modelsmulti-token-predictionLLM-internalsconsistencyrepresentation-learning
5 upvotes

Exclusive Unlearning

High Relevance

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, et al. University of Tokyo, RIKEN

Proposes a novel machine unlearning approach that removes broad categories of harmful content by forgetting everything except desired knowledge domains, inverting the typical targeted-deletion paradigm.

Key Findings

  • Exclusive unlearning (keep-only) is more effective than inclusive unlearning (delete-specific) for safety

  • Approach scales better to unknown harmful content categories than enumeration-based methods

  • Maintains model utility on retained knowledge domains while broadly removing harmful capabilities

machine-unlearningLLM-safetyharmful-contentalignmentknowledge-management
5 upvotes

Target Policy Optimization

High Relevance

Jean Kaddour Google DeepMind

Separates target distribution construction from parameter updates in RL for language models, demonstrating improved performance on sparse reward tasks by decoupling these traditionally entangled components.

Key Findings

  • Decoupling target distribution from parameter updates improves sparse reward optimization

  • Cleaner theoretical framework than PPO/DPO for RLHF by separating what-to-optimize from how-to-optimize

  • Achieves state-of-the-art on sparse reward benchmarks with simpler training dynamics

RLHFpolicy-optimizationsparse-rewardsLLM-trainingreinforcement-learning
5 upvotes

Artificial Intelligence and the Structure of Mathematics

High Relevance

Maissam Barkeshli, Michael R. Douglas, Michael H. Freedman University of Maryland, Harvard University, Microsoft Research

Discusses how AI may reveal the global structure of formal proofs and enable mathematical discovery, authored by Fields Medal-level mathematicians including Michael Freedman.

Key Findings

  • AI could reveal hidden structural patterns in the space of formal mathematical proofs

  • Automated proof systems may enable discovery of connections between distant mathematical domains

  • The paper outlines concrete paths for AI-assisted mathematical research beyond theorem proving

AI-for-mathformal-proofsmathematical-discoveryautomated-reasoningfoundations
5 upvotes

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

High Relevance

Hongxu Zhou Independent Researcher

Reveals that constrained decoding in LLM self-correction triggers 'structure snowballing' rather than improving reflection, exposing a hidden alignment tax in structured output generation.

Key Findings

  • Constrained decoding triggers structure snowballing that compounds errors rather than correcting them

  • Self-correction mechanisms fail under constrained output formats due to cascading structural commitments

  • Identifies a fundamental tension between structured output requirements and genuine model reflection

constrained-decodingself-correctionalignmentstructured-outputLLM-limitations
5 upvotes

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty

Wang Yang, Chaoda Song, Xinpeng Li, et al. Chinese Academy of Sciences, University of Chinese Academy of Sciences

Proposes a unified grid-based planning framework for agent evaluation with fine-grained control over task horizon and difficulty, addressing the high overhead and limited configurability of existing agent benchmarks.

Key Findings

  • Grid-based planning tasks enable continuous difficulty scaling for agent evaluation

  • Controllable horizon length isolates planning capability from task-specific knowledge

  • Lightweight environments dramatically reduce the cost of large-scale agent benchmarking

agent-evaluationbenchmarkconfigurable-difficultyplanningautonomous-agents
2 upvotes

Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

High Relevance

Changgeon Ko, Jisu Shin, Hoyun Song, et al. Seoul National University, KAIST

Demonstrates that LLM agents serving as human delegates in multi-agent environments have their accuracy significantly degraded by social pressure and rhetorical manipulation, exposing a critical vulnerability in collective AI decision-making.

Key Findings

  • Representative LLM agents' decision accuracy declines significantly under social pressure

  • Rhetorical manipulation is more effective against LLM agents than logical argumentation

  • Multi-agent collective decision-making inherits and amplifies individual agent vulnerabilities

multi-agentsocial-manipulationdecision-makingLLM-vulnerabilitiescollective-intelligence
5 upvotes

Trending Models (10)

NousResearch Hermes Agent

NousResearch · agent-framework · Various

View on HF

NousResearch's agent framework that grows with users has exploded to 3,009 stars/day on GitHub, representing the fastest-growing AI agent project tracked. Model-native agent design from NousResearch's deep open-weight expertise.

AI-agentsframeworkopen-source
0 downloads33.7K likes
Qwen 3.6 Plus

Alibaba · text-generation · MoE + linear attention

View on HF

Alibaba's latest release featuring 1M context window, 65K output tokens, and always-on chain-of-thought reasoning. Beats Claude Opus on Terminal-Bench 2.0 (61.6 vs 59.3) and available as free preview on OpenRouter.

qwenlong-contextreasoningMoE
150.0K downloads380 likes
Gemma-4-31B-IT

Google · image-text-to-text · 31B

View on HF

Google's flagship 31B dense Gemma-4 instruction-tuned model continues strong trending with 678k downloads and 1,158 likes. Apache 2.0 license makes it the first Google model with fully permissive enterprise licensing.

gemma4multimodalinstruction-tunedapache-2.0
884.3K downloads1.4K likes
Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Jackrong (Community) · text-generation · 27B

View on HF

Community-built Qwen3.5-27B distilled from Claude Opus reasoning outputs continues massive traction with 2,403 likes and 548k downloads, representing the pinnacle of closed-to-open capability transfer.

qwen3.5distillationreasoningclaude-opus
552.0K downloads2.5K likes
NeMo Data Designer

NVIDIA · data-generation · N/A

View on HF

NVIDIA's synthetic data generation tool for creating high-quality training data from scratch or seed data, trending at 244 stars/day as enterprises seek data-centric AI approaches.

synthetic-dataNVIDIAdata-centricNeMo
0 downloads1.5K likes
GLM-5

Zhipu AI · text-generation · 744B (40B active)

View on HF

Zhipu AI's frontier reasoning model with 744B total / 40B active parameters, trained on Huawei silicon under MIT license. Achieves 50.4% on Humanity's Last Exam, demonstrating competitive non-NVIDIA training infrastructure.

frontier-modelMoEMIT-licensereasoning
389 downloads519 likes
Hindsight: Agent Memory That Learns

Vectorize · agent-memory · N/A

View on HF

Agent memory system that learns and improves over time, trending at 160 stars/day. Addresses a critical gap in the agent stack: persistent, learning memory beyond simple RAG retrieval.

agent-memorylearningRAG-alternativeinfrastructure
0 downloads7.8K likes
Gemma-4-26B-A4B-IT

Google · image-text-to-text · 26B (4B active)

View on HF

Gemma-4 MoE variant with 26B total / 4B active parameters, offering strong multimodal performance at fraction of dense model inference cost. 476k downloads show strong enterprise adoption.

gemma4MoEefficient-inferencemultimodal
659.8K downloads515 likes
PersonaPlex

NVIDIA · persona-generation · N/A

View on HF

NVIDIA's system for generating and managing AI personas, trending at 662 stars/day. Signals NVIDIA's expanding role beyond hardware into agent personality and character management.

personasNVIDIAAI-charactersagent-infrastructure
0 downloads8.1K likes
GPT-OSS-120B

OpenAI · text-generation · 117B (5.1B active)

View on HF

OpenAI's first Apache 2.0 open-weight model at 117B total / 5.1B active parameters with MXFP4 quantization and 128K context. A landmark shift in OpenAI's open-source strategy.

OpenAIopen-weightMoEapache-2.0
3.7M downloads4.7K likes

Trending GitHub Repos (12)

NousResearch's extensible AI agent framework that grows with users. Explosive growth from 28.9k to 32.7k stars, the highest daily gain tracked in this project's history.

AI-agentsLLM-agentsNousResearchframework
Python32.7K+3.0K today4.2K

Client-side knowledge graph creator running entirely in-browser. Drop in GitHub repos or ZIP files for interactive knowledge graphs with built-in Graph RAG Agent capabilities.

knowledge-graphcode-analysisRAGbrowser-based
TypeScript24.6K+1.2K today2.8K

Mini CLI search engine for docs, knowledge bases, and meeting notes using local state-of-the-art approaches. Continued strong growth to 19.7k stars.

searchlocal-firstCLIknowledge-management
TypeScript19.7K+859 today1.2K

NVIDIA's PersonaPlex system for generating and managing AI personas. Surging from 7.5k to 8k stars as NVIDIA expands into agent personality infrastructure.

personasNVIDIAAI-charactersgeneration
Python8.0K+662 today1.2K

Create Reddit Videos with just one command. Resurgent popularity at 636 stars/day, likely driven by content creator demand for automated video pipelines.

video-generationRedditcontent-creationautomation
Python10.1K+636 today2.5K

Google's lightweight runtime for running language models on edge devices. Complementing AI Edge Gallery with C++ inference infrastructure at 528 stars/day.

edge-inferenceLLM-runtimeC++Google
C++2.6K+528 today253

NeMo Data Designer: Generate high-quality synthetic data from scratch or seed data. NVIDIA's data-centric AI approach gaining traction at 244 stars/day.

synthetic-dataNVIDIAdata-generationNeMo
Python1.5K+244 today132

Specialized Claude workspace for creating long-form, SEO-optimized blog content with research, writing, analysis, and optimization features. 215 stars/day.

SEOcontent-generationClaudewriting
Python4.0K+215 today665
High RelevanceGitHub

Agent-native personalized learning assistant from HKU. Steady growth at 168 stars/day as education-focused AI tools gain traction.

educationAI-tutorpersonalizationagents
Python12.4K+168 today1.7K

Hindsight: Agent Memory That Learns. A novel agent memory system that improves over time, addressing the critical gap between simple context windows and full persistent memory.

agent-memorylearningRAG-alternativeinfrastructure
Python7.8K+160 today486
High RelevanceGitHub

Fully-automated and zero-code LLM agent framework from HKU. Enables building agents without programming, at 76 stars/day.

agent-frameworkzero-codeautomationLLM-agents
Python9.0K+76 today1.3K

Sources Checked