Wednesday, April 8, 2026

In-Place Test-Time Training enables LLMs to adapt during inference; Polynomial Mixer achieves linear-time attention replacement; Gym-Anything turns any software into an agent environment

test-time-adaptationlinear-attention-replacementsagent-environment-infrastructurehallucination-detectionautonomous-agent-evaluationagent-tooling-dominance

Executive Summary

April 8th delivers a strong showing in adaptive inference and efficient architectures. The headline paper, In-Place Test-Time Training, breaks the static train-then-deploy paradigm by enabling LLMs to update their parameters during inference, directly addressing the long-context performance ceiling that plagues fixed-weight models. This joins yesterday's test-time scaling work to form a clear two-day trend: the field is converging on inference as a first-class optimization target, not just a cost center.

The Polynomial Mixer (PoM) offers a mathematically rigorous linear-time replacement for attention that provably preserves the universal approximation properties of transformers. Unlike previous linear attention approximations that sacrifice expressivity, PoM satisfies the contextual mapping property — a theoretical guarantee that could finally make sub-quadratic transformers viable for production workloads. Meanwhile, Gym-Anything automates environment creation for computer-use agents, producing 10K+ long-horizon tasks across occupational domains — a critical infrastructure contribution as the agent ecosystem matures.

The model landscape sees NousResearch/hermes-agent explode to 3,009 stars/day on GitHub, dwarfing all other repos. NVIDIA enters the agent space with PersonaPlex and DataDesigner, while Hindsight from Vectorize introduces learning agent memory — signals that agent infrastructure is becoming the dominant category in open-source AI tooling.

Researcher Notes

In-Place Test-Time Training is the most architecturally ambitious paper today. The core idea — allowing LLMs to modify their own parameters at inference time — directly addresses the fundamental limitation that models are frozen after training. While test-time compute scaling (more tokens at inference) has been the dominant paradigm, test-time training (weight updates at inference) is a qualitatively different capability. The connection to yesterday's T^2 scaling laws paper is direct: if inference is now an optimization target, then the boundary between training and inference is dissolving. Watch for rapid follow-up work combining both approaches.

The Polynomial Mixer deserves more attention than it will probably get. PoM's proof that it satisfies the contextual mapping property while maintaining linear complexity is the strongest theoretical result for efficient attention alternatives in recent memory. Previous linear attention schemes (Mamba, RWKV, etc.) traded theoretical guarantees for empirical performance; PoM keeps both. The paper comes from David Picard's group, which has a strong track record in vision architectures. The immediate question: does the theoretical guarantee translate to practical gains at scale, or is there a constant-factor penalty that makes it uncompetitive with FlashAttention?

The agent evaluation crisis is becoming acute. Three papers today — Claw-Eval, ACE-Bench, and Gym-Anything — all address the same problem from different angles: we cannot reliably evaluate autonomous agents. Claw-Eval records full execution trajectories, ACE-Bench provides controllable difficulty scaling, and Gym-Anything generates environments automatically. The fact that three independent teams are building evaluation infrastructure simultaneously signals that the community recognizes agent benchmarking as a critical bottleneck. The contrast with yesterday's SimpleStream result (simple baseline beats 13 complex methods) suggests current agent benchmarks may face the same reckoning.

HaloProbe's Bayesian approach to hallucination detection is the sleeper hit. Rather than treating hallucinations as classification problems, it decomposes description statistics into factorized probabilities — a fundamentally more principled approach. The paper targets vision-language models specifically, but the statistical framework could generalize to text-only hallucination detection. At a time when every VLM vendor claims low hallucination rates, principled detection methods that don't rely on the model's own confidence are increasingly valuable.

The GitHub trending data tells a clear story: agent infrastructure is eating the world. NousResearch/hermes-agent at 3,009 stars/day is the highest single-day gain we've tracked. Vectorize's Hindsight (agent memory that learns), NVIDIA's DataDesigner (synthetic data for agents), and HKUDS's AutoAgent (zero-code agent framework) all reinforce the same trend. The interesting signal is the diversity of agent tooling: memory, evaluation, persona management, data generation, and framework construction are all simultaneously trending. This is infrastructure build-out, not hype — these are the tools builders actually need.

Themes & Trends

↑

Test-Time Adaptation

rising

The boundary between training and inference is dissolving, with papers on in-place test-time training and target policy optimization showing that inference is becoming a first-class optimization target.

↑

Efficient Architecture Alternatives

rising

The Polynomial Mixer provides the strongest theoretical guarantee yet for linear-time attention replacement, joining the ongoing race to make sub-quadratic transformers production-ready.

↑

Agent Evaluation Crisis

rising

Three independent papers tackle agent evaluation from different angles — trajectory recording, configurable difficulty, and automated environment generation — signaling community recognition of a critical bottleneck.

→

LLM Safety and Alignment

stable

Exclusive unlearning inverts the safety paradigm (keep-only vs delete-specific), while constrained decoding snowballing reveals hidden alignment taxes in structured output generation.

↑

Agent Infrastructure Build-Out

rising

GitHub trending is dominated by agent tooling: frameworks (hermes-agent), memory (hindsight), personas (personaplex), data (DataDesigner), and evaluation — the full agent stack is being built simultaneously.

Trending Papers (14)

In-Place Test-Time Training

High Relevance

Guhao Feng, Shengjie Luo, Kai Hua, et al. — Tsinghua University, Microsoft Research

Breaks the static train-then-deploy paradigm by enabling LLMs to update their parameters during inference, directly targeting improved performance on long contexts and distribution shifts without retraining.

Key Findings

•
LLMs can update parameters in-place during inference for dynamic adaptation
•
Significant performance improvements on long-context tasks compared to frozen-weight models
•
Framework addresses distribution shift without requiring access to original training data

test-time-trainingadaptive-inferencelong-contextLLM-trainingdynamic-adaptation

4 upvotes

arXiv HF PDF

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

High Relevance

David Picard, Nicolas Dufour, Lucas Degeorge, et al. — ENPC, Valeo.ai

Introduces the Polynomial Mixer, a novel token mixing mechanism with linear complexity that provably satisfies the contextual mapping property, maintaining transformer universality while eliminating quadratic attention cost.

Key Findings

•
PoM satisfies the contextual mapping property — the first linear-time method with this guarantee
•
Maintains universal approximation capabilities of full attention transformers
•
Achieves competitive performance with significantly reduced computational cost

efficient-attentionlinear-complexitytoken-mixingtransformerstheoretical-guarantees

5 upvotes

arXiv HF PDF

Gym-Anything: Turn any Software into an Agent Environment

High Relevance

Pranjal Aggarwal, Graham Neubig, Sean Welleck — Carnegie Mellon University

Frames environment creation for computer-use agents as a multi-agent task, automatically producing 10K+ long-horizon tasks across diverse occupational domains from arbitrary software.

Key Findings

•
Automated environment creation produces 10K+ long-horizon tasks from arbitrary software
•
Multi-agent task framing enables scalable environment generation without manual annotation
•
Tasks span diverse occupational domains, providing realistic evaluation for computer-use agents

agent-environmentscomputer-usebenchmark-generationautomationLLM-agents

5 upvotes

arXiv HF PDF

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

High Relevance

Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, et al. — University of Alberta, Amii

Presents a Bayesian framework that factorizes description statistics to detect and mitigate object hallucinations in vision-language models, offering a principled alternative to classification-based approaches.

Key Findings

•
Factorized Bayesian statistics detect hallucination probabilities without relying on model confidence
•
Framework enables both detection and mitigation of object hallucinations in VLMs
•
Outperforms existing hallucination detection methods across multiple VLM architectures

hallucination-detectionVLMBayesianvision-languagereliability

5 upvotes

arXiv HF PDF

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

High Relevance

Bowen Ye, Rang Li, Qibin Yang, et al. — Zhejiang University, Alibaba Group

Introduces a comprehensive evaluation suite with 300 tasks recording full execution trajectories — including audit logs and environment snapshots — for trustworthy assessment of autonomous LLM agents.

Key Findings

•
300 tasks with full trajectory recording across execution traces, audit logs, and snapshots
•
Reveals significant gaps between task completion rates and execution quality in current agents
•
Trajectory-level evaluation catches failure modes invisible to outcome-only metrics

agent-evaluationbenchmarkautonomous-agentstrustworthy-AItrajectory-analysis

37 upvotes

arXiv HF PDF

Action Images: End-to-End Policy Learning via Multiview Video Generation

High Relevance

Haoyu Zhen, Zixian Gao, Qiao Sun, et al. — Tsinghua University, Shanghai AI Laboratory

Formulates robot policy learning through multiview video generation with pixel-grounded action representations, enabling end-to-end policy learning that bridges perception and control.

Key Findings

•
Pixel-grounded action representations enable direct policy extraction from generated videos
•
Multiview generation provides spatial consistency critical for real-world robot deployment
•
End-to-end approach eliminates the need for separate perception and planning pipelines

roboticspolicy-learningvideo-generationmultiviewworld-models

3 upvotes

arXiv HF PDF

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

High Relevance

Komal Kumar, Aman Chadha, Salman Khan, et al. — MBZUAI, Stanford University, Amazon

Introduces an open-source multi-agent system with discovery and analysis pipelines for academic literature, addressing the challenge of efficient research synthesis at scale.

Key Findings

•
Multi-agent architecture separates discovery from analysis for efficient research workflows
•
Open-source framework enables reproducible and extensible research automation
•
Outperforms single-agent approaches on literature review quality metrics

research-automationmulti-agentliterature-reviewopen-sourcescientific-discovery

10 upvotes

arXiv HF PDF

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

High Relevance

Qimin Zhong, Hao Liao, Haiming Qin, et al. — Peking University, ByteDance

Analyzes multi-token prediction gradient bias in world models and proposes anchoring predictions to ground-truth trajectories for improved consistency, contributing to the debate on whether LLMs develop coherent internal world models.

Key Findings

•
Multi-token prediction introduces gradient bias that degrades world model consistency
•
Anchoring to ground-truth trajectories corrects drift in sequential predictions
•
Latent semantic enhancement improves the coherence of learned internal representations

world-modelsmulti-token-predictionLLM-internalsconsistencyrepresentation-learning

5 upvotes

arXiv HF PDF

Exclusive Unlearning

High Relevance

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, et al. — University of Tokyo, RIKEN

Proposes a novel machine unlearning approach that removes broad categories of harmful content by forgetting everything except desired knowledge domains, inverting the typical targeted-deletion paradigm.

Key Findings

•
Exclusive unlearning (keep-only) is more effective than inclusive unlearning (delete-specific) for safety
•
Approach scales better to unknown harmful content categories than enumeration-based methods
•
Maintains model utility on retained knowledge domains while broadly removing harmful capabilities

machine-unlearningLLM-safetyharmful-contentalignmentknowledge-management

5 upvotes

arXiv HF PDF

Target Policy Optimization

High Relevance

Jean Kaddour — Google DeepMind

Separates target distribution construction from parameter updates in RL for language models, demonstrating improved performance on sparse reward tasks by decoupling these traditionally entangled components.

Key Findings

•
Decoupling target distribution from parameter updates improves sparse reward optimization
•
Cleaner theoretical framework than PPO/DPO for RLHF by separating what-to-optimize from how-to-optimize
•
Achieves state-of-the-art on sparse reward benchmarks with simpler training dynamics

RLHFpolicy-optimizationsparse-rewardsLLM-trainingreinforcement-learning

5 upvotes

arXiv HF PDF

Artificial Intelligence and the Structure of Mathematics

High Relevance

Maissam Barkeshli, Michael R. Douglas, Michael H. Freedman — University of Maryland, Harvard University, Microsoft Research

Discusses how AI may reveal the global structure of formal proofs and enable mathematical discovery, authored by Fields Medal-level mathematicians including Michael Freedman.

Key Findings

•
AI could reveal hidden structural patterns in the space of formal mathematical proofs
•
Automated proof systems may enable discovery of connections between distant mathematical domains
•
The paper outlines concrete paths for AI-assisted mathematical research beyond theorem proving

AI-for-mathformal-proofsmathematical-discoveryautomated-reasoningfoundations

5 upvotes

arXiv HF PDF

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

High Relevance

Hongxu Zhou — Independent Researcher

Reveals that constrained decoding in LLM self-correction triggers 'structure snowballing' rather than improving reflection, exposing a hidden alignment tax in structured output generation.

Key Findings

•
Constrained decoding triggers structure snowballing that compounds errors rather than correcting them
•
Self-correction mechanisms fail under constrained output formats due to cascading structural commitments
•
Identifies a fundamental tension between structured output requirements and genuine model reflection

constrained-decodingself-correctionalignmentstructured-outputLLM-limitations

5 upvotes

arXiv HF PDF

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty

Wang Yang, Chaoda Song, Xinpeng Li, et al. — Chinese Academy of Sciences, University of Chinese Academy of Sciences

Proposes a unified grid-based planning framework for agent evaluation with fine-grained control over task horizon and difficulty, addressing the high overhead and limited configurability of existing agent benchmarks.

Key Findings

•
Grid-based planning tasks enable continuous difficulty scaling for agent evaluation
•
Controllable horizon length isolates planning capability from task-specific knowledge
•
Lightweight environments dramatically reduce the cost of large-scale agent benchmarking

agent-evaluationbenchmarkconfigurable-difficultyplanningautonomous-agents

2 upvotes

arXiv HF PDF

Trending Models (10)

NousResearch Hermes Agent

NousResearch · agent-framework · Various

View on HF

NousResearch's agent framework that grows with users has exploded to 3,009 stars/day on GitHub, representing the fastest-growing AI agent project tracked. Model-native agent design from NousResearch's deep open-weight expertise.

AI-agentsframeworkopen-source

0 downloads33.7K likes

Qwen 3.6 Plus

Alibaba · text-generation · MoE + linear attention

View on HF

Alibaba's latest release featuring 1M context window, 65K output tokens, and always-on chain-of-thought reasoning. Beats Claude Opus on Terminal-Bench 2.0 (61.6 vs 59.3) and available as free preview on OpenRouter.

qwenlong-contextreasoningMoE

150.0K downloads380 likes

Gemma-4-31B-IT

Google · image-text-to-text · 31B

View on HF

Google's flagship 31B dense Gemma-4 instruction-tuned model continues strong trending with 678k downloads and 1,158 likes. Apache 2.0 license makes it the first Google model with fully permissive enterprise licensing.

gemma4multimodalinstruction-tunedapache-2.0

884.3K downloads1.4K likes

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Jackrong (Community) · text-generation · 27B

View on HF

Community-built Qwen3.5-27B distilled from Claude Opus reasoning outputs continues massive traction with 2,403 likes and 548k downloads, representing the pinnacle of closed-to-open capability transfer.

qwen3.5distillationreasoningclaude-opus

552.0K downloads2.5K likes

NeMo Data Designer

NVIDIA · data-generation · N/A

View on HF

NVIDIA's synthetic data generation tool for creating high-quality training data from scratch or seed data, trending at 244 stars/day as enterprises seek data-centric AI approaches.

synthetic-dataNVIDIAdata-centricNeMo

0 downloads1.5K likes

GLM-5

Zhipu AI · text-generation · 744B (40B active)

View on HF

Zhipu AI's frontier reasoning model with 744B total / 40B active parameters, trained on Huawei silicon under MIT license. Achieves 50.4% on Humanity's Last Exam, demonstrating competitive non-NVIDIA training infrastructure.

frontier-modelMoEMIT-licensereasoning

389 downloads519 likes

Hindsight: Agent Memory That Learns

Vectorize · agent-memory · N/A

View on HF

Agent memory system that learns and improves over time, trending at 160 stars/day. Addresses a critical gap in the agent stack: persistent, learning memory beyond simple RAG retrieval.

agent-memorylearningRAG-alternativeinfrastructure

0 downloads7.8K likes

Gemma-4-26B-A4B-IT

Google · image-text-to-text · 26B (4B active)

View on HF

Gemma-4 MoE variant with 26B total / 4B active parameters, offering strong multimodal performance at fraction of dense model inference cost. 476k downloads show strong enterprise adoption.

gemma4MoEefficient-inferencemultimodal

659.8K downloads515 likes

PersonaPlex

NVIDIA · persona-generation · N/A

View on HF

NVIDIA's system for generating and managing AI personas, trending at 662 stars/day. Signals NVIDIA's expanding role beyond hardware into agent personality and character management.

personasNVIDIAAI-charactersagent-infrastructure

0 downloads8.1K likes

GPT-OSS-120B

OpenAI · text-generation · 117B (5.1B active)

View on HF

OpenAI's first Apache 2.0 open-weight model at 117B total / 5.1B active parameters with MXFP4 quantization and 128K context. A landmark shift in OpenAI's open-source strategy.

OpenAIopen-weightMoEapache-2.0

3.7M downloads4.7K likes

Trending GitHub Repos (12)

NousResearch/hermes-agent

High RelevanceGitHub

NousResearch's extensible AI agent framework that grows with users. Explosive growth from 28.9k to 32.7k stars, the highest daily gain tracked in this project's history.

AI-agentsLLM-agentsNousResearchframework

Python32.7K+3.0K today4.2K

abhigyanpatwari/GitNexus

High RelevanceGitHub

Client-side knowledge graph creator running entirely in-browser. Drop in GitHub repos or ZIP files for interactive knowledge graphs with built-in Graph RAG Agent capabilities.

knowledge-graphcode-analysisRAGbrowser-based

TypeScript24.6K+1.2K today2.8K

google-ai-edge/gallery

High RelevanceGitHub

Google's showcase gallery for on-device ML/GenAI use cases. Continued strong growth to 18.9k stars, enabling local model experimentation on mobile devices.

on-device-AImobile-MLGoogledemo

Kotlin18.9K+897 today1.8K

tobi/qmd

GitHub

Mini CLI search engine for docs, knowledge bases, and meeting notes using local state-of-the-art approaches. Continued strong growth to 19.7k stars.

searchlocal-firstCLIknowledge-management

TypeScript19.7K+859 today1.2K

NVIDIA/personaplex

High RelevanceGitHub

NVIDIA's PersonaPlex system for generating and managing AI personas. Surging from 7.5k to 8k stars as NVIDIA expands into agent personality infrastructure.

personasNVIDIAAI-charactersgeneration

Python8.0K+662 today1.2K

elebumm/RedditVideoMakerBot

GitHub

Create Reddit Videos with just one command. Resurgent popularity at 636 stars/day, likely driven by content creator demand for automated video pipelines.

video-generationRedditcontent-creationautomation

Python10.1K+636 today2.5K

google-ai-edge/LiteRT-LM

High RelevanceGitHub

Google's lightweight runtime for running language models on edge devices. Complementing AI Edge Gallery with C++ inference infrastructure at 528 stars/day.

edge-inferenceLLM-runtimeC++Google

C++2.6K+528 today253

NVIDIA-NeMo/DataDesigner

High RelevanceGitHub

NeMo Data Designer: Generate high-quality synthetic data from scratch or seed data. NVIDIA's data-centric AI approach gaining traction at 244 stars/day.

synthetic-dataNVIDIAdata-generationNeMo

Python1.5K+244 today132

TheCraigHewitt/seomachine

GitHub

Specialized Claude workspace for creating long-form, SEO-optimized blog content with research, writing, analysis, and optimization features. 215 stars/day.

SEOcontent-generationClaudewriting

Python4.0K+215 today665

HKUDS/DeepTutor

High RelevanceGitHub

Agent-native personalized learning assistant from HKU. Steady growth at 168 stars/day as education-focused AI tools gain traction.

educationAI-tutorpersonalizationagents

Python12.4K+168 today1.7K

vectorize-io/hindsight

High RelevanceGitHub

Hindsight: Agent Memory That Learns. A novel agent memory system that improves over time, addressing the critical gap between simple context windows and full persistent memory.

agent-memorylearningRAG-alternativeinfrastructure

Python7.8K+160 today486

HKUDS/AutoAgent

High RelevanceGitHub

Fully-automated and zero-code LLM agent framework from HKU. Enables building agents without programming, at 76 stars/day.

agent-frameworkzero-codeautomationLLM-agents

Python9.0K+76 today1.3K

Sources Checked

arXiv

06:30 AM UTC

HuggingFace Daily Papers

06:30 AM UTC

HuggingFace Trending Models

06:30 AM UTC

GitHub Trending

06:30 AM UTC

GitHub Trending (Python)

06:31 AM UTC

Web Search (Supplementary)

06:32 AM UTC

AlphaXiv

06:30 AM UTC

← Tuesday, April 7, 2026 Thursday, April 9, 2026→

In-Place Test-Time Training enables LLMs to adapt during inference; Polynomial Mixer achieves linear-time attention replacement; Gym-Anything turns any software into an agent environment

Executive Summary

Researcher Notes

Themes & Trends

Test-Time Adaptation

Efficient Architecture Alternatives

Agent Evaluation Crisis

LLM Safety and Alignment

Agent Infrastructure Build-Out

Trending Papers (14)

In-Place Test-Time Training

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

Gym-Anything: Turn any Software into an Agent Environment

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Action Images: End-to-End Policy Learning via Multiview Video Generation

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

Exclusive Unlearning

Target Policy Optimization

Artificial Intelligence and the Structure of Mathematics

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty

Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

Trending Models (10)

Trending GitHub Repos (12)

Sources Checked