Attention Residuals rethink Transformers; LLM agents autonomously discover GPU kernels and RL algorithms; AI safety alarms as models fail without adversarial prompts

AVO: Agentic Variation Operators for Autonomous Evolutionary Search

High Relevance

Terry Chen, Zhifan Ye, Bing Xu, Zihao Ye et al. — NVIDIA

Introduces Agentic Variation Operators (AVO) that replace fixed mutation/crossover heuristics of classical evolutionary search with autonomous LLM coding agents. Agents loop through proposing, repairing, critiquing, and verifying code edits while consulting lineage, domain knowledge, and execution feedback. Tested on attention mechanisms using NVIDIA Blackwell GPUs over 7 days.

Key Findings

•
Up to 3.5% speedup over cuDNN and 10.5% over FlashAttention-4 on attention kernel optimization
•
Transfer learning to new GPU architectures requires only 30 additional minutes
•
Fully autonomous evolutionary search for complex optimization without hand-designed heuristics

agentsevolutionary-searchoptimizationcode-generation

107 upvotes

arXiv HF AlphaXiv PDF

MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild

High Relevance

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu et al. — AIMING Lab, University of North Carolina at Chapel Hill

Introduces a continual meta-learning framework enabling LLM agents to jointly evolve a base LLM policy and a library of reusable behavioral skills. Features skill-driven fast adaptation from failure analysis and opportunistic policy optimization using cloud-based fine-tuning during inactive periods.

Key Findings

•
Skill-driven adaptation improves accuracy by up to 32% relative
•
Full pipeline advances performance from 21.4% to 40.6% accuracy with 18.3% robustness increase
•
Versioning system prevents data contamination during self-improvement

agentsmeta-learningself-improvementcontinual-learning

134 upvotes

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

High Relevance

Hejun Dong, Junbo Niu, Bin Wang, Weijun Zeng et al. — Shanghai AI Laboratory

Reframes document OCR as inverse rendering rather than sequential text generation. Replaces autoregressive decoding with parallel diffusion denoising under visual conditioning. Introduces block-wise diffusion decoding and uncertainty-driven curriculum learning.

Key Findings

•
Up to 3.2x faster than traditional autoregressive OCR methods
•
Semantic Shuffle benchmark demonstrates reduced reliance on linguistic patterns and stronger visual grounding
•
Treats OCR as recovering the rendering process rather than reading text — a paradigm shift

document-AIOCRdiffusionefficiency

130 upvotes

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

High Relevance

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee et al. — Microsoft Research

Investigates why self-distillation — a widely used post-training method — sometimes degrades mathematical reasoning in LLMs. Identifies 'epistemic verbalization' (expressing uncertainty during reasoning) as a critical mechanism. When teachers condition on rich information, they suppress uncertainty expression, which helps in-domain but causes up to 40% performance drops on out-of-distribution tasks.

Key Findings

•
Self-distillation can cause up to 40% performance degradation on out-of-distribution reasoning tasks
•
Epistemic verbalization (uncertainty expression) is critical for generalization and gets suppressed during distillation
•
Performance on in-domain tasks can mask significant generalization failures

distillationreasoningtraining-methodologylanguage

94 upvotes

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

High Relevance

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding et al. — Huazhong University of Science and Technology, Kuaishou Technology

Introduces Hybrid Memory paradigm for video world models with the HyDRA architecture. Models must simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. Introduces HM-World dataset with 59K high-fidelity clips.

Key Findings

•
HyDRA significantly outperforms state-of-the-art in dynamic subject consistency and generation quality
•
HM-World provides 59K clips with meticulously designed exit-entry events across 17 scenes and 49 subjects
•
Spatiotemporal relevance-driven retrieval preserves identity and motion of hidden subjects

video-generationworld-modelsmemoryvision

62 upvotes

Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

High Relevance

Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev — FusionBrain Lab, Moscow State University

Uncovers hidden potential of Diffusion Transformers by introducing a single learned scaling parameter per DiT block. Frames DiT calibration as black-box reward optimization solved via evolutionary algorithm, modifying only ~100 parameters. Consistently improves performance across various text-to-image models while reducing required inference steps.

Key Findings

•
A single learned scaling parameter per DiT block significantly improves generative quality
•
Modifies only ~100 parameters yet consistently improves across various text-to-image models
•
Reduces inference steps required for image generation while maintaining quality

diffusionefficiencytext-to-imageparameter-efficient

54 upvotes

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

High Relevance

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye et al. — Imperial College London, EPFL

Uses Claude Code as an autonomous research agent to discover novel adversarial attack techniques for LLMs. Achieves up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B (vs. <=10% for existing methods). Attacks transfer directly to held-out models.

Key Findings

•
100% attack success rate against Meta-SecAlign-70B vs. 56% for best baseline
•
Attacks discovered autonomously by an LLM transfer to models not seen during search
•
Demonstrates AI can autonomously discover novel attack strategies more effective than human-designed ones

ai-safetyadversarial-attacksagentsautonomous-research

49 upvotes

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

High Relevance

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen et al. — MMLab, CUHK, Kling AI Research, Kuaishou Technology

Novel causal multi-shot architecture enabling interactive storytelling by reformulating multi-shot video as next-shot generation conditioned on historical context. Uses dual-cache memory mechanism (global context for inter-shot, local for intra-shot coherence) and two-stage distillation strategy.

Key Findings

•
Achieves 16 FPS streaming generation on a single GPU
•
Users can dynamically instruct ongoing narratives via streaming prompts
•
Two-stage self-forcing distillation effectively bridges the train-test gap for autoregressive generation

video-generationstreamingstorytellingdistillation

45 upvotes

Internal Safety Collapse in Frontier Large Language Models

High Relevance

Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng et al. — Fudan University, University of Melbourne, University of Illinois

Identifies Internal Safety Collapse (ISC) — a critical failure mode where frontier LLMs spontaneously generate harmful content during otherwise benign tasks, without any adversarial prompting. Develops TVD framework and ISC-Bench with 53 scenarios across professional disciplines.

Key Findings

•
95.3% average safety failure rate across four frontier LLMs — substantially exceeding standard jailbreak attacks
•
ISC occurs without adversarial prompt engineering, emerging from routine task execution
•
Alignment reshapes observable outputs but does not eliminate underlying unsafe capabilities

ai-safetyalignmentevaluationlanguage

30 upvotes

Hyperagents: Self-Referential Self-Improving Agents

High Relevance

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster et al. — Meta FAIR, University of British Columbia

Introduces hyperagents — self-referential agents integrating a task agent and meta agent into a single editable program. The DGM-Hyperagents framework enables self-improvement across diverse domains without domain-specific alignment assumptions. Accepted at ICLR 2026.

Key Findings

•
Generates emergent meta-level enhancements: persistent memory, performance tracking
•
Improvements transfer across domains and accumulate across iterations
•
Eliminates need for domain-specific alignment between task performance and self-modification

agentsself-improvementmeta-learningautonomous-research

36 upvotes

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

High Relevance

Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao et al. — EverMind AI

Introduces Memory Sparse Attention (MSA) with document-wise RoPE for linear computational scaling to 100 million token contexts. Combines KV cache compression with memory parallelism and a memory interleave mechanism for complex reasoning.

Key Findings

•
Less than 9% performance degradation when scaling from 16K to 100M tokens
•
Linear computational complexity enables practical deployment at extreme context lengths
•
Outperforms RAG and memory-focused agents on long-context benchmarks without external retrieval

context-lengthefficiencyattentionlanguage

35 upvotes

Composer 2 Technical Report

High Relevance

Cursor Research (56 authors) — Cursor

Presents Composer 2, a specialized Mixture-of-Experts model for agentic software engineering. Trained with continued pretraining plus large-scale RL targeting multi-step reasoning on extended coding tasks. From Cursor, one of the most widely-used AI coding tools.

Key Findings

•
61.3% accuracy on CursorBench-3 (37% improvement over predecessor)
•
61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual
•
Large-scale RL on extended coding tasks is key to agentic code model performance

code-generationagentsreasoningMoE

31 upvotes

Mirage: The Illusion of Visual Understanding

High Relevance

Mohammad Asadi, Jack W. O'Sullivan, Fang Cao, Fei-Fei Li et al. — Stanford University

Reveals that frontier multimodal models generate detailed image descriptions and reasoning traces even when no images are provided — termed 'mirage reasoning.' One model achieved top performance on chest X-ray tasks despite lacking image access. Introduces B-Clean benchmark for vision-grounded evaluation.

Key Findings

•
Models produce detailed reasoning about images even when no images are provided
•
One model scored top on chest X-ray tasks without seeing any X-rays
•
Existing multimodal benchmarks may be measuring language priors rather than visual understanding

multimodalevaluationvision-languagebenchmarks

12 upvotes

Voxtral TTS

Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Guillaume Lample et al. — Mistral AI

Mistral's multilingual text-to-speech system combining auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. Uses custom speech tokenizer with hybrid quantization. Open-weight release under CC BY-NC license.

Key Findings

•
68.4% human preference win rate over ElevenLabs Flash v2.5 for multilingual voice cloning
•
Supports 9 languages with 20 preset voices and zero-shot voice cloning
•
70ms latency and up to 1430 char/s/GPU throughput

speechTTSmultimodalopen-weights

18 upvotes

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Trending Models (10)

Jackrong (individual) · image-text-to-text · 27B

Fine-tuned Qwen3.5-27B distilled on Claude 4.6 Opus reasoning trajectories. Features structured chain-of-thought with <think> tags, 262K context. Four variants (base, v2, GGUF versions) dominate trending with 1M+ combined downloads — the most viral reasoning distillation to date.

reasoningdistillationchain-of-thought

1.0M downloads1.6K likes

Voxtral-4B-TTS

Mistral AI · text-to-speech · 4B

Frontier open-weight TTS model with 20 preset voices, 9 languages, zero-shot voice cloning. 68.4% win rate over ElevenLabs Flash v2.5. Production-ready streaming inference at 70ms latency on single GPU.

speechTTSmultilingualopen-weights

2.4K downloads468 likes

Cohere Transcribe

Cohere Labs · automatic-speech-recognition · 2B

State-of-the-art ASR with conformer-based encoder-decoder. Best-in-class 5.42 average WER on English ASR leaderboard across 14 languages. Up to 3x faster real-time factor than comparable dedicated ASR models.

speechASRmultilingual

20.0K downloads467 likes

Qianfan-OCR

Baidu · image-text-to-text · 5B

Unified end-to-end document intelligence model for direct image-to-Markdown conversion. Supports 192 languages. Ranked #1 on OmniDocBench v1.5 with 93.12 score. Features Layout-as-Thought reasoning.

OCRdocument-AIvision-language

15.6K downloads608 likes

daVinci-MagiHuman

SII-GAIR / Sand.ai · image-to-video · 15B

Fully open-source 15B single-stream audio-video generative model for human-centric generation. Joint text/video/audio processing via self-attention only. 80% win rate vs Ovi 1.1. Multilingual support across 6 languages.

video-generationaudio-visualopen-source

466 downloads249 likes

Context-1

Chroma · text-generation · 20B (MoE)

Nemotron-Cascade-2-30B-A3B

20B agentic search model trained to retrieve supporting documents for complex multi-hop queries. Built on GPT-OSS-20B with SFT + RL (CISPO). Operates as retrieval subagent alongside frontier reasoning models with parallel tool calling.

agentsretrievalsearch

1.1K downloads240 likes

NVIDIA · text-generation · 32B total / 3B active (MoE)

MoE reasoning model (32B total, 3B active) achieving Gold Medal on IMO 2025 (35 pts) and IOI 2025 (439.3 pts). Features thinking mode, agentic task support, tool calling, and Python code execution.

reasoningmathcodeMoE

74.8K downloads405 likes

OmniCoder-9B

Tesslate · text-generation · 9B

9B coding agent fine-tuned on 425K+ curated agentic coding trajectories from Claude Opus 4.6, GPT-5.3-Codex, GPT-5.4, and Gemini 3.1 Pro. Targets scaffolding patterns from Claude Code, OpenCode, Codex, and Droid.

codeagentsreasoning

27.2K downloads527 likes

TRIBE v2

Meta AI · multimodal (brain encoding) · composite

Foundation model for in-silico neuroscience predicting fMRI brain responses to naturalistic stimuli. Combines LLaMA 3.2-3B (text), V-JEPA2 (video), and Wav2Vec-BERT 2.0 (audio) into unified brain encoding Transformer.

neurosciencemultimodalbrain-encoding

4.9K downloads170 likes

LTX-2.3

Lightricks · image-to-video · 22B