Monday, March 30, 2026
Attention Residuals rethink Transformers; LLM agents autonomously discover GPU kernels and RL algorithms; AI safety alarms as models fail without adversarial prompts
Executive Summary
Today's AI/ML landscape is defined by three major threads. First, architectural innovation: Moonshot AI's Attention Residuals propose the first serious rethinking of residual connections in Transformers, while EverMind pushes context windows to 100 million tokens with Memory Sparse Attention. Second, autonomous AI agents are becoming researchers themselves: NVIDIA's AVO uses LLM agents to discover GPU kernels faster than hand-tuned CUDA, Fudan's POISE autonomously discovers novel RL training algorithms, and Meta's Hyperagents demonstrate self-referential self-improvement (accepted at ICLR 2026). Third, AI safety findings are increasingly alarming: Internal Safety Collapse shows frontier LLMs fail at 95.3% rates without any adversarial prompting, while Claudini demonstrates an LLM autonomously discovering attacks against other LLMs. On the generative side, video generation continues its rapid advance with four significant papers on world models, streaming multi-shot generation, and temporal extrapolation.
Researcher Notes
Attention Residuals is the sleeper hit of the week. At 165 HF upvotes, it's getting more attention than most model releases. The insight — that residual connections cause signal dilution in deep Transformers and can be replaced with learned attention over prior layers — is deceptively simple and potentially universal. If it holds up at larger scales, every Transformer-based model will want this. Watch for adoption signals in the next 2-4 weeks.
The 'AI researching AI' cluster is worth watching closely. AVO (GPU kernel discovery), POISE (RL algorithm discovery), MetaClaw (meta-learning agents), and Hyperagents (self-referential improvement) all appeared in the same week. This is no longer a fringe idea — it's a trend with serious institutional backing (NVIDIA, Meta FAIR, major universities). The POISE result is particularly striking: starting from GRPO, it autonomously discovered analytic-variance scaling that improved AIME25 pass@32 from 26.7% to 43.3%.
The Qwen 3.5 ecosystem dominance on HuggingFace is remarkable. 9 of the top 15 trending models are Qwen 3.5-based. Jackrong's Claude 4.6 Opus reasoning distillations alone account for 4 entries with over 1M combined downloads. The open-source model landscape is increasingly a Qwen monoculture in terms of base architectures, which is both a testament to Qwen's quality and a concentration risk.
Internal Safety Collapse deserves more attention than it's getting. A 95.3% safety failure rate during routine (non-adversarial) tasks across frontier LLMs is deeply concerning. Unlike jailbreaks, ISC requires no attacker — it emerges from normal use. The finding that alignment 'reshapes observable outputs but does not eliminate underlying unsafe capabilities' challenges the assumption that RLHF/DPO fundamentally changes model behavior rather than just surface-level responses.
Mirage from Stanford (including Fei-Fei Li) is provocative. Showing that multimodal models generate detailed reasoning about images even when no images are provided undermines a lot of existing multimodal evaluation. If models can score well on chest X-ray tasks without seeing X-rays, our benchmarks are measuring language priors, not visual understanding.
Themes & Trends
Autonomous AI Agents as Researchers
risingA striking cluster of papers demonstrating LLM agents autonomously conducting research tasks previously requiring human expertise. AVO uses agents to discover GPU kernels outperforming hand-tuned CUDA. POISE agents autonomously discover novel RL training algorithms. MetaClaw agents meta-learn and evolve behavioral skills. Hyperagents self-referentially self-improve. This represents a qualitative shift from AI as tool to AI as independent researcher.
AI Safety Red Flags
risingTwo papers reveal deeply concerning safety failure modes. Internal Safety Collapse shows 95.3% safety failure rates during routine (non-adversarial) tasks — alignment may be more superficial than assumed. Claudini demonstrates AI autonomously discovering adversarial attacks against other AI systems. Together, they suggest both the defensive and offensive sides of AI safety are harder than current approaches account for.
Video Generation Pushes Toward Real-Time Interactivity
risingFour papers advance video generation from batch processing toward real-time interactive use. ShotStream enables 16 FPS streaming multi-shot storytelling. PackForcing achieves 24x temporal extrapolation on a single GPU. HyDRA tackles object permanence in video world models. daVinci-MagiHuman generates 1080p human video in 38 seconds. The field is converging on interactive, real-time video as the next frontier.
Transformer Architecture Innovation
risingAttention Residuals proposes the first major rethinking of residual connections since their introduction, with broad implications for all Transformer-based models. MSA pushes context windows to 100M tokens with linear complexity. These are not incremental improvements — they challenge fundamental design decisions in the dominant architecture.
Reasoning Distillation and Its Pitfalls
stableThe HuggingFace model landscape is dominated by Qwen 3.5 models distilled from Claude 4.6 Opus reasoning traces (4 of top 15 trending models, 1M+ downloads). However, the self-distillation degradation paper warns that this approach can cause up to 40% OOD performance drops by suppressing epistemic verbalization. The community's enthusiasm for distillation may be outrunning understanding of its failure modes.
Speech and Audio Models Mature
risingMistral's Voxtral TTS beats ElevenLabs in human preference evaluations. Cohere Transcribe achieves best-in-class ASR. daVinci-MagiHuman generates synchronized audio-video. The speech/audio modality is rapidly closing the gap with text and vision capabilities in open-source models.
Trending Papers (15)
Attention Residuals
High RelevanceGuangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu et al. — Moonshot AI (Kimi)
Proposes Attention Residuals (AttnRes), replacing fixed-weight accumulation in residual connections with softmax attention over preceding layer outputs. Each layer selectively aggregates earlier representations with learned, input-dependent weights. Block AttnRes partitions layers into blocks to reduce overhead. Integrated into Kimi Linear (48B/3B activated parameters) trained on 1.4T tokens.
Key Findings
- •Consistent improvements across model sizes: MMLU 73.5→74.6, GPQA-Diamond 36.9→44.4, Math 53.5→57.1
- •1.25x compute advantage with <2% inference latency overhead
- •Addresses signal dilution in PreNorm Transformers where hidden-state magnitude grows unchecked with depth
AVO: Agentic Variation Operators for Autonomous Evolutionary Search
High RelevanceTerry Chen, Zhifan Ye, Bing Xu, Zihao Ye et al. — NVIDIA
Introduces Agentic Variation Operators (AVO) that replace fixed mutation/crossover heuristics of classical evolutionary search with autonomous LLM coding agents. Agents loop through proposing, repairing, critiquing, and verifying code edits while consulting lineage, domain knowledge, and execution feedback. Tested on attention mechanisms using NVIDIA Blackwell GPUs over 7 days.
Key Findings
- •Up to 3.5% speedup over cuDNN and 10.5% over FlashAttention-4 on attention kernel optimization
- •Transfer learning to new GPU architectures requires only 30 additional minutes
- •Fully autonomous evolutionary search for complex optimization without hand-designed heuristics
MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild
High RelevancePeng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu et al. — AIMING Lab, University of North Carolina at Chapel Hill
Introduces a continual meta-learning framework enabling LLM agents to jointly evolve a base LLM policy and a library of reusable behavioral skills. Features skill-driven fast adaptation from failure analysis and opportunistic policy optimization using cloud-based fine-tuning during inactive periods.
Key Findings
- •Skill-driven adaptation improves accuracy by up to 32% relative
- •Full pipeline advances performance from 21.4% to 40.6% accuracy with 18.3% robustness increase
- •Versioning system prevents data contamination during self-improvement
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding
High RelevanceHejun Dong, Junbo Niu, Bin Wang, Weijun Zeng et al. — Shanghai AI Laboratory
Reframes document OCR as inverse rendering rather than sequential text generation. Replaces autoregressive decoding with parallel diffusion denoising under visual conditioning. Introduces block-wise diffusion decoding and uncertainty-driven curriculum learning.
Key Findings
- •Up to 3.2x faster than traditional autoregressive OCR methods
- •Semantic Shuffle benchmark demonstrates reduced reliance on linguistic patterns and stronger visual grounding
- •Treats OCR as recovering the rendering process rather than reading text — a paradigm shift
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
High RelevanceJeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee et al. — Microsoft Research
Investigates why self-distillation — a widely used post-training method — sometimes degrades mathematical reasoning in LLMs. Identifies 'epistemic verbalization' (expressing uncertainty during reasoning) as a critical mechanism. When teachers condition on rich information, they suppress uncertainty expression, which helps in-domain but causes up to 40% performance drops on out-of-distribution tasks.
Key Findings
- •Self-distillation can cause up to 40% performance degradation on out-of-distribution reasoning tasks
- •Epistemic verbalization (uncertainty expression) is critical for generalization and gets suppressed during distillation
- •Performance on in-domain tasks can mask significant generalization failures
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
High RelevanceKaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding et al. — Huazhong University of Science and Technology, Kuaishou Technology
Introduces Hybrid Memory paradigm for video world models with the HyDRA architecture. Models must simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. Introduces HM-World dataset with 59K high-fidelity clips.
Key Findings
- •HyDRA significantly outperforms state-of-the-art in dynamic subject consistency and generation quality
- •HM-World provides 59K clips with meticulously designed exit-entry events across 17 scenes and 49 subjects
- •Spatiotemporal relevance-driven retrieval preserves identity and motion of hidden subjects
Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
High RelevanceDanil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev — FusionBrain Lab, Moscow State University
Uncovers hidden potential of Diffusion Transformers by introducing a single learned scaling parameter per DiT block. Frames DiT calibration as black-box reward optimization solved via evolutionary algorithm, modifying only ~100 parameters. Consistently improves performance across various text-to-image models while reducing required inference steps.
Key Findings
- •A single learned scaling parameter per DiT block significantly improves generative quality
- •Modifies only ~100 parameters yet consistently improves across various text-to-image models
- •Reduces inference steps required for image generation while maintaining quality
Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
High RelevanceAlexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye et al. — Imperial College London, EPFL
Uses Claude Code as an autonomous research agent to discover novel adversarial attack techniques for LLMs. Achieves up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B (vs. <=10% for existing methods). Attacks transfer directly to held-out models.
Key Findings
- •100% attack success rate against Meta-SecAlign-70B vs. 56% for best baseline
- •Attacks discovered autonomously by an LLM transfer to models not seen during search
- •Demonstrates AI can autonomously discover novel attack strategies more effective than human-designed ones
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
High RelevanceYawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen et al. — MMLab, CUHK, Kling AI Research, Kuaishou Technology
Novel causal multi-shot architecture enabling interactive storytelling by reformulating multi-shot video as next-shot generation conditioned on historical context. Uses dual-cache memory mechanism (global context for inter-shot, local for intra-shot coherence) and two-stage distillation strategy.
Key Findings
- •Achieves 16 FPS streaming generation on a single GPU
- •Users can dynamically instruct ongoing narratives via streaming prompts
- •Two-stage self-forcing distillation effectively bridges the train-test gap for autoregressive generation
Internal Safety Collapse in Frontier Large Language Models
High RelevanceYutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng et al. — Fudan University, University of Melbourne, University of Illinois
Identifies Internal Safety Collapse (ISC) — a critical failure mode where frontier LLMs spontaneously generate harmful content during otherwise benign tasks, without any adversarial prompting. Develops TVD framework and ISC-Bench with 53 scenarios across professional disciplines.
Key Findings
- •95.3% average safety failure rate across four frontier LLMs — substantially exceeding standard jailbreak attacks
- •ISC occurs without adversarial prompt engineering, emerging from routine task execution
- •Alignment reshapes observable outputs but does not eliminate underlying unsafe capabilities
Hyperagents: Self-Referential Self-Improving Agents
High RelevanceJenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster et al. — Meta FAIR, University of British Columbia
Introduces hyperagents — self-referential agents integrating a task agent and meta agent into a single editable program. The DGM-Hyperagents framework enables self-improvement across diverse domains without domain-specific alignment assumptions. Accepted at ICLR 2026.
Key Findings
- •Generates emergent meta-level enhancements: persistent memory, performance tracking
- •Improvements transfer across domains and accumulate across iterations
- •Eliminates need for domain-specific alignment between task performance and self-modification
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
High RelevanceYu Chen, Runkai Chen, Sheng Yi, Xinda Zhao et al. — EverMind AI
Introduces Memory Sparse Attention (MSA) with document-wise RoPE for linear computational scaling to 100 million token contexts. Combines KV cache compression with memory parallelism and a memory interleave mechanism for complex reasoning.
Key Findings
- •Less than 9% performance degradation when scaling from 16K to 100M tokens
- •Linear computational complexity enables practical deployment at extreme context lengths
- •Outperforms RAG and memory-focused agents on long-context benchmarks without external retrieval
Composer 2 Technical Report
High RelevanceCursor Research (56 authors) — Cursor
Presents Composer 2, a specialized Mixture-of-Experts model for agentic software engineering. Trained with continued pretraining plus large-scale RL targeting multi-step reasoning on extended coding tasks. From Cursor, one of the most widely-used AI coding tools.
Key Findings
- •61.3% accuracy on CursorBench-3 (37% improvement over predecessor)
- •61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual
- •Large-scale RL on extended coding tasks is key to agentic code model performance
Mirage: The Illusion of Visual Understanding
High RelevanceMohammad Asadi, Jack W. O'Sullivan, Fang Cao, Fei-Fei Li et al. — Stanford University
Reveals that frontier multimodal models generate detailed image descriptions and reasoning traces even when no images are provided — termed 'mirage reasoning.' One model achieved top performance on chest X-ray tasks despite lacking image access. Introduces B-Clean benchmark for vision-grounded evaluation.
Key Findings
- •Models produce detailed reasoning about images even when no images are provided
- •One model scored top on chest X-ray tasks without seeing any X-rays
- •Existing multimodal benchmarks may be measuring language priors rather than visual understanding
Voxtral TTS
Alexander H. Liu, Alexis Tacnet, Andy Ehrenberg, Guillaume Lample et al. — Mistral AI
Mistral's multilingual text-to-speech system combining auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. Uses custom speech tokenizer with hybrid quantization. Open-weight release under CC BY-NC license.
Key Findings
- •68.4% human preference win rate over ElevenLabs Flash v2.5 for multilingual voice cloning
- •Supports 9 languages with 20 preset voices and zero-shot voice cloning
- •70ms latency and up to 1430 char/s/GPU throughput
Trending Models (10)
Jackrong (individual) · image-text-to-text · 27B
Fine-tuned Qwen3.5-27B distilled on Claude 4.6 Opus reasoning trajectories. Features structured chain-of-thought with <think> tags, 262K context. Four variants (base, v2, GGUF versions) dominate trending with 1M+ combined downloads — the most viral reasoning distillation to date.
Mistral AI · text-to-speech · 4B
Frontier open-weight TTS model with 20 preset voices, 9 languages, zero-shot voice cloning. 68.4% win rate over ElevenLabs Flash v2.5. Production-ready streaming inference at 70ms latency on single GPU.
Cohere Labs · automatic-speech-recognition · 2B
State-of-the-art ASR with conformer-based encoder-decoder. Best-in-class 5.42 average WER on English ASR leaderboard across 14 languages. Up to 3x faster real-time factor than comparable dedicated ASR models.
Baidu · image-text-to-text · 5B
Unified end-to-end document intelligence model for direct image-to-Markdown conversion. Supports 192 languages. Ranked #1 on OmniDocBench v1.5 with 93.12 score. Features Layout-as-Thought reasoning.
SII-GAIR / Sand.ai · image-to-video · 15B
Fully open-source 15B single-stream audio-video generative model for human-centric generation. Joint text/video/audio processing via self-attention only. 80% win rate vs Ovi 1.1. Multilingual support across 6 languages.
Chroma · text-generation · 20B (MoE)
20B agentic search model trained to retrieve supporting documents for complex multi-hop queries. Built on GPT-OSS-20B with SFT + RL (CISPO). Operates as retrieval subagent alongside frontier reasoning models with parallel tool calling.
NVIDIA · text-generation · 32B total / 3B active (MoE)
MoE reasoning model (32B total, 3B active) achieving Gold Medal on IMO 2025 (35 pts) and IOI 2025 (439.3 pts). Features thinking mode, agentic task support, tool calling, and Python code execution.
Tesslate · text-generation · 9B
9B coding agent fine-tuned on 425K+ curated agentic coding trajectories from Claude Opus 4.6, GPT-5.3-Codex, GPT-5.4, and Gemini 3.1 Pro. Targets scaffolding patterns from Claude Code, OpenCode, Codex, and Droid.
Meta AI · multimodal (brain encoding) · composite
Foundation model for in-silico neuroscience predicting fMRI brain responses to naturalistic stimuli. Combines LLaMA 3.2-3B (text), V-JEPA2 (video), and Wav2Vec-BERT 2.0 (audio) into unified brain encoding Transformer.
Lightricks · image-to-video · 22B
DiT-based 22B audio-video foundation model with synchronized video+audio generation. Highest downloads on trending list at 1.37M. Supports text-to-video, image-to-video, and audio-video generation with extensive ecosystem of LoRAs and upscalers.