Tuesday, March 31, 2026

Trillion-parameter scientific foundation model arrives; Agent skill distillation from trajectories gains traction; Coding agents get specialized models and organicity benchmarks

trillion-scale-modelsagent-skill-learningcoding-agentsmedical-aidiffusion-transformersvideo-generation

Executive Summary

Today's AI/ML landscape is shaped by three converging trends. First, scale meets science: Intern-S1-Pro debuts as the first trillion-parameter scientific multimodal foundation model, promising comprehensive enhancement across general and scientific domains with advanced agent capabilities. Second, agent engineering matures rapidly: Trace2Skill introduces a principled framework for distilling reusable skills from agent trajectories, Natural-Language Agent Harnesses proposes externalizing agent control logic as portable natural-language artifacts, and Learning to Commit tackles the overlooked 'organicity' problem where LLM-generated PRs get rejected despite being functionally correct. Third, coding agents and benchmarks evolve in tandem: Cursor's Composer 2 achieves frontier-level agentic software engineering through RL at scale, while SlopCodeBench reveals how coding agents degrade over iterative tasks — a critical blind spot in current evaluation.

On the medical AI front, MedOpenClaw exposes a striking paradox where VLMs actually perform worse when given professional tools, and Medical AI Scientist demonstrates the first autonomous clinical research framework. In computer vision, Calibri shows that a single learned scaling parameter can significantly enhance Diffusion Transformers, and GenMask elegantly adapts DiT for segmentation by generating masks directly.

Researcher Notes

Intern-S1-Pro at 1T parameters is a landmark release. While we've seen trillion-scale language models before, this is the first claiming multimodal scientific capabilities at this scale. The 9-comment engagement on HuggingFace (highest today alongside Trace2Skill) signals genuine community interest. The key question is whether the scientific specialization justifies the scale — or whether smaller, domain-adapted models remain more practical. Watch for benchmark comparisons against specialized scientific models.

Trace2Skill's 13 comments make it today's most-discussed paper, and for good reason. The problem it addresses — how to automatically extract reusable, transferable skills from agent trajectories without overfitting to trajectory-local lessons — is one of the core bottlenecks in building practical LLM agents. Combined with yesterday's MetaClaw (meta-learning agents) and AVO (evolutionary search agents), we're seeing a clear research arc toward agents that improve themselves systematically rather than through brute-force prompting.

The 'coding agent' cluster tells a nuanced story. Composer 2 (Cursor) demonstrates that RL-trained specialized models can match frontier general-purpose models on real software engineering. But SlopCodeBench provides the necessary counterpoint: current coding agents degrade progressively over iterative tasks, a failure mode invisible to single-shot benchmarks like SWE-bench. Meanwhile, Learning to Commit identifies 'organicity' — adherence to project conventions, API reuse, and architectural consistency — as the real barrier to PR acceptance. These three papers together paint a picture of a field that's getting good at isolated tasks but struggling with the sustained, convention-aware work that real software engineering demands.

MedOpenClaw's performance paradox is genuinely surprising. State-of-the-art VLMs (Gemini 3.1 Pro, GPT-5.4) actually perform worse when given access to professional medical imaging tools (3D Slicer). The authors attribute this to lack of precise spatial grounding — models can reason about pre-selected 2D slices but fail when they need to navigate full 3D volumes. This challenges the assumption that 'more tools = better agents' and suggests tool-use training needs fundamental rethinking for spatial domains.

The Qwen 3.5 ecosystem continues to dominate HuggingFace trending models. Jackrong's Claude 4.6 Opus reasoning distillations maintain their grip with multiple GGUF variants in the top 20. New entries include Nvidia's Nemotron-Cascade-2-30B-A3B (hybrid architecture), Cohere's first speech recognition model, and Mistral's Voxtral TTS — the model landscape is diversifying beyond text generation into speech and multimodal pipelines.

Themes & Trends

Agent Skill Learning & Engineering

rising

Multiple papers converge on how to make LLM agents learn, retain, and transfer skills — from trajectory distillation (Trace2Skill) to portable harness design (NLAHs) and code convention learning (Learning to Commit).

Coding Agent Maturation

rising

Specialized coding models (Composer 2, Kernel-Smith), organicity-focused evaluation, and long-horizon degradation benchmarks (SlopCodeBench) signal the field moving beyond simple code generation to sustained software engineering.

Medical AI Agents

rising

MedOpenClaw reveals tool-use paradoxes in medical VLMs while Medical AI Scientist demonstrates autonomous clinical research. Both highlight the gap between general AI capabilities and domain-specific requirements.

Diffusion Transformer Optimization

stable

Calibri demonstrates parameter-efficient DiT enhancement via learned scaling, GenMask repurposes DiT for segmentation, and multiple papers push the boundaries of diffusion-based generation.

Reasoning Transparency & Safety

stable

Lie to Me exposes a stark gap between models' internal reasoning acknowledgment and their final outputs, raising concerns about CoT faithfulness as a safety mechanism.

Trillion-Scale Scientific Models

rising

Intern-S1-Pro marks the arrival of trillion-parameter models specifically targeting scientific understanding, while PRBench shows current agents still fail at end-to-end physics paper reproduction.

Trending Papers (14)

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

High Relevance

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou

Introduces a framework for distilling reusable, transferable skills from LLM agent trajectories. Addresses the scalability bottleneck of manual skill authoring and the fragility of automated skill generation that overfits to trajectory-local lessons.

Key Findings

  • Overcomes shallow parametric knowledge limitations in automated skill generation
  • Produces transferable skills that generalize beyond specific trajectory contexts
  • Most-discussed paper of the day with 13 HuggingFace comments
agentsskill-learningtrajectoriestransfer-learning
0 upvotes

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

High Relevance

Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou

First one-trillion-parameter scientific multimodal foundation model. Delivers comprehensive enhancement across general and scientific domains with advanced agent capabilities and scientific expertise at unprecedented scale.

Key Findings

  • First trillion-parameter model specifically targeting scientific multimodal understanding
  • Combines stronger reasoning, image-text understanding, and agent capabilities
  • Scientific expertise augmented beyond general-purpose improvements
foundation-modelsscientific-aimultimodaltrillion-scale
0 upvotes

Composer 2 Technical Report

High Relevance

Cursor Research, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger Cursor

Specialized coding model for agentic software engineering trained via continued pretraining and large-scale RL. Achieves frontier-level performance on real software engineering problems in large codebases.

Key Findings

  • 61.3 on CursorBench, 61.7 on Terminal-Bench, 73.7 on SWE-bench Multilingual
  • Two-phase training: continued pretraining + large-scale reinforcement learning
  • Strong long-term planning and coding intelligence for interactive and agentic use
coding-agentsreinforcement-learningsoftware-engineeringlanguage-models
0 upvotes

MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

High Relevance

Weixiang Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu

Proposes MEDOPENCLAW runtime enabling VLMs to operate within standard medical tools (3D Slicer) and MEDFLOWBENCH benchmark for full-study medical imaging evaluation. Reveals that models degrade when given professional tool access due to lack of precise spatial grounding.

Key Findings

  • State-of-the-art VLMs (Gemini 3.1 Pro, GPT-5.4) perform worse with professional tool access
  • Performance paradox attributed to lack of precise spatial grounding in 3D volumes
  • First benchmark evaluating VLMs on uncurated full medical imaging studies
medical-aiagentstool-usevision-languagebenchmarks
0 upvotes

Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

High Relevance

Danil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev

Uncovers hidden potential of Diffusion Transformers by demonstrating that a single learned scaling parameter can significantly improve DiT block performance. Proposes a parameter-efficient calibration approach for enhancing generative tasks.

Key Findings

  • Single learned scaling parameter yields significant performance gains in DiT blocks
  • Parameter-efficient approach — minimal additional parameters needed
  • Applicable across generative tasks without architectural redesign
diffusion-transformersparameter-efficiencyimage-generationcalibration
0 upvotes

Natural-Language Agent Harnesses

High Relevance

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

Introduces Natural-Language Agent Harnesses (NLAHs) that externalize agent control logic as portable natural-language artifacts with explicit contracts and lightweight adapters. Proposes Intelligent Harness Runtime (IHR) for shared execution.

Key Findings

  • Harness behavior expressed in editable natural language rather than buried in controller code
  • Portable across runtimes with explicit contracts and lightweight adapters
  • Evaluated on coding and computer-use benchmarks
agentsharness-engineeringportabilitynatural-language
0 upvotes

Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

High Relevance

Mo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu

Proposes Learning to Commit framework using Online Repository Memory to improve LLM coding agents' code organicity — adherence to project conventions, API reuse, and architectural consistency. Agents perform supervised contrastive reflection on historical commits.

Key Findings

  • Identifies 'organicity' as the root cause of PR rejection, not functional correctness
  • Supervised contrastive reflection on historical commits distills project-specific patterns
  • Improved organicity on genuinely future merged PRs
coding-agentspull-requestscode-conventionssoftware-engineering
0 upvotes

Towards a Medical AI Scientist

High Relevance

Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao

Introduces Medical AI Scientist, the first autonomous research framework for clinical research with clinician-engineer co-reasoning mechanism. Operates in three modes: paper reproduction, literature-inspired innovation, and task-driven exploration.

Key Findings

  • First autonomous research framework specifically for clinical research
  • Generated research ideas of substantially higher quality than commercial LLMs across 171 cases
  • Generated manuscripts approach MICCAI-level quality
medical-aiautonomous-researchclinicalscientific-agents
0 upvotes

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

High Relevance

Richard J. Young

Tests 12 open-weight reasoning models (7B-685B) on 498 questions with six hint categories. Reveals stark gap between thinking-token acknowledgment (87.5%) and answer-text acknowledgment (28.6%), suggesting models suppress acknowledgment in final outputs.

Key Findings

  • Faithfulness ranges from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale)
  • 87.5% thinking-token vs 28.6% answer-text acknowledgment reveals systematic suppression
  • First large-scale faithfulness study on open-weight reasoning models
reasoningchain-of-thoughtfaithfulnesssafetyevaluation
0 upvotes

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

High Relevance

He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai

Framework combining evolutionary algorithms with post-training RL for GPU kernel generation. Achieves state-of-the-art on KernelBench and outperforms Gemini-3.0-pro and Claude-4.6-opus. Produces upstream contributions to SGLang and LMDeploy.

Key Findings

  • State-of-the-art on KernelBench (Triton backend)
  • Outperforms Gemini-3.0-pro and Claude-4.6-opus on kernel optimization
  • Validated on MetaX MACA backend; upstream contributions to SGLang and LMDeploy
kernel-optimizationevolutionary-algorithmsgpucode-generation
0 upvotes

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu

Proposes a causal multi-shot architecture enabling interactive storytelling and efficient on-the-fly frame generation. Reformulates multi-shot video generation as next-shot generation conditioned on historical context.

Key Findings

  • Causal architecture enables streaming generation — no need to wait for full sequence
  • Interactive storytelling with on-the-fly frame generation
  • Overcomes latency and interactivity limitations of bidirectional architectures
video-generationstorytellingstreamingcausal-models
0 upvotes

GenMask: Adapting DiT for Segmentation via Direct Mask

Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai

Trains DiT to generate both black-and-white segmentation masks and RGB images. Introduces timestep sampling strategy emphasizing extreme noise for segmentation and moderate noise for generation.

Key Findings

  • State-of-the-art on referring and reasoning segmentation benchmarks
  • Removes need for specialized feature extraction pipelines
  • Elegant adaptation of generative models for discriminative tasks
segmentationdiffusion-transformersvisiongeneration-for-discrimination
0 upvotes

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu

Introduces benchmark with 2,800+ instances grounded in authentic datasets for evaluating chart-to-code generation. First benchmark evaluating generation from large-scale raw data and iterative code refinement.

Key Findings

  • Significant performance degradation on complex multi-panel charts with real data
  • 14 leading VLMs evaluated — all struggle with authentic data complexity
  • First benchmark for iterative code refinement in chart reproduction
benchmarkscode-generationvisualizationvision-language
0 upvotes

PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu

Benchmark of 30 expert-curated physics research tasks requiring AI agents to comprehend paper methodology, implement algorithms, and produce quantitative results. Best performer (GPT-5.3-Codex) achieves 34% with zero end-to-end success.

Key Findings

  • Zero end-to-end success rate across all tested coding agents
  • Best performer (GPT-5.3-Codex) achieves only 34% overall score
  • Systematic failures in formula implementation and numerical simulation debugging
benchmarksscientific-aicode-generationphysicsreproduction
0 upvotes

Trending Models (10)

Reasoning-distilled version of Qwen3.5-27B using Claude 4.6 Opus traces. Continues to dominate HuggingFace trending with multiple GGUF variants.

reasoningdistillationqwen3.5multimodal
309.4K downloads1.7K likes
Qwen3.5-9B

Qwen · image-text-to-text · 9B

View on HF

Official Qwen 3.5 9B model. Most downloaded model on HuggingFace with 4.5M downloads, serving as the base for numerous community fine-tunes.

foundationmultimodalqwen3.5
4.5M downloads1.1K likes
Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

HauhauCS · image-text-to-text · 35B-A3B (MoE)

View on HF

Uncensored MoE variant of Qwen3.5 with 35B total / 3B active parameters. Popular GGUF release for local deployment.

moeuncensoredggufqwen3.5
569.0K downloads1.1K likes
LTX-2.3

Lightricks · image-to-video · N/A

View on HF

Video generation model supporting image-to-video, text-to-video, and video-to-video tasks. Second most downloaded trending model.

video-generationdiffusersmultimodal
1.4M downloads840 likes
Qianfan-OCR

Baidu · image-text-to-text · N/A

View on HF

Vision-language model specialized for OCR tasks based on InternVL architecture. Strong engagement with 652 likes.

ocrvision-languagedocument-understanding
16.3K downloads652 likes
cohere-transcribe-03-2026

Cohere Labs · automatic-speech-recognition · N/A

View on HF

Cohere's first dedicated speech recognition model. New entry in the trending models, signaling Cohere's expansion beyond text.

asrspeechaudio
28.2K downloads570 likes
OmniCoder-9B

Tesslate · text-generation · 9B

View on HF

Code-focused model built on Qwen3.5 architecture with image-text-to-text capabilities. Combines coding and multimodal understanding.

codeqwen3.5multimodal
28.2K downloads537 likes
Voxtral-4B-TTS-2603

Mistral AI · text-to-speech · 4B

View on HF

Expressive multilingual text-to-speech model generating natural speech from 3 seconds of reference audio. Companion to the Voxtral TTS paper also trending today.

ttsspeechmultilingual
2.9K downloads528 likes
Nemotron-Cascade-2-30B-A3B

NVIDIA · text-generation · 30B-A3B

View on HF

Hybrid architecture model from NVIDIA with 30B total / 3B active parameters. Notable for cascade/MoE design targeting efficiency.

nvidiahybridefficient
78.2K downloads421 likes
daVinci-MagiHuman

GAIR · image-to-video · N/A

View on HF

Multimodal generation model supporting text-to-video, image-to-video, and text-to-audio. Unique in combining video and audio generation.

video-generationaudiomultimodal
540 downloads266 likes

Sources Checked