Tuesday, March 31, 2026
Trillion-parameter scientific foundation model arrives; Agent skill distillation from trajectories gains traction; Coding agents get specialized models and organicity benchmarks
Executive Summary
Today's AI/ML landscape is shaped by three converging trends. First, scale meets science: Intern-S1-Pro debuts as the first trillion-parameter scientific multimodal foundation model, promising comprehensive enhancement across general and scientific domains with advanced agent capabilities. Second, agent engineering matures rapidly: Trace2Skill introduces a principled framework for distilling reusable skills from agent trajectories, Natural-Language Agent Harnesses proposes externalizing agent control logic as portable natural-language artifacts, and Learning to Commit tackles the overlooked 'organicity' problem where LLM-generated PRs get rejected despite being functionally correct. Third, coding agents and benchmarks evolve in tandem: Cursor's Composer 2 achieves frontier-level agentic software engineering through RL at scale, while SlopCodeBench reveals how coding agents degrade over iterative tasks — a critical blind spot in current evaluation.
On the medical AI front, MedOpenClaw exposes a striking paradox where VLMs actually perform worse when given professional tools, and Medical AI Scientist demonstrates the first autonomous clinical research framework. In computer vision, Calibri shows that a single learned scaling parameter can significantly enhance Diffusion Transformers, and GenMask elegantly adapts DiT for segmentation by generating masks directly.
Researcher Notes
Intern-S1-Pro at 1T parameters is a landmark release. While we've seen trillion-scale language models before, this is the first claiming multimodal scientific capabilities at this scale. The 9-comment engagement on HuggingFace (highest today alongside Trace2Skill) signals genuine community interest. The key question is whether the scientific specialization justifies the scale — or whether smaller, domain-adapted models remain more practical. Watch for benchmark comparisons against specialized scientific models.
Trace2Skill's 13 comments make it today's most-discussed paper, and for good reason. The problem it addresses — how to automatically extract reusable, transferable skills from agent trajectories without overfitting to trajectory-local lessons — is one of the core bottlenecks in building practical LLM agents. Combined with yesterday's MetaClaw (meta-learning agents) and AVO (evolutionary search agents), we're seeing a clear research arc toward agents that improve themselves systematically rather than through brute-force prompting.
The 'coding agent' cluster tells a nuanced story. Composer 2 (Cursor) demonstrates that RL-trained specialized models can match frontier general-purpose models on real software engineering. But SlopCodeBench provides the necessary counterpoint: current coding agents degrade progressively over iterative tasks, a failure mode invisible to single-shot benchmarks like SWE-bench. Meanwhile, Learning to Commit identifies 'organicity' — adherence to project conventions, API reuse, and architectural consistency — as the real barrier to PR acceptance. These three papers together paint a picture of a field that's getting good at isolated tasks but struggling with the sustained, convention-aware work that real software engineering demands.
MedOpenClaw's performance paradox is genuinely surprising. State-of-the-art VLMs (Gemini 3.1 Pro, GPT-5.4) actually perform worse when given access to professional medical imaging tools (3D Slicer). The authors attribute this to lack of precise spatial grounding — models can reason about pre-selected 2D slices but fail when they need to navigate full 3D volumes. This challenges the assumption that 'more tools = better agents' and suggests tool-use training needs fundamental rethinking for spatial domains.
The Qwen 3.5 ecosystem continues to dominate HuggingFace trending models. Jackrong's Claude 4.6 Opus reasoning distillations maintain their grip with multiple GGUF variants in the top 20. New entries include Nvidia's Nemotron-Cascade-2-30B-A3B (hybrid architecture), Cohere's first speech recognition model, and Mistral's Voxtral TTS — the model landscape is diversifying beyond text generation into speech and multimodal pipelines.
Themes & Trends
Agent Skill Learning & Engineering
risingMultiple papers converge on how to make LLM agents learn, retain, and transfer skills — from trajectory distillation (Trace2Skill) to portable harness design (NLAHs) and code convention learning (Learning to Commit).
Coding Agent Maturation
risingSpecialized coding models (Composer 2, Kernel-Smith), organicity-focused evaluation, and long-horizon degradation benchmarks (SlopCodeBench) signal the field moving beyond simple code generation to sustained software engineering.
Medical AI Agents
risingMedOpenClaw reveals tool-use paradoxes in medical VLMs while Medical AI Scientist demonstrates autonomous clinical research. Both highlight the gap between general AI capabilities and domain-specific requirements.
Diffusion Transformer Optimization
stableCalibri demonstrates parameter-efficient DiT enhancement via learned scaling, GenMask repurposes DiT for segmentation, and multiple papers push the boundaries of diffusion-based generation.
Reasoning Transparency & Safety
stableLie to Me exposes a stark gap between models' internal reasoning acknowledgment and their final outputs, raising concerns about CoT faithfulness as a safety mechanism.
Trillion-Scale Scientific Models
risingIntern-S1-Pro marks the arrival of trillion-parameter models specifically targeting scientific understanding, while PRBench shows current agents still fail at end-to-end physics paper reproduction.
Trending Papers (14)
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
High RelevanceJingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou
Introduces a framework for distilling reusable, transferable skills from LLM agent trajectories. Addresses the scalability bottleneck of manual skill authoring and the fragility of automated skill generation that overfits to trajectory-local lessons.
Key Findings
- •Overcomes shallow parametric knowledge limitations in automated skill generation
- •Produces transferable skills that generalize beyond specific trajectory contexts
- •Most-discussed paper of the day with 13 HuggingFace comments
Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
High RelevanceYicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou
First one-trillion-parameter scientific multimodal foundation model. Delivers comprehensive enhancement across general and scientific domains with advanced agent capabilities and scientific expertise at unprecedented scale.
Key Findings
- •First trillion-parameter model specifically targeting scientific multimodal understanding
- •Combines stronger reasoning, image-text understanding, and agent capabilities
- •Scientific expertise augmented beyond general-purpose improvements
Composer 2 Technical Report
High RelevanceCursor Research, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger — Cursor
Specialized coding model for agentic software engineering trained via continued pretraining and large-scale RL. Achieves frontier-level performance on real software engineering problems in large codebases.
Key Findings
- •61.3 on CursorBench, 61.7 on Terminal-Bench, 73.7 on SWE-bench Multilingual
- •Two-phase training: continued pretraining + large-scale reinforcement learning
- •Strong long-term planning and coding intelligence for interactive and agentic use
MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies
High RelevanceWeixiang Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu
Proposes MEDOPENCLAW runtime enabling VLMs to operate within standard medical tools (3D Slicer) and MEDFLOWBENCH benchmark for full-study medical imaging evaluation. Reveals that models degrade when given professional tool access due to lack of precise spatial grounding.
Key Findings
- •State-of-the-art VLMs (Gemini 3.1 Pro, GPT-5.4) perform worse with professional tool access
- •Performance paradox attributed to lack of precise spatial grounding in 3D volumes
- •First benchmark evaluating VLMs on uncurated full medical imaging studies
Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration
High RelevanceDanil Tokhchukov, Aysel Mirzoeva, Andrey Kuznetsov, Konstantin Sobolev
Uncovers hidden potential of Diffusion Transformers by demonstrating that a single learned scaling parameter can significantly improve DiT block performance. Proposes a parameter-efficient calibration approach for enhancing generative tasks.
Key Findings
- •Single learned scaling parameter yields significant performance gains in DiT blocks
- •Parameter-efficient approach — minimal additional parameters needed
- •Applicable across generative tasks without architectural redesign
Natural-Language Agent Harnesses
High RelevanceLinyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng
Introduces Natural-Language Agent Harnesses (NLAHs) that externalize agent control logic as portable natural-language artifacts with explicit contracts and lightweight adapters. Proposes Intelligent Harness Runtime (IHR) for shared execution.
Key Findings
- •Harness behavior expressed in editable natural language rather than buried in controller code
- •Portable across runtimes with explicit contracts and lightweight adapters
- •Evaluated on coding and computer-use benchmarks
Learning to Commit: Generating Organic Pull Requests via Online Repository Memory
High RelevanceMo Li, L. H. Xu, Qitai Tan, Ting Cao, Yunxin Liu
Proposes Learning to Commit framework using Online Repository Memory to improve LLM coding agents' code organicity — adherence to project conventions, API reuse, and architectural consistency. Agents perform supervised contrastive reflection on historical commits.
Key Findings
- •Identifies 'organicity' as the root cause of PR rejection, not functional correctness
- •Supervised contrastive reflection on historical commits distills project-specific patterns
- •Improved organicity on genuinely future merged PRs
Towards a Medical AI Scientist
High RelevanceHongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao
Introduces Medical AI Scientist, the first autonomous research framework for clinical research with clinician-engineer co-reasoning mechanism. Operates in three modes: paper reproduction, literature-inspired innovation, and task-driven exploration.
Key Findings
- •First autonomous research framework specifically for clinical research
- •Generated research ideas of substantially higher quality than commercial LLMs across 171 cases
- •Generated manuscripts approach MICCAI-level quality
Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
High RelevanceRichard J. Young
Tests 12 open-weight reasoning models (7B-685B) on 498 questions with six hint categories. Reveals stark gap between thinking-token acknowledgment (87.5%) and answer-text acknowledgment (28.6%), suggesting models suppress acknowledgment in final outputs.
Key Findings
- •Faithfulness ranges from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale)
- •87.5% thinking-token vs 28.6% answer-text acknowledgment reveals systematic suppression
- •First large-scale faithfulness study on open-weight reasoning models
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
High RelevanceHe Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai
Framework combining evolutionary algorithms with post-training RL for GPU kernel generation. Achieves state-of-the-art on KernelBench and outperforms Gemini-3.0-pro and Claude-4.6-opus. Produces upstream contributions to SGLang and LMDeploy.
Key Findings
- •State-of-the-art on KernelBench (Triton backend)
- •Outperforms Gemini-3.0-pro and Claude-4.6-opus on kernel optimization
- •Validated on MetaX MACA backend; upstream contributions to SGLang and LMDeploy
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu
Proposes a causal multi-shot architecture enabling interactive storytelling and efficient on-the-fly frame generation. Reformulates multi-shot video generation as next-shot generation conditioned on historical context.
Key Findings
- •Causal architecture enables streaming generation — no need to wait for full sequence
- •Interactive storytelling with on-the-fly frame generation
- •Overcomes latency and interactivity limitations of bidirectional architectures
GenMask: Adapting DiT for Segmentation via Direct Mask
Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai
Trains DiT to generate both black-and-white segmentation masks and RGB images. Introduces timestep sampling strategy emphasizing extreme noise for segmentation and moderate noise for generation.
Key Findings
- •State-of-the-art on referring and reasoning segmentation benchmarks
- •Removes need for specialized feature extraction pipelines
- •Elegant adaptation of generative models for discriminative tasks
RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation
Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu
Introduces benchmark with 2,800+ instances grounded in authentic datasets for evaluating chart-to-code generation. First benchmark evaluating generation from large-scale raw data and iterative code refinement.
Key Findings
- •Significant performance degradation on complex multi-panel charts with real data
- •14 leading VLMs evaluated — all struggle with authentic data complexity
- •First benchmark for iterative code refinement in chart reproduction
PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu
Benchmark of 30 expert-curated physics research tasks requiring AI agents to comprehend paper methodology, implement algorithms, and produce quantitative results. Best performer (GPT-5.3-Codex) achieves 34% with zero end-to-end success.
Key Findings
- •Zero end-to-end success rate across all tested coding agents
- •Best performer (GPT-5.3-Codex) achieves only 34% overall score
- •Systematic failures in formula implementation and numerical simulation debugging
Trending Models (10)
Jackrong · image-text-to-text · 27B
Reasoning-distilled version of Qwen3.5-27B using Claude 4.6 Opus traces. Continues to dominate HuggingFace trending with multiple GGUF variants.
Qwen · image-text-to-text · 9B
Official Qwen 3.5 9B model. Most downloaded model on HuggingFace with 4.5M downloads, serving as the base for numerous community fine-tunes.
HauhauCS · image-text-to-text · 35B-A3B (MoE)
Uncensored MoE variant of Qwen3.5 with 35B total / 3B active parameters. Popular GGUF release for local deployment.
Lightricks · image-to-video · N/A
Video generation model supporting image-to-video, text-to-video, and video-to-video tasks. Second most downloaded trending model.
Baidu · image-text-to-text · N/A
Vision-language model specialized for OCR tasks based on InternVL architecture. Strong engagement with 652 likes.
Cohere Labs · automatic-speech-recognition · N/A
Cohere's first dedicated speech recognition model. New entry in the trending models, signaling Cohere's expansion beyond text.
Tesslate · text-generation · 9B
Code-focused model built on Qwen3.5 architecture with image-text-to-text capabilities. Combines coding and multimodal understanding.
Mistral AI · text-to-speech · 4B
Expressive multilingual text-to-speech model generating natural speech from 3 seconds of reference audio. Companion to the Voxtral TTS paper also trending today.
NVIDIA · text-generation · 30B-A3B
Hybrid architecture model from NVIDIA with 30B total / 3B active parameters. Notable for cascade/MoE design targeting efficiency.
GAIR · image-to-video · N/A
Multimodal generation model supporting text-to-video, image-to-video, and text-to-audio. Unique in combining video and audio generation.