Agent safety and benchmark proliferation dominate the day; LLM reasoning robustness under context pressure emerges as a critical concern; distillation and efficient scaling techniques show surprising gains

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

High Relevance

Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin — MiroMind AI, National University of Singapore, Nanyang Technological University

MiroEval is a benchmark targeting both the research process and final output quality of deep research agents, addressing a critical gap where existing benchmarks only evaluate final reports. It provides multimodal coverage, real-world query complexity, and a refreshable knowledge design. This directly addresses the disconnect between benchmark performance and real user needs.

Key Findings

•
Existing deep research benchmarks fail to evaluate the underlying research process
•
Most benchmarks have limited multimodal coverage and rely on synthetic tasks
•
MiroEval supports knowledge refreshability to avoid benchmark contamination over time

benchmarkdeep-researchmultimodalagentsevaluation

52 upvotes

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

High Relevance

Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang — Tsinghua University, Meituan, University of Hong Kong, Chinese Academy of Sciences

ViGoR-Bench exposes a 'logical desert' within modern AIGC models: despite impressive visual fidelity, they fail at physical, causal, and complex spatial reasoning. The benchmark fills a gap left by superficial metrics and fragmented evaluations. The findings have direct implications for reliability of vision-language and generative image models in real applications.

Key Findings

•
Modern AIGC models exhibit a 'logical desert' — high visual fidelity masking broken causal and spatial reasoning
•
Current evaluations rely on superficial metrics that miss reasoning failures
•
ViGoR-Bench provides unified zero-shot visual reasoning evaluation across physical, causal, and spatial dimensions

benchmarkvisual-reasoningAIGCmultimodalspatial-reasoning

36 upvotes

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu — Zhipu AI, Tsinghua University

Vision2Web is a hierarchical benchmark spanning UI-to-code generation, interactive multi-page frontend reproduction, and long-horizon full-stack website development with agent verification. It fills the gap in systematic evaluation of end-to-end website development by coding agents. The benchmark addresses real-world complexity that existing coding benchmarks miss.

Key Findings

•
Existing coding benchmarks do not capture complex end-to-end website development tasks
•
Vision2Web spans three levels: static UI-to-code, interactive multi-page frontend, and full-stack development
•
Agent verification is integrated to assess functional correctness beyond syntactic quality

benchmarkcode-generationweb-developmentagentsUI

33 upvotes

Reasoning Shift: How Context Silently Shortens LLM Reasoning

High Relevance

Gleb Rodionov — Yandex Research

This paper documents that as context grows, LLMs silently produce shorter reasoning chains — not from efficiency gains but as a behavioral shift. The finding poses a direct challenge to test-time scaling assumptions, where longer context was assumed to enable richer reasoning. Three systematic evaluation scenarios are analyzed.

Key Findings

•
Context silently shortens LLM reasoning chains in a systematic, scenario-dependent manner
•
The robustness of test-time scaling reasoning behaviors is underexplored
•
Reasoning shortening is a behavioral shift, not a sign of improved efficiency

LLM-reasoningtest-time-scalingcontext-lengthrobustnessevaluation

22 upvotes

QuitoBench: A High-Quality Open Time Series Forecasting Benchmark

Siqiao Xue, Zhaoyang Zhu, Wei Zhang, Rongyao Cai, Rui Wang — Ant Group

QuitoBench introduces a regime-balanced benchmark for time series forecasting, covering eight trend-seasonality-forecastability regimes across finance, healthcare, and cloud computing domains. The benchmark addresses the lack of diversity in existing time series evaluations. Its regime-balanced design ensures models are tested across a full spectrum of real-world temporal patterns.

Key Findings

•
Existing time series benchmarks lack regime diversity across trend, seasonality, and forecastability combinations
•
QuitoBench covers eight distinct regimes relevant to finance, healthcare, and cloud computing
•
The benchmark is designed as an open, high-quality evaluation standard for the field

time-seriesforecastingbenchmarkfinancehealthcare

25 upvotes

Brevity Constraints Reverse Performance Hierarchies in Language Models

High Relevance

MD Azizul Hakim — Bangladesh Sweden Polytechnic Institute

Through evaluation of 31 models (0.5B–405B parameters) across 1,485 problems, this paper finds that larger LLMs underperform smaller ones by 28.4 percentage points on 7.7% of benchmark problems. The mechanism is spontaneous scale-dependent verbosity — larger models violate brevity constraints that smaller models respect. This challenges the universal scaling assumption.

Key Findings

•
Larger models underperform smaller models by 28.4 pp on 7.7% of benchmark problems
•
The mechanism is spontaneous scale-dependent verbosity that violates output constraints
•
Performance hierarchy reversals are reproducible across 31 models and 5 datasets

LLM-scalingverbosityevaluationperformance-degradationbenchmarks

16 upvotes

Embarrassingly Simple Self-Distillation Improves Code Generation

High Relevance

Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert — Apple

Self-Supervised Distillation (SSD) improves LLM code generation using only the model's own raw outputs, with no verifier, teacher model, or reinforcement learning. Applied to Qwen3-30B-Instruct, it achieves a +12.9 pp improvement on LiveCodeBench v6 (42.4% to 55.3% pass@1). The simplicity and effectiveness of the approach challenges the necessity of complex feedback pipelines.

Key Findings

•
SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6
•
No verifier, teacher model, or RL signal is required — only raw model outputs
•
The approach is described as 'embarrassingly simple' yet outperforms more complex methods

code-generationself-distillationLLMself-improvementtraining

10 upvotes

Universal YOCO for Efficient Depth Scaling

High Relevance

Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang — Microsoft

YOCO-U combines the YOCO decoder-decoder architecture with recursive computation to achieve efficient depth scaling without KV cache inflation. This directly addresses the memory and compute cost bottleneck of test-time scaling in LLMs. The work is highly relevant as the field pushes toward deeper reasoning chains at inference time.

Key Findings

•
YOCO-U achieves depth scaling without KV cache inflation via recursive computation
•
The approach enables test-time compute scaling with controlled memory costs
•
Builds on the YOCO decoder-decoder architecture with universal applicability

efficient-inferencedepth-scalingKV-cachearchitecturetest-time-compute

11 upvotes

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

High Relevance

Jack Young — Independent Researcher

S0 Tuning adapts hybrid recurrent-attention models by tuning a single initial state matrix per recurrent layer, with zero inference overhead. It outperforms LoRA by +10.8 pp on HumanEval and improves Qwen3.5-4B (GatedDeltaNet hybrid) by +23.6 pp on greedy pass@1. The approach is highly practical for efficient fine-tuning of the emerging class of recurrent-attention hybrid architectures.

Key Findings

•
S0 tuning outperforms LoRA by +10.8 pp on HumanEval with zero inference overhead
•
Qwen3.5-4B (GatedDeltaNet hybrid) improves +23.6 pp on greedy pass@1
•
Only one initial state matrix per recurrent layer needs tuning

PEFTrecurrent-modelshybrid-architecturefine-tuningefficiency

1 upvotes

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Shaoxuan Li, Zhixuan Zhao, Hanze Deng, Zirun Ma, Shulin Tian — Tsinghua University, University of Washington, Nanyang Technological University

PerceptionComp is a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning requiring multiple temporally separated visual evidence and compositional constraints. It targets a gap where current video understanding models fail at multi-step temporal reasoning. The manual annotation and compositional design make it more robust to shortcut learning than automated benchmarks.

Key Findings

•
Long-horizon video reasoning requiring temporally separated evidence is poorly evaluated by existing benchmarks
•
PerceptionComp uses manual annotation to ensure compositional reasoning requirements
•
The benchmark reveals limitations of current video models in perception-centric reasoning

video-understandingbenchmarktemporal-reasoningperceptionmultimodal

14 upvotes

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu — Zhejiang University, Meituan, Tsinghua University

SKILL0 demonstrates that skills can be internalized into model parameters via agentic RL, eliminating the retrieval noise and token overhead of inference-time skill augmentation. This represents a shift from skill-as-context to skill-as-weights, improving both efficiency and reliability. The work connects agentic RL, knowledge distillation, and tool-use optimization.

Key Findings

•
Skills can be internalized into model weights via agentic RL, not just retrieved at inference time
•
Internalization eliminates retrieval noise and token overhead from skill augmentation
•
Agentic RL provides an effective mechanism for learning reusable skills from interaction

agentic-RLskill-learningknowledge-distillationagentsreinforcement-learning

2 upvotes

Proactive Agent Research Environment (PARE)

Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan, Yinfei Yang — Apple

PARE is a framework for building and evaluating proactive agents that anticipate user needs and autonomously execute tasks as digital assistants. It addresses the gap between reactive agents (that respond to explicit instructions) and truly proactive digital assistants. The Apple affiliation signals major industry investment in this direction.

Key Findings

•
Proactive agents that anticipate user needs require dedicated evaluation frameworks beyond reactive agent benchmarks
•
PARE provides both a building framework and evaluation suite for proactive digital assistants
•
Anticipatory task execution is identified as a core gap in current agent capabilities

proactive-agentsdigital-assistantsevaluationtask-automationagents

6 upvotes

AgentWatcher: A Rule-based Prompt Injection Monitor

Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia — Pennsylvania State University

AgentWatcher is a rule-based prompt injection detection system that maintains effectiveness as context length increases, using explicit rules to define injection patterns. Unlike learned classifiers, the rule-based approach provides interpretability and consistent performance across context lengths. This complements the ClawKeeper and related agent security work appearing on the same day.

Key Findings

•
Rule-based prompt injection detection maintains effectiveness as context length scales
•
Explicit rules provide interpretability that learned classifiers lack
•
The approach addresses a critical vulnerability in agentic systems

prompt-injectionagent-securityrule-basedsafetyLLM-security

0 upvotes

When Users Change Their Mind: Evaluating Interruptible Agents

Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou — University of Illinois Chicago, McGill University, MBZUAI, UC Santa Barbara, University of Southern California

The first systematic study of interruptible agents in long-horizon web navigation, where users add requirements or revise goals mid-task. Most current agents are designed for static goal completion and fail to gracefully handle dynamic user intent. This work opens a new evaluation dimension for practical agentic deployment.

Key Findings

•
Interruptible long-horizon web navigation is the first systematic study of mid-task goal revision
•
Current agents are not designed to handle dynamic user intent during task execution
•
New evaluation methodology is proposed for assessing graceful interruption handling

agentsweb-navigationuser-intentevaluationinterruptibility

1 upvotes

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Trending Models (12)

Jackrong (Community) · text-generation · 27B

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

A Qwen3.5-27B model distilled from Claude 4.6 Opus reasoning traces, representing the growing community practice of distilling frontier model reasoning into open weights. Available in both safetensors and GGUF variants with very high download counts.

reasoningdistillationqwen3.5unslothopen-weights

428.8K downloads2.1K likes

HauhauCS (Community) · text-generation · 35B (MoE, A3B active)

cohere-transcribe-03-2026

An uncensored GGUF variant of the Qwen3.5-35B MoE model, targeting users seeking unfiltered outputs. High download count (621k+) and 1151 likes reflect significant community demand for uncensored open-weight models with vision capabilities.

ggufuncensoredmoevisionqwen3.5

622.0K downloads1.2K likes

CohereLabs · automatic-speech-recognition · Unknown

Cohere's latest automatic speech recognition model released March 2026, with 71k downloads and 733 likes indicating strong industry adoption. Represents Cohere's expansion into audio modalities.

ASRaudiospeech-recognitiontransformers

71.0K downloads733 likes

Voxtral-4B-TTS-2603

Mistral AI · text-to-speech · 4B

Mistral's 4B text-to-speech model supporting English and French, representing a significant expansion of Mistral's capabilities into audio synthesis. The vllm and mistral-common tags suggest optimized inference support.

TTSaudiomistralmultilingualvllm

4.3K downloads635 likes

Qianfan-OCR

Baidu · feature-extraction · Unknown

Baidu's vision-language OCR model built on the InternVL Chat architecture, with 19k downloads and 811 likes. Positioned as a feature-extraction model for document understanding, reflecting growing enterprise demand for accurate OCR in Chinese and multilingual settings.

OCRvision-languageInternVLdocument-understandingChinese

19.1K downloads811 likes

gemma-4-31B-it

Google · image-text-to-text · 31B

Nemotron-Cascade-2-30B-A3B

Google's instruction-tuned Gemma 4 31B model, part of the new Gemma 4 family. With 29k downloads and 370 likes, it is the most downloaded of the three Gemma 4 variants trending today, supporting image-text-to-text tasks.

gemma4multimodalinstruction-tunedconversational

29.0K downloads370 likes

NVIDIA · text-generation · 30B (MoE, A3B active)

NVIDIA's 30B MoE text generation model from the Nemotron Cascade 2 series, with 3B active parameters. 114k downloads and 454 likes signal strong adoption for enterprise inference workloads, leveraging the NemotronH architecture.

MoEnvidianemotronefficient-inferenceenterprise

114.5K downloads454 likes

Bonsai-8B-gguf

Prism ML · text-generation · 8B (1-bit quantized)

A 1-bit quantized 8B model in GGUF format from Prism ML, enabling extreme compression for edge deployment. Appearing in both GGUF and MLX variants in trending signals growing interest in 1-bit inference across hardware platforms.

1-bitquantizationedge-inferenceggufefficiency

13.8K downloads319 likes

context-1

ChromaDB · text-generation · Unknown

ChromaDB's text generation model built on the GPT OSS architecture, representing the vector database company's entry into model development. With 2820 downloads and 357 likes, it suggests ChromaDB is building tightly integrated retrieval-generation capabilities.

text-generationRAGconversationalretrieval

2.8K downloads357 likes

harrier-oss-v1-0.6b

Microsoft · feature-extraction · 0.6B

Microsoft's compact 0.6B embedding model built on Qwen3 for feature extraction and MTEB evaluation, achieving strong sentence embedding performance at minimal size. Signals continued investment in efficient embedding models for RAG and retrieval pipelines.

embeddingsMTEBsentence-transformersqwen3efficient

2.8K downloads140 likes

Holo3-35B-A3B

Hcompany · image-text-to-text · 35B (MoE, A3B active)

A 35B multimodal MoE model from Hcompany built on the Qwen3.5 MoE architecture with 3B active parameters, targeting image-text-to-text tasks. With 603 downloads and 184 likes, it represents a new entrant in the multimodal MoE space.

multimodalMoEqwen3.5image-textvision-language

603 downloads184 likes

LFM2.5-350M

Liquid AI · text-generation · 350M

siddharthvaddem/openscreen

Liquid AI's compact 350M parameter text generation model from the LFM2.5 series, representing the liquid neural network architecture at small scale. 7703 downloads and 201 likes signal growing interest in alternative architectures beyond transformers.

liquidalternative-architecturesmall-modelefficient

7.7K downloads201 likes

Trending GitHub Repos (10)

Yeachan-Heo/oh-my-codex

High RelevanceGitHub

OmX extends OpenAI Codex with hooks, agent teams, HUDs, and modular enhancements. The top trending repo today by stars earned (2867) reflects the explosion of developer tooling built on top of commercial coding agents.

coding-agentscodexdeveloper-toolsagent-tooling

TypeScript12.1K+2.9K today1.1K

sherlock-project/sherlock

An open-source alternative to Screen Studio for creating stunning product demos and screencasts. Trending with 2573 stars today, likely driven by developer communities seeking free alternatives to commercial screen recording tools.

developer-toolsopen-sourcescreen-recordingdemos

TypeScript16.3K+2.6K today1.1K

google-research/timesfm

High RelevanceGitHub

Google Research's Time Series Foundation Model, surging to 1176 stars today — likely correlated with the QuitoBench paper also trending. TimesFM represents Google's foundational approach to time series forecasting across diverse domains.

time-seriesforecastingfoundation-modelgoogle-research

Python13.5K+1.2K today1.1K

asgeirtj/system_prompts_leaks

A Python tool for hunting down social media accounts by username across social networks. Trending at 827 stars today, this OSINT tool has sustained popularity and relevance for security research and social graph analysis.

OSINTsecuritysocial-mediaidentity

Python77.4K+827 today9.1K

roboflow/supervision

High RelevanceGitHub

Roboflow's reusable computer vision toolkit providing annotation utilities, visualization tools, and model-agnostic interfaces. At 535 stars today, it remains a consistently trending CV utility library widely used in production pipelines.

computer-visionobject-detectionannotationutilities

Python37.5K+535 today3.3K

yusufkaraaslan/Skill_Seekers

A collection of extracted system prompts from ChatGPT, Claude, Gemini, Grok, Perplexity, and other AI systems. Trending at 306 stars today, reflecting ongoing public interest in AI system transparency and the security implications of prompt extraction.

system-promptsAI-transparencysecurityprompt-extraction

36.6K+306 today6.0K

High RelevanceGitHub

Converts documentation websites, GitHub repos, and PDFs into Claude AI skills for direct integration. Trending at 264 stars today, it exemplifies the growing ecosystem of tools that wrap commercial AI APIs to create domain-specific agent capabilities.

claudeagent-skillsdocumentationRAGAI-tools

Python12.2K+264 today1.2K

zai-org/GLM-OCR

High RelevanceGitHub

GLM-OCR delivers accurate, fast, and comprehensive optical character recognition powered by the GLM architecture. Trending at 237 stars today, it joins Baidu's Qianfan-OCR in signaling strong current momentum around AI-powered document understanding.

OCRdocument-understandingGLMvision-language

Python5.3K+237 today460

MervinPraison/PraisonAI

High RelevanceGitHub

PraisonAI is a low-code multi-agent AI framework that automates complex workflows with a 24/7 AI employee team concept. Trending at 107 stars today, it represents the broader democratization of multi-agent orchestration for non-expert users.

multi-agentautomationlow-codeworkflowagents

Python6.3K+107 today923

Alishahryar1/free-claude-code