Medical AI gets its ImageNet moment with 1000+ dataset survey; Terminal-only agents challenge complex enterprise frameworks; Pretraining science matures with daVinci-LLM scaling laws

daVinci-LLM: Towards the Science of Pretraining

High Relevance

Yiwei Qin, Yixiu Liu, Tiantian Mi et al. — GAIR

A systematic study of pretraining decisions — data mixing, learning rates, architecture choices — that moves LLM pretraining from alchemy toward engineering. Provides principled recipes and scaling insights for practitioners.

Key Findings

•
Systematic analysis of pretraining hyperparameter interactions
•
Identifies critical decision points that determine model capability ceilings
•
Provides reproducible pretraining recipes for various compute budgets

pretrainingscaling-lawsllm-trainingoptimization

24 upvotes

MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

High Relevance

Han Wang, Yifan Sun, Brian Ko et al. — University of Illinois Urbana-Champaign, University of Washington, UC San Diego

Introduces MonitorBench, the first comprehensive benchmark for evaluating whether LLM chains of thought faithfully reflect their actual reasoning processes. Tests whether CoT is causally responsible for model outputs or merely confabulated.

Key Findings

•
First comprehensive benchmark specifically for CoT monitorability
•
Reveals significant gaps between displayed reasoning and actual model computation
•
Provides metrics for evaluating faithfulness of reasoning traces

ai-safetychain-of-thoughtinterpretabilitybenchmark

17 upvotes

Terminal Agents Suffice for Enterprise Automation

High Relevance

Patrice Bechard, Orlando Marquez Ayala, Emily Chen et al. — ServiceNow, Mila - Quebec AI Institute, Université de Montréal

Argues that coding agents equipped only with a terminal can match or exceed complex MCP-based and web-agent systems for enterprise automation tasks, at significantly lower cost and operational overhead.

Key Findings

•
Terminal-only agents achieve comparable or better enterprise task completion
•
Complex agentic frameworks add cost without proportional capability gains
•
Simplicity of terminal interface reduces failure modes and debugging complexity

agentic-aienterprise-automationterminal-agentssimplicity

14 upvotes

Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

High Relevance

Jaeik Kim, Woojin Kim, Jihwan Hong et al. — Seoul National University

The first masked-diffusion-based omnimodal foundation model unifying text, image, and speech understanding and generation, plus video understanding, within a single architecture. Demonstrates competitive performance across all modalities.

Key Findings

•
First masked-diffusion omnimodal model covering text, image, speech, and video
•
Unified architecture eliminates need for separate modality-specific models
•
Competitive performance across understanding and generation tasks

multimodaldiffusionomnimodalfoundation-model

11 upvotes

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Yan Li, Zezi Zeng, Ziwei Zhou et al. — Microsoft, Shanghai Jiao Tong University, Xi'an Jiaotong University, Fudan University

Introduces a systematic benchmark for evaluating image generation models on practical commercial visual content creation tasks, addressing the gap between aesthetic benchmarks and real-world business applications.

Key Findings

•
First systematic benchmark for commercial visual content generation
•
Existing models show significant gaps on business-specific visual requirements
•
Framework bridges gap between aesthetic evaluation and practical utility

image-generationbenchmarkcommercialvisual-content

10 upvotes

Falcon Perception

Aviraj Bevli, Sofian Chaybouti, Yasser Dahou et al. — TII

Presents a unified perception system that replaces modular encoder-decoder vision pipelines with a single foundation backbone for multiple perception tasks including detection, segmentation, and depth estimation.

Key Findings

•
Unified backbone matches or exceeds modular pipelines across perception tasks
•
Single model handles detection, segmentation, and depth estimation
•
Reduces system complexity while maintaining competitive accuracy

computer-visionperceptionunified-architecturefoundation-model

7 upvotes

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Yubo Li, Lu Zhang, Tianchong Jiang et al. — Carnegie Mellon University

Demonstrates that LLMs systematically fail when salient surface cues conflict with unstated feasibility constraints, using a diagnose-measure-bridge-treat framework to analyze and mitigate these failures.

Key Findings

•
LLMs prioritize surface-level cues over implicit physical or logical constraints
•
Failures are systematic and predictable via causal-behavioral analysis
•
Proposes diagnostic framework for identifying and treating heuristic biases

llm-reasoningheuristicsbiasevaluation

5 upvotes

PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi et al. — University of Oxford, Ukrainian Catholic University

A scalable pipeline using diffusion models to generate photorealistic labeled datasets for 3D human mesh estimation, addressing the bottleneck of acquiring annotated 3D human data from monocular images.

Key Findings

•
Diffusion-based pipeline generates photorealistic human data with 3D annotations
•
Scales dataset creation beyond manual annotation bottlenecks
•
Generated data improves downstream 3D mesh estimation performance

human-posediffusiondata-generation3d-reconstruction

4 upvotes

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Analysis

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang et al. — National Taiwan University, NVIDIA, Academia Sinica

Investigates how much auditory knowledge LLMs encode through text-only pretraining and how this prior knowledge shapes the capabilities of Large Audio Language Models built on top of them.

Key Findings

•
LLMs encode significant auditory knowledge from text-only pretraining
•
This implicit knowledge significantly shapes audio model capabilities
•
Backbone selection matters more than previously assumed for audio LMs

audiomultimodalllmauditory-knowledge

4 upvotes

RawGen: Learning Camera Raw Image Generation

Dongyoung Kim, Junyong Lee, Abhijith Punnappurath et al. — Samsung AI Center Toronto, Yonsei University

A generative framework for synthesizing camera raw images to address the scarcity of raw training data for low-level vision tasks, decoupled from specific camera hardware.

Key Findings

•
First diffusion framework designed specifically for raw image synthesis
•
Generated raw data is hardware-agnostic and improves downstream tasks
•
Addresses major bottleneck in low-level vision research

image-generationraw-imagesdiffusionlow-level-vision

3 upvotes

Distilling Human-Aligned Privacy Sensitivity Assessment from Large Language Models

Gabriel Loiseau, Damien Sileo, Damien Riquet et al. — Hornetsecurity, Université de Lille, Inria, CNRS

Proposes distilling privacy sensitivity assessment capabilities from large LLMs into smaller models, enabling scalable and human-aligned privacy evaluation of textual data.

Key Findings

•
LLMs can serve as reliable privacy sensitivity assessors
•
Distilled smaller models retain assessment quality at lower compute cost
•
Alignment with human privacy judgments validated across diverse text types

privacydistillationnlpsafety

3 upvotes

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Trending Models (12)

Jackrong · image-text-to-text · 27B

Cohere Transcribe 03-2026

Qwen 3.5 27B fine-tuned on Claude 4.6 Opus reasoning traces, capturing frontier chain-of-thought capabilities in a locally deployable model.

reasoningchain-of-thoughtqwen3.5distillation

353.2K downloads2.0K likes

CohereLabs · automatic-speech-recognition · Unknown

State-of-the-art automatic speech recognition model supporting 20+ languages, topping the HF ASR leaderboard.

asrmultilingualspeech-recognition

58.7K downloads697 likes

Voxtral 4B TTS

Mistral AI · text-to-speech · 4B

Mistral's 4B-parameter text-to-speech model supporting 10+ languages, built on the Ministral 3B base architecture.

ttsmultilingualspeech-synthesis

3.9K downloads603 likes

Qianfan-OCR

Baidu · image-text-to-text · Unknown

Baidu's vision-language model specialized for OCR and document intelligence, built on InternVL architecture.

ocrdocument-intelligencevision-language

17.8K downloads779 likes

Context-1

ChromaDB · text-generation · 20B

ChromaDB's 20B-parameter conversational model fine-tuned from OpenAI's GPT-OSS, targeting context-heavy retrieval and generation tasks.

text-generationconversationalretrieval

2.5K downloads339 likes

Bonsai 8B

Prism ML · text-generation · 8B (1-bit)

1-bit quantized 8B model optimized for on-device inference via llama.cpp, pushing the boundary of extreme compression for local deployment.

1-biton-devicequantizationedge

1.5K downloads213 likes

LFM2.5-350M

LiquidAI · text-generation · 350M

Nemotron Cascade 2 30B-A3B

LiquidAI's 350M-parameter edge model using their novel LFM2 architecture, supporting 10+ languages for on-device conversational AI.

edgeliquidefficientmultilingual

3.8K downloads167 likes

NVIDIA · text-generation · 30B (3B active)

NVIDIA's 30B mixture-of-experts reasoning model with 3B active parameters, optimized for general-purpose text generation with RL-trained reasoning.

moereasoningnvidiaefficient

89.6K downloads443 likes

Holo3 35B-A3B

Hcompany · image-text-to-text · 35B (3B active)

Multimodal agent model specialized for computer use and GUI automation, built on Qwen 3.5 MoE architecture.

agentcomputer-usegui-automationmoe

44 downloads131 likes

daVinci MagiHuman

GAIR · image-to-video · Unknown

Multimodal generative model for human-centric video, audio, and image synthesis from text and image inputs.

video-generationaudio-generationmultimodalhuman-centric

617 downloads285 likes

SAM 3.1

Facebook/Meta · mask-generation · Unknown

Meta's updated Segment Anything Model for video segmentation, extending SAM to temporal mask generation.

segmentationvideosamfoundation-model

3.5K downloads107 likes

Harrier OSS v1 0.6B

Microsoft · feature-extraction · 0.6B

NousResearch/hermes-agent

Microsoft's 600M-parameter multilingual embedding model supporting 100+ languages, built on Qwen3 with sentence-transformers.

embeddingsmultilingualsentence-transformers

493 downloads119 likes

Trending GitHub Repos (12)

anthropics/claude-code

High RelevanceGitHub

Agentic coding tool that runs in the terminal, understands codebases, and helps execute tasks through natural language. Exploding in popularity with 10K+ stars today.

ai-agentscoding-assistantclillm

Shell102.1K+10.7K today15.9K

luongnv89/claude-howto

High RelevanceGitHub

Visual, example-driven guide to Claude Code from basic concepts to advanced agents, with copy-paste templates. Top trending with 3.3K stars today.

claudeai-agentstutorialdeveloper-tools

Python15.9K+3.3K today1.8K

openai/codex

High RelevanceGitHub

Lightweight coding agent that runs in the terminal, built in Rust. OpenAI's answer to agentic coding with 2.4K stars today.

ai-agentscoding-assistantcliopenai

Rust71.9K+2.4K today10.1K

microsoft/VibeVoice

High RelevanceGitHub

Open-source frontier voice AI from Microsoft. Rapidly gaining traction with 1.7K stars today, signaling strong demand for open voice models.

voice-aispeechopen-sourcemicrosoft

Python34.6K+1.7K today3.9K

High RelevanceGitHub

An adaptive AI agent framework from NousResearch that grows with the user. Trending strongly with 1.5K stars today.

ai-agentsframeworkllmnous-research

Python21.8K+1.5K today2.7K

PaddlePaddle/PaddleOCR

Powerful OCR toolkit supporting 100+ languages that converts PDFs and images into structured data for LLMs. Steady growth at 686 stars today.

ocrdocument-aicomputer-visionnlp

Python74.6K+686 today10.2K

google-research/timesfm

High RelevanceGitHub

Google Research's pretrained time-series foundation model for forecasting. Reflects growing interest in foundation models beyond NLP.

time-seriesfoundation-modelsforecastinggoogle

Python12.3K+380 today1.0K

sansan0/TrendRadar

AI-driven public opinion and trend monitor with multi-platform aggregation, RSS feeds, and smart alerts.

trend-analysisnlpmonitoringai-tools

Python50.5K+258 today22.9K

OpenBMB/ChatDev

High RelevanceGitHub

ChatDev 2.0: full software development through LLM-powered multi-agent collaboration. Demonstrates maturation of agent-based development.

multi-agentsoftware-developmentllmautomation

Python32.6K+247 today4.0K

allenai/OLMo-core

PyTorch building blocks for the OLMo open language model ecosystem from AI2. Part of the push for fully open LLM training.

open-source-llmpretrainingpytorchai2

Python1.1K+66 today213

LMCache/LMCache

Fastest KV cache layer for LLM inference. Key infrastructure for reducing latency and cost in LLM serving.

llm-inferencekv-cacheoptimizationserving

Python7.8K+30 today1.1K

NVIDIA/Model-Optimizer