GBQA benchmark reveals frontier LLMs catch under half of game bugs autonomously; ThinkTwice unifies reasoning and self-refinement via GRPO; Gemma 4 family dominates HuggingFace trending with six model variants

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo — Seoul National University, NAVER

A unified framework that generates garment-transferred human animation videos from a single image, garment images, and pose guidance, eliminating the identity drift and garment distortion of two-stage pipelines.

Key Findings

•
Single-stage unified approach eliminates cascading errors from separate try-on and animation stages
•
Synthetic triplet supervision enables training without paired ground-truth animation data
•
Achieves coherent front-back consistency and identity preservation across frames

virtual-try-onvideo-generationhuman-animationfashion-techcomputer-vision

31 upvotes

Watch Before You Answer: Learning from Visually Grounded Post-Training

High Relevance

Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia — Meta AI, University of Illinois Urbana-Champaign

Reveals that 40-60% of long video understanding benchmark questions can be answered with text cues alone, and proposes visually grounded post-training to force genuine visual reasoning in VLMs.

Key Findings

•
40-60% of long video benchmark questions are solvable from text cues without watching any video
•
Current VLM evaluation conflates language reasoning with visual understanding
•
Visually grounded post-training significantly improves genuine visual reasoning fidelity

video-understandingVLMbenchmark-critiquevisual-groundingpost-training

26 upvotes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

High Relevance

Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye — University of Notre Dame, Lehigh University

A memory-centric system that trains 100B+ parameter LLMs at full precision on a single GPU by storing parameters and optimizer states in host memory and treating GPUs as transient compute engines.

Key Findings

•
Enables full-precision 100B+ training on a single GPU via CPU-GPU memory orchestration
•
Micro-pipeline scheduling and adaptive memory management overcome bandwidth bottleneck
•
Democratizes large-scale training for researchers without multi-node clusters

efficient-trainingsingle-gpumemory-optimizationlarge-scalesystems

24 upvotes

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

High Relevance

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang — MIT CSAIL, MIT

Formally benchmarks LLM agent skill usage under realistic conditions where agents must search, select, and compose skills from large pools rather than being handed task-specific tools.

Key Findings

•
Performance degrades significantly when agents must self-select skills from large pools
•
Current agents struggle with skill composition and multi-step skill chains
•
Gap between idealized skill-provided benchmarks and realistic self-serve settings is substantial

agent-skillsbenchmarktool-useLLM-agentsrealistic-evaluation

24 upvotes

General Multimodal Protein Design Enables DNA-Encoding of Chemistry

High Relevance

Jarrid Rector-Brooks, Théophile Lambert, Marta Skreta, Daniel Roth, Yueming Long — Mila, Université de Montréal, University of Toronto

DISCO co-designs protein sequence and 3D structure around arbitrary biomolecules using diffusion, creating enzymes without pre-specifying catalytic residues — a first for generative protein design.

Key Findings

•
First generative model to design enzymes without pre-specified catalytic residues
•
Co-designs protein sequence and 3D structure simultaneously around arbitrary ligands
•
Inference-time scaling methods optimize designs for stability and binding affinity

protein-designenzyme-engineeringdiffusion-modelsstructural-biologydrug-discovery

21 upvotes

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal — Mohamed bin Zayed University of AI, Stanford University, Amazon

A multi-agent LLM system for automated research discovery and analysis that reduces the effort to find, assess, organize, and synthesize relevant scientific papers.

Key Findings

•
Multi-agent architecture distributes search, evaluation, and synthesis across specialized LLM agents
•
Open-source framework enables customizable research workflows
•
Demonstrates significant reduction in manual literature review effort

research-automationmulti-agentliterature-reviewscientific-discoveryopen-source

20 upvotes

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

High Relevance

Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi — Tsinghua University, Harbin Institute of Technology

The first unified post-training framework for diffusion language models, consolidating RL objectives, rollout implementations, and evaluation across the fragmented dLLM ecosystem.

Key Findings

•
Unifies reinforcement learning objectives for diffusion language models under one framework
•
Standardizes rollout and evaluation pipelines across previously incompatible dLLM codebases
•
Enables systematic comparison of alignment approaches for non-autoregressive generation

diffusion-LLMalignmentreinforcement-learningpost-trainingframework

17 upvotes

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

High Relevance

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao — Microsoft Research, University of Washington

A benchmark for evaluating LLM agents in realistic productivity settings with five high-fidelity mock services covering email, scheduling, and document management workflows.

Key Findings

•
Existing benchmarks fail to capture stateful, multi-service productivity workflows
•
Five mock services simulate realistic email, calendar, and document interactions
•
Reveals significant capability and safety gaps in current LLM productivity agents

benchmarkproductivity-agentssafetymulti-serviceworkspace-simulation

16 upvotes

In-Place Test-Time Training

High Relevance

Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He — Tsinghua University, Microsoft Research

Enables LLMs to dynamically update their parameters during inference, breaking the static train-then-deploy paradigm to handle continuous streams of new information and distribution shifts.

Key Findings

•
LLMs can update parameters in-place during inference for dynamic adaptation
•
Addresses architectural incompatibility and computational inefficiency of prior TTT methods
•
Significant improvements on long-context and distribution-shift tasks

test-time-trainingadaptive-inferencelong-contextdynamic-adaptationLLM

14 upvotes

MedGemma 1.5 Technical Report

Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil — Google Health, Google DeepMind

Expands MedGemma with support for CT/MRI volumes, histopathology whole slide images, anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding.

Key Findings

•
Single 4B architecture handles diverse high-dimensional medical imaging modalities
•
Adds bounding-box anatomical localization and multi-timepoint X-ray analysis
•
Improved medical document understanding for lab reports and electronic health records

medical-AImultimodalradiologypathologyclinical-NLP

9 upvotes

Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents

Ádám Kovács — ETH Zurich

Addresses the problem of coding agents consuming excessively long tool observations by introducing task-conditioned pruning that returns only the smallest relevant evidence block.

Key Findings

•
Fine-tuned Qwen 3B achieves strong pruning accuracy on 11,477 SWE-bench-derived examples
•
Reduces agent context consumption by extracting minimal verbatim evidence blocks
•
Manually curated 618-example test set validates real-world pruning quality

coding-agentscontext-efficiencytool-useSWE-benchpruning

4 upvotes

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Trending Models (11)

Jackrong · text-generation · 27B

A 27B Qwen3.5 model distilled from Claude 4.6 Opus reasoning traces, optimized for chain-of-thought and logical inference tasks.

reasoning-distillationqwen3.5open-weights

560.8K downloads2.5K likes

Gemma 4 31B Instruct

Google · image-text-to-text · 31B

Google's flagship 31B instruction-tuned Gemma 4 model with multimodal image-text-to-text capabilities and conversational fine-tuning.

gemma4multimodalinstruction-tuned

1.1M downloads1.5K likes

Qianfan-OCR

Baidu · feature-extraction · undisclosed

Vision-language model specialized for OCR and document understanding, built on InternVL architecture with strong feature extraction for text-heavy images.

OCRvision-languagedocument-understanding

41.7K downloads1.1K likes

Gemma 4 26B-A4B Instruct

Google · image-text-to-text · 26B (4B active)

Gemma-4-31B-JANG_4M-CRACK

Mixture-of-experts Gemma 4 variant with 26B total parameters but only 4B active, hitting a sweet spot for efficient local deployment with multimodal capabilities.

gemma4MoEefficient-deployment

835.8K downloads541 likes

DealignAI · text-generation · 31B

Abliterated (uncensored) version of Gemma 4 31B in MLX format, targeting local deployment without safety restrictions.

abliterateduncensoredMLX

44.2K downloads792 likes

GLM-5.1

ZAI (Zhipu AI) · text-generation · undisclosed

Latest GLM series model with MoE architecture, continuing Zhipu AI's competitive Chinese-English bilingual LLM line.

GLMMoEbilingual

1.3K downloads745 likes

Void Model

Netflix · video-inpainting · undisclosed

Video inpainting and object removal model based on CogVideoX diffusion architecture, enabling seamless video editing workflows.

video-editinginpaintingobject-removal

0 downloads647 likes

Bonsai-8B-gguf

Prism ML · text-generation · 8B (1-bit)

1-bit quantized 8B model in GGUF format optimized for llama.cpp and CUDA, pushing extreme compression for on-device deployment.

1-bitGGUFextreme-quantization

59.6K downloads521 likes

Gemma 4 E4B Instruct

Google · any-to-any · 4B

Compact 4B Gemma 4 variant with any-to-any multimodal capabilities, optimized for edge and mobile deployment.

gemma4edge-deploymentany-to-any

623.0K downloads509 likes

VoxCPM2

OpenBMB · text-to-speech · undisclosed

Multilingual text-to-speech model with zero-shot voice cloning capabilities, part of the CPM model family from Tsinghua University's OpenBMB lab.

TTSmultilingualvoice-cloning

605 downloads463 likes

OmniVoice

k2-fsa (Next-gen Kaldi) · text-to-speech · undisclosed

NousResearch/hermes-agent

Zero-shot multilingual voice cloning model from the Kaldi successor project, enabling high-quality speech synthesis across languages with minimal reference audio.

voice-cloningmultilingualzero-shot

144.9K downloads398 likes

Trending GitHub Repos (13)

High RelevanceGitHub

Full-featured agentic framework built on the Hermes model family, providing extensible agent capabilities that grow with user needs. Continues explosive growth from prior days.

agent-frameworkhermesextensible-agents

Python38.0K+5.8K today4.8K

obra/superpowers

High RelevanceGitHub

An agentic skills framework and software development methodology providing reusable skill components for AI-assisted coding workflows.

agentic-skillsdev-methodologycoding-agents

Shell141.7K+2.0K today12.1K

HKUDS/DeepTutor

High RelevanceGitHub

Agent-native personalized learning assistant that adapts to individual student needs through multi-agent architecture and intelligent tutoring strategies.

education-AIpersonalized-learningtutoring-agent

Python13.7K+1.3K today1.9K

abhigyanpatwari/GitNexus

High RelevanceGitHub

Client-side knowledge graph engine that runs in-browser, converting GitHub repos or ZIP files into interactive knowledge graphs with built-in Graph RAG agent for code exploration.

knowledge-graphcode-explorationgraph-RAG

TypeScript25.3K+980 today2.8K

google-ai-edge/gallery

High RelevanceGitHub

Showcase gallery for on-device ML and GenAI use cases, allowing users to try and run models locally on mobile and edge devices.

on-device-MLedge-inferencemobile-AI

Kotlin19.5K+853 today1.8K

forrestchang/andrej-karpathy-skills

TheCraigHewitt/seomachine

Collection of AI/ML skills and knowledge distilled from Andrej Karpathy's teachings, packaged for use in agentic coding workflows.

skills-collectionkarpathyeducation

9.1K+702 today629

atilaahmettaner/tradingview-mcp

Specialized Claude Code workspace for creating long-form SEO-optimized blog content, integrating research, writing, analysis, and optimization in a single agent pipeline.

SEOcontent-generationclaude-code

Python4.6K+649 today714

NVIDIA/personaplex

High RelevanceGitHub

NVIDIA's persona management and multi-agent system for creating and orchestrating diverse AI personas across workflows.

persona-managementmulti-agentNVIDIA

Python8.5K+586 today1.2K

google-ai-edge/LiteRT-LM

High RelevanceGitHub

Lightweight runtime for running language models on edge devices, part of Google's on-device AI infrastructure push.

edge-inferencelightweight-runtimeon-device-LLM

C++3.0K+501 today283

MCP server for AI-powered market analysis with TradingView integration, supporting real-time crypto and stock screening, technical indicators, and candlestick pattern detection.

MCPtradingmarket-analysis

Python1.3K+447 today302

microsoft/BitNet

High RelevanceGitHub

Official inference framework for 1-bit LLMs from Microsoft Research, enabling extreme model compression while maintaining generation quality.

1-bit-LLMinferencecompression

Python37.9K+388 today3.4K

HKUDS/AI-Trader