Thursday, May 28, 2026

ResearchMath-14K introduces largest research-level math dataset with multi-agent curation from Seoul National University; NEO-ov pioneers native one-vision VLMs for multi-image and video understanding; Understand-Anything leads GitHub with 4,465 stars/day as agent skills ecosystem explodes

research-level-mathematical-reasoningnative-vision-language-modelsagent-skills-ecosystemefficient-video-generationagent-governance-and-securitycompact-model-deployment

Executive Summary

Wednesday's HuggingFace Daily Papers features four submissions spanning research-level mathematical reasoning, vision-language model evaluation, native multimodal architectures, and efficient video generation. The top paper is ResearchMath-14K from Seoul National University (8 upvotes), which constructs the largest collection of research-level math problems (14,056) via a multi-agent pipeline, along with 220K teacher trajectories that reveal concerning fabrication behaviors in newer LLMs. Chartographer tackles the perennial shortcut problem in chart QA by introducing counterfactual charts that force genuine visual reasoning. NEO-ov proposes a native foundation model that eliminates the traditional encoder-decoder stitching approach for VLMs, learning cross-frame representations from raw pixels. OSP-Next combines sparse attention, HiF8 quantization, and reinforcement learning for efficient video generation.

The model landscape sees continued dominance by DeepSeek-V4-Pro (5M downloads, 4,360 likes), with notable gains for ByteDance Lance (924 likes, up from 866) in multimodal generation. HauhauCS's Qwen3.6-35B uncensored leads downloads at 1.6M, while OpenBMB's MiniCPM5-1B (418 likes) and MiniCPM-V-4.6 (1,013 likes) demonstrate the strong demand for efficient compact models. New entrants include Meituan's LongCat-Video-Avatar-1.5 (345 likes) for audio-driven video avatars and CohereLabs' command-a-plus quantized vision model.

GitHub trending is dominated by the agent skills and tooling ecosystem. Understand-Anything continues its explosive growth (40K stars, 4,465 today), while ECC (196K stars), Superpowers (210K stars), and taste-skill (24K stars, 2,715 today) represent the maturing agent behavioral alignment space. MoneyPrinterTurbo surges with 1,742 stars today for AI video generation, and DigitalPlatDev/FreeDomain (169K stars, 2,222 today) leads in raw star velocity among non-AI repos.

Researcher Notes

ResearchMath-14K's most provocative finding is not the dataset itself but the fabrication behavior analysis. Across eight open-weight models, newer generations produce 5.6x more references and 5.0x more fake references per reasoning trace. This is a concrete, quantified demonstration of a widely suspected failure mode: as models become more capable at generating plausible-looking mathematical reasoning, they simultaneously become more prone to hallucinating citations and mathematical results. The multi-agent curation pipeline (sourcing from academic papers, filtering, and generating teacher trajectories) is a pragmatic approach to a real bottleneck — research-level math problems are scarce because they require domain expertise to formulate. The 9.2-point average improvement from filtered open-problem attempts on Qwen3 (4B-30B) shows that even imperfect supervision on hard problems provides useful signal, which challenges the conventional wisdom that training data must be fully correct.

NEO-ov's native pixel-to-language approach represents a genuine architectural bet. Current VLMs use a modular pipeline: image encoder produces features, adapter maps them to language space, decoder generates text. This works but fragments information — pixel-level signals are compressed through the encoder bottleneck before they can interact with language tokens. NEO-ov learns cross-frame representations directly from raw pixels, which in principle preserves more fine-grained visual information. The extension to multi-image, video understanding, and spatial intelligence is where this architecture could show its advantages, since temporal and spatial relationships across frames are precisely what gets lost in the encode-then-align paradigm. Whether the training cost of learning visual representations from scratch (rather than leveraging pre-trained encoders) is justified by the quality gains remains the key question.

Chartographer addresses a fundamental evaluation problem that extends beyond charts. The core insight — that models can answer chart questions correctly by exploiting shortcuts or prior knowledge rather than actually reading the chart — applies to virtually every VLM benchmark. By creating counterfactual charts where the visual content changes but the question structure remains fixed, Chartographer isolates visual reasoning from everything else. This is methodologically rigorous but also practically important: if we can't distinguish models that truly read charts from those that pattern-match, we can't trust them in production settings where the charts contain novel data.

The GitHub trending data reveals the agent skills ecosystem reaching a new phase. The combined scale is remarkable: Superpowers (210K stars), ECC (196K stars), FreeDomain (169K stars), Anthropic Skills (142K stars), and taste-skill (24K stars gaining 2,715/day). But the more interesting signal is the diversification from generic agent harnesses toward behavioral alignment tools. taste-skill (preventing generic output), stop-slop (removing AI tells from prose), and Anthropic-Cybersecurity-Skills (structured security capabilities) all focus on what agents should and shouldn't do, rather than how to make them run. This is a maturation signal — the community has moved past 'can we build agents' to 'how do we make agents that produce quality output.' Microsoft's agent-governance-toolkit (2,975 stars, 472 today) adds the enterprise governance layer.

The video generation space is converging on a common efficiency playbook. OSP-Next combines sparse attention, quantization (HiF8), and RL-based optimization — which is essentially the same toolkit that language model inference has used, now applied to Diffusion Transformers. Meituan's LongCat-Video-Avatar-1.5 (345 likes with zero downloads, suggesting it just launched) targets audio-driven avatar generation, while SulphurAI's Sulphur-2-base (1.4M downloads) continues to lead in open text-to-video. The pattern is clear: video generation is following the language model trajectory of capability → efficiency → specialization, compressed into a much shorter timeline.

Themes & Trends

↑

Research-Level Mathematical Reasoning

rising

ResearchMath-14K tackles the scarcity of research-level math problems for LLM training with a multi-agent curation pipeline, while revealing that newer models increasingly fabricate mathematical references — a critical finding for the field's reliability.

↑

Agent Skills and Behavioral Alignment Ecosystem

rising

The agent ecosystem has shifted from raw capability to behavioral alignment, with taste-skill (2,715 stars/day), stop-slop, Anthropic-Cybersecurity-Skills, and Anthropic Skills (142K stars) all focusing on what agents should and shouldn't do rather than how to make them run.

↑

Native Vision-Language Architectures

rising

NEO-ov's native pixel-to-language approach and Chartographer's rigorous evaluation methodology both address fundamental limitations of modular VLM pipelines — information loss through encoder bottlenecks and evaluation shortcuts that mask true visual reasoning.

↑

Video Generation Efficiency and Specialization

rising

OSP-Next applies the language model efficiency playbook (sparse attention, quantization, RL) to Diffusion Transformers, while Meituan's LongCat and SulphurAI's Sulphur-2 represent the specialization phase of video generation toward avatars and general text-to-video respectively.

↑

Agent Governance and Enterprise Security

rising

Microsoft's agent-governance-toolkit (472 stars/day) and Anthropic-Cybersecurity-Skills (886 stars/day) signal that agent safety and governance are becoming first-class enterprise concerns, with structured frameworks mapped to established security standards.

→

Compact Model Deployment

stable

OpenBMB's MiniCPM5-1B and MiniCPM-V-4.6, Sapient's HRM-Text-1B, and the Qwen3.6 GGUF quantizations demonstrate sustained demand for efficient, edge-deployable models that balance capability with resource constraints.

Trending Papers (4)

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

High Relevance

Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang, Youngjae Yu — Seoul National University

Introduces ResearchMath-14K, the largest collection of 14,056 research-level mathematical problems curated via a multi-agent pipeline, along with ResearchMath-Reasoning containing 220K teacher trajectories. Reveals that newer LLM generations produce 5.6x more references and 5.0x more fabricated references per trace, while showing that filtered open-problem attempts still improve Qwen3 models by 9.2 points on average.

Key Findings

•
Multi-agent pipeline curates 14,056 research-level math problems from academic sources, making it the largest such dataset to date
•
Newer LLM generations produce 5.6x more references and 5.0x more fake references per reasoning trace, quantifying hallucination trends in mathematical reasoning
•
Filtered open-problem attempts improve Qwen3 models (4B-30B) by 9.2 points on average, demonstrating useful supervision from imperfect reasoning traces

mathematical-reasoningdatasetmulti-agenthallucination-analysisfine-tuning

8 upvotes

arXiv HF PDF

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

High Relevance

Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi — University of Waterloo, Layer 6 AI, University of Waterloo

Proposes counterfactual chart generation to rigorously evaluate visual reasoning in VLMs by fixing the chart-question task while varying the underlying chart data and answers. Introduces a framework to reverse-engineer charts into executable specifications and regenerate them with modified data, isolating genuine visual reasoning from shortcuts and prior knowledge.

Key Findings

•
Models can answer chart questions correctly via shortcuts or prior familiarity rather than visual reasoning, undermining benchmark validity
•
Counterfactual charts that fix question structure while varying visual content isolate genuine visual reasoning ability
•
Framework reverse-engineers charts into executable specifications enabling systematic counterfactual generation

vlm-evaluationchart-qacounterfactualvisual-reasoningbenchmark

3 upvotes

arXiv HF PDF

From Pixels to Words -- Towards Native One-Vision Models at Scale

High Relevance

Haiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong, Yuwei Niu — Peking University, Shanghai AI Laboratory

Introduces NEO-ov, a native foundation model that eliminates the traditional encoder-decoder stitching approach for vision-language models. NEO-ov learns cross-frame representations directly from raw pixels, enabling multi-image, video understanding, and spatial intelligence without the information loss inherent in modular pipelines.

Key Findings

•
Native pixel-to-language architecture avoids the information fragmentation caused by separate encoder-adapter-decoder pipelines
•
Cross-frame representation learning from raw pixels preserves fine-grained visual information for temporal and spatial reasoning
•
Extends native VLM capabilities beyond single images to multi-image, video understanding, and spatial intelligence

vision-language-modelnative-architecturemulti-imagevideo-understandingspatial-intelligence

1 upvotes

arXiv HF PDF

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

High Relevance

Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu — Microsoft Research

OSP-Next integrates sparse attention, parallelism, HiF8 quantization, and reinforcement learning for efficient text-to-video generation. The hybrid full-sparse attention architecture uses Skiparse-2D Attention with token-wise and group-wise sparse patterns along spatial and temporal dimensions to reduce the quadratic cost of full attention in Diffusion Transformers.

Key Findings

•
Hybrid full-sparse attention architecture with Skiparse-2D Attention reduces quadratic attention cost in Diffusion Transformers
•
HiF8 quantization combined with sparse sequence parallelism enables practical deployment of high-quality video generation
•
Reinforcement learning-based optimization further improves generation quality while maintaining efficiency gains

video-generationsparse-attentionquantizationdiffusion-transformersreinforcement-learning

0 upvotes

arXiv HF PDF

Trending Models (12)

DeepSeek-V4-Pro

DeepSeek AI · text-generation · unknown

View on HF

The dominant open-weight large language model, maintaining its position as the most-downloaded model on HuggingFace with over 5 million downloads and massive community adoption for conversational tasks.

conversationaltext-generationdeepseek

5.0M downloads4.4K likes

Anima

Circlestone Labs · image-generation · unknown

View on HF

A leading open diffusion model compatible with ComfyUI, continuing strong traction as a community-favored image generation model with single-file distribution and growing likes.

diffusioncomfyuiimage-generation

690.2K downloads1.6K likes

Sulphur-2-base

SulphurAI · text-to-video · unknown

View on HF

A leading open text-to-video generation model available in both diffusers and GGUF formats, maintaining high download volume of 1.4M for video generation workloads.

text-to-videodiffusersvideo-generation

1.4M downloads1.4K likes

Hy-MT2-1.8B

Tencent · translation · 1.8B

View on HF

A specialized 1.8B-parameter translation model from Tencent's Hunyuan family, demonstrating sustained community interest in dedicated translation models with over 1,000 likes.

translationhunyuanmultilingual

7.5K downloads1.1K likes

MiniCPM-V-4.6

OpenBMB · image-text-to-text · unknown

View on HF

An efficient multimodal vision-language model continuing the MiniCPM-V series, with strong community adoption for on-device and edge deployment scenarios at over 355K downloads.

multimodalvision-languageefficient

355.0K downloads1.0K likes

Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

HauhauCS · text-generation · 35B-A3B (MoE)

View on HF

A community-produced uncensored variant of Qwen3.6-35B using mixture-of-experts architecture (3B active parameters), distributed in GGUF format with vision capabilities. Leads in download volume at 1.6M.

qwen3.6moeggufuncensoredvision

1.6M downloads947 likes

Lance

ByteDance Research · image-generation · unknown

View on HF

ByteDance's multimodal generation model targeting both image and video generation, showing rising community engagement with 924 likes, up from 866 yesterday.

multimodalimage-generationvideo-generation

1.9K downloads924 likes

supertonic-3

Supertone · text-to-speech · unknown

View on HF

A text-to-speech and speech synthesis model using ONNX format, reflecting continued interest in high-quality open TTS solutions with 713 likes.

ttsspeech-synthesisonnx

48.1K downloads713 likes

Qwen3.6-27B-MTP-GGUF

Unsloth · text-generation · 27B

View on HF

Unsloth's GGUF quantization of Qwen3.6-27B with Multi-Token Prediction support, enabling efficient local inference with 735K downloads.

ggufquantizedqwenmtp

735.3K downloads519 likes

MiniCPM5-1B

OpenBMB · text-generation · 1B

View on HF

The latest 1B-parameter entry in the MiniCPM series, a compact language model suitable for edge deployment that has gained significant traction with 418 likes.

compactminicpmedge-deployment

2.4K downloads418 likes

HRM-Text-1B

Sapient Inc · text-generation · 1B

View on HF

A compact 1B-parameter text generation model with high download volume of 103K, suggesting strong utility for lightweight text generation use cases.

text-generationcompacthrm

103.0K downloads394 likes

LongCat-Video-Avatar-1.5

Meituan · audio-text-to-video · unknown

View on HF

A new audio-driven video avatar generation model supporting audio-text-to-video and audio-image-text-to-video pipelines, representing Meituan's entry into the avatar generation space with 345 likes.

video-avataraudio-drivendiffusers

0 downloads345 likes

Trending GitHub Repos (15)

Lum1104/Understand-Anything

High RelevanceGitHub

Turns any codebase into an interactive knowledge graph for exploration, search, and Q&A. Leading in AI-related star velocity at 4,465 stars/day, up to 40K total stars.

knowledge-graphcode-understandingdeveloper-tools

TypeScript40.1K+4.5K today3.2K

Leonxlnx/taste-skill

High RelevanceGitHub

A skill file that gives AI coding agents 'good taste' by preventing generic output generation, leading the behavioral alignment category with 2,715 stars today.

agent-skillsbehavioral-alignmentquality-control

Shell24.4K+2.7K today1.9K

DigitalPlatDev/FreeDomain

GitHub

Free domain service for everyone, leading in raw star velocity at 2,222 stars/day and 169K total stars, indicating massive community interest.

free-domaindeveloper-toolsinfrastructure

HTML169.2K+2.2K today3.2K

affaan-m/ECC

High RelevanceGitHub

The agent harness performance optimization system with skills, instincts, memory, security, and research-first development, continuing its massive growth to 196K stars.

agent-harnesscoding-agentsdeveloper-tools

JavaScript196.1K+2.1K today30.2K

harry0703/MoneyPrinterTurbo

High RelevanceGitHub

AI-powered one-click short video generation tool using LLMs, surging with 1,742 stars today to 62K total. Combines AI content generation with automated video production.

video-generationcontent-creationllm-application

Python62.4K+1.7K today9.1K

rohitg00/ai-engineering-from-scratch

High RelevanceGitHub

Comprehensive learning resource for AI engineering covering the full lifecycle, surging with 1,739 stars today to 22.6K total.

educationai-engineeringlearning-resource

Python22.6K+1.7K today3.7K

obra/superpowers

High RelevanceGitHub

An agentic skills framework and software development methodology that works, now the largest agent framework on GitHub at 210K stars with sustained momentum of 1,511 stars/day.

agent-frameworkskillsmethodology

Shell209.6K+1.5K today18.7K

mukul975/Anthropic-Cybersecurity-Skills

High RelevanceGitHub

754 structured cybersecurity skills mapped to 5 frameworks (MITRE ATT&CK, NIST CSF 2.0, MITRE ATLAS, D3FEND, NIST AI RMF) for AI agents, continuing strong adoption at 11K stars with 886 today.

cybersecurityagent-skillssecurity-frameworks

Python11.0K+886 today1.3K

anthropics/knowledge-work-plugins

High RelevanceGitHub

Anthropic's open-source plugins for knowledge workers using Claude Cowork, maintaining strong growth at 17K stars with 695 new stars today.

pluginsknowledge-workanthropic

Python17.3K+695 today2.0K

anthropics/skills

High RelevanceGitHub

Anthropic's public repository for agent skills, the official skills ecosystem for Claude with 142K stars and sustained daily growth of 686 stars.

agent-skillsanthropicclaude

Python142.0K+686 today16.8K

twentyhq/twenty

GitHub

The open alternative to Salesforce designed for AI, with 47K stars and 519 new stars today, representing the growing AI-native business software trend.

crmai-nativesalesforce-alternative

TypeScript47.4K+519 today6.7K

microsoft/agent-governance-toolkit

High RelevanceGitHub

Microsoft's AI agent governance toolkit with policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering covering all OWASP Agentic Top 10 risks.

agent-governancesecuritymicrosoftzero-trust

Python3.0K+472 today468

shiyu-coder/Kronos

High RelevanceGitHub

A foundation model for the language of financial markets, continuing its strong growth at 27K stars with 401 new stars today.

financial-aifoundation-modelmarkets

Python26.9K+401 today4.7K

iii-hq/iii

GitHub

A Rust-based platform to compose, extend, and observe services in real-time, gaining 376 stars today with 17K total stars.

service-compositionobservabilityrust

Rust16.9K+376 today1.1K