Saturday, May 30, 2026

AgentDoG 1.5 proposes lightweight safety alignment for open-world AI agents with 81 upvotes; Qwen-VLA unifies manipulation and navigation across robot embodiments; VoxCPM surges 1,815 stars/day for tokenizer-free multilingual TTS

agent-safety-alignmentembodied-foundation-modelsunified-retrieval-systemsspeech-synthesis-renaissanceefficient-lora-techniquesvideo-world-models

Executive Summary

Friday's research landscape is dominated by agent safety and embodied intelligence. The top paper, AgentDoG 1.5 (81 upvotes), introduces a lightweight and scalable alignment framework for AI agent safety, updating the safety taxonomy to address emergent risks from frontier models like Codex that drastically lower attack barriers. Qwen-VLA (74 upvotes) presents a unified vision-language-action model that bridges manipulation, navigation, and other embodied tasks across different robot platforms — a significant step toward general-purpose embodied foundation models. OmniRetrieval (54 upvotes) tackles the fragmented retrieval landscape by unifying access across text, tables, knowledge graphs, and property graphs without collapsing structural affordances.

The model ecosystem shows continued momentum in compact and efficient architectures. DeepSeek-V4-Pro maintains its dominant position at 5.8M downloads and 4,439 likes. SulphurAI's Sulphur-2-base reaches 1.5M downloads with 1,441 likes for text-to-video generation. Tencent's Hy-MT2 translation models debut strongly with the 1.8B variant gaining 1,088 likes and the 30B MoE version earning 425 likes, signaling serious competition in neural machine translation. ByteDance Lance (974 likes) continues climbing for multimodal any-to-any generation, while NVIDIA LocateAnything-3B (389 likes) introduces visual grounding at scale.

GitHub trending reveals a speech synthesis renaissance alongside the maturing agent tooling ecosystem. VoxCPM explodes with 1,815 stars/day (22.2K total) for tokenizer-free multilingual TTS, and MOSS-TTS gains 355 stars/day for its comprehensive speech generation family. MoneyPrinterTurbo leads with 3,567 stars/day for AI video generation. The agent ecosystem continues its massive scale with ECC (198.6K stars), Anthropic Skills (143.6K stars), and taste-skill (28.2K stars, 2,062/day) representing the quality-alignment movement in AI output.

Researcher Notes

AgentDoG 1.5's framing of the agent safety problem is timely and important. The paper correctly identifies that modern open-world agents like OpenClaw have powerful cross-environment execution capabilities, but the current alignment frameworks are inadequate because frontier AI models have dramatically lowered the barrier to attack. The lightweight and scalable approach is pragmatically sound — heavy-weight alignment methods that add significant inference overhead or require model-specific tuning won't survive contact with the rapid deployment cycles of agent frameworks. At 81 upvotes, this is the highest-engagement paper of the day by a significant margin, reflecting growing community anxiety about agent safety as agent deployment scales.

Qwen-VLA represents an architecturally ambitious attempt at embodied unification that deserves close attention. Most embodied AI research remains fragmented — manipulation models know nothing about navigation, tabletop policies don't transfer to mobile robots, and indoor models fail outdoors. Qwen-VLA extends Qwen's vision-language stack from perception into action, attempting to handle heterogeneous embodied decision-making within a single model. The key question is whether a single VLA model can genuinely achieve competitive performance across diverse tasks and embodiments, or whether the unification comes at the cost of specialist performance. At 74 upvotes, the community is clearly interested in the answer.

The LoRA research thread is producing increasingly sophisticated understanding. Two papers today advance our understanding of LoRA from different angles: CollectionLoRA (49 upvotes) solves the practical deployment problem of managing many effect LoRAs by distilling 50 effects into a single adapter via multi-teacher on-policy distillation, while How LoRA Remembers (20 upvotes) establishes a quantitative parametric memory law for LoRA fine-tuning. The former addresses the immediate pain of LoRA proliferation in production systems; the latter provides theoretical foundations for understanding capacity limits. Together, they suggest the field is moving from 'LoRA works' to 'LoRA understood' — a maturation signal.

The video world model space is heating up with minWM and YoCausal addressing complementary gaps. minWM (40 upvotes) provides a full-stack open-source framework for real-time interactive video world models, spanning the entire pipeline from data construction through streaming inference. YoCausal (32 upvotes) asks the harder question of whether video diffusion models truly understand causality or merely overfit to statistical temporal patterns. The VoE (Violation of Expectation) paradigm borrowed from cognitive science is clever — using temporally reversed real-world videos as zero-cost counterfactual samples is both elegant and scalable. The complementarity is clear: minWM gives you the engineering to build world models; YoCausal gives you the evaluation to know if they're actually world models.

The GitHub trending data shows speech synthesis entering a new phase of open-source maturity. VoxCPM (1,815 stars/day) from OpenBMB offers tokenizer-free TTS for multilingual speech generation, creative voice design, and true-to-life cloning. MOSS-TTS (355 stars/day) from OpenMOSS covers the full spectrum from stable long-form speech to multi-speaker dialogue and real-time streaming. Combined with Supertone's supertonic-3 model (738 likes on HuggingFace), we're seeing a convergence of high-quality open-source TTS options that could significantly lower the barrier to voice-enabled applications.

Themes & Trends

↑

Agent Safety and Alignment

rising

Growing focus on safety frameworks for increasingly capable open-world AI agents, with AgentDoG 1.5 leading at 81 upvotes and reflecting community anxiety about deployment risks.

↑

Embodied Foundation Models

rising

Convergence toward unified models that handle diverse embodied tasks across robot platforms, with Qwen-VLA representing the most ambitious unification attempt at 74 upvotes.

↑

LoRA Understanding and Scaling

rising

Maturation from empirical LoRA usage to theoretical understanding and practical scaling, with CollectionLoRA solving deployment overhead and the parametric memory law establishing capacity limits.

↑

Video World Model Evaluation

rising

Dual thrust in video world models: minWM provides full-stack engineering while YoCausal introduces cognitive-science-inspired evaluation for causal understanding.

↑

Speech Synthesis Renaissance

rising

Open-source TTS reaching new maturity with VoxCPM (1,815 stars/day), MOSS-TTS, and Supertone's supertonic-3 converging to dramatically lower the barrier to voice-enabled applications.

↑

AI Output Quality Alignment

rising

Growing demand for AI agents that produce authentic, non-generic output, driven by taste-skill (2,062 stars/day) and stop-slop (617 stars/day) on GitHub.

Trending Papers (14)

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

High Relevance

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen — Tsinghua University, Institute of Automation, CAS

Proposes a lightweight and scalable agent safety alignment framework that updates the agent safety taxonomy to accommodate emergent risks from frontier AI models. Addresses the inadequacy of current alignment frameworks for open-world agents like OpenClaw that exhibit powerful cross-environment execution capabilities.

Key Findings

•
Updates the agent safety taxonomy to cover emergent risks from frontier models that lower attack barriers
•
Provides a lightweight alignment framework that scales across diverse agent architectures without prohibitive overhead
•
Demonstrates effectiveness against broad safety risk sources introduced by modern open-world agents

agent-safetyalignmentsecurityopen-world-agentssafety-taxonomy

81 upvotes

arXiv HF PDF

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

High Relevance

Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie — Alibaba Group, Tsinghua University

Presents Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception to action. Handles heterogeneous embodied decision-making across manipulation, navigation, and other tasks within a single vision-language-action model.

Key Findings

•
Unifies heterogeneous embodied decision-making problems within a single VLA model across tasks, environments, and robot embodiments
•
Extends Qwen's vision-language stack from perception to actionable embodied intelligence
•
Demonstrates generalization across manipulation, navigation, and diverse robot platforms

embodied-aivision-language-actionroboticsfoundation-modelqwen

74 upvotes

arXiv HF PDF

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

High Relevance

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang — KAIST, Google DeepMind

Introduces OmniRetrieval, a framework for unified retrieval across structurally diverse knowledge sources including unstructured text, relational tables, knowledge graphs, and property graphs, without collapsing structural affordances into a shared space.

Key Findings

•
Unifies retrieval across text, tables, knowledge graphs, and property graphs without erasing structural affordances
•
Avoids the naive approach of collapsing diverse sources into a shared space, which loses structural query capabilities
•
Addresses the fragmented retrieval landscape where existing retrievers operate over one source at a time

information-retrievalknowledge-graphsunified-retrievalheterogeneous-datarag

54 upvotes

arXiv HF PDF

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

High Relevance

Fangtai Wu, Hailong Guo, Shijie Huang, Jiayi Song, Yubo Huang — Peking University, ByteDance

Addresses the deployment overhead of managing numerous effect LoRAs by distilling 50 visual effects into a single LoRA adapter using multi-teacher on-policy distillation, eliminating parameter interference when cascading with acceleration modules.

Key Findings

•
Distills 50 distinct visual effects into a single LoRA adapter via multi-teacher on-policy distillation
•
Eliminates severe parameter interference and concept bleeding when cascading effect LoRAs with acceleration modules
•
Dramatically reduces deployment overhead from storing and dynamically loading numerous individual LoRA adapters

loradiffusion-modelsimage-editingdistillationefficient-deployment

49 upvotes

arXiv HF PDF

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

High Relevance

Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen — Shanghai Jiao Tong University, Ant Group

Presents minWM, a full-stack open-source framework for building real-time interactive video world models, covering the entire pipeline from data construction and controllable fine-tuning through autoregressive training, few-step distillation, and streaming inference.

Key Findings

•
Provides a complete open-source pipeline spanning data construction, controllable fine-tuning, autoregressive training, distillation, and streaming inference
•
Addresses the gap between high-quality video generation and real-time interactive controllability
•
Enables controllable, causal, and low-latency rollout required for interactive world model deployment

world-modelsvideo-generationinteractivereal-timeopen-source

40 upvotes

arXiv HF PDF

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

High Relevance

You-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang, Yu-Lun Liu — National Yang Ming Chiao Tung University, MediaTek Research

Presents YoCausal, a two-level benchmark inspired by the Violation of Expectation paradigm from cognitive science, using temporally reversed real-world videos as zero-cost counterfactual samples to evaluate whether video diffusion models truly understand causality.

Key Findings

•
Applies the Violation of Expectation (VoE) paradigm from cognitive science to evaluate causal understanding in video models
•
Uses temporally reversed real-world videos as natural counterfactual samples at zero data collection cost
•
Reveals whether video diffusion models understand causality or merely overfit to statistical temporal patterns

video-generationworld-modelscausalitybenchmarkcognitive-science

32 upvotes

arXiv HF PDF

GenClaw: Code-Driven Agentic Image Generation

High Relevance

Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang — Huazhong University of Science and Technology, ByteDance

Proposes GenClaw, a code-driven agentic image generation system where LLMs serve as a genuine brush for precise visual construction, breaking free from the repetitive prompt-rewriting cycle of existing agents by enabling direct canvas manipulation through code.

Key Findings

•
Enables LLMs to directly manipulate the image canvas through code rather than iterative prompt rewriting
•
Breaks existing agents free from the black-box image model dependency cycle
•
Demonstrates that code-driven generation provides precise control that prompt-based approaches cannot achieve

image-generationagentic-aicode-generationvisual-constructionllm-tools

25 upvotes

arXiv HF PDF

EarlyTom: Early Token Compression Completes Fast Video Understanding

Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen — University of Electronic Science and Technology of China, Eastern Institute of Technology

Proposes EarlyTom, which performs token compression at early stages of the vision encoder rather than at the late prefilling stage, optimizing efficiency throughout the entire Video-LLM pipeline rather than just the language model portion.

Key Findings

•
Moves token compression upstream to the vision encoder stage, reducing computation throughout the entire pipeline
•
Achieves extremely low token retention ratios while maintaining accuracy comparable to full-token baselines
•
Addresses the previously unoptimized efficiency bottleneck in the vision encoder itself

video-understandingtoken-compressionefficiencyvideo-llmvision-encoder

23 upvotes

arXiv HF PDF

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

High Relevance

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang — Alibaba Group, Zhejiang University

Establishes a quantitative parametric memory law for LoRA fine-tuning by using LoRA as a controlled memory capacity probe within the latent space, systematically quantifying exact capacity limits and underlying dynamics of parametric memory in LLMs.

Key Findings

•
Derives a quantitative law governing how LoRA stores and retrieves parametric memory
•
Uses LoRA as a controlled probe to systematically measure exact parametric memory capacity limits
•
Bridges the gap between qualitative downstream evaluations and quantitative understanding of LoRA's memory dynamics

lorafine-tuningparametric-memorycapacity-analysisllm-theory

20 upvotes

arXiv HF PDF

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang — Renmin University of China, Kuaishou Technology

Proposes UniSteer, a text-guided activation flow matching model that learns conditional dynamics in activation space for versatile LLM steering, overcoming the limitations of fixed steering directions and task-specific intervention modules.

Key Findings

•
Learns conditional dynamics in activation space via flow matching, enabling text-guided behavioral control
•
Overcomes limitations of fixed steering directions and task-specific intervention modules
•
Enables fine-grained concept-level and compositional constraint-based LLM control during inference

llm-steeringactivation-engineeringflow-matchinginference-controlrepresentation-intervention

19 upvotes

arXiv HF PDF

Native Audio-Visual Alignment for Generation

Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu — Baidu, ERNIE Research

Proposes a Native Audio-Visual Alignment framework for joint audio-video generation using an Align-then-Fuse MMDiT architecture that first establishes audio-video correspondence in a dedicated interaction space, then conditions joint denoising with external context.

Key Findings

•
Align-then-Fuse MMDiT avoids weaknesses of both dual-tower and unified tri-modal designs for audio-video generation
•
Introduces Timbre-in-Context Conditioning for controllable speech timbre generation
•
Achieves competitive quality at 6.3B parameters, significantly smaller than many unified approaches

audio-visual-generationmultimodaldiffusion-transformerspeech-synthesisvideo-generation

18 upvotes

arXiv HF PDF

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

High Relevance

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter — Yonsei University, Georgia Institute of Technology

Introduces LaRA, a layer-wise representation analysis framework for detecting data contamination in RL post-trained LLMs using three complementary metrics: perturbation sensitivity, directional collapse, and local representation rigidity.

Key Findings

•
Output-level contamination detection methods become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards
•
Contamination produces progressive geometric deviations across layers including amplified perturbation sensitivity and directional collapse
•
Representation-level detection outperforms output-level baselines for contamination detection in RL-trained reasoning models

reinforcement-learningdata-contaminationrepresentation-analysisllm-evaluationpost-training

17 upvotes

arXiv HF PDF

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang — Shanghai AI Laboratory, Tsinghua University

Identifies and addresses 'carrier sensitivity' in VLMs where replacing text with rendered-image equivalents causes dramatic performance degradation. Proposes local modality substitution to achieve deeper vision-language fusion beyond surface-level alignment.

Key Findings

•
Identifies carrier sensitivity: replacing textual questions with rendered-image equivalents causes dramatic VLM performance drops
•
Attributes the issue to inherent bias in current training paradigms that treat modalities asymmetrically
•
Local modality substitution achieves deeper fusion by forcing the model to be invariant to the carrier modality

vision-language-modelsmultimodal-fusionmodality-invariancevlm-robustnesstraining-paradigms

17 upvotes

arXiv HF PDF

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su — Ohio State University, Seoul National University

Introduces a representation-level analysis framework using minimal contrastive pairs to reveal that VLMs consistently entangle vertical position with depth — objects that are far away are represented as 'up' — questioning whether benchmark performance reflects genuine 3D understanding.

Key Findings

•
VLMs consistently entangle vertical position with distance: far objects are represented as spatially 'up'
•
Strong benchmark performance may reflect statistical shortcuts rather than structured 3D understanding
•
Minimal contrastive pair analysis reveals spatial axes are not properly disentangled in VLM embeddings

spatial-reasoningvision-language-modelsrepresentation-analysis3d-understandingbenchmark-analysis

20 upvotes

arXiv HF PDF

Trending Models (12)

DeepSeek-V4-Pro

DeepSeek AI · text-generation · unknown

View on HF

DeepSeek's latest flagship language model with state-of-the-art performance across reasoning and generation tasks, maintaining dominant community adoption.

conversationalreasoningdeepseek

5.8M downloads4.4K likes

Sulphur-2-base

SulphurAI · text-to-video · unknown

View on HF

Open-source text-to-video generation model with strong community adoption, available in both diffusers and GGUF formats for broad deployment flexibility.

text-to-videodiffusersgguf

1.5M downloads1.4K likes

Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

HauhauCS · text-generation · 35B (3B active)

View on HF

Community fine-tuned uncensored Qwen3.6 35B MoE model with 3B active parameters, optimized for unrestricted generation with vision capabilities.

uncensoredqwen3.6moevision

2.1M downloads1.1K likes

Hy-MT2-1.8B

Tencent · translation · 1.8B

View on HF

Compact 1.8B-parameter neural machine translation model from Tencent's Hunyuan team, rapidly gaining community adoption for efficient multilingual translation.

translationhunyuanmultilingual

15.8K downloads1.1K likes

Lance

ByteDance Research · multimodal-generation · unknown

View on HF

Multimodal any-to-any generation model supporting image and video generation, continuing rapid community growth.

multimodalimage-generationvideo-generation

2.7K downloads974 likes

supertonic-3

Supertone · text-to-speech · unknown

View on HF

Third-generation text-to-speech and speech synthesis model with high-quality voice generation capabilities in ONNX format.

text-to-speechspeech-synthesisttsonnx

53.8K downloads738 likes

MiniCPM5-1B

OpenBMB · text-generation · 1B

View on HF

Compact 1B-parameter multimodal model from the MiniCPM series, designed for edge deployment with strong vision-language capabilities relative to its size.

minicpmcompactedge-deployment

23.6K downloads556 likes

Qwen3.6-27B-MTP-GGUF

Unsloth · text-generation · 27B

View on HF

Quantized GGUF variant of Qwen3.6-27B with Multi-Token Prediction support, optimized by Unsloth for efficient local inference via llama.cpp.

ggufquantizedunslothlocal-inference

841.1K downloads549 likes

Marlin-2B

NemoStation · video-captioning · 2B

View on HF

Compact 2B-parameter multimodal model specialized in video captioning and understanding tasks.

videomultimodalvideo-captioning

14.7K downloads446 likes

Hy-MT2-30B-A3B

Tencent · translation · 30B (3B active)

View on HF

Large 30B MoE translation model from Tencent with 3B active parameters, offering high-quality translation with efficient inference via mixture-of-experts architecture.

translationmoehunyuanmultilingual

3.1K downloads425 likes

HRM-Text-1B

Sapient Inc · text-generation · 1B

View on HF

1B-parameter text generation model from Sapient with strong download numbers, indicating strong production deployment.

hrmtext-generationcompact

131.8K downloads407 likes

LongCat-Video-Avatar-1.5

Meituan · audio-text-to-video · unknown

View on HF

Audio-text-to-video model from Meituan for generating video avatars from audio and text inputs, enabling realistic talking head generation.

video-avataraudio-to-videotalking-head

0 downloads395 likes

Trending GitHub Repos (15)

harry0703/MoneyPrinterTurbo

High RelevanceGitHub

AI-powered short video generation tool that creates high-definition videos with one click using LLMs. Surging with 3,567 stars today, reflecting strong demand for automated video content creation.

video-generationai-automationcontent-creation

Python69.8K+3.6K today10.1K

Leonxlnx/taste-skill

High RelevanceGitHub

AI skill file that gives coding agents aesthetic judgment, preventing generation of boring, generic output. Leading the AI output quality alignment movement with 2,062 stars today.

agent-skillsoutput-qualityai-alignment

Shell28.2K+2.1K today2.1K

microsoft/markitdown

High RelevanceGitHub

Microsoft's Python tool for converting files and office documents to Markdown, essential infrastructure for LLM document processing pipelines.

document-processingmarkdownllm-tools

Python130.0K+1.9K today8.9K

OpenBMB/VoxCPM

High RelevanceGitHub

Tokenizer-free TTS system for multilingual speech generation, creative voice design, and true-to-life voice cloning. Exploding with 1,815 stars today.

text-to-speechvoice-cloningmultilingualspeech-synthesis

Python22.2K+1.8K today2.6K

affaan-m/ECC

High RelevanceGitHub

Comprehensive agent harness performance optimization system with skills, instincts, memory, security, and research-first development for Claude Code, Codex, Cursor, and beyond.

agent-harnessperformance-optimizationdeveloper-tools

JavaScript198.6K+1.4K today30.5K

anthropics/skills

High RelevanceGitHub

Official public repository for Agent Skills from Anthropic, providing the standardized skill interface for the Claude agent ecosystem.

agent-skillsclaudeanthropicagent-ecosystem

Python143.6K+945 today16.9K

run-llama/liteparse

High RelevanceGitHub

Fast, open-source document parser built in Rust from the LlamaIndex team, optimized for converting documents into structured data for LLM consumption.

document-parsingrustllm-toolsllamaindex

Rust7.4K+701 today446

hardikpandya/stop-slop

GitHub

Skill file for removing AI tells from prose, complementing taste-skill in the growing AI output quality alignment movement.

ai-qualitywritingoutput-alignment

7.0K+617 today493

twentyhq/twenty

GitHub

Open alternative to Salesforce designed for AI, gaining 578 stars today as AI-native business tools continue to gain traction.

crmai-nativesalesforce-alternative

TypeScript48.4K+578 today6.9K

anthropics/claude-code

High RelevanceGitHub

Anthropic's agentic coding tool that lives in the terminal, understands codebases, and handles git workflows through natural language commands.

coding-agentclianthropicdeveloper-tools

Python127.9K+395 today20.9K

galilai-group/stable-worldmodel

High RelevanceGitHub

Platform for reproducible world model research and evaluation, gaining 362 stars today as interest in world models accelerates.

world-modelsresearchevaluationreproducibility

Python1.3K+362 today149

OpenMOSS/MOSS-TTS

High RelevanceGitHub

Open-source speech and sound generation model family covering stable long-form speech, multi-speaker dialogue, voice design, environmental sound effects, and real-time streaming TTS.

text-to-speechspeech-synthesismulti-speakerstreaming

Python2.5K+355 today229

EveryInc/compound-engineering-plugin

GitHub

Official Compound Engineering plugin for Claude Code, Codex, Cursor, and more, representing the growing plugin ecosystem for AI coding agents.

agent-plugindeveloper-toolscompound-engineering

TypeScript18.2K+353 today1.4K

NVlabs/Eagle

High RelevanceGitHub

NVIDIA's frontier vision-language model using data-centric strategies, surging 250 stars today as a strong open VLM contender.

vision-language-modelnvidiadata-centric

Python1.5K+250 today103