Alex Yang's picture

2 14

Alex Yang

yyf86

·

AI & ML interests

None yet

Organizations

None yet

upvoted 2 papers 3 months ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 56

AToken: A Unified Tokenizer for Vision

Paper • 2509.14476 • Published Sep 17 • 36

upvoted 2 papers 9 months ago

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Paper • 2503.13111 • Published Mar 17 • 7

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Paper • 2503.10618 • Published Mar 13 • 18

upvoted a paper 11 months ago

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Paper • 2501.07301 • Published Jan 13 • 99

upvoted a paper 12 months ago

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 74

upvoted 6 papers about 1 year ago

Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 47

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Paper • 2410.08159 • Published Oct 10, 2024 • 26

Progressive Autoregressive Video Diffusion Models

Paper • 2410.08151 • Published Oct 10, 2024 • 16

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 41

upvoted 2 papers over 1 year ago

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 24

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83