image

Overview

HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the SEED Think 14B line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.


Basic Information

  • Architecture : Transformer-based vision-language model (VLM) architecture (Dense Model)
  • Parameters : 32B
  • Input Format: Text/Image/Video
  • Output Format: Text
  • Context Length : 128K
  • Knowledge Cutoff: May 2025

Benchmarks

테크니컬 리포트 04@2x

  • General Knowledge (Korean Text): KoBalt, CLIcK, HAERAE Bench 1.0
  • Vision Understanding : ChartVQA, TextVQA, K-MMBench, K-DTCBench
  • Agentic Tasks: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom

Examples

  • Solving 2026 Korean CSAT Math Problem
  • Understanding Text layout

Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

  • Inputs: Text, Image
  • Outputs: Text

Requirements

  • 4x NVIDIA A100 80GB
  • Docker & Docker Compose
  • NVIDIA Driver 525+, CUDA 12.1+

Installation

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~60GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
    --local-dir ./models/HyperCLOVAX-SEED-Think-32B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Think-32B \
    --output ./track_a \
    --track a

# Configure environment
cp .env.example .env
# Edit .env:
# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B

# Build and run
docker compose --profile track-a build
docker compose --profile track-a up -d

# Wait for model loading (~5 minutes)
docker compose logs -f vlm

Basic Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/a/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

print(response.choices[0].message.content)

Reasoning Mode

Enable chain-of-thought reasoning for complex tasks:

response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
    ],
    max_tokens=1024,
    extra_body={
        "thinking_token_budget": 500,
        "chat_template_kwargs": {"thinking": True}
    }
)

# Response includes <think>...</think> with reasoning process
print(response.choices[0].message.content)

More Examples

Video Understanding
response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)
Base64 Image Input
import base64

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)
Using curl
curl -X POST http://localhost:8000/a/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_a_model",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
          {"type": "text", "text": "Describe this image."}
        ]
      }
    ],
    "max_tokens": 512,
    "extra_body": {"chat_template_kwargs": {"thinking": false}}
  }'

Model Capabilities

Input Output
Text Text
Image Text
Video Text
Image + Text Text
Video + Text Text

Features:

  • Reasoning mode with <think>...</think> output
  • Multi-turn conversation support
  • Image/Video understanding

Architecture

                         User Request
                       (Image/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /a/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │                   ┌─────────────────┐                            │   │
│  │                   │  Vision Encoder │                            │   │
│  │                   └────────┬────────┘                            │   │
│  │                            │ embeddings                          │   │
│  └────────────────────────────┼─────────────────────────────────────┘   │
│                               ▼                                         │
│                       ┌──────────────┐                                  │
│                       │  LLM (32B)   │◀──── text                        │
│                       └──────┬───────┘                                  │
│                              │                                          │
│                              ▼                                          │
│                        Text Response                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                           Response
                            (Text)

Hardware Requirements

Component GPU VRAM
Vision Encoder 1x ~8GB
LLM (32B) 2x ~60GB
Total 3x ~68GB

Key Parameters

Parameter Description Default
chat_template_kwargs.thinking Enable reasoning false
thinking_token_budget Max reasoning tokens 500
max_tokens Max output tokens -
temperature Sampling temperature 0.7

For more details, see OmniServe documentation.


Citation

TBU (Technical Report)


Questions

For any other questions, please feel free to contact us at [email protected].


License

The model is licensed under HyperCLOVA X SEED 32B Think Model License Agreement

Downloads last month
83
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including naver-hyperclovax/HyperCLOVAX-SEED-Think-32B