sarashina2.2-vision-3b / README.md

toshi-456

Upload README.md

147b6cc verified 19 days ago

preview code

raw

history blame contribute delete

6.95 kB

metadata

language:
  - ja
  - en
base_model:
  - sbintuitions/sarashina2.2-3b-instruct-v0.1
license: mit
tags:
  - multimodal
  - vision-language
pipeline_tag: image-to-text
library_name: transformers

Sarashina2.2-Vision-3B

Sarashina2.2-Vision-3B is a Japanese Large Vision Language Model trained by SB Intuitions.

This model is based on Sarashina2.2-3B-Instruct and Image Encoder of SigLIP.

Model Performance

Japanese Performance

Model	Params(B)	BussinessSlide VQA^*1	Heron-Bench^*1	JDocQA^*1	JMMMU
Sarashina2.2-Vision-3B	3.8	3.932	3.214	3.327	0.486
Qwen2.5-VL-3B-Instruct	3.8	3.516	2.000	3.019	0.450
Qwen3-VL-4B-Instruct	4.4	4.105	2.330	3.596	0.493
InternVL3_5-4B	4.7	3.311	1.893	2.626	0.437
Sarashina2-Vision-14B	14.4	3.110	2.184	-^*2	0.432
Stockmark-2-VL-100B-beta	96.5	3.973	2.563	3.168	-^*2

*1. gpt-oss-120b was used for LLM-as-a-Judge.

*2. These scores cannot be measured because some input data exceeds the model's max_position_embeddings.

English Performance

Model	Params(B)	DocVQA	InfoVQA	RealworldQA
Sarashina2.2-Vision-3B	3.8	0.831	0.567	0.625
Qwen2.5-VL-3B-Instruct	3.8	0.924	0.750	0.586
Qwen3-VL-4B-Instruct	4.4	0.948	0.798	0.712
InternVL3_5-4B	4.7	0.823	0.541	0.553
Sarashina2-Vision-14B	14.4	0.729	0.490	0.519

How to use

1. Install dependencies

pip install transformers==4.57.1 torch torchvision pillow protobuf sentencepiece accelerate

2. Inference

The following script loads the model and allows inference.

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, set_seed

# Define model path
model_path = "sbintuitions/sarashina2.2-vision-3b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
set_seed(42)

image_url = "https://huggingface.co/sbintuitions/sarashina2.2-vision-3b/resolve/main/sample.jpg"
message = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_url,
            },
            {
                "type": "text",
                "text": "これはどこで撮った写真ですか？",
            },
        ],
    }
]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <|user|><|prefix|><|file|><|suffix|>これはどこで撮った写真ですか？</s><|assistant|>"""

image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.2,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""
この写真は、**道後温泉本館（どうごおんせんほんかん）** の入り口を夜景で撮影した写真です。

---
 場所の詳細：
- **名称**：道後温泉本館（Dogo Onsen Honkan）
- **所在地**：〒790-0842 愛媛県松山市道後湯之町1丁目3番5号
- **アクセス**：JR松山駅から市内電車「道後温泉駅」下車すぐ
- **特徴**：日本最古の温泉の一つとして知られる「道後温泉」の中心的な施設。国の重要文化財にも指定されています。

---
 写真の特徴から判断した理由：
- 建物の屋根や装飾が伝統的な和風建築で、「道後温泉」の看板が目立つ。
- 入口の垂れ幕には「道後」「道後」と書かれており、白い鳳凰の模様が描かれている → 道後温泉の象徴的デザイン。
- 夜の照明と石灯籠、提灯風の灯りが日本の温泉地らしい雰囲気を醸し出している。
- 看板に「道後温泉」の文字が明確に表示されている。

---
 補足情報：
道後温泉本館は、夏目漱石の小説『坊っちゃん』の舞台としても有名で、多くの観光客が訪れる人気スポットです。また、2020年にリニューアルされ、現代的な設備も導入されていますが、外観は伝統を残しています。

---
よって、この写真は **愛媛県松山市にある「道後温泉本館」の夜景** です。
"""

Training

Sarashina2.2-Vision-3B is created through the following five-stage training process:

PreTrain

Projector Warmup: To bridge the gap between the text and image embedding spaces within the LLM
Vision Encoder Pretraining: To enhance image comprehension, especially for understanding Japan-specific images and text
Full Model Pretraining: To enhance the model's unified understanding of images and language using interleaved data

PostTrain

Supervised Fine-Tuning(SFT): To improve the model's ability to follow instructions and respond appropriately to user prompts
Mixed Preference Optimization(MPO): To align the model's outputs with user preferences, ensuring it generates more desirable responses

Limitations

This model has limited safety training. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs. Before using it, we would like developers to tune models based on human preferences and safety considerations.

LICENSE

MIT License