Yuan-lab commited on
Commit
a2af15a
Β·
verified Β·
1 Parent(s): 0fe2b2d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -0
README.md CHANGED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <h1>
3
+ Yuan 3.0 Multimodal Foundation Model
4
+ </h1>
5
+ </div>
6
+
7
+ <hr>
8
+ <div align="center" style="line-height: 1;">
9
+ <a href="https://github.com/Yuan-lab-LLM/Yuan3.0"><img alt="GitHub"
10
+ src="https://img.shields.io/badge/GitHub-Yuan%203.0%20Repo-181717?logo=github&logoColor=white"/></a>
11
+ <a href="https://www.modelscope.cn/profile/Yuanlab"><img alt="ModelScope"
12
+ src="https://img.shields.io/badge/πŸ’Ύ%20ModelScope-Yuan3.0-6b4fbb?color=6b4fbb&logoColor=white"/></a>
13
+ <a href="https://x.com/Yuanlabai"><img alt="Twitter Follow"
14
+ src="https://img.shields.io/badge/Twitter-Yuanlabai-white?logo=x&logoColor=white"/></a>
15
+ <a href="https://github.com/Yuan-lab-LLM/Yuan3.0/blob/main/docs/YUAN3.0_FLASH-paper.pdf"><img alt="arXiv"
16
+ src="https://img.shields.io/badge/arXiv-Yuan3.0%20Paper-b31b1b?logo=arxiv&logoColor=white"/></a>
17
+ </a>
18
+
19
+
20
+
21
+ </div>
22
+
23
+
24
+ -----
25
+
26
+
27
+
28
+ ## Latest Updates πŸŽ‰πŸŽ‰
29
+
30
+ * **[2025-12-30]** **Released Yuan 3.0-40B Multimodal Large Language Model, a high-performance model for enterprise-grade application scenarios: Yuan3.0 Flash**
31
+
32
+
33
+
34
+ ## 1. Introduction
35
+
36
+ Yuan 3.0 Flash, developed by the **YuanLab.ai team**, is a **40B parameter multimodal foundation model** that employs a Mixture of Experts (MoE) architecture, activating only approximately **3.7B parameters** per inference. Through innovative reinforcement learning training methods (RAPO), it significantly reduces inference token consumption while improving reasoning accuracy, exploring the innovative path of "less computation, higher intelligence" for large language models. We have also released the <a href="https://github.com/Yuan-lab-LLM/Yuan3.0/blob/main/docs/YUAN3.0_FLASH-paper.pdf" target="_blank">**technical report**</a> for the Yuan3.0 model, where you can find more detailed technical information and evaluation results.
37
+
38
+ <div align="center">
39
+ <img src="https://huggingface.co/YuanLabAI/Yuan3.0-Flash-4bit/resolve/main/docs/Yuan3.0-architecture.png" width="80%" />
40
+ Fig.1: Yuan3.0 Multimodal Large Language Model Architecture
41
+ </div>
42
+
43
+ ### Core Features
44
+
45
+ - πŸš€ **Efficient Inference**: Reduces inference token consumption by up to 75%, significantly lowering costs
46
+ - 🎯 **Enterprise-Grade Optimization**: Deeply optimized for enterprise scenarios such as RAG, document understanding, and table analysis
47
+ - 🎨 **Multimodal Support**: Supports text, image, table, document and other multimodal inputs
48
+ - πŸ“š **Long Context**: Supports 128K context length, achieving 100% accuracy in "Needle in a Haystack" tests
49
+ - ⚑ **Ready-to-Use Intelligence**: Default inference mode meets the needs of most enterprise scenarios
50
+
51
+ ## 2. Performance
52
+
53
+ Yuan 3.0 Flash outperforms GPT-5.1 in enterprise-grade RAG, multimodal retrieval, table understanding, summary generation and other tasks. With 40B parameters, it achieves the reasoning accuracy of 235B/671B models while reducing token consumption by 50%-75%, providing enterprises with high-performance, low-cost large language model solutions.
54
+
55
+
56
+ <div align="center">
57
+ <img src="https://huggingface.co/YuanLabAI/Yuan3.0-Flash-4bit/resolve/main/docs/Yuan3.0-benchmarks.png" width="80%" />
58
+ Fig.1: Yuan3.0 Multimodal Large Language Model Architecture
59
+ </div>
60
+
61
+
62
+
63
+ ## 3. Core Technology
64
+
65
+ ### RAPO Reinforcement Learning Algorithm
66
+
67
+ The innovative **Reflection-aware Adaptive Policy Optimization (RAPO)** algorithm, through the Reflection Inhibition Reward Mechanism (RIRM):
68
+
69
+ - βœ… Identifies the key point where the correct answer is first obtained
70
+ - 🎯 Suppresses subsequent redundant reasoning behavior
71
+ - πŸ“‰ Improves accuracy while reducing inference token count by approximately 75%
72
+
73
+ | Training Method | AIME 2024 Accuracy | Avg Output Length | MATH-500 Accuracy | Avg Output Length |
74
+ |---------|------------------|--------------|-----------------|--------------|
75
+ | Yuan3.0 Flash (40B) SFT | 31.45% | 13,656 tokens | 83.20% | 3,362 tokens |
76
+ | RL+DAPO length-penalty | 46.35% | 13,781 tokens | 89.06% | 3,974 tokens |
77
+ | **RL+RIRM** | **47.92%** | **7,505 tokens** | **89.47%** | **1,777 tokens** |
78
+
79
+
80
+
81
+
82
+
83
+ ## 4. Model Download
84
+
85
+ **We provide download links for multiple model formats:**
86
+
87
+ | Model | Parameters | Precision | Sequence Length | Model Format | Download Link |
88
+ | :----------: | :------: | :------: | :------: | :-------: |:---------------------------: |
89
+ | Yuan3.0 Flash | 40B | 16bit | 128K | HuggingFace | [ModelScope]( https://modelscope.cn/models/Yuanlab/Yuan3.0-Flash) \| [HuggingFace]( https://huggingface.co/YuanLabAI/Yuan3.0-Flash) \| [WiseModel]( https://www.wisemodel.cn/models/YuanLabAI/Yuan3.0-Flash)
90
+ | Yuan3.0 Flash 4bit | 40B | 4bit | 128K | HuggingFace | [ModelScope]( https://modelscope.cn/models/Yuanlab/Yuan3.0-Flash-int4) \| [HuggingFace]( https://huggingface.co/YuanLabAI/Yuan3.0-Flash-4bit) \| [WiseModel]( https://www.wisemodel.cn/models/YuanLab/Yuan3.0-Flash-4bit)
91
+
92
+
93
+
94
+
95
+
96
+ ## 5. Evaluation Results
97
+
98
+ **5.1 Text-based RAG Evaluation: ChatRAG** πŸ†
99
+
100
+ Yuan 3.0 Flash leads DeepSeek-V3, DeepSeek-R1 and other large language models in average accuracy across 10 evaluation tasks in the industry-standard RAG benchmark ChatRAG.
101
+
102
+ **Model Average Accuracy Comparison**
103
+
104
+
105
+ | Models | Avg All | D2D | QuAC | QReCC | CoQA | DoQA | CFQA | SQA | TCQA | HDial | INSCIT |
106
+ |--------|---------|-----|------|-------|------|------|------|-----|------|-------|--------|
107
+ | **DeepSeek-V3** | 50.47 | 31.59 | 28.86 | 49.31 | 76.98 | 26.11 | 83.49 | 82.13 | 46.69 | 47.43 | 32.08 |
108
+ | **DeepSeek-V3.23** | 49.67 | 34.30 | 28.09 | 49.97 | 77.29 | 29.46 | 72.85 | 79.48 | 44.64 | 47.99 | 32.64 |
109
+ | **OpenAI GPT-4o** | 50.54 | 32.76 | 26.56 | 49.30 | 76.11 | 28.78 | 81.85 | 81.14 | 49.75 | 41.29 | 26.69 |
110
+ | **OpenAI GPT-o3** | 44.06 | 23.05 | 20.82 | 40.42 | 69.42 | 18.56 | 67.75 | 86.71 | 45.85 | 41.29 | 26.69 |
111
+ | **DeepSeek-R1** | 39.42 | 21.46 | 22.23 | 42.41 | 62.53 | 24.68 | 81.48 | 82.06 | 30.74 | 37.97 | 28.68 |
112
+ | **OpenAI GPT-5.1** | 46.10 | 28.24 | 23.16 | 45.43 | 68.84 | 20.88 | 73.05 | 81.32 | 44.70 | 45.39 | 29.95 |
113
+ | **Yuan3.0 Flash** | **64.47** | 49.82 | 53.79 | 57.08 | 90.93 | 59.99 | 74.40 | 87.52 | 66.31 | 68.45 | 36.40 |
114
+
115
+
116
+
117
+ *<small>
118
+ β€’ **Long Context Tests** (D2D, QuAC, QReCC)
119
+ β€’ **Wikipedia Retrieval Tests** (TCQA, INSCIT)
120
+ β€’ **Short Text & Structured Context Tests** (CoQA, DoQA, CFQA, SQA, HDial)
121
+ </small>*
122
+
123
+ ---
124
+
125
+
126
+ **5.2 Multimodal RAG Evaluation: Docmatix** πŸ†
127
+
128
+ Yuan3.0 Flash leads Claude3.5, OpenAI GPT-4o, o3 and other models in the multimodal RAG benchmark Docmatix, with accuracy performance only second to GPT-5.1.
129
+
130
+ **Model Average Accuracy Comparison**
131
+
132
+ | Models | Avg. |
133
+ |--------|:---------:|
134
+ | **Qwen2.5-VL-72B-Instruct** | 59.75 |
135
+ | **InternVL3-78B** | 42.99 |
136
+ | **Claude3.5-Sonnet** | 42.55 |
137
+ | **OpenAI GPT-4o** | 56.79 |
138
+ | **OpenAI GPT-o3** | 45.57 |
139
+ | **OpenAI GPT-4V** | 60.10 |
140
+ | **OpenAI GPT-5.1** | 48.52 |
141
+ | **Yuan3.0 Flash** | **65.07** |
142
+
143
+
144
+ *<small>**Docmatix** - Evaluates the model's ability to retrieve information, correlate, and accurately answer questions across text, tables, images and other multimodal content in multi-page complex documents.</small>*
145
+
146
+ ---
147
+
148
+ **5.3 Multimodal Complex Table Content Analysis Evaluation: MMTab** πŸ†
149
+
150
+ Multimodal table understanding is an important application scenario in enterprise office automation. Yuan3.0 Flash achieves leading average accuracy on 15 evaluation tasks in the industry-standard multimodal complex table understanding benchmark MMTab, surpassing OpenAI's GPT-5.1.
151
+
152
+ **Model Average Accuracy Comparison**
153
+
154
+ | Models | Avg. | TABMWP | WTQ | WTQ | HiTab | TAT-QA | FeTaQAU | TabFact | InfoTabs | HiTab_T2T | Rotowire | WikiBIO | TSD_Row | TSD_Col | TCE | TCL | MCD | RCE |
155
+ |--------|:----:|:------:|:---:|:---:|:-----:|:------:|:-------:|:-------:|:--------:|:---------:|:--------:|:-------:|:-------:|:-------:|:---:|:---:|:---:|:---:|
156
+ | **Zhipu GLM-4.5V** | 52.00 | 88.21 | 77.42 | 51.52 | 62.69 | 5.25 | 89.44 | 79.48 | 5.17 | 4.48 | 2.69 | 47.40 | 89.70 | 52.74 | 50.84 | 43.47 | 50.77 | 82.79 |
157
+ | **OpenAI GPT-4V** | 29.90 | 60.50 | 48.00 | 27.50 | 32.50 | 11.04 | 45.50 | 65.60 | 2.98 | 4.23 | 1.94 | 19.00 | 38.00 | 14.36 | 27.91 | 3.50 | 48.52 | 57.14 |
158
+ | **OpenAI GPT-5.1** | 55.15 | 64.95 | 60.77 | 77.77 | 61.37 | 8.70 | 52.81 | 64.30 | 44.16 | 17.81 | 11.95 | 96.60 | 62.10 | 86.43 | 44.66 | 72.46 | 53.58 | 57.20 |
159
+ | **Yuan3.0 Flash** | 58.29 | 95.09 | 68.23 | 69.80 | 69.17 | 28.42 | 87.32 | 83.50 | 13.30 | 14.74 | 17.26 | 46.60 | 82.80 | 56.77 | 56.98 | 65.20 | 62.07 | 73.67 |
160
+
161
+ ---
162
+
163
+ **5.4 Text Summarization Generation Evaluation: SummEval** πŸ†
164
+
165
+ Summarization generation is a core requirement for historical information compression in intelligent agent applications. Yuan 3.0 achieves leading average accuracy in the industry-standard summarization generation benchmark SummEval across three major capabilities: lexical overlap, semantic similarity, and factual consistency, surpassing the DeepSeek-V3 large language model.
166
+
167
+ **Model Average Accuracy Comparison**
168
+
169
+
170
+ | Models | Avg. | Lexical Overlap<br>ROUGE-1 | Lexical Overlap<br>ROUGE-2 | Semantic Similarity<br>BERTScore | Factual Consistency<br>SummaC |
171
+ |--------|:---------:|:-----------:|:-----------:|:--------------:|:------------:|
172
+ | **DeepSeek-V3** | 59.28 | 25.50 | 9.20 | 86.30 | 68.20 |
173
+ | **DeepSeek-V3.2** | 51.36 | 33.30 | 11.92 | 85.61 | 41.76 |
174
+ | **Gemini-2.0-Flash** | 45.35 | 24.80 | 8.70 | 85.70 | 29.50 |
175
+ | **Claude-3.5-Sonnet** | 45.43 | 24.10 | 8.30 | 85.20 | 30.70 |
176
+ | **OpenAI GPT-4o** | 46.53 | 25.00 | 8.90 | 85.90 | 32.50 |
177
+ | **OpenAI GPT-5.1** | 49.44 | 27.48 | 10.16 | 84.63 | 40.50 |
178
+ | **Yuan3.0 Flash** | **59.31** | 51.32 | 28.32 | 89.99 | 45.34 |
179
+
180
+