PenPaperKeyCode commited on
Commit
f5e4236
·
0 Parent(s):

Init history

Browse files
.gitattributes ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ eval_output_dir_if_think/IFBench_multi-turn_input_response_data_hf.jsonl filter=lfs diff=lfs merge=lfs -text
37
+ eval_output_dir_dev_think/IFBench_multi-turn_input_response_data_hf.jsonl filter=lfs diff=lfs merge=lfs -text
38
+ eval_output_dir_dev_think/humaneval_plus_prediction.jsonl filter=lfs diff=lfs merge=lfs -text
39
+ eval_output_dir_if/IFBench_multi-turn_input_response_data_hf.jsonl filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ HyperCLOVA X SEED 32B Think Model License Agreement
2
+
3
+ Model Release Date: December 29, 2025
4
+
5
+ This HyperCLOVA X SEED 32B Think Model License Agreement (the “Agreement”) is a legal agreement between you and NAVER Corporation (“Naver Corp.”) and NAVER Cloud Corporation (“Naver Cloud Corp.”) (Naver Corp. and Naver Cloud Corp. are collectively referred to as “NAVER”) and governs your use of the Models that NAVER provides to You under this Agreement.
6
+
7
+ NAVER Corp., as the holder of the intellectual property of the Model, and its affiliate, NAVER Cloud Corp., as the exclusive business operator of HyperCLOVA X, enter into this Agreement with you. NAVER and you are each a “party” and collectively the “parties.”
8
+
9
+ By using, reproducing, modifying, distributing, performing or displaying any portion or element of the Model or Derivative Model, or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement. You represent to us that you are lawfully able to enter into contracts, and if you are entering into this Agreement for an entity, that you have legal authority to bind that entity.
10
+
11
+ 1. Definitions.
12
+
13
+ 1.1. "Affiliate” means any entity directly or indirectly controlling, controlled by or under common control with either party, where “control” means the possession, directly or indirectly, of the power to independently direct or cause the direction of the management and policies of an entity, whether through ownership of more than fifty percent (50%) of the stock or other equity interests entitled to vote for representation on its board of directors, or body performing similar functions, by contract or otherwise.
14
+
15
+ 1.2. “Derivative Model” means all (i) modifications to the Model, (ii) works based on the Model, or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of the Model, to that model in order to cause that model to perform similarly to the Model, including distillation methods that use intermediate data representations or methods based on the generation of synthetic data Outputs by the Model for training that Model. For clarity, Outputs are not deemed Derivative Model.
16
+
17
+ 1.3. “Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
18
+
19
+ 1.4. “Model” means the foundational large language models and software and algorithms, including machine-learning model code and trained model weights distributed by NAVER.
20
+
21
+
22
+ 1.5. “Output” means the information content output of the Model or a Derivative Model that results from operating or otherwise using the Model or Derivative Model.
23
+
24
+ 2. Conditions for Use, License Grant and Restrictions
25
+
26
+ 2.1. Conditions for Use. The Model and any Derivative Model are subject to the terms of this Agreement and govern your use. If You institute copyright or patent litigation against any entity (including a crossclaim or counterclaim in a lawsuit) alleging that the Model or Derivative Model constitutes direct or contributory copyright or patent infringement, then any license granted to you under this Agreement for that Model or Derivative Model will terminate as of the date such litigation is filed. NAVER may update this Agreement to comply with legal and regulatory requirements any time and You agree to either comply with any updated license or cease your copying, use, and distribution of the Model and any Derivative Model.
27
+
28
+ 2.2. License Grant. Subject to the terms and conditions of this Agreement, NAVER hereby grants to you a non-exclusive, worldwide, non-transferable, revocable and royalty-free limited license under NAVER’s intellectual property or other rights owned by NAVER embodied in the Model to access, download, install, copy, use, reproduce, distribute, create derivative works of, and make modifications to the Model.
29
+
30
+ 2.3. Prohibited Use Policy. NAVER is committed to ensuring safety trust, and transparency in the development and use of AI technologies. Accordingly, your use of the Model and any Derivative Models is subject to the following conditions:
31
+ (i) You must ensure that any product or service you develop, use, offer as a service, or distribute complies with all applicable laws and regulations, and is operated appropriately for the relevant industry or use case.
32
+ (ii) You must comply with the Acceptable Use Policy applicable to the Model and any Derivative Models, which is attached hereto as Addendum A and incorporated by reference into this Agreement.
33
+ (iii) NAVER expressly prohibits the use of its products or services for any purpose in violation of applicable law and regulation, including but not limited to:
34
+ (a) illegal surveillance,
35
+ (b) illegal collection or processing of biometric information without the consent of the subject which is required under applicable law, or
36
+ (c) illegal harassment, abuse, threatening or bullying of individuals or groups of individuals or intentionally misleading or deceiving others.
37
+ (iv) You must take reasonable measures to address unintended bias and to mitigate harm to others, including underrepresented or vulnerable groups.
38
+
39
+
40
+ 3. Redistribution.
41
+
42
+ 3.1. You may reproduce, distribute or make available the Model or Derivative Models thereof, or a product or service (including another AI model) that contains any of them, if you meet all of the following conditions: you must (i) include the Prohibited Use Policy referenced in Section 2.3. as an enforceable provision in any agreement (e.g., license agreement, terms of use, etc.) governing the use and/or distribution of the Model or Derivative Model and you must provide notice to subsequence users you distribute to the Model or Derivative Models are subject to the use restrictions in Section 2.3., (ii) provide all third party recipients of the Model or Derivative Models a copy of this Agreement, (iii) cause any modified files to carry prominent notices stating that you modified the files; (iv) include the following attribution notice within a “Notice” text file distributed as part of such copies: “HyperCLOVA X SEED 32B Think Model is licensed under the HyperCLOVA X SEED 32B Think Model License Agreement, Copyright © NAVER Corp. All Rights Reserved.”, and (v) prominently display “Powered by HyperCLOVA X” on a related website, user interface, blogpost, about page, or product documentation. If you use the Model or any Outputs of the Model to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “HyperCLOVA X” at the beginning of any such AI model name.
43
+ 3.2. You may add your own copyright statement to your modifications and, except as set forth in this Section, may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such Derivative Models as a whole, provided your use, reproduction, and distribution of the Model or Derivative Models otherwise comply with the terms and conditions stated in this Agreement. Any additional or different terms and conditions you impose must not conflict with the terms of this Agreement.
44
+
45
+ 4. Additional Commercial Terms. If (i) as of the Model Release Date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s Affiliates, is greater than 10 million monthly active users in the preceding calendar month, or (ii) the Licensee or its Affiliate distributes or makes available any product or service, which is substantially similar to or directly competes with any product and service provided by NAVER, then the Licensee must request a license from NAVER. Such a license may be granted by NAVER at its sole discretion, and the Licensee is not authorized to exercise any rights under this Agreement unless and until NAVER expressly grants you such rights.
46
+
47
+ 5. Generated Output. NAVER claims no rights in Outputs you generate using the Model. You and your use are solely responsible for Outputs and their subsequent uses.
48
+
49
+ 6. DISCLAIMER OF WARRANTY. UNLESS REQUIRED BY APPLICABLE LAW, THE MODEL AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OR ANY KIND, AND NAVER DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE MODEL, DERIVATIVE MODELS, OUTPUTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE MODEL AND ANY OUTPUTS AND RESULTS AND YOUR EXERCISE OF PERMISSION UNDER THIS AGREEMENT.
50
+
51
+ 7. LIMITATION OF LIABILITY. IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, UNLESS REQUIRED BY APPLICABLE LAW (SUCH AS IN CASES OF DELIBERATE AND GROSSLY NEGLIGENT ACTS), WILL NAVER BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY, OR PUNITIVE DAMAGES, OR LOST PROFITS OF ANY KIND, ARISING FROM OR RELATED TO THIS AGREEMENT, OR RESULTING FROM THE USE OR INABILITY TO USE THE MODEL, DERIVATIVE MODELS OR, OUTPUTS (INCLUDING, BUT NOT LIMITED TO, DAMAGES FOR LOSS OF GOODWILL, WORK STOPPAGES, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER COMMERCIAL DAMAGES OR LOSSES), EVEN IF NAVER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
52
+
53
+ 8. Indemnity. You will indemnify and hold harmless NAVER from and against any claim by any third party arising out of or related to your use or distribution of the Model, Derivative Model or Outputs.
54
+
55
+ 9. Intellectual Property.
56
+
57
+ 9.1. This Agreement does not grant permission to use the trade names, trademarks, service marks, or product names of NAVER, except as required for reasonable and customary use in describing the origin of the Model and reproducing the content of the “Notice” text file.
58
+
59
+ 9.2. NAVER Corp. owns the Model and any Derivative Model created by NAVER Corp. Except as expressively granted in this Agreement, NAVER Corp. reserves all rights, interests and remedies in connection with the Model and Derivative Model created by NAVER Corp. and no other license or right is granted to you by implication, estoppel or otherwise. Subject to NAVER Corp.’s ownership of the Model and any Derivative Model made by or for NAVER Corp., with respect to any derivative works and modifications of the Model that are made by you, as between you and NAVER Corp., you are and will be the owner of such derivative works and modifications.
60
+
61
+ 10. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Model and will continue in full force and effect until terminated in accordance with the terms and conditions of this Agreement. NAVER may terminate this Agreement if you breach any of the terms or conditions of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Model and Derivative Model. Section 5, 6, 7 and 10 shall survive the termination of this Agreement.
62
+
63
+ 11. Governing Law and Jurisdiction.
64
+
65
+ 11.1. This Agreement will be governed by and construed in accordance with the laws of the Republic of Korea, without regard to its conflicts of laws principles.
66
+
67
+ 11.2. Any disputes, controversies, or claims arising out of or relating to this Agreement, including its existence, validity, interpretation, performance, breach, or termination, shall be referred to and finally resolved by arbitration administered by the Korean Commercial Arbitration Board (KCAB) in accordance with the International Arbitration Rules of the Korean Commercial Arbitration Board in force at the time of the commencement of the arbitration. The seat of arbitration shall be Seoul, Republic of Korea. The tribunal shall consist of one arbitrator. The language of the arbitration shall be English. Either party may seek interim or provisional relief from a court of competent jurisdiction and doing so shall not be considered a waiver of any provision in this section. The arbitral tribunal also has the authority to issue orders for interim or provisional relief.
68
+
69
+ 12. Modifications. NAVER reserves the right to modify or amend this Agreement at any time, in its sole discretion. Any modifications will be effective upon posting the updated Agreement on our website or through other means of communication. You are responsible for reviewing the Agreement periodically for changes.
70
+
71
+ 13. No Waiver. NAVER will not be treated as having waived any rights by not exercising (or delaying the exercise of) any rights under this Agreement.
72
+
73
+
74
+
75
+ Addendum A – Acceptable Use Policy
76
+
77
+ NAVER is committed to promoting safe and responsible use of its AI technologies, including the HyperCLOVA X SEED 32B Think Model (the “Model”). By accessing or using the Model and Derivative Model (Defined in the Model License Agreement) (the Model and Derivative Model are collectively referred to as the “Models”), you agree to this Acceptable Use Policy (“Policy”).
78
+
79
+ We want everyone to use the Models safely, legally, and ethically. You agree that you will not use, or allow others to use, the Models to:
80
+
81
+ 1. Violate applicable laws or the rights of others, including by:
82
+ a. Engaging in, promoting, contributing to, encouraging, planning, inciting, or furthering illegal or unlawful activity or content, such as:
83
+  Violence or terrorism
84
+  Exploitation or harm to children, including the creation or dissemination of child exploitative content
85
+  Human trafficking, exploitation, or sexual violence
86
+  The unlawful distribution of obscene or harmful material to minors, or failure to apply legally required age restrictions
87
+  Sexual solicitation or sexually exploitative behavior
88
+  Any other criminal activity
89
+ b. Engaging in, promoting, inciting, or facilitating the harassment, abuse, threatening, or bullying of individuals or groups
90
+ c. Engaging in, promoting, inciting, or facilitating discrimination or other unlawful or harmful conduct in the provision of employment, credit, housing, or access to essential goods and services
91
+ d. Providing unauthorized or unlicensed professional services, including but not limited to financial, legal, medical/health, or related services
92
+ e. Collecting, processing, disclosing, generating, or inferring private or sensitive personal information, including identity, health, or demographic data, unless lawfully permitted under applicable laws
93
+ f. Infringing, misappropriating, or otherwise violating third-party rights, including through the generation or use of outputs derived from the Models
94
+ g. Creating, generating, or facilitating malicious code, malware, or computer viruses, or interfering with the functioning, security, or integrity of a website, application, or system
95
+ h. Intentionally bypassing or disabling usage restrictions, safety measures, or access controls imposed by NAVER
96
+
97
+ 2. Engage in or promote use cases that may pose a risk of death, bodily harm, or significant safety hazard to individuals, including use of the Models in connection with:
98
+ a. Military, warfare, nuclear technology or espionage
99
+ b. The development or distribution of firearms or illegal weapons
100
+ c. Illegal drugs or regulated controlled substances
101
+ d. Operation of critical infrastructure, transportation systems, or heavy machinery
102
+ e. Content promoting self-harm, including suicide, or eating disorders
103
+ f. Any other use intended to incite or cause physical harm
104
+
105
+ 3. Intentionally deceive or mislead others, including by:
106
+ a. Generating, promoting, or disseminating fraudulent or misleading content
107
+ b. Creating or sharing defamatory content
108
+ c. Generating or distributing spam
109
+ d. Impersonating another individual or entity without proper authorization
110
+ e. Representing Model output as human-generated
111
+ f. Generating or enabling fake online engagement, such as fake reviews or fake users
112
+
113
+ 4. Fail to disclose to end users any known risks or limitations of an AI system that incorporates the Models.
114
+
115
+ 5. Use the Models in conjunction with third-party tools, models, or software designed to generate unlawful content or conduct, or falsely represent outputs from such tools as associated with NAVER or HyperCLOVA X.
116
+
117
+ If you become aware of a violation of this Policy, a bug, or any behavior that could result in a breach of this Policy, please report it to us:
118
+
119
+ Reporting risky outputs: [email protected]
120
+ Reporting policy violations or unauthorized use: [email protected]
121
+
README.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: hyperclovax
4
+ license_link: LICENSE
5
+ library_name: transformers
6
+ ---
7
+
8
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/2wkHd-bv3M9Zsma_ykIf8.png)
9
+
10
+ # Overview
11
+ HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the [SEED Think 14B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B) line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.
12
+
13
+ # Basic Information
14
+
15
+ - **Architecture** : Transformer-based vision-language model (VLM) architecture (Dense Model)
16
+ - **Parameters** : 32B
17
+ - **Input Format**: Text/Image/Video
18
+ - **Output Format**: Text
19
+ - **Context Length** : 128K
20
+
21
+ # Benchmarks
22
+
23
+ ![테크니컬 리포트 04@2x](https://cdn-uploads.huggingface.co/production/uploads/646acf46086023e36edce4c4/qfIKiKlFVJWyCx3Dl1qN0.png)
24
+
25
+ - General Knowledge (Korean Text): KoBalt, CLIcK, HAERAE Bench 1.0
26
+ - Vision Understanding : ChartVQA, TextVQA, K-MMBench, K-DTCBench
27
+ - Agentic Tasks: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom
28
+
29
+
30
+ # Examples
31
+ - Solving 2026 Korean CSAT Math Problem
32
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/LPU8kNbYQ8FN_piQ_p6Je.jpeg" style="width: 640px;">
33
+ - Understanding Text layout
34
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/Y8lHa7s1TmJcS6F82d41L.jpeg" style="width: 640px;">
35
+ <!-- - Understanding Charts
36
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/zoH2Lh6CSkgdzvXz7JaHo.jpeg" style="width: 640px;"> -->
37
+
38
+ # Inference
39
+
40
+ We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API.
41
+
42
+ ## Capabilities
43
+
44
+ - **Inputs**: Text, Image
45
+ - **Outputs**: Text
46
+
47
+ ## Requirements
48
+
49
+ - 4x NVIDIA A100 80GB
50
+ - Docker & Docker Compose
51
+ - NVIDIA Driver 525+, CUDA 12.1+
52
+
53
+ ## Installation
54
+
55
+ ```bash
56
+ # Clone OmniServe
57
+ git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
58
+ cd OmniServe
59
+
60
+ # Install dependencies
61
+ pip install huggingface_hub safetensors torch openai easydict
62
+
63
+ # Download model (~60GB)
64
+ huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
65
+ --local-dir ./models/HyperCLOVAX-SEED-Think-32B
66
+
67
+ # Convert model to component format
68
+ python convert_model.py \
69
+ --input ./models/HyperCLOVAX-SEED-Think-32B \
70
+ --output ./track_a \
71
+ --track a
72
+
73
+ # Configure environment
74
+ cp .env.example .env
75
+ # Edit .env:
76
+ # VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
77
+ # VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B
78
+
79
+ # Build and run
80
+ docker compose --profile track-a build
81
+ docker compose --profile track-a up -d
82
+
83
+ # Wait for model loading (~5 minutes)
84
+ docker compose logs -f vlm
85
+ ```
86
+
87
+ ## Basic Usage
88
+
89
+ ```python
90
+ from openai import OpenAI
91
+
92
+ client = OpenAI(
93
+ base_url="http://localhost:8000/a/v1",
94
+ api_key="not-needed"
95
+ )
96
+
97
+ # Image understanding
98
+ response = client.chat.completions.create(
99
+ model="track_a_model",
100
+ messages=[
101
+ {
102
+ "role": "user",
103
+ "content": [
104
+ {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
105
+ {"type": "text", "text": "Describe this image."}
106
+ ]
107
+ }
108
+ ],
109
+ max_tokens=512,
110
+ extra_body={"chat_template_kwargs": {"thinking": False}}
111
+ )
112
+
113
+ print(response.choices[0].message.content)
114
+ ```
115
+
116
+ ## Reasoning Mode
117
+
118
+ Enable chain-of-thought reasoning for complex tasks:
119
+
120
+ ```python
121
+ response = client.chat.completions.create(
122
+ model="track_a_model",
123
+ messages=[
124
+ {"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
125
+ ],
126
+ max_tokens=1024,
127
+ extra_body={
128
+ "thinking_token_budget": 500,
129
+ "chat_template_kwargs": {"thinking": True}
130
+ }
131
+ )
132
+
133
+ # Response includes <think>...</think> with reasoning process
134
+ print(response.choices[0].message.content)
135
+ ```
136
+
137
+ ## More Examples
138
+
139
+ <details>
140
+ <summary>Video Understanding</summary>
141
+
142
+ ```python
143
+ response = client.chat.completions.create(
144
+ model="track_a_model",
145
+ messages=[
146
+ {
147
+ "role": "user",
148
+ "content": [
149
+ {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
150
+ {"type": "text", "text": "Describe this video."}
151
+ ]
152
+ }
153
+ ],
154
+ max_tokens=512,
155
+ extra_body={"chat_template_kwargs": {"thinking": False}}
156
+ )
157
+ ```
158
+
159
+ </details>
160
+
161
+ <details>
162
+ <summary>Base64 Image Input</summary>
163
+
164
+ ```python
165
+ import base64
166
+
167
+ with open("image.png", "rb") as f:
168
+ image_b64 = base64.b64encode(f.read()).decode()
169
+
170
+ response = client.chat.completions.create(
171
+ model="track_a_model",
172
+ messages=[
173
+ {
174
+ "role": "user",
175
+ "content": [
176
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
177
+ {"type": "text", "text": "What is in this image?"}
178
+ ]
179
+ }
180
+ ],
181
+ max_tokens=512,
182
+ extra_body={"chat_template_kwargs": {"thinking": False}}
183
+ )
184
+ ```
185
+
186
+ </details>
187
+
188
+ <details>
189
+ <summary>Using curl</summary>
190
+
191
+ ```bash
192
+ curl -X POST http://localhost:8000/a/v1/chat/completions \
193
+ -H "Content-Type: application/json" \
194
+ -d '{
195
+ "model": "track_a_model",
196
+ "messages": [
197
+ {
198
+ "role": "user",
199
+ "content": [
200
+ {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
201
+ {"type": "text", "text": "Describe this image."}
202
+ ]
203
+ }
204
+ ],
205
+ "max_tokens": 512,
206
+ "extra_body": {"chat_template_kwargs": {"thinking": false}}
207
+ }'
208
+ ```
209
+
210
+ </details>
211
+
212
+ ## Model Capabilities
213
+
214
+ | Input | Output |
215
+ |-------|--------|
216
+ | Text | Text |
217
+ | Image | Text |
218
+ | Video | Text |
219
+ | Image + Text | Text |
220
+ | Video + Text | Text |
221
+
222
+ **Features:**
223
+ - Reasoning mode with `<think>...</think>` output
224
+ - Multi-turn conversation support
225
+ - Image/Video understanding
226
+
227
+ ## Architecture
228
+
229
+ ```
230
+ User Request
231
+ (Image/Video/Text)
232
+
233
+
234
+ ┌─────────────────────────────────────────────────────────────────────────┐
235
+ │ OmniServe │
236
+ │ POST /a/v1/chat/completions │
237
+ │ │
238
+ │ ┌──────────────────────────────────────────────────────────────────┐ │
239
+ │ │ [1] INPUT ENCODING │ │
240
+ │ │ │ │
241
+ │ │ ┌─────────────────┐ │ │
242
+ │ │ │ Vision Encoder │ │ │
243
+ │ │ └────────┬────────┘ │ │
244
+ │ │ │ embeddings │ │
245
+ │ └────────────────────────────┼─────────────────────────────────────┘ │
246
+ │ ▼ │
247
+ │ ┌──────────────┐ │
248
+ │ │ LLM (32B) │◀──── text │
249
+ │ └──────┬───────┘ │
250
+ │ │ │
251
+ │ ▼ │
252
+ │ Text Response │
253
+ │ │
254
+ └─────────────────────────────────────────────────────────────────────────┘
255
+
256
+
257
+ Response
258
+ (Text)
259
+ ```
260
+
261
+ ## Hardware Requirements
262
+
263
+ | Component | GPU | VRAM |
264
+ |-----------|-----|------|
265
+ | Vision Encoder | 1x | ~8GB |
266
+ | LLM (32B) | 2x | ~60GB |
267
+ | **Total** | **3x** | **~68GB** |
268
+
269
+ ## Key Parameters
270
+
271
+ | Parameter | Description | Default |
272
+ |-----------|-------------|---------|
273
+ | `chat_template_kwargs.thinking` | Enable reasoning | `false` |
274
+ | `thinking_token_budget` | Max reasoning tokens | 500 |
275
+ | `max_tokens` | Max output tokens | - |
276
+ | `temperature` | Sampling temperature | 0.7 |
277
+
278
+ For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe).
279
+
280
+
281
+ # Citation
282
+ TBU (Technical Report)
283
+
284
+ # Questions
285
+ For any other questions, please feel free to contact us at [email protected].
286
+
287
+
288
+ # License
289
+ The model is licensed under [HyperCLOVA X SEED 32B Think Model License Agreement](./LICENSE)
added_tokens.json ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</arg_key>": 128045,
3
+ "</arg_value>": 128047,
4
+ "</think>": 128041,
5
+ "</tool_call>": 128043,
6
+ "</tool_response>": 128049,
7
+ "</tools>": 128051,
8
+ "<EMAIL>": 128037,
9
+ "<KEY>": 128038,
10
+ "<NAME>": 128036,
11
+ "<PASSWORD>": 128039,
12
+ "<arg_key>": 128044,
13
+ "<arg_value>": 128046,
14
+ "<code_to_intermediate>": 128018,
15
+ "<empty_output>": 128017,
16
+ "<file_sep>": 128008,
17
+ "<intermediate_to_code>": 128019,
18
+ "<issue_closed>": 128011,
19
+ "<issue_comment>": 128010,
20
+ "<issue_start>": 128009,
21
+ "<jupyter_code>": 128014,
22
+ "<jupyter_output>": 128015,
23
+ "<jupyter_script>": 128016,
24
+ "<jupyter_start>": 128012,
25
+ "<jupyter_text>": 128013,
26
+ "<pr>": 128020,
27
+ "<pr_base>": 128023,
28
+ "<pr_base_code>": 128025,
29
+ "<pr_comment>": 128028,
30
+ "<pr_diff>": 128026,
31
+ "<pr_diff_hunk>": 128027,
32
+ "<pr_diff_hunk_comment_line>": 128035,
33
+ "<pr_event_id>": 128029,
34
+ "<pr_file>": 128024,
35
+ "<pr_in_reply_to_comment_id>": 128034,
36
+ "<pr_in_reply_to_review_id>": 128033,
37
+ "<pr_is_merged>": 128022,
38
+ "<pr_review>": 128030,
39
+ "<pr_review_comment>": 128032,
40
+ "<pr_review_state>": 128031,
41
+ "<pr_status>": 128021,
42
+ "<repo_name>": 128007,
43
+ "<think>": 128040,
44
+ "<tool_call>": 128042,
45
+ "<tool_response>": 128048,
46
+ "<tools>": 128050,
47
+ "<|IMAGE_PAD|>": 128060,
48
+ "<|VIDEO_PAD|>": 128061,
49
+ "<|_placeholder_067|>": 128067,
50
+ "<|_placeholder_068|>": 128068,
51
+ "<|_placeholder_069|>": 128069,
52
+ "<|_placeholder_070|>": 128070,
53
+ "<|_placeholder_071|>": 128071,
54
+ "<|_placeholder_072|>": 128072,
55
+ "<|_placeholder_073|>": 128073,
56
+ "<|_placeholder_074|>": 128074,
57
+ "<|_placeholder_075|>": 128075,
58
+ "<|_placeholder_076|>": 128076,
59
+ "<|_placeholder_077|>": 128077,
60
+ "<|_placeholder_078|>": 128078,
61
+ "<|_placeholder_079|>": 128079,
62
+ "<|_placeholder_080|>": 128080,
63
+ "<|_placeholder_081|>": 128081,
64
+ "<|_placeholder_082|>": 128082,
65
+ "<|_placeholder_083|>": 128083,
66
+ "<|_placeholder_084|>": 128084,
67
+ "<|_placeholder_085|>": 128085,
68
+ "<|_placeholder_086|>": 128086,
69
+ "<|_placeholder_087|>": 128087,
70
+ "<|_placeholder_088|>": 128088,
71
+ "<|_placeholder_089|>": 128089,
72
+ "<|_placeholder_090|>": 128090,
73
+ "<|_placeholder_091|>": 128091,
74
+ "<|_placeholder_092|>": 128092,
75
+ "<|_placeholder_093|>": 128093,
76
+ "<|_placeholder_094|>": 128094,
77
+ "<|_placeholder_095|>": 128095,
78
+ "<|_placeholder_096|>": 128096,
79
+ "<|_placeholder_097|>": 128097,
80
+ "<|_placeholder_098|>": 128098,
81
+ "<|_placeholder_099|>": 128099,
82
+ "<|_placeholder_100|>": 128100,
83
+ "<|_placeholder_101|>": 128101,
84
+ "<|_placeholder_102|>": 128102,
85
+ "<|_placeholder_103|>": 128103,
86
+ "<|_placeholder_104|>": 128104,
87
+ "<|_placeholder_105|>": 128105,
88
+ "<|_placeholder_106|>": 128106,
89
+ "<|_placeholder_107|>": 128107,
90
+ "<|_placeholder_108|>": 128108,
91
+ "<|_placeholder_109|>": 128109,
92
+ "<|_placeholder_110|>": 128110,
93
+ "<|_placeholder_111|>": 128111,
94
+ "<|_placeholder_112|>": 128112,
95
+ "<|_placeholder_113|>": 128113,
96
+ "<|_placeholder_114|>": 128114,
97
+ "<|_placeholder_115|>": 128115,
98
+ "<|_placeholder_116|>": 128116,
99
+ "<|_placeholder_117|>": 128117,
100
+ "<|_placeholder_118|>": 128118,
101
+ "<|_placeholder_119|>": 128119,
102
+ "<|_placeholder_120|>": 128120,
103
+ "<|_placeholder_121|>": 128121,
104
+ "<|_placeholder_122|>": 128122,
105
+ "<|_placeholder_123|>": 128123,
106
+ "<|_placeholder_124|>": 128124,
107
+ "<|_placeholder_125|>": 128125,
108
+ "<|_placeholder_126|>": 128126,
109
+ "<|_placeholder_127|>": 128127,
110
+ "<|_placeholder_128|>": 128128,
111
+ "<|_placeholder_129|>": 128129,
112
+ "<|_placeholder_130|>": 128130,
113
+ "<|_placeholder_131|>": 128131,
114
+ "<|_placeholder_132|>": 128132,
115
+ "<|_placeholder_133|>": 128133,
116
+ "<|_placeholder_134|>": 128134,
117
+ "<|_placeholder_135|>": 128135,
118
+ "<|_placeholder_136|>": 128136,
119
+ "<|_placeholder_137|>": 128137,
120
+ "<|_placeholder_138|>": 128138,
121
+ "<|_placeholder_139|>": 128139,
122
+ "<|_placeholder_140|>": 128140,
123
+ "<|_placeholder_141|>": 128141,
124
+ "<|_placeholder_142|>": 128142,
125
+ "<|_placeholder_143|>": 128143,
126
+ "<|_placeholder_144|>": 128144,
127
+ "<|_placeholder_145|>": 128145,
128
+ "<|_placeholder_146|>": 128146,
129
+ "<|_placeholder_147|>": 128147,
130
+ "<|_placeholder_148|>": 128148,
131
+ "<|_placeholder_149|>": 128149,
132
+ "<|_placeholder_150|>": 128150,
133
+ "<|_placeholder_151|>": 128151,
134
+ "<|_placeholder_152|>": 128152,
135
+ "<|_placeholder_153|>": 128153,
136
+ "<|_placeholder_154|>": 128154,
137
+ "<|_placeholder_155|>": 128155,
138
+ "<|_placeholder_156|>": 128156,
139
+ "<|_placeholder_157|>": 128157,
140
+ "<|_placeholder_158|>": 128158,
141
+ "<|_placeholder_159|>": 128159,
142
+ "<|_placeholder_160|>": 128160,
143
+ "<|_placeholder_161|>": 128161,
144
+ "<|_placeholder_162|>": 128162,
145
+ "<|_placeholder_163|>": 128163,
146
+ "<|_placeholder_164|>": 128164,
147
+ "<|_placeholder_165|>": 128165,
148
+ "<|_placeholder_166|>": 128166,
149
+ "<|_placeholder_167|>": 128167,
150
+ "<|_placeholder_168|>": 128168,
151
+ "<|_placeholder_169|>": 128169,
152
+ "<|_placeholder_170|>": 128170,
153
+ "<|_placeholder_171|>": 128171,
154
+ "<|_placeholder_172|>": 128172,
155
+ "<|_placeholder_173|>": 128173,
156
+ "<|_placeholder_174|>": 128174,
157
+ "<|_placeholder_175|>": 128175,
158
+ "<|_placeholder_176|>": 128176,
159
+ "<|_placeholder_177|>": 128177,
160
+ "<|_placeholder_178|>": 128178,
161
+ "<|_placeholder_179|>": 128179,
162
+ "<|_placeholder_180|>": 128180,
163
+ "<|_placeholder_181|>": 128181,
164
+ "<|_placeholder_182|>": 128182,
165
+ "<|_placeholder_183|>": 128183,
166
+ "<|_placeholder_184|>": 128184,
167
+ "<|_placeholder_185|>": 128185,
168
+ "<|_placeholder_186|>": 128186,
169
+ "<|_placeholder_187|>": 128187,
170
+ "<|_placeholder_188|>": 128188,
171
+ "<|_placeholder_189|>": 128189,
172
+ "<|_placeholder_190|>": 128190,
173
+ "<|_placeholder_191|>": 128191,
174
+ "<|_placeholder_192|>": 128192,
175
+ "<|_placeholder_193|>": 128193,
176
+ "<|_placeholder_194|>": 128194,
177
+ "<|_placeholder_195|>": 128195,
178
+ "<|_placeholder_196|>": 128196,
179
+ "<|_placeholder_197|>": 128197,
180
+ "<|_placeholder_198|>": 128198,
181
+ "<|_placeholder_199|>": 128199,
182
+ "<|_placeholder_200|>": 128200,
183
+ "<|_placeholder_201|>": 128201,
184
+ "<|_placeholder_202|>": 128202,
185
+ "<|_placeholder_203|>": 128203,
186
+ "<|_placeholder_204|>": 128204,
187
+ "<|_placeholder_205|>": 128205,
188
+ "<|_placeholder_206|>": 128206,
189
+ "<|_placeholder_207|>": 128207,
190
+ "<|_placeholder_208|>": 128208,
191
+ "<|_placeholder_209|>": 128209,
192
+ "<|_placeholder_210|>": 128210,
193
+ "<|_placeholder_211|>": 128211,
194
+ "<|_placeholder_212|>": 128212,
195
+ "<|_placeholder_213|>": 128213,
196
+ "<|_placeholder_214|>": 128214,
197
+ "<|_placeholder_215|>": 128215,
198
+ "<|_placeholder_216|>": 128216,
199
+ "<|_placeholder_217|>": 128217,
200
+ "<|_placeholder_218|>": 128218,
201
+ "<|_placeholder_219|>": 128219,
202
+ "<|_placeholder_220|>": 128220,
203
+ "<|_placeholder_221|>": 128221,
204
+ "<|_placeholder_222|>": 128222,
205
+ "<|_placeholder_223|>": 128223,
206
+ "<|_placeholder_224|>": 128224,
207
+ "<|_placeholder_225|>": 128225,
208
+ "<|_placeholder_226|>": 128226,
209
+ "<|_placeholder_227|>": 128227,
210
+ "<|_placeholder_228|>": 128228,
211
+ "<|_placeholder_229|>": 128229,
212
+ "<|_placeholder_230|>": 128230,
213
+ "<|_placeholder_231|>": 128231,
214
+ "<|_placeholder_232|>": 128232,
215
+ "<|_placeholder_233|>": 128233,
216
+ "<|_placeholder_234|>": 128234,
217
+ "<|_placeholder_235|>": 128235,
218
+ "<|_placeholder_236|>": 128236,
219
+ "<|_placeholder_237|>": 128237,
220
+ "<|_placeholder_238|>": 128238,
221
+ "<|_placeholder_239|>": 128239,
222
+ "<|_placeholder_240|>": 128240,
223
+ "<|_placeholder_241|>": 128241,
224
+ "<|_placeholder_242|>": 128242,
225
+ "<|_placeholder_243|>": 128243,
226
+ "<|_placeholder_244|>": 128244,
227
+ "<|_placeholder_245|>": 128245,
228
+ "<|_placeholder_246|>": 128246,
229
+ "<|_placeholder_247|>": 128247,
230
+ "<|_placeholder_248|>": 128248,
231
+ "<|_placeholder_249|>": 128249,
232
+ "<|_placeholder_250|>": 128250,
233
+ "<|_placeholder_251|>": 128251,
234
+ "<|_placeholder_252|>": 128252,
235
+ "<|_placeholder_253|>": 128253,
236
+ "<|_placeholder_254|>": 128254,
237
+ "<|_placeholder_255|>": 128255,
238
+ "<|back_translation|>": 128065,
239
+ "<|code_switching|>": 128064,
240
+ "<|document_end|>": 128055,
241
+ "<|document_start|>": 128054,
242
+ "<|endofturn|>": 128003,
243
+ "<|fim_middle|>": 128005,
244
+ "<|fim_prefix|>": 128004,
245
+ "<|fim_suffix|>": 128006,
246
+ "<|im_end|>": 128001,
247
+ "<|im_start|>": 128000,
248
+ "<|image_end|>": 128057,
249
+ "<|image_start|>": 128056,
250
+ "<|instruction_pretraining|>": 128066,
251
+ "<|mime_end|>": 128053,
252
+ "<|mime_start|>": 128052,
253
+ "<|stop|>": 128002,
254
+ "<|video_end|>": 128059,
255
+ "<|video_start|>": 128058,
256
+ "<|vision_aux_end|>": 128063,
257
+ "<|vision_aux_start|>": 128062
258
+ }
chat_template.jinja ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- set ns_img = namespace(count=0) %}
2
+ {%- set ns_vid = namespace(count=0) %}
3
+ {%- if tools %}
4
+ {{- '<|im_start|>system\n' }}
5
+ {%- if messages[0].role == 'system' %}
6
+ {%- if messages[0].content is string %}
7
+ {{- messages[0].content + '\n\n' }}
8
+ {%- elif messages[0].content is sequence %}
9
+ {%- for content_part in messages[0].content %}
10
+ {%- if content_part.type == 'text' %}
11
+ {{- content_part.text + '\n\n' }}
12
+ {%- endif %}
13
+ {%- endfor %}
14
+ {%- endif %}
15
+ {%- endif %}
16
+ {{- '# Tools\n\n' }}
17
+ {{- 'You may call one or more functions to assist with the user query.\n\n' }}
18
+ {{- 'You are provided with function signatures within <tools></tools> XML tags:\n' }}
19
+ {{- '<tools>\n' }}
20
+ {%- for tool in tools %}
21
+ {{- tool | tojson(ensure_ascii=False) }}
22
+ {%- endfor %}
23
+ {{- '\n</tools>\n\n' }}
24
+ {{- 'For each function call, output the function name and arguments within the following XML format:\n' }}
25
+ {{- '<tool_call>{function-name}\n' }}
26
+ {{- '<arg_key>{arg-key-1}</arg_key>\n' }}
27
+ {{- '<arg_value>{arg-value-1}</arg_value>\n' }}
28
+ {{- '<arg_key>{arg-key-2}</arg_key>\n' }}
29
+ {{- '<arg_value>{arg-value-2}</arg_value>\n' }}
30
+ {{- '...\n' }}
31
+ {{- '</tool_call><|im_end|>\n' }}
32
+ {%- else %}
33
+ {%- if messages[0].role == 'system' %}
34
+ {{- '<|im_start|>system\n' }}
35
+ {%- if messages[0].content is string %}
36
+ {{- messages[0].content }}
37
+ {%- elif messages[0].content is sequence %}
38
+ {%- for content_part in messages[0].content %}
39
+ {%- if content_part.type == 'text' %}
40
+ {{- content_part.text }}
41
+ {%- endif %}
42
+ {%- endfor %}
43
+ {%- endif %}
44
+ {{- '<|im_end|>\n' }}
45
+ {%- endif %}
46
+ {%- endif %}
47
+ {%- set ns = namespace(last_user_index=-1) %}
48
+ {%- for m in messages %}
49
+ {%- if m.role == 'user' %}
50
+ {%- set ns.last_user_index = loop.index0 %}
51
+ {%- endif %}
52
+ {%- endfor %}
53
+ {%- for message in messages %}
54
+ {%- set content = message.content %}
55
+ {%- if (message.role == 'system' and not loop.first) %}
56
+ {{- '<|im_start|>' + message.role + '\n' }}
57
+ {%- if content is string %}
58
+ {{- content }}
59
+ {%- elif content is sequence %}
60
+ {%- for content_part in content %}
61
+ {%- if content_part.type == 'text' %}
62
+ {{- content_part.text }}
63
+ {%- endif %}
64
+ {%- endfor %}
65
+ {%- endif %}
66
+ {{- '<|im_end|>' + '\n' }}
67
+ {%- elif message.role == 'user' %}
68
+ {{- '<|im_start|>user\n' }}
69
+ {%- if message['content'] is string %}
70
+ {{- message['content'] + '<|im_end|>\n' }}
71
+ {%- elif message['content'] is sequence %}
72
+ {%- for content in message['content'] %}
73
+ {%- if not loop.first %}
74
+ {{- '\n' }}
75
+ {%- endif %}
76
+ {%- if content['type'] == 'image' %}
77
+ {%- set image_id = 'image_%02d' % ns_img.count %}
78
+ {%- set ns_img.count = ns_img.count + 1 %}
79
+ {{- '<|mime_start|>{"id": "' + image_id + '", "type": "image/jpeg", "filename": "' + content.get('filename', "a.jpg") + '"}<|mime_end|>\n' }}
80
+ {{- '<|image_start|><|IMAGE_PAD|><|image_end|>' }}
81
+ {%- elif content['type'] == 'video' %}
82
+ {%- set video_id = 'video_%02d' % ns_vid.count %}
83
+ {%- set ns_vid.count = ns_vid.count + 1 %}
84
+ {{- '<|mime_start|>{"id": "' + video_id + '", "type": "video/mp4", "filename": "' + content.get('filename', "a.mp4") + '"}<|mime_end|>\n' }}
85
+ {{- '<|video_aux_start|>다음 중 video_duration은 비디오 길이 정보입니다. 참고하여 답변하세요. {"video_duration": ' + (content.get('video_duration') | tojson if content.get('video_duration') else '<|video_duration|>') + '}<|video_aux_end|>\n'}}
86
+ {{- '<|video_start|><|VIDEO_PAD|><|video_end|>\n'}}
87
+ {%- elif content['type'] == 'text' %}
88
+ {{- content['text'] }}
89
+ {%- endif %}
90
+ {%- endfor %}
91
+ {{- '<|im_end|>\n'}}
92
+ {%- endif %}
93
+ {%- elif message.role == 'assistant' %}
94
+ {%- set reasoning_content = '' %}
95
+ {%- if message.reasoning_content is string %}
96
+ {%- set reasoning_content = message.reasoning_content %}
97
+ {%- else %}
98
+ {%- if '</think>' in content %}
99
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
100
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
101
+ {%- endif %}
102
+ {%- endif %}
103
+ {%- if loop.index0 > ns.last_user_index %}
104
+ {%- if loop.last or reasoning_content %}
105
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
106
+ {%- else %}
107
+ {{- '<|im_start|>' + message.role + '\n' + content }}
108
+ {%- endif %}
109
+ {%- else %}
110
+ {{- '<|im_start|>' + message.role + '\n' + content }}
111
+ {%- endif %}
112
+ {%- if message.tool_calls %}
113
+ {%- for tool_call in message.tool_calls %}
114
+ {%- if not loop.first or content %}
115
+ {{- '\n' }}
116
+ {%- endif %}
117
+ {%- if tool_call.function %}
118
+ {%- set tool_call = tool_call.function %}
119
+ {%- endif %}
120
+ {{- '<tool_call>' + tool_call.name + '\n' }}
121
+ {%- set _args = tool_call.arguments %}
122
+ {%- for k, v in _args.items() %}
123
+ {{- '<arg_key>' + k + '</arg_key>\n' }}
124
+ {{- '<arg_value>' + (v | tojson(ensure_ascii=False) if v is not string else v) + '</arg_value>\n' }}
125
+ {%- endfor %}
126
+ {{- '</tool_call>' }}
127
+ {%- endfor %}
128
+ {%- endif %}
129
+ {{- '<|im_end|>\n' }}
130
+ {%- elif message.role == 'tool' %}
131
+ {%- if loop.first or (messages[loop.index0 - 1].role != 'tool') %}
132
+ {{- '<|im_start|>tool' }}
133
+ {%- endif %}
134
+ {{- '\n<tool_response>' + message.get('name', '') + '\n' }}
135
+ {{- content }}
136
+ {{- '\n</tool_response>' }}
137
+ {%- if loop.last or (messages[loop.index0 + 1].role != 'tool') %}
138
+ {{- '<|im_end|>\n' }}
139
+ {%- endif %}
140
+ {%- endif %}
141
+ {%- endfor %}
142
+ {%- if add_generation_prompt %}
143
+ {{- '<|im_start|>assistant\n<think>\n' }}
144
+ {%- if skip_reasoning is defined and skip_reasoning is true %}
145
+ {{- '\n</think>\n\n' }}
146
+ {%- endif %}
147
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "anyres": false,
3
+ "architectures": [
4
+ "HCXVisionV2ForCausalLM"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_vlm.HCXVisionConfig",
8
+ "AutoModelForCausalLM": "modeling_vlm.HCXVisionForCausalLM",
9
+ "AutoModelForSequenceClassification": "modeling_vlm.HCXVisionForSequenceClassification",
10
+ "AutoModelForTokenClassification": "modeling_vlm.HCXVisionForTokenClassification"
11
+ },
12
+ "bos_token_id": 0,
13
+ "eos_token_id": 128001,
14
+ "freeze_before_sampler": false,
15
+ "freeze_decoder": false,
16
+ "freeze_encoder": true,
17
+ "freeze_mm_projector": false,
18
+ "hidden_size": 5120,
19
+ "ignore_index": -100,
20
+ "image_token_id": 128060,
21
+ "img_start_id": 128060,
22
+ "is_safetensor_save": true,
23
+ "max_num_grids": -1,
24
+ "mm_projector_type": "linear",
25
+ "model_type": "vlm",
26
+ "num_queries_vis_abstractor": -1,
27
+ "pad_token_id": 0,
28
+ "possible_resolutions": [],
29
+ "proj_pos_emb": true,
30
+ "proj_prenorm": false,
31
+ "q_former_model_name_or_path": null,
32
+ "text_config": {
33
+ "add_cross_attention": false,
34
+ "architectures": [
35
+ "HyperCLOVAXForCausalLM"
36
+ ],
37
+ "attention_bias": false,
38
+ "attention_dropout": 0.0,
39
+ "attention_multiplier": 0.08838834764831845,
40
+ "auto_map": {
41
+ "AutoConfig": "configuration_hyperclovax.HyperCLOVAXConfig",
42
+ "AutoModel": "modeling_hyperclovax.HyperCLOVAXModel",
43
+ "AutoModelForCausalLM": "modeling_hyperclovax.HyperCLOVAXForCausalLM"
44
+ },
45
+ "bad_words_ids": null,
46
+ "begin_suppress_tokens": null,
47
+ "bos_token_id": 128000,
48
+ "chunk_size_feed_forward": 0,
49
+ "cross_attention_hidden_size": null,
50
+ "decoder_start_token_id": null,
51
+ "diversity_penalty": 0.0,
52
+ "do_sample": false,
53
+ "dtype": "bfloat16",
54
+ "early_stopping": false,
55
+ "embedding_multiplier": 1.0,
56
+ "encoder_no_repeat_ngram_size": 0,
57
+ "end_token_id": 128001,
58
+ "eos_token_id": 128001,
59
+ "exponential_decay_length_penalty": null,
60
+ "finetuning_task": null,
61
+ "forced_bos_token_id": null,
62
+ "forced_eos_token_id": null,
63
+ "head_dim": 128,
64
+ "hidden_act": "silu",
65
+ "hidden_size": 5120,
66
+ "id2label": {
67
+ "0": "LABEL_0",
68
+ "1": "LABEL_1"
69
+ },
70
+ "initializer_range": 0.006,
71
+ "intermediate_size": 24192,
72
+ "is_decoder": false,
73
+ "is_encoder_decoder": false,
74
+ "label2id": {
75
+ "LABEL_0": 0,
76
+ "LABEL_1": 1
77
+ },
78
+ "length_penalty": 1.0,
79
+ "logits_scaling": 1.0,
80
+ "max_length": 20,
81
+ "max_position_embeddings": 131072,
82
+ "min_length": 0,
83
+ "mlp_bias": false,
84
+ "model_type": "hyperclovax",
85
+ "no_repeat_ngram_size": 0,
86
+ "num_attention_heads": 40,
87
+ "num_beam_groups": 1,
88
+ "num_beams": 1,
89
+ "num_hidden_layers": 72,
90
+ "num_key_value_heads": 8,
91
+ "num_return_sequences": 1,
92
+ "output_attentions": false,
93
+ "output_hidden_states": false,
94
+ "output_scores": false,
95
+ "pad_token_id": 0,
96
+ "prefix": null,
97
+ "pretraining_tp": 1,
98
+ "problem_type": null,
99
+ "pruned_heads": {},
100
+ "remove_invalid_values": false,
101
+ "repetition_penalty": 1.0,
102
+ "resid_pdrop": 0.2,
103
+ "residual_multiplier": 1.0,
104
+ "return_dict": true,
105
+ "return_dict_in_generate": false,
106
+ "rms_norm_eps": 1e-05,
107
+ "rope_scaling": null,
108
+ "rope_theta": 50000000,
109
+ "sep_token_id": null,
110
+ "suppress_tokens": null,
111
+ "task_specific_params": null,
112
+ "temperature": 1.0,
113
+ "tf_legacy_loss": false,
114
+ "tie_encoder_decoder": false,
115
+ "tie_word_embeddings": false,
116
+ "tokenizer_class": null,
117
+ "top_k": 50,
118
+ "top_p": 1.0,
119
+ "torch_dtype": "float32",
120
+ "torchscript": false,
121
+ "typical_p": 1.0,
122
+ "use_bfloat16": false,
123
+ "use_cache": false,
124
+ "use_post_norm": false,
125
+ "vocab_size": 128256,
126
+ "_name_or_path": "naver-hyperclovax/HyperCLOVAX-SEED-Think-32B"
127
+ },
128
+ "text_model_name_or_path": null,
129
+ "torch_dtype": "float32",
130
+ "transformers_version": "4.52.4",
131
+ "unpad": false,
132
+ "use_1x1_grid": false,
133
+ "use_nth_layer": -2,
134
+ "video_first_last_frames_slows": null,
135
+ "video_max_num_frames": null,
136
+ "video_num_queries_fast": null,
137
+ "video_num_queries_slow": null,
138
+ "video_start_id": 128061,
139
+ "video_token_id": 128061,
140
+ "vision_config": {
141
+ "add_cross_attention": false,
142
+ "anyres": false,
143
+ "architectures": [
144
+ "Qwen2_5_VisionTransformerPretrainedModel"
145
+ ],
146
+ "bad_words_ids": null,
147
+ "begin_suppress_tokens": null,
148
+ "bos_token_id": null,
149
+ "chunk_size_feed_forward": 0,
150
+ "cross_attention_hidden_size": null,
151
+ "decoder_start_token_id": null,
152
+ "depth": 32,
153
+ "diversity_penalty": 0.0,
154
+ "do_sample": false,
155
+ "early_stopping": false,
156
+ "encoder_no_repeat_ngram_size": 0,
157
+ "eos_token_id": null,
158
+ "exponential_decay_length_penalty": null,
159
+ "finetuning_task": null,
160
+ "forced_bos_token_id": null,
161
+ "forced_eos_token_id": null,
162
+ "fullatt_block_indexes": [
163
+ 7,
164
+ 15,
165
+ 23,
166
+ 31
167
+ ],
168
+ "hidden_act": "silu",
169
+ "hidden_size": 1280,
170
+ "id2label": {
171
+ "0": "LABEL_0",
172
+ "1": "LABEL_1"
173
+ },
174
+ "in_channels": 3,
175
+ "in_chans": 3,
176
+ "initializer_range": 0.02,
177
+ "intermediate_size": 3456,
178
+ "is_decoder": false,
179
+ "is_encoder_decoder": false,
180
+ "label2id": {
181
+ "LABEL_0": 0,
182
+ "LABEL_1": 1
183
+ },
184
+ "length_penalty": 1.0,
185
+ "max_length": 20,
186
+ "max_num_grids": -1,
187
+ "min_length": 0,
188
+ "model_type": "qwen2_5_vl",
189
+ "no_repeat_ngram_size": 0,
190
+ "num_beam_groups": 1,
191
+ "num_beams": 1,
192
+ "num_heads": 16,
193
+ "num_return_sequences": 1,
194
+ "out_hidden_size": 5120,
195
+ "output_attentions": false,
196
+ "output_hidden_states": false,
197
+ "output_scores": false,
198
+ "pad_token_id": null,
199
+ "patch_size": 14,
200
+ "prefix": null,
201
+ "problem_type": null,
202
+ "pruned_heads": {},
203
+ "remove_invalid_values": false,
204
+ "repetition_penalty": 1.0,
205
+ "return_dict": true,
206
+ "return_dict_in_generate": false,
207
+ "sep_token_id": null,
208
+ "spatial_merge_size": 2,
209
+ "spatial_patch_size": 14,
210
+ "suppress_tokens": null,
211
+ "task_specific_params": null,
212
+ "temperature": 1.0,
213
+ "temporal_patch_size": 2,
214
+ "tf_legacy_loss": false,
215
+ "tie_encoder_decoder": false,
216
+ "tie_word_embeddings": true,
217
+ "tokenizer_class": null,
218
+ "tokens_per_second": 2,
219
+ "top_k": 50,
220
+ "top_p": 1.0,
221
+ "torch_dtype": "float32",
222
+ "torchscript": false,
223
+ "typical_p": 1.0,
224
+ "use_bfloat16": false,
225
+ "window_size": 112
226
+ },
227
+ "vision_input_chunk_size": null,
228
+ "vision_model_name_or_path": null
229
+ }
configuration_hyperclovax.py ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """LLaMA model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+
24
+ # from transformers.modeling_rope_utils import rope_config_validation
25
+ # from transformers import PretrainedConfig, rope_config_validation
26
+
27
+
28
+ class HyperCLOVAXConfig(PretrainedConfig):
29
+ r"""
30
+ This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
31
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
32
+ defaults will yield a similar configuration to that of the LLaMA-7B.
33
+
34
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
35
+ documentation from [`PretrainedConfig`] for more information.
36
+
37
+
38
+ Args:
39
+ vocab_size (`int`, *optional*, defaults to 32000):
40
+ Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
41
+ `inputs_ids` passed when calling [`LlamaModel`]
42
+ hidden_size (`int`, *optional*, defaults to 4096):
43
+ Dimension of the hidden representations.
44
+ intermediate_size (`int`, *optional*, defaults to 11008):
45
+ Dimension of the MLP representations.
46
+ num_hidden_layers (`int`, *optional*, defaults to 32):
47
+ Number of hidden layers in the Transformer decoder.
48
+ num_attention_heads (`int`, *optional*, defaults to 32):
49
+ Number of attention heads for each attention layer in the Transformer decoder.
50
+ num_key_value_heads (`int`, *optional*):
51
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
52
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
53
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
54
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
55
+ by meanpooling all the original heads within that group. For more details checkout [this
56
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
57
+ `num_attention_heads`.
58
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
59
+ The non-linear activation function (function or string) in the decoder.
60
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
61
+ The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
62
+ Llama 2 up to 4096, CodeLlama up to 16384.
63
+ initializer_range (`float`, *optional*, defaults to 0.02):
64
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
65
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
66
+ The epsilon used by the rms normalization layers.
67
+ use_cache (`bool`, *optional*, defaults to `True`):
68
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
69
+ relevant if `config.is_decoder=True`.
70
+ pad_token_id (`int`, *optional*):
71
+ Padding token id.
72
+ bos_token_id (`int`, *optional*, defaults to 1):
73
+ Beginning of stream token id.
74
+ eos_token_id (`int`, *optional*, defaults to 2):
75
+ End of stream token id.
76
+ pretraining_tp (`int`, *optional*, defaults to 1):
77
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
78
+ document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
79
+ understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
80
+ results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
81
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
82
+ Whether to tie weight embeddings
83
+ rope_theta (`float`, *optional*, defaults to 10000.0):
84
+ The base period of the RoPE embeddings.
85
+ rope_scaling (`Dict`, *optional*):
86
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
87
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
88
+ accordingly.
89
+ Expected contents:
90
+ `rope_type` (`str`):
91
+ The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
92
+ 'llama3'], with 'default' being the original RoPE implementation.
93
+ `factor` (`float`, *optional*):
94
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
95
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
96
+ original maximum pre-trained length.
97
+ `original_max_position_embeddings` (`int`, *optional*):
98
+ Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
99
+ pretraining.
100
+ `attention_factor` (`float`, *optional*):
101
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
102
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
103
+ `factor` field to infer the suggested value.
104
+ `beta_fast` (`float`, *optional*):
105
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
106
+ ramp function. If unspecified, it defaults to 32.
107
+ `beta_slow` (`float`, *optional*):
108
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
109
+ ramp function. If unspecified, it defaults to 1.
110
+ `short_factor` (`List[float]`, *optional*):
111
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
112
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
113
+ size divided by the number of attention heads divided by 2
114
+ `long_factor` (`List[float]`, *optional*):
115
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
116
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
117
+ size divided by the number of attention heads divided by 2
118
+ `low_freq_factor` (`float`, *optional*):
119
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
120
+ `high_freq_factor` (`float`, *optional*):
121
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
122
+ attention_bias (`bool`, *optional*, defaults to `False`):
123
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
124
+ attention_dropout (`float`, *optional*, defaults to 0.0):
125
+ The dropout ratio for the attention probabilities.
126
+ mlp_bias (`bool`, *optional*, defaults to `False`):
127
+ Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
128
+ head_dim (`int`, *optional*):
129
+ The attention head dimension. If None, it will default to hidden_size // num_heads
130
+
131
+ ```python
132
+ >>> from transformers import LlamaModel, LlamaConfig
133
+
134
+ >>> # Initializing a LLaMA llama-7b style configuration
135
+ >>> configuration = LlamaConfig()
136
+
137
+ >>> # Initializing a model from the llama-7b style configuration
138
+ >>> model = LlamaModel(configuration)
139
+
140
+ >>> # Accessing the model configuration
141
+ >>> configuration = model.config
142
+ ```"""
143
+
144
+ model_type = "hyperclovax"
145
+ keys_to_ignore_at_inference = ["past_key_values"]
146
+
147
+ def __init__(
148
+ self,
149
+ vocab_size=32000,
150
+ hidden_size=4096,
151
+ intermediate_size=11008,
152
+ num_hidden_layers=32,
153
+ num_attention_heads=32,
154
+ num_key_value_heads=None,
155
+ hidden_act="silu",
156
+ max_position_embeddings=2048,
157
+ initializer_range=0.02,
158
+ rms_norm_eps=1e-6,
159
+ use_cache=True,
160
+ pad_token_id=None,
161
+ bos_token_id=1,
162
+ eos_token_id=2,
163
+ pretraining_tp=1,
164
+ tie_word_embeddings=False,
165
+ rope_theta=10000.0,
166
+ rope_scaling=None,
167
+ attention_bias=False,
168
+ attention_dropout=0.0,
169
+ mlp_bias=False,
170
+ head_dim=None,
171
+ embedding_multiplier=1.0, # mup
172
+ logits_scaling=1.0, # mup
173
+ attention_multiplier=1.0, # mup
174
+ residual_multiplier=1.0, # mup
175
+ use_post_norm=False, # post-norm
176
+ auto_map={
177
+ "AutoConfig": "configuration_hyperclovax.HyperCLOVAXConfig",
178
+ "AutoModel": "modeling_hyperclovax.HyperCLOVAXModel",
179
+ "AutoModelForCausalLM": "modeling_hyperclovax.HyperCLOVAXForCausalLM",
180
+ },
181
+ **kwargs,
182
+ ):
183
+ self.vocab_size = vocab_size
184
+ self.max_position_embeddings = max_position_embeddings
185
+ self.hidden_size = hidden_size
186
+ self.intermediate_size = intermediate_size
187
+ self.num_hidden_layers = num_hidden_layers
188
+ self.num_attention_heads = num_attention_heads
189
+
190
+ # for backward compatibility
191
+ if num_key_value_heads is None:
192
+ num_key_value_heads = num_attention_heads
193
+
194
+ self.num_key_value_heads = num_key_value_heads
195
+ self.hidden_act = hidden_act
196
+ self.initializer_range = initializer_range
197
+ self.rms_norm_eps = rms_norm_eps
198
+ self.pretraining_tp = pretraining_tp
199
+ self.use_cache = use_cache
200
+ self.rope_theta = rope_theta
201
+ self.rope_scaling = rope_scaling
202
+ self.attention_bias = attention_bias
203
+ self.attention_dropout = attention_dropout
204
+ self.mlp_bias = mlp_bias
205
+ self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
206
+ # Validate the correctness of rotary position embeddings parameters
207
+ # BC: if there is a 'type' field, copy it it to 'rope_type'.
208
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
209
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
210
+ # rope_config_validation(self)
211
+
212
+ # mup
213
+ self.embedding_multiplier = embedding_multiplier
214
+ self.logits_scaling = logits_scaling
215
+ self.attention_multiplier = attention_multiplier
216
+ self.residual_multiplier = residual_multiplier
217
+
218
+ # post-norm (dual-norm)
219
+ self.use_post_norm = use_post_norm
220
+
221
+ super().__init__(
222
+ pad_token_id=pad_token_id,
223
+ bos_token_id=bos_token_id,
224
+ eos_token_id=eos_token_id,
225
+ tie_word_embeddings=tie_word_embeddings,
226
+ auto_map=auto_map,
227
+ **kwargs,
228
+ )
configuration_vlm.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import transformers
2
+ from transformers import AutoConfig, PretrainedConfig
3
+
4
+
5
+ class HCXVisionConfig(PretrainedConfig):
6
+ model_type = "vlm"
7
+ keys_to_ignore_at_inference = ["past_key_values"]
8
+
9
+ def __init__(
10
+ self,
11
+ text_config=None,
12
+ vision_config=None,
13
+ text_model_name_or_path=None,
14
+ vision_model_name_or_path=None,
15
+ q_former_model_name_or_path=None,
16
+ mm_projector_type="mlp",
17
+ use_nth_layer=-2,
18
+ img_start_id=100271, # <|IMAGE_PAD|>
19
+ video_start_id=100270, # <|VIDEO_PAD|>
20
+ freeze_encoder=False,
21
+ freeze_decoder=False,
22
+ freeze_mm_projector=False,
23
+ anyres=False,
24
+ unpad=False,
25
+ max_num_grids=-1,
26
+ num_queries_vis_abstractor=-1,
27
+ video_num_queries_fast=None,
28
+ video_num_queries_slow=None,
29
+ video_first_last_frames_slows=None,
30
+ video_max_num_frames=None,
31
+ ignore_index=-100,
32
+ proj_pos_emb=True,
33
+ proj_prenorm=False,
34
+ use_1x1_grid=False,
35
+ possible_resolutions=[],
36
+ **kwargs,
37
+ ):
38
+ from transformers import CONFIG_MAPPING
39
+
40
+ if kwargs.get("language_config", None) is not None: # for bc
41
+ text_config = CONFIG_MAPPING[kwargs["language_config"]["model_type"]](**kwargs["language_config"])
42
+ elif text_config is None and text_model_name_or_path is not None:
43
+ text_config = AutoConfig.from_pretrained(text_model_name_or_path, trust_remote_code=True)
44
+ if vision_config is None and vision_model_name_or_path is not None:
45
+ vision_config = AutoConfig.from_pretrained(vision_model_name_or_path, trust_remote_code=True)
46
+
47
+ if isinstance(text_config, dict):
48
+ text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
49
+
50
+ if isinstance(vision_config, dict):
51
+ if vision_config["model_type"] == "qwen2_5_vl":
52
+ vision_config["model_type"] = "qwen2_5_vl_visual"
53
+ assert transformers.__version__ >= "4.52.4", "please upgrade transformers to 4.52.4 or higher"
54
+ vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
55
+
56
+ self.text_config = text_config
57
+ self.vision_config = vision_config
58
+
59
+ if text_config is not None:
60
+ # deepspeed zero3에서 config의 hidden_size를 보고 메모리 크기를 자동으로 결정함.
61
+ self.hidden_size = text_config.hidden_size if hasattr(text_config, "hidden_size") else text_config.n_embd
62
+ # add VLM configs
63
+ self.text_model_name_or_path = text_model_name_or_path
64
+ self.vision_model_name_or_path = vision_model_name_or_path
65
+ self.q_former_model_name_or_path = q_former_model_name_or_path
66
+ self.mm_projector_type = mm_projector_type
67
+ self.use_nth_layer = use_nth_layer
68
+ self.freeze_encoder = freeze_encoder
69
+ self.freeze_decoder = freeze_decoder
70
+ self.freeze_mm_projector = freeze_mm_projector
71
+ self.anyres = anyres
72
+ self.unpad = unpad
73
+ self.max_num_grids = max_num_grids
74
+ self.num_queries_vis_abstractor = num_queries_vis_abstractor
75
+ self.video_num_queries_fast = video_num_queries_fast
76
+ self.video_num_queries_slow = video_num_queries_slow
77
+ self.video_first_last_frames_slows = video_first_last_frames_slows
78
+ self.video_max_num_frames = video_max_num_frames
79
+ self.img_start_id = img_start_id
80
+ self.image_token_id = img_start_id
81
+ self.video_start_id = video_start_id
82
+ self.video_token_id = video_start_id
83
+ self.ignore_index = ignore_index
84
+ self.proj_pos_emb = proj_pos_emb
85
+ self.proj_prenorm = proj_prenorm
86
+ self.use_1x1_grid = use_1x1_grid
87
+ self.possible_resolutions = possible_resolutions
88
+ super().__init__(**kwargs)
89
+ if self.text_config is not None: # needed for HCXVisionForSequenceClassification
90
+ self.pad_token_id = self.text_config.pad_token_id
91
+
92
+
93
+ AutoConfig.register("vlm", HCXVisionConfig)
94
+ try:
95
+ from .configuration_hyperclovax import HyperCLOVAXConfig
96
+
97
+ AutoConfig.register("hyperclovax", HyperCLOVAXConfig)
98
+ except:
99
+ pass
100
+ try:
101
+ from transformers import CONFIG_MAPPING, MODEL_MAPPING
102
+ from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
103
+ Qwen2_5_VisionTransformerPretrainedModel,
104
+ Qwen2_5_VLPatchMerger,
105
+ Qwen2_5_VLVisionConfig,
106
+ )
107
+
108
+ MODEL_MAPPING.register(Qwen2_5_VLVisionConfig, Qwen2_5_VisionTransformerPretrainedModel)
109
+ CONFIG_MAPPING.register("qwen2_5_vl_visual", Qwen2_5_VLVisionConfig)
110
+ except:
111
+ pass
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 128000,
4
+ "eos_token_id": 128001,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.52.4",
7
+ "use_cache": false
8
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ce534529a94e7ec564597d53844faa017ecc8df3976a019dfa199326850245d
3
+ size 4950209576
model-00002-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17d38e682e956b69951fe2ac43aaeaf7e71041de1b6d63d2b5c2202493feacdc
3
+ size 4770540848
model-00003-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1bf73569b760f3c1c7d75b911f26c8d18997052907d5f90e76b59d1630f6804b
3
+ size 4819716376
model-00004-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8358e640759c2b251d38d3f0b39275d26ece4500d090e7f94e781f835479643
3
+ size 4867797112
model-00005-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22983b512ee1d72c8b656a9c40f901b623ebfa8340cfd7e5e977fb64277339ed
3
+ size 4983871936
model-00006-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8da16a0bc916243c672232ee00e51f0dd928a8f9c4586622896075b1ccab3ce7
3
+ size 4782422992
model-00007-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3618232ec5fd2dac853df8d3e692e077b573f20ae62a5545856831c45e0928a3
3
+ size 4977005552
model-00008-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3a5eb08aba7d9623e01c92a142927524270ae28c699d44f5eb19936ee49d8039
3
+ size 4909855088
model-00009-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7643f4d4dcbbe4915ab161ff09badafbc575c9e500ae6c7edc7a1e27d0d71793
3
+ size 4858416608
model-00010-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3fcaa327125a328f34bb53f06cea5ab319f6fb9e9b02772a0eace5a3cf0580c7
3
+ size 4937737048
model-00011-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cddf84315c9c0d793223ce7643a670e920791d76ff7438145878bfa8a21a6a83
3
+ size 4817347272
model-00012-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e2f015222bf0fef5b9f078b6290ad161b1cb450da7ac0cf481e59c29b4105f26
3
+ size 4824605160
model-00013-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2641a78bd72c036ed15f6e074b3e434a7f72c851854931bed431f3c06a9ac6fd
3
+ size 4976322456
model-00014-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42af553c74487cd4b9c1344dc37b09f254bcc6ff7dc6b60ec1a414982460338e
3
+ size 3151109544
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_hyperclovax.py ADDED
@@ -0,0 +1,1866 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ import math
21
+ from typing import List, Optional, Tuple, Union
22
+
23
+ import torch
24
+ import torch.nn.functional as F
25
+ import torch.utils.checkpoint
26
+ from torch import nn
27
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
28
+ from transformers.activations import ACT2FN
29
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
30
+ from transformers.generation import GenerationMixin
31
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
32
+ from transformers.modeling_flash_attention_utils import _flash_attention_forward
33
+ from transformers.modeling_outputs import (
34
+ BaseModelOutputWithPast,
35
+ CausalLMOutputWithPast,
36
+ QuestionAnsweringModelOutput,
37
+ SequenceClassifierOutputWithPast,
38
+ TokenClassifierOutput,
39
+ )
40
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS
41
+ from transformers.modeling_utils import PreTrainedModel
42
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
43
+ from transformers.utils import (
44
+ add_start_docstrings,
45
+ add_start_docstrings_to_model_forward,
46
+ is_flash_attn_greater_or_equal_2_10,
47
+ is_torchdynamo_compiling,
48
+ logging,
49
+ replace_return_docstrings,
50
+ )
51
+
52
+ from .configuration_hyperclovax import HyperCLOVAXConfig
53
+
54
+ logger = logging.get_logger(__name__)
55
+
56
+ _CONFIG_FOR_DOC = "HyperCLOVAXConfig"
57
+
58
+
59
+ def _prepare_4d_causal_attention_mask_with_cache_position(
60
+ attention_mask: torch.Tensor,
61
+ sequence_length: int,
62
+ target_length: int,
63
+ dtype: torch.dtype,
64
+ device: torch.device,
65
+ min_dtype: float,
66
+ cache_position: torch.Tensor,
67
+ batch_size: int,
68
+ ):
69
+ """
70
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
71
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
72
+
73
+ Args:
74
+ attention_mask (`torch.Tensor`):
75
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
76
+ sequence_length (`int`):
77
+ The sequence length being processed.
78
+ target_length (`int`):
79
+ The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
80
+ dtype (`torch.dtype`):
81
+ The dtype to use for the 4D attention mask.
82
+ device (`torch.device`):
83
+ The device to plcae the 4D attention mask on.
84
+ min_dtype (`float`):
85
+ The minimum value representable with the dtype `dtype`.
86
+ cache_position (`torch.Tensor`):
87
+ Indices depicting the position of the input sequence tokens in the sequence.
88
+ batch_size (`torch.Tensor`):
89
+ Batch size.
90
+ """
91
+ if attention_mask is not None and attention_mask.dim() == 4:
92
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
93
+ causal_mask = attention_mask
94
+ else:
95
+ causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
96
+ if sequence_length != 1:
97
+ causal_mask = torch.triu(causal_mask, diagonal=1)
98
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
99
+ causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
100
+ if attention_mask is not None:
101
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
102
+ mask_length = attention_mask.shape[-1]
103
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
104
+ padding_mask = padding_mask == 0
105
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(padding_mask, min_dtype)
106
+
107
+ return causal_mask
108
+
109
+
110
+ class HyperCLOVAXRMSNorm(nn.Module):
111
+ def __init__(self, hidden_size, eps=1e-6):
112
+ """
113
+ HyperCLOVAXRMSNorm is equivalent to T5LayerNorm
114
+ """
115
+ super().__init__()
116
+ self.weight = nn.Parameter(torch.ones(hidden_size))
117
+ self.variance_epsilon = eps
118
+
119
+ def forward(self, hidden_states):
120
+ input_dtype = hidden_states.dtype
121
+ hidden_states = hidden_states.to(torch.float32)
122
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
123
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
124
+ return self.weight * hidden_states.to(input_dtype)
125
+
126
+ def extra_repr(self):
127
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
128
+
129
+
130
+ ALL_LAYERNORM_LAYERS.append(HyperCLOVAXRMSNorm)
131
+
132
+
133
+ class HyperCLOVAXRotaryEmbedding(nn.Module):
134
+ def __init__(
135
+ self,
136
+ dim=None,
137
+ max_position_embeddings=2048,
138
+ base=10000,
139
+ device=None,
140
+ scaling_factor=1.0,
141
+ rope_type="default",
142
+ config: Optional[HyperCLOVAXConfig] = None,
143
+ ):
144
+ super().__init__()
145
+ # TODO (joao): remove the `if` below, only used for BC
146
+ self.rope_kwargs = {}
147
+ if config is None:
148
+ logger.warning_once(
149
+ "`HyperCLOVAXRotaryEmbedding` can now be fully parameterized by passing the model config through the "
150
+ "`config` argument. All other arguments will be removed in v4.46"
151
+ )
152
+ self.rope_kwargs = {
153
+ "rope_type": rope_type,
154
+ "factor": scaling_factor,
155
+ "dim": dim,
156
+ "base": base,
157
+ "max_position_embeddings": max_position_embeddings,
158
+ }
159
+ self.rope_type = rope_type
160
+ self.max_seq_len_cached = max_position_embeddings
161
+ self.original_max_seq_len = max_position_embeddings
162
+ else:
163
+ # BC: "rope_type" was originally "type"
164
+ if config.rope_scaling is not None:
165
+ self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
166
+ else:
167
+ self.rope_type = "default"
168
+ self.max_seq_len_cached = config.max_position_embeddings
169
+ self.original_max_seq_len = config.max_position_embeddings
170
+
171
+ self.config = config
172
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
173
+
174
+ inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device, **self.rope_kwargs)
175
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
176
+ self.original_inv_freq = self.inv_freq
177
+
178
+ def _dynamic_frequency_update(self, position_ids, device):
179
+ """
180
+ dynamic RoPE layers should recompute `inv_freq` in the following situations:
181
+ 1 - growing beyond the cached sequence length (allow scaling)
182
+ 2 - the current sequence length is in the original scale (avoid losing precision with small sequences)
183
+ """
184
+ seq_len = torch.max(position_ids) + 1
185
+ if seq_len > self.max_seq_len_cached: # growth
186
+ inv_freq, self.attention_scaling = self.rope_init_fn(
187
+ self.config, device, seq_len=seq_len, **self.rope_kwargs
188
+ )
189
+ self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation
190
+ self.max_seq_len_cached = seq_len
191
+
192
+ if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset
193
+ self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)
194
+ self.max_seq_len_cached = self.original_max_seq_len
195
+
196
+ @torch.no_grad()
197
+ def forward(self, x, position_ids):
198
+ if "dynamic" in self.rope_type:
199
+ self._dynamic_frequency_update(position_ids, device=x.device)
200
+
201
+ # Core RoPE block
202
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
203
+ position_ids_expanded = position_ids[:, None, :].float()
204
+ # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
205
+ device_type = x.device.type
206
+ device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
207
+ with torch.autocast(device_type=device_type, enabled=False):
208
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
209
+ emb = torch.cat((freqs, freqs), dim=-1)
210
+ cos = emb.cos()
211
+ sin = emb.sin()
212
+
213
+ # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
214
+ cos = cos * self.attention_scaling
215
+ sin = sin * self.attention_scaling
216
+
217
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
218
+
219
+
220
+ class HyperCLOVAXLinearScalingRotaryEmbedding(HyperCLOVAXRotaryEmbedding):
221
+ """HyperCLOVAXRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
222
+
223
+ def __init__(self, *args, **kwargs):
224
+ logger.warning_once(
225
+ "`HyperCLOVAXLinearScalingRotaryEmbedding` is deprecated an will be removed in v4.46. Please use "
226
+ "`HyperCLOVAXRotaryEmbedding`, which now also does linear scaling (simply pass the model config to __init__)."
227
+ )
228
+ kwargs["rope_type"] = "linear"
229
+ super().__init__(*args, **kwargs)
230
+
231
+
232
+ class HyperCLOVAXDynamicNTKScalingRotaryEmbedding(HyperCLOVAXRotaryEmbedding):
233
+ """HyperCLOVAXRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
234
+
235
+ def __init__(self, *args, **kwargs):
236
+ logger.warning_once(
237
+ "`HyperCLOVAXDynamicNTKScalingRotaryEmbedding` is deprecated an will be removed in v4.46. Please use "
238
+ "`HyperCLOVAXRotaryEmbedding`, which now also does dynamic ntk scaling (simply pass the model config to "
239
+ "__init__)."
240
+ )
241
+ kwargs["rope_type"] = "dynamic"
242
+ super().__init__(*args, **kwargs)
243
+
244
+
245
+ def rotate_half(x):
246
+ """Rotates half the hidden dims of the input."""
247
+ x1 = x[..., : x.shape[-1] // 2]
248
+ x2 = x[..., x.shape[-1] // 2 :]
249
+ return torch.cat((-x2, x1), dim=-1)
250
+
251
+
252
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
253
+ """Applies Rotary Position Embedding to the query and key tensors.
254
+
255
+ Args:
256
+ q (`torch.Tensor`): The query tensor.
257
+ k (`torch.Tensor`): The key tensor.
258
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
259
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
260
+ position_ids (`torch.Tensor`, *optional*):
261
+ Deprecated and unused.
262
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
263
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
264
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
265
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
266
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
267
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
268
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
269
+ Returns:
270
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
271
+ """
272
+ cos = cos.unsqueeze(unsqueeze_dim)
273
+ sin = sin.unsqueeze(unsqueeze_dim)
274
+ q_embed = (q * cos) + (rotate_half(q) * sin)
275
+ k_embed = (k * cos) + (rotate_half(k) * sin)
276
+ return q_embed, k_embed
277
+
278
+
279
+ class HyperCLOVAXMLP(nn.Module):
280
+ def __init__(self, config):
281
+ super().__init__()
282
+ self.config = config
283
+ self.hidden_size = config.hidden_size
284
+ self.intermediate_size = config.intermediate_size
285
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
286
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
287
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
288
+ self.act_fn = ACT2FN[config.hidden_act]
289
+
290
+ def forward(self, x):
291
+ if self.config.pretraining_tp > 1:
292
+ slice = self.intermediate_size // self.config.pretraining_tp
293
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
294
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
295
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
296
+
297
+ gate_proj = torch.cat([F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
298
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
299
+
300
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
301
+ down_proj = [
302
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
303
+ ]
304
+ down_proj = sum(down_proj)
305
+ else:
306
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
307
+
308
+ return down_proj
309
+
310
+
311
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
312
+ """
313
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
314
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
315
+ """
316
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
317
+ if n_rep == 1:
318
+ return hidden_states
319
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
320
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
321
+
322
+
323
+ class HyperCLOVAXAttention(nn.Module):
324
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
325
+
326
+ def __init__(self, config: HyperCLOVAXConfig, layer_idx: Optional[int] = None):
327
+ super().__init__()
328
+ self.config = config
329
+ self.layer_idx = layer_idx
330
+ if layer_idx is None:
331
+ logger.warning_once(
332
+ f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
333
+ "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
334
+ "when creating this class."
335
+ )
336
+
337
+ self.attention_dropout = config.attention_dropout
338
+ self.hidden_size = config.hidden_size
339
+ self.num_heads = config.num_attention_heads
340
+ self.head_dim = getattr(config, "head_dim", self.hidden_size // self.num_heads)
341
+ self.num_key_value_heads = config.num_key_value_heads
342
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
343
+ self.max_position_embeddings = config.max_position_embeddings
344
+ self.rope_theta = config.rope_theta
345
+ self.is_causal = True
346
+
347
+ self.scaling = config.attention_multiplier
348
+
349
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
350
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
351
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
352
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
353
+
354
+ # TODO (joao): remove in v4.46 (RoPE is computed in the model, not in the decoder layers)
355
+ self.rotary_emb = HyperCLOVAXRotaryEmbedding(config=self.config)
356
+
357
+ def forward(
358
+ self,
359
+ hidden_states: torch.Tensor,
360
+ attention_mask: Optional[torch.Tensor] = None,
361
+ position_ids: Optional[torch.LongTensor] = None,
362
+ past_key_value: Optional[Cache] = None,
363
+ output_attentions: bool = False,
364
+ use_cache: bool = False,
365
+ cache_position: Optional[torch.LongTensor] = None,
366
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
367
+ **kwargs,
368
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
369
+ bsz, q_len, _ = hidden_states.size()
370
+
371
+ if self.config.pretraining_tp > 1:
372
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
373
+ query_slices = self.q_proj.weight.split(
374
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
375
+ )
376
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
377
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
378
+
379
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
380
+ query_states = torch.cat(query_states, dim=-1)
381
+
382
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
383
+ key_states = torch.cat(key_states, dim=-1)
384
+
385
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
386
+ value_states = torch.cat(value_states, dim=-1)
387
+
388
+ else:
389
+ query_states = self.q_proj(hidden_states)
390
+ key_states = self.k_proj(hidden_states)
391
+ value_states = self.v_proj(hidden_states)
392
+
393
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
394
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
395
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
396
+
397
+ if position_embeddings is None:
398
+ logger.warning_once(
399
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
400
+ "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
401
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
402
+ "removed and `position_embeddings` will be mandatory."
403
+ )
404
+ cos, sin = self.rotary_emb(value_states, position_ids)
405
+ else:
406
+ cos, sin = position_embeddings
407
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
408
+
409
+ if past_key_value is not None:
410
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
411
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
412
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
413
+
414
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
415
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
416
+ # attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scaling / math.sqrt(self.head_dim)
417
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scaling
418
+
419
+ if attention_mask is not None: # no matter the length, we just slice it
420
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
421
+ attn_weights = attn_weights + causal_mask
422
+
423
+ # upcast attention to fp32
424
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
425
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
426
+ attn_output = torch.matmul(attn_weights, value_states)
427
+
428
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
429
+ raise ValueError(
430
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
431
+ f" {attn_output.size()}"
432
+ )
433
+
434
+ attn_output = attn_output.transpose(1, 2).contiguous()
435
+
436
+ attn_output = attn_output.reshape(bsz, q_len, -1)
437
+
438
+ if self.config.pretraining_tp > 1:
439
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
440
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
441
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
442
+ else:
443
+ attn_output = self.o_proj(attn_output)
444
+
445
+ if not output_attentions:
446
+ attn_weights = None
447
+
448
+ return attn_output, attn_weights, past_key_value
449
+
450
+
451
+ class HyperCLOVAXFlashAttention2(HyperCLOVAXAttention):
452
+ """
453
+ HyperCLOVAX flash attention module. This module inherits from `HyperCLOVAXAttention` as the weights of the module stays
454
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
455
+ flash attention and deal with padding tokens in case the input contains any of them.
456
+ """
457
+
458
+ def __init__(self, *args, **kwargs):
459
+ super().__init__(*args, **kwargs)
460
+
461
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
462
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
463
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
464
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
465
+
466
+ def forward(
467
+ self,
468
+ hidden_states: torch.Tensor,
469
+ attention_mask: Optional[torch.LongTensor] = None,
470
+ position_ids: Optional[torch.LongTensor] = None,
471
+ past_key_value: Optional[Cache] = None,
472
+ output_attentions: bool = False,
473
+ use_cache: bool = False,
474
+ cache_position: Optional[torch.LongTensor] = None,
475
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
476
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
477
+ if isinstance(past_key_value, StaticCache):
478
+ raise ValueError(
479
+ "`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` "
480
+ "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers"
481
+ )
482
+
483
+ output_attentions = False
484
+
485
+ bsz, q_len, _ = hidden_states.size()
486
+
487
+ query_states = self.q_proj(hidden_states)
488
+ key_states = self.k_proj(hidden_states)
489
+ value_states = self.v_proj(hidden_states)
490
+
491
+ # Flash attention requires the input to have the shape
492
+ # batch_size x seq_length x head_dim x hidden_dim
493
+ # therefore we just need to keep the original shape
494
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
495
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
496
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
497
+
498
+ if position_embeddings is None:
499
+ logger.warning_once(
500
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
501
+ "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
502
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
503
+ "removed and `position_embeddings` will be mandatory."
504
+ )
505
+ cos, sin = self.rotary_emb(value_states, position_ids)
506
+ else:
507
+ cos, sin = position_embeddings
508
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
509
+
510
+ if past_key_value is not None:
511
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
512
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
513
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
514
+
515
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
516
+ # to be able to avoid many of these transpose/reshape/view.
517
+ query_states = query_states.transpose(1, 2)
518
+ key_states = key_states.transpose(1, 2)
519
+ value_states = value_states.transpose(1, 2)
520
+
521
+ dropout_rate = self.attention_dropout if self.training else 0.0
522
+
523
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
524
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
525
+ # cast them back in the correct dtype just to be sure everything works as expected.
526
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
527
+ # in fp32. (HyperCLOVAXRMSNorm handles it correctly)
528
+
529
+ input_dtype = query_states.dtype
530
+ if input_dtype == torch.float32:
531
+ if torch.is_autocast_enabled():
532
+ target_dtype = torch.get_autocast_gpu_dtype()
533
+ # Handle the case where the model is quantized
534
+ elif hasattr(self.config, "_pre_quantization_dtype"):
535
+ target_dtype = self.config._pre_quantization_dtype
536
+ else:
537
+ target_dtype = self.q_proj.weight.dtype
538
+
539
+ logger.warning_once(
540
+ f"The input hidden states seems to be silently casted in float32, this might be related to"
541
+ f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
542
+ f" {target_dtype}."
543
+ )
544
+
545
+ query_states = query_states.to(target_dtype)
546
+ key_states = key_states.to(target_dtype)
547
+ value_states = value_states.to(target_dtype)
548
+
549
+ attn_output = _flash_attention_forward(
550
+ query_states,
551
+ key_states,
552
+ value_states,
553
+ attention_mask,
554
+ q_len,
555
+ position_ids=position_ids,
556
+ dropout=dropout_rate,
557
+ softmax_scale=self.scaling, # mup
558
+ sliding_window=getattr(self, "sliding_window", None),
559
+ use_top_left_mask=self._flash_attn_uses_top_left_mask,
560
+ is_causal=self.is_causal,
561
+ )
562
+
563
+ attn_output = attn_output.reshape(bsz, q_len, -1).contiguous()
564
+ attn_output = self.o_proj(attn_output)
565
+
566
+ if not output_attentions:
567
+ attn_weights = None
568
+
569
+ return attn_output, attn_weights, past_key_value
570
+
571
+
572
+ class HyperCLOVAXSdpaAttention(HyperCLOVAXAttention):
573
+ """
574
+ HyperCLOVAX attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
575
+ `HyperCLOVAXAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
576
+ SDPA API.
577
+ """
578
+
579
+ # Adapted from HyperCLOVAXAttention.forward
580
+ def forward(
581
+ self,
582
+ hidden_states: torch.Tensor,
583
+ attention_mask: Optional[torch.Tensor] = None,
584
+ position_ids: Optional[torch.LongTensor] = None,
585
+ past_key_value: Optional[Cache] = None,
586
+ output_attentions: bool = False,
587
+ use_cache: bool = False,
588
+ cache_position: Optional[torch.LongTensor] = None,
589
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
590
+ **kwargs,
591
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
592
+ if output_attentions:
593
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
594
+ logger.warning_once(
595
+ "HyperCLOVAXModel is using HyperCLOVAXSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
596
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
597
+ )
598
+ return super().forward(
599
+ hidden_states=hidden_states,
600
+ attention_mask=attention_mask,
601
+ position_ids=position_ids,
602
+ past_key_value=past_key_value,
603
+ output_attentions=output_attentions,
604
+ use_cache=use_cache,
605
+ cache_position=cache_position,
606
+ position_embeddings=position_embeddings,
607
+ )
608
+
609
+ bsz, q_len, _ = hidden_states.size()
610
+
611
+ query_states = self.q_proj(hidden_states)
612
+ key_states = self.k_proj(hidden_states)
613
+ value_states = self.v_proj(hidden_states)
614
+
615
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
616
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
617
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
618
+
619
+ if position_embeddings is None:
620
+ logger.warning_once(
621
+ "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
622
+ "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
623
+ "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
624
+ "removed and `position_embeddings` will be mandatory."
625
+ )
626
+ cos, sin = self.rotary_emb(value_states, position_ids)
627
+ else:
628
+ cos, sin = position_embeddings
629
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
630
+
631
+ if past_key_value is not None:
632
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
633
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
634
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
635
+
636
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
637
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
638
+
639
+ causal_mask = attention_mask
640
+ if attention_mask is not None:
641
+ causal_mask = causal_mask[:, :, :, : key_states.shape[-2]]
642
+
643
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
644
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
645
+ if query_states.device.type == "cuda" and causal_mask is not None:
646
+ query_states = query_states.contiguous()
647
+ key_states = key_states.contiguous()
648
+ value_states = value_states.contiguous()
649
+
650
+ # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
651
+ # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
652
+ is_causal = True if causal_mask is None and q_len > 1 else False
653
+
654
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
655
+ query_states,
656
+ key_states,
657
+ value_states,
658
+ attn_mask=causal_mask,
659
+ dropout_p=self.attention_dropout if self.training else 0.0,
660
+ is_causal=is_causal,
661
+ scale=self.scaling, # mup
662
+ )
663
+
664
+ attn_output = attn_output.transpose(1, 2).contiguous()
665
+ attn_output = attn_output.view(bsz, q_len, -1)
666
+
667
+ attn_output = self.o_proj(attn_output)
668
+
669
+ return attn_output, None, past_key_value
670
+
671
+
672
+ HyperCLOVAX_ATTENTION_CLASSES = {
673
+ "eager": HyperCLOVAXAttention,
674
+ "flash_attention_2": HyperCLOVAXFlashAttention2,
675
+ "sdpa": HyperCLOVAXSdpaAttention,
676
+ }
677
+
678
+
679
+ class HyperCLOVAXDecoderLayer(nn.Module):
680
+ def __init__(self, config: HyperCLOVAXConfig, layer_idx: int):
681
+ super().__init__()
682
+ self.hidden_size = config.hidden_size
683
+
684
+ self.self_attn = HyperCLOVAX_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
685
+
686
+ self.mlp = HyperCLOVAXMLP(config)
687
+ self.input_layernorm = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
688
+ self.post_attention_layernorm = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
689
+
690
+ # post-norm (dual-norm)
691
+ self.use_post_norm = config.use_post_norm
692
+ if self.use_post_norm:
693
+ self.post_norm1 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
694
+ self.post_norm2 = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
695
+
696
+ self.residual_multiplier = config.residual_multiplier # mup
697
+
698
+ def forward(
699
+ self,
700
+ hidden_states: torch.Tensor,
701
+ attention_mask: Optional[torch.Tensor] = None,
702
+ position_ids: Optional[torch.LongTensor] = None,
703
+ past_key_value: Optional[Cache] = None,
704
+ output_attentions: Optional[bool] = False,
705
+ use_cache: Optional[bool] = False,
706
+ cache_position: Optional[torch.LongTensor] = None,
707
+ position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, # will become mandatory in v4.46
708
+ **kwargs,
709
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
710
+ """
711
+ Args:
712
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
713
+ attention_mask (`torch.FloatTensor`, *optional*):
714
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
715
+ query_sequence_length, key_sequence_length)` if default attention is used.
716
+ output_attentions (`bool`, *optional*):
717
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
718
+ returned tensors for more detail.
719
+ use_cache (`bool`, *optional*):
720
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
721
+ (see `past_key_values`).
722
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
723
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
724
+ Indices depicting the position of the input sequence tokens in the sequence
725
+ position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
726
+ Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
727
+ with `head_dim` being the embedding dimension of each attention head.
728
+ kwargs (`dict`, *optional*):
729
+ Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
730
+ into the model
731
+ """
732
+ residual = hidden_states
733
+
734
+ hidden_states = self.input_layernorm(hidden_states)
735
+
736
+ # Self Attention
737
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
738
+ hidden_states=hidden_states,
739
+ attention_mask=attention_mask,
740
+ position_ids=position_ids,
741
+ past_key_value=past_key_value,
742
+ output_attentions=output_attentions,
743
+ use_cache=use_cache,
744
+ cache_position=cache_position,
745
+ position_embeddings=position_embeddings,
746
+ **kwargs,
747
+ )
748
+
749
+ if self.use_post_norm:
750
+ hidden_states = self.post_norm1(hidden_states)
751
+
752
+ hidden_states = residual + hidden_states * self.residual_multiplier # mup
753
+
754
+ # Fully Connected
755
+ residual = hidden_states
756
+ hidden_states = self.post_attention_layernorm(hidden_states)
757
+ hidden_states = self.mlp(hidden_states)
758
+
759
+ if self.use_post_norm:
760
+ hidden_states = self.post_norm2(hidden_states)
761
+
762
+ hidden_states = residual + hidden_states * self.residual_multiplier # mup
763
+
764
+ outputs = (hidden_states,)
765
+
766
+ if output_attentions:
767
+ outputs += (self_attn_weights,)
768
+
769
+ if use_cache:
770
+ outputs += (present_key_value,)
771
+
772
+ return outputs
773
+
774
+
775
+ HyperCLOVAX_START_DOCSTRING = r"""
776
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
777
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
778
+ etc.)
779
+
780
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
781
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
782
+ and behavior.
783
+
784
+ Parameters:
785
+ config ([`HyperCLOVAXConfig`]):
786
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
787
+ load the weights associated with the model, only the configuration. Check out the
788
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
789
+ """
790
+
791
+
792
+ @add_start_docstrings(
793
+ "The bare HyperCLOVAX Model outputting raw hidden-states without any specific head on top.",
794
+ HyperCLOVAX_START_DOCSTRING,
795
+ )
796
+ class HyperCLOVAXPreTrainedModel(PreTrainedModel):
797
+ config_class = HyperCLOVAXConfig
798
+ base_model_prefix = "model"
799
+ supports_gradient_checkpointing = True
800
+ _no_split_modules = ["HyperCLOVAXDecoderLayer"]
801
+ _skip_keys_device_placement = ["past_key_values"]
802
+ _supports_flash_attn_2 = True
803
+ _supports_sdpa = True
804
+ _supports_cache_class = True
805
+ _supports_quantized_cache = True
806
+ _supports_static_cache = True
807
+
808
+ def _init_weights(self, module):
809
+ std = self.config.initializer_range
810
+ if isinstance(module, nn.Linear):
811
+ module.weight.data.normal_(mean=0.0, std=std)
812
+ if module.bias is not None:
813
+ module.bias.data.zero_()
814
+ elif isinstance(module, nn.Embedding):
815
+ module.weight.data.normal_(mean=0.0, std=std)
816
+ if module.padding_idx is not None:
817
+ module.weight.data[module.padding_idx].zero_()
818
+
819
+
820
+ HyperCLOVAX_INPUTS_DOCSTRING = r"""
821
+ Args:
822
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
823
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
824
+ it.
825
+
826
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
827
+ [`PreTrainedTokenizer.__call__`] for details.
828
+
829
+ [What are input IDs?](../glossary#input-ids)
830
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
831
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
832
+
833
+ - 1 for tokens that are **not masked**,
834
+ - 0 for tokens that are **masked**.
835
+
836
+ [What are attention masks?](../glossary#attention-mask)
837
+
838
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
839
+ [`PreTrainedTokenizer.__call__`] for details.
840
+
841
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
842
+ `past_key_values`).
843
+
844
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
845
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
846
+ information on the default strategy.
847
+
848
+ - 1 indicates the head is **not masked**,
849
+ - 0 indicates the head is **masked**.
850
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
851
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
852
+ config.n_positions - 1]`.
853
+
854
+ [What are position IDs?](../glossary#position-ids)
855
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
856
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
857
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
858
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
859
+
860
+ Two formats are allowed:
861
+ - a [`~cache_utils.Cache`] instance, see our
862
+ [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache);
863
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
864
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
865
+ cache format.
866
+
867
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
868
+ legacy cache format will be returned.
869
+
870
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
871
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
872
+ of shape `(batch_size, sequence_length)`.
873
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
874
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
875
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
876
+ model's internal embedding lookup matrix.
877
+ use_cache (`bool`, *optional*):
878
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
879
+ `past_key_values`).
880
+ output_attentions (`bool`, *optional*):
881
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
882
+ tensors for more detail.
883
+ output_hidden_states (`bool`, *optional*):
884
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
885
+ more detail.
886
+ return_dict (`bool`, *optional*):
887
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
888
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
889
+ Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
890
+ this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
891
+ the complete sequence length.
892
+ """
893
+
894
+
895
+ @add_start_docstrings(
896
+ "The bare HyperCLOVAX Model outputting raw hidden-states without any specific head on top.",
897
+ HyperCLOVAX_START_DOCSTRING,
898
+ )
899
+ class HyperCLOVAXModel(HyperCLOVAXPreTrainedModel):
900
+ """
901
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`HyperCLOVAXDecoderLayer`]
902
+
903
+ Args:
904
+ config: HyperCLOVAXConfig
905
+ """
906
+
907
+ def __init__(self, config: HyperCLOVAXConfig):
908
+ super().__init__(config)
909
+ self.padding_idx = config.pad_token_id
910
+ self.vocab_size = config.vocab_size
911
+
912
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
913
+ self.layers = nn.ModuleList(
914
+ [HyperCLOVAXDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
915
+ )
916
+ self.norm = HyperCLOVAXRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
917
+ self.rotary_emb = HyperCLOVAXRotaryEmbedding(config=config)
918
+ self.gradient_checkpointing = False
919
+
920
+ # Initialize weights and apply final processing
921
+ self.post_init()
922
+
923
+ # mup
924
+ self.embedding_multiplier = config.embedding_multiplier
925
+
926
+ def get_input_embeddings(self):
927
+ return self.embed_tokens
928
+
929
+ def set_input_embeddings(self, value):
930
+ self.embed_tokens = value
931
+
932
+ @add_start_docstrings_to_model_forward(HyperCLOVAX_INPUTS_DOCSTRING)
933
+ def forward(
934
+ self,
935
+ input_ids: torch.LongTensor = None,
936
+ attention_mask: Optional[torch.Tensor] = None,
937
+ position_ids: Optional[torch.LongTensor] = None,
938
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
939
+ inputs_embeds: Optional[torch.FloatTensor] = None,
940
+ use_cache: Optional[bool] = None,
941
+ output_attentions: Optional[bool] = None,
942
+ output_hidden_states: Optional[bool] = None,
943
+ return_dict: Optional[bool] = None,
944
+ cache_position: Optional[torch.LongTensor] = None,
945
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
946
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
947
+ output_hidden_states = (
948
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
949
+ )
950
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
951
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
952
+
953
+ if (input_ids is None) ^ (inputs_embeds is not None):
954
+ raise ValueError(
955
+ "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
956
+ )
957
+
958
+ if self.gradient_checkpointing and self.training and use_cache:
959
+ logger.warning_once(
960
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
961
+ )
962
+ use_cache = False
963
+
964
+ if inputs_embeds is None:
965
+ inputs_embeds = self.embed_tokens(input_ids)
966
+
967
+ inputs_embeds = inputs_embeds * self.embedding_multiplier # mup
968
+
969
+ # kept for BC (non `Cache` `past_key_values` inputs)
970
+ return_legacy_cache = False
971
+ if use_cache and not isinstance(past_key_values, Cache):
972
+ return_legacy_cache = True
973
+ if past_key_values is None:
974
+ past_key_values = DynamicCache()
975
+ else:
976
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
977
+ logger.warning_once(
978
+ "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
979
+ "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
980
+ "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
981
+ )
982
+
983
+ if cache_position is None:
984
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
985
+ cache_position = torch.arange(
986
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
987
+ )
988
+ if position_ids is None:
989
+ position_ids = cache_position.unsqueeze(0)
990
+
991
+ causal_mask = self._update_causal_mask(
992
+ attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
993
+ )
994
+ hidden_states = inputs_embeds
995
+
996
+ # create position embeddings to be shared across the decoder layers
997
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
998
+
999
+ # decoder layers
1000
+ all_hidden_states = () if output_hidden_states else None
1001
+ all_self_attns = () if output_attentions else None
1002
+ next_decoder_cache = None
1003
+
1004
+ for decoder_layer in self.layers:
1005
+ if output_hidden_states:
1006
+ all_hidden_states += (hidden_states,)
1007
+
1008
+ if self.gradient_checkpointing and self.training:
1009
+ layer_outputs = self._gradient_checkpointing_func(
1010
+ decoder_layer.__call__,
1011
+ hidden_states,
1012
+ causal_mask,
1013
+ position_ids,
1014
+ past_key_values,
1015
+ output_attentions,
1016
+ use_cache,
1017
+ cache_position,
1018
+ position_embeddings,
1019
+ )
1020
+ else:
1021
+ layer_outputs = decoder_layer(
1022
+ hidden_states,
1023
+ attention_mask=causal_mask,
1024
+ position_ids=position_ids,
1025
+ past_key_value=past_key_values,
1026
+ output_attentions=output_attentions,
1027
+ use_cache=use_cache,
1028
+ cache_position=cache_position,
1029
+ position_embeddings=position_embeddings,
1030
+ )
1031
+
1032
+ hidden_states = layer_outputs[0]
1033
+
1034
+ if use_cache:
1035
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1036
+
1037
+ if output_attentions:
1038
+ all_self_attns += (layer_outputs[1],)
1039
+
1040
+ hidden_states = self.norm(hidden_states)
1041
+
1042
+ # add hidden states from the last decoder layer
1043
+ if output_hidden_states:
1044
+ all_hidden_states += (hidden_states,)
1045
+
1046
+ next_cache = next_decoder_cache if use_cache else None
1047
+ if return_legacy_cache:
1048
+ next_cache = next_cache.to_legacy_cache()
1049
+
1050
+ if not return_dict:
1051
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1052
+ return BaseModelOutputWithPast(
1053
+ last_hidden_state=hidden_states,
1054
+ past_key_values=next_cache,
1055
+ hidden_states=all_hidden_states,
1056
+ attentions=all_self_attns,
1057
+ )
1058
+
1059
+ def _update_causal_mask(
1060
+ self,
1061
+ attention_mask: torch.Tensor,
1062
+ input_tensor: torch.Tensor,
1063
+ cache_position: torch.Tensor,
1064
+ past_key_values: Cache,
1065
+ output_attentions: bool,
1066
+ ):
1067
+ if self.config._attn_implementation == "flash_attention_2":
1068
+ if attention_mask is not None and 0.0 in attention_mask:
1069
+ return attention_mask
1070
+ return None
1071
+
1072
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
1073
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
1074
+ # to infer the attention mask.
1075
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
1076
+ using_static_cache = isinstance(past_key_values, StaticCache)
1077
+
1078
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
1079
+ if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
1080
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
1081
+ attention_mask,
1082
+ inputs_embeds=input_tensor,
1083
+ past_key_values_length=past_seen_tokens,
1084
+ is_training=self.training,
1085
+ ):
1086
+ return None
1087
+
1088
+ dtype, device = input_tensor.dtype, input_tensor.device
1089
+ min_dtype = torch.finfo(dtype).min
1090
+ sequence_length = input_tensor.shape[1]
1091
+ if using_static_cache:
1092
+ target_length = past_key_values.get_max_length()
1093
+ else:
1094
+ target_length = (
1095
+ attention_mask.shape[-1]
1096
+ if isinstance(attention_mask, torch.Tensor)
1097
+ else past_seen_tokens + sequence_length + 1
1098
+ )
1099
+
1100
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
1101
+ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
1102
+ attention_mask,
1103
+ sequence_length=sequence_length,
1104
+ target_length=target_length,
1105
+ dtype=dtype,
1106
+ device=device,
1107
+ min_dtype=min_dtype,
1108
+ cache_position=cache_position,
1109
+ batch_size=input_tensor.shape[0],
1110
+ )
1111
+
1112
+ if (
1113
+ self.config._attn_implementation == "sdpa"
1114
+ and attention_mask is not None
1115
+ and attention_mask.device.type == "cuda"
1116
+ and not output_attentions
1117
+ ):
1118
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
1119
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
1120
+ # Details: https://github.com/pytorch/pytorch/issues/110213
1121
+ causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
1122
+
1123
+ return causal_mask
1124
+
1125
+
1126
+ class HyperCLOVAXForCausalLM(HyperCLOVAXPreTrainedModel, GenerationMixin):
1127
+ _tied_weights_keys = ["lm_head.weight"]
1128
+
1129
+ def __init__(self, config):
1130
+ super().__init__(config)
1131
+ self.model = HyperCLOVAXModel(config)
1132
+ self.vocab_size = config.vocab_size
1133
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1134
+
1135
+ # Initialize weights and apply final processing
1136
+ self.post_init()
1137
+
1138
+ def _get_apply_liger_kernel_converter(self):
1139
+ return _apply_liger_kernel_to_instance
1140
+
1141
+ def get_input_embeddings(self):
1142
+ return self.model.embed_tokens
1143
+
1144
+ def set_input_embeddings(self, value):
1145
+ self.model.embed_tokens = value
1146
+
1147
+ def get_output_embeddings(self):
1148
+ return self.lm_head
1149
+
1150
+ def set_output_embeddings(self, new_embeddings):
1151
+ self.lm_head = new_embeddings
1152
+
1153
+ def set_decoder(self, decoder):
1154
+ self.model = decoder
1155
+
1156
+ def get_decoder(self):
1157
+ return self.model
1158
+
1159
+ @add_start_docstrings_to_model_forward(HyperCLOVAX_INPUTS_DOCSTRING)
1160
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1161
+ def forward(
1162
+ self,
1163
+ input_ids: torch.LongTensor = None,
1164
+ attention_mask: Optional[torch.Tensor] = None,
1165
+ position_ids: Optional[torch.LongTensor] = None,
1166
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
1167
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1168
+ labels: Optional[torch.LongTensor] = None,
1169
+ use_cache: Optional[bool] = None,
1170
+ output_attentions: Optional[bool] = None,
1171
+ output_hidden_states: Optional[bool] = None,
1172
+ return_dict: Optional[bool] = None,
1173
+ cache_position: Optional[torch.LongTensor] = None,
1174
+ num_logits_to_keep: int = 0,
1175
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1176
+ r"""
1177
+ Args:
1178
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1179
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1180
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1181
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1182
+
1183
+ num_logits_to_keep (`int`, *optional*):
1184
+ Calculate logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
1185
+ `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
1186
+ token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
1187
+
1188
+ Returns:
1189
+
1190
+ Example:
1191
+
1192
+ ```python
1193
+ >>> from transformers import AutoTokenizer, HyperCLOVAXForCausalLM
1194
+
1195
+ >>> model = HyperCLOVAXForCausalLM.from_pretrained(YOUR_DIR)
1196
+ >>> tokenizer = AutoTokenizer.from_pretrained(YOUR_DIR)
1197
+
1198
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1199
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1200
+
1201
+ >>> # Generate
1202
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1203
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1204
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1205
+ ```"""
1206
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1207
+ output_hidden_states = (
1208
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1209
+ )
1210
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1211
+
1212
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1213
+ outputs = self.model(
1214
+ input_ids=input_ids,
1215
+ attention_mask=attention_mask,
1216
+ position_ids=position_ids,
1217
+ past_key_values=past_key_values,
1218
+ inputs_embeds=inputs_embeds,
1219
+ use_cache=use_cache,
1220
+ output_attentions=output_attentions,
1221
+ output_hidden_states=output_hidden_states,
1222
+ return_dict=return_dict,
1223
+ cache_position=cache_position,
1224
+ )
1225
+
1226
+ hidden_states = outputs[0]
1227
+ if self.config.pretraining_tp > 1:
1228
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1229
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1230
+ logits = torch.cat(logits, dim=-1)
1231
+ else:
1232
+ if labels is None and not is_torchdynamo_compiling():
1233
+ logger.warning_once(
1234
+ "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"
1235
+ )
1236
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
1237
+ # TODO: remove the float() operation in v4.46
1238
+ logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
1239
+
1240
+ logits = logits * self.config.logits_scaling # mup
1241
+
1242
+ loss = None
1243
+ if labels is not None:
1244
+ # Upcast to float if we need to compute the loss to avoid potential precision issues
1245
+ logits = logits.float()
1246
+ # Shift so that tokens < n predict n
1247
+ shift_logits = logits[..., :-1, :].contiguous()
1248
+ shift_labels = labels[..., 1:].contiguous()
1249
+ # Flatten the tokens
1250
+ loss_fct = CrossEntropyLoss()
1251
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1252
+ shift_labels = shift_labels.view(-1)
1253
+ # Enable model parallelism
1254
+ shift_labels = shift_labels.to(shift_logits.device)
1255
+ loss = loss_fct(shift_logits, shift_labels)
1256
+
1257
+ if not return_dict:
1258
+ output = (logits,) + outputs[1:]
1259
+ return (loss,) + output if loss is not None else output
1260
+
1261
+ return CausalLMOutputWithPast(
1262
+ loss=loss,
1263
+ logits=logits,
1264
+ past_key_values=outputs.past_key_values,
1265
+ hidden_states=outputs.hidden_states,
1266
+ attentions=outputs.attentions,
1267
+ )
1268
+
1269
+ def prepare_inputs_for_generation(
1270
+ self,
1271
+ input_ids,
1272
+ past_key_values=None,
1273
+ attention_mask=None,
1274
+ inputs_embeds=None,
1275
+ cache_position=None,
1276
+ position_ids=None,
1277
+ use_cache=True,
1278
+ num_logits_to_keep=None,
1279
+ **kwargs,
1280
+ ):
1281
+ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
1282
+ # Exception 1: when passing input_embeds, input_ids may be missing entries
1283
+ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
1284
+ if past_key_values is not None:
1285
+ if inputs_embeds is not None: # Exception 1
1286
+ input_ids = input_ids[:, -cache_position.shape[0] :]
1287
+ elif input_ids.shape[1] != cache_position.shape[0]: # Default case (the "else", a no op, is Exception 2)
1288
+ input_ids = input_ids[:, cache_position]
1289
+
1290
+ if attention_mask is not None and position_ids is None:
1291
+ # create position_ids on the fly for batch generation
1292
+ position_ids = attention_mask.long().cumsum(-1) - 1
1293
+ position_ids.masked_fill_(attention_mask == 0, 1)
1294
+ if past_key_values:
1295
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1296
+
1297
+ # This `clone` call is needed to avoid recapturing cuda graphs with `torch.compile`'s `mode="reduce-overhead`, as otherwise the input `position_ids` would have various stride during the decoding. Here, simply using `.contiguous()` is not sufficient as in the batch size = 1 case, `position_ids` is already contiguous but with varying stride which retriggers a capture.
1298
+ position_ids = position_ids.clone(memory_format=torch.contiguous_format)
1299
+
1300
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1301
+ if inputs_embeds is not None and cache_position[0] == 0:
1302
+ model_inputs = {"inputs_embeds": inputs_embeds, "input_ids": None}
1303
+ else:
1304
+ # The clone here is for the same reason as for `position_ids`.
1305
+ model_inputs = {"input_ids": input_ids.clone(memory_format=torch.contiguous_format), "inputs_embeds": None}
1306
+
1307
+ if isinstance(past_key_values, StaticCache) and attention_mask.ndim == 2:
1308
+ if model_inputs["inputs_embeds"] is not None:
1309
+ batch_size, sequence_length, _ = model_inputs["inputs_embeds"].shape
1310
+ device = model_inputs["inputs_embeds"].device
1311
+ else:
1312
+ batch_size, sequence_length = model_inputs["input_ids"].shape
1313
+ device = model_inputs["input_ids"].device
1314
+
1315
+ dtype = self.lm_head.weight.dtype
1316
+ min_dtype = torch.finfo(dtype).min
1317
+
1318
+ attention_mask = _prepare_4d_causal_attention_mask_with_cache_position(
1319
+ attention_mask,
1320
+ sequence_length=sequence_length,
1321
+ target_length=past_key_values.get_max_length(),
1322
+ dtype=dtype,
1323
+ device=device,
1324
+ min_dtype=min_dtype,
1325
+ cache_position=cache_position,
1326
+ batch_size=batch_size,
1327
+ )
1328
+
1329
+ if num_logits_to_keep is not None:
1330
+ model_inputs["num_logits_to_keep"] = num_logits_to_keep
1331
+
1332
+ model_inputs.update(
1333
+ {
1334
+ "position_ids": position_ids,
1335
+ "cache_position": cache_position,
1336
+ "past_key_values": past_key_values,
1337
+ "use_cache": use_cache,
1338
+ "attention_mask": attention_mask,
1339
+ }
1340
+ )
1341
+ return model_inputs
1342
+
1343
+
1344
+ @add_start_docstrings(
1345
+ """
1346
+ The HyperCLOVAX Model transformer with a sequence classification head on top (linear layer).
1347
+
1348
+ [`HyperCLOVAXForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1349
+ (e.g. GPT-2) do.
1350
+
1351
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1352
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1353
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1354
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1355
+ each row of the batch).
1356
+ """,
1357
+ HyperCLOVAX_START_DOCSTRING,
1358
+ )
1359
+ class HyperCLOVAXForSequenceClassification(HyperCLOVAXPreTrainedModel):
1360
+ def __init__(self, config):
1361
+ super().__init__(config)
1362
+ self.num_labels = config.num_labels
1363
+ self.model = HyperCLOVAXModel(config)
1364
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1365
+
1366
+ # Initialize weights and apply final processing
1367
+ self.post_init()
1368
+
1369
+ def get_input_embeddings(self):
1370
+ return self.model.embed_tokens
1371
+
1372
+ def set_input_embeddings(self, value):
1373
+ self.model.embed_tokens = value
1374
+
1375
+ @add_start_docstrings_to_model_forward(HyperCLOVAX_INPUTS_DOCSTRING)
1376
+ def forward(
1377
+ self,
1378
+ input_ids: Optional[torch.LongTensor] = None,
1379
+ attention_mask: Optional[torch.Tensor] = None,
1380
+ position_ids: Optional[torch.LongTensor] = None,
1381
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
1382
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1383
+ labels: Optional[torch.LongTensor] = None,
1384
+ use_cache: Optional[bool] = None,
1385
+ output_attentions: Optional[bool] = None,
1386
+ output_hidden_states: Optional[bool] = None,
1387
+ return_dict: Optional[bool] = None,
1388
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1389
+ r"""
1390
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1391
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1392
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1393
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1394
+ """
1395
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1396
+
1397
+ transformer_outputs = self.model(
1398
+ input_ids,
1399
+ attention_mask=attention_mask,
1400
+ position_ids=position_ids,
1401
+ past_key_values=past_key_values,
1402
+ inputs_embeds=inputs_embeds,
1403
+ use_cache=use_cache,
1404
+ output_attentions=output_attentions,
1405
+ output_hidden_states=output_hidden_states,
1406
+ return_dict=return_dict,
1407
+ )
1408
+ hidden_states = transformer_outputs[0]
1409
+ logits = self.score(hidden_states)
1410
+
1411
+ if input_ids is not None:
1412
+ batch_size = input_ids.shape[0]
1413
+ else:
1414
+ batch_size = inputs_embeds.shape[0]
1415
+
1416
+ if self.config.pad_token_id is None and batch_size != 1:
1417
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1418
+ if self.config.pad_token_id is None:
1419
+ sequence_lengths = -1
1420
+ else:
1421
+ if input_ids is not None:
1422
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
1423
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
1424
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
1425
+ sequence_lengths = sequence_lengths.to(logits.device)
1426
+ else:
1427
+ sequence_lengths = -1
1428
+
1429
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1430
+
1431
+ loss = None
1432
+ if labels is not None:
1433
+ labels = labels.to(logits.device)
1434
+ if self.config.problem_type is None:
1435
+ if self.num_labels == 1:
1436
+ self.config.problem_type = "regression"
1437
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1438
+ self.config.problem_type = "single_label_classification"
1439
+ else:
1440
+ self.config.problem_type = "multi_label_classification"
1441
+
1442
+ if self.config.problem_type == "regression":
1443
+ loss_fct = MSELoss()
1444
+ if self.num_labels == 1:
1445
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1446
+ else:
1447
+ loss = loss_fct(pooled_logits, labels)
1448
+ elif self.config.problem_type == "single_label_classification":
1449
+ loss_fct = CrossEntropyLoss()
1450
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1451
+ elif self.config.problem_type == "multi_label_classification":
1452
+ loss_fct = BCEWithLogitsLoss()
1453
+ loss = loss_fct(pooled_logits, labels)
1454
+ if not return_dict:
1455
+ output = (pooled_logits,) + transformer_outputs[1:]
1456
+ return ((loss,) + output) if loss is not None else output
1457
+
1458
+ return SequenceClassifierOutputWithPast(
1459
+ loss=loss,
1460
+ logits=pooled_logits,
1461
+ past_key_values=transformer_outputs.past_key_values,
1462
+ hidden_states=transformer_outputs.hidden_states,
1463
+ attentions=transformer_outputs.attentions,
1464
+ )
1465
+
1466
+
1467
+ @add_start_docstrings(
1468
+ """
1469
+ The HyperCLOVAX Model transformer with a span classification head on top for extractive question-answering tasks like
1470
+ SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
1471
+ """,
1472
+ HyperCLOVAX_START_DOCSTRING,
1473
+ )
1474
+ class HyperCLOVAXForQuestionAnswering(HyperCLOVAXPreTrainedModel):
1475
+ base_model_prefix = "transformer"
1476
+
1477
+ # Copied from transformers.models.bloom.modeling_bloom.BloomForQuestionAnswering.__init__ with Bloom->HyperCLOVAX
1478
+ def __init__(self, config):
1479
+ super().__init__(config)
1480
+ self.transformer = HyperCLOVAXModel(config)
1481
+ self.qa_outputs = nn.Linear(config.hidden_size, 2)
1482
+
1483
+ # Initialize weights and apply final processing
1484
+ self.post_init()
1485
+
1486
+ def get_input_embeddings(self):
1487
+ return self.transformer.embed_tokens
1488
+
1489
+ def set_input_embeddings(self, value):
1490
+ self.transformer.embed_tokens = value
1491
+
1492
+ @add_start_docstrings_to_model_forward(HyperCLOVAX_INPUTS_DOCSTRING)
1493
+ def forward(
1494
+ self,
1495
+ input_ids: Optional[torch.LongTensor] = None,
1496
+ attention_mask: Optional[torch.FloatTensor] = None,
1497
+ position_ids: Optional[torch.LongTensor] = None,
1498
+ past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
1499
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1500
+ start_positions: Optional[torch.LongTensor] = None,
1501
+ end_positions: Optional[torch.LongTensor] = None,
1502
+ output_attentions: Optional[bool] = None,
1503
+ output_hidden_states: Optional[bool] = None,
1504
+ return_dict: Optional[bool] = None,
1505
+ ) -> Union[Tuple, QuestionAnsweringModelOutput]:
1506
+ r"""
1507
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1508
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1509
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1510
+ are not taken into account for computing the loss.
1511
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1512
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1513
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1514
+ are not taken into account for computing the loss.
1515
+ """
1516
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1517
+
1518
+ outputs = self.transformer(
1519
+ input_ids,
1520
+ attention_mask=attention_mask,
1521
+ position_ids=position_ids,
1522
+ past_key_values=past_key_values,
1523
+ inputs_embeds=inputs_embeds,
1524
+ output_attentions=output_attentions,
1525
+ output_hidden_states=output_hidden_states,
1526
+ return_dict=return_dict,
1527
+ )
1528
+
1529
+ sequence_output = outputs[0]
1530
+
1531
+ logits = self.qa_outputs(sequence_output)
1532
+ start_logits, end_logits = logits.split(1, dim=-1)
1533
+ start_logits = start_logits.squeeze(-1).contiguous()
1534
+ end_logits = end_logits.squeeze(-1).contiguous()
1535
+
1536
+ total_loss = None
1537
+ if start_positions is not None and end_positions is not None:
1538
+ # If we are on multi-GPU, split add a dimension
1539
+ if len(start_positions.size()) > 1:
1540
+ start_positions = start_positions.squeeze(-1).to(start_logits.device)
1541
+ if len(end_positions.size()) > 1:
1542
+ end_positions = end_positions.squeeze(-1).to(end_logits.device)
1543
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1544
+ ignored_index = start_logits.size(1)
1545
+ start_positions = start_positions.clamp(0, ignored_index)
1546
+ end_positions = end_positions.clamp(0, ignored_index)
1547
+
1548
+ loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
1549
+ start_loss = loss_fct(start_logits, start_positions)
1550
+ end_loss = loss_fct(end_logits, end_positions)
1551
+ total_loss = (start_loss + end_loss) / 2
1552
+
1553
+ if not return_dict:
1554
+ output = (start_logits, end_logits) + outputs[2:]
1555
+ return ((total_loss,) + output) if total_loss is not None else output
1556
+
1557
+ return QuestionAnsweringModelOutput(
1558
+ loss=total_loss,
1559
+ start_logits=start_logits,
1560
+ end_logits=end_logits,
1561
+ hidden_states=outputs.hidden_states,
1562
+ attentions=outputs.attentions,
1563
+ )
1564
+
1565
+
1566
+ @add_start_docstrings(
1567
+ """
1568
+ The HyperCLOVAX Model transformer with a token classification head on top (a linear layer on top of the hidden-states
1569
+ output) e.g. for Named-Entity-Recognition (NER) tasks.
1570
+ """,
1571
+ HyperCLOVAX_START_DOCSTRING,
1572
+ )
1573
+ class HyperCLOVAXForTokenClassification(HyperCLOVAXPreTrainedModel):
1574
+ def __init__(self, config):
1575
+ super().__init__(config)
1576
+ self.num_labels = config.num_labels
1577
+ self.model = HyperCLOVAXModel(config)
1578
+ if getattr(config, "classifier_dropout", None) is not None:
1579
+ classifier_dropout = config.classifier_dropout
1580
+ elif getattr(config, "hidden_dropout", None) is not None:
1581
+ classifier_dropout = config.hidden_dropout
1582
+ else:
1583
+ classifier_dropout = 0.1
1584
+ self.dropout = nn.Dropout(classifier_dropout)
1585
+ self.score = nn.Linear(config.hidden_size, config.num_labels)
1586
+
1587
+ # Initialize weights and apply final processing
1588
+ self.post_init()
1589
+
1590
+ def get_input_embeddings(self):
1591
+ return self.model.embed_tokens
1592
+
1593
+ def set_input_embeddings(self, value):
1594
+ self.model.embed_tokens = value
1595
+
1596
+ @add_start_docstrings_to_model_forward(HyperCLOVAX_INPUTS_DOCSTRING)
1597
+ def forward(
1598
+ self,
1599
+ input_ids: Optional[torch.LongTensor] = None,
1600
+ attention_mask: Optional[torch.Tensor] = None,
1601
+ position_ids: Optional[torch.LongTensor] = None,
1602
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1603
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1604
+ labels: Optional[torch.LongTensor] = None,
1605
+ use_cache: Optional[bool] = None,
1606
+ output_attentions: Optional[bool] = None,
1607
+ output_hidden_states: Optional[bool] = None,
1608
+ return_dict: Optional[bool] = None,
1609
+ ) -> Union[Tuple, TokenClassifierOutput]:
1610
+ r"""
1611
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1612
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1613
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1614
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1615
+ """
1616
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1617
+
1618
+ outputs = self.model(
1619
+ input_ids,
1620
+ attention_mask=attention_mask,
1621
+ position_ids=position_ids,
1622
+ past_key_values=past_key_values,
1623
+ inputs_embeds=inputs_embeds,
1624
+ use_cache=use_cache,
1625
+ output_attentions=output_attentions,
1626
+ output_hidden_states=output_hidden_states,
1627
+ return_dict=return_dict,
1628
+ )
1629
+ sequence_output = outputs[0]
1630
+ sequence_output = self.dropout(sequence_output)
1631
+ logits = self.score(sequence_output)
1632
+
1633
+ loss = None
1634
+ if labels is not None:
1635
+ loss_fct = CrossEntropyLoss()
1636
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1637
+
1638
+ if not return_dict:
1639
+ output = (logits,) + outputs[2:]
1640
+ return ((loss,) + output) if loss is not None else output
1641
+
1642
+ return TokenClassifierOutput(
1643
+ loss=loss,
1644
+ logits=logits,
1645
+ hidden_states=outputs.hidden_states,
1646
+ attentions=outputs.attentions,
1647
+ )
1648
+
1649
+
1650
+ ################################################################################################
1651
+ ################################################################################################
1652
+ """
1653
+ liger kernel monkey patching
1654
+ https://github.com/linkedin/Liger-Kernel/blob/v0.5.2/src/liger_kernel/transformers/monkey_patch.py
1655
+ """
1656
+
1657
+ import inspect
1658
+ import logging
1659
+ from functools import partial
1660
+ from typing import TYPE_CHECKING, Callable, List, Optional, Tuple, Union
1661
+
1662
+ import torch
1663
+ import torch.nn.functional as F
1664
+ import transformers
1665
+ from packaging import version
1666
+ from torch.nn import CrossEntropyLoss
1667
+ from transformers import PreTrainedModel
1668
+
1669
+ if TYPE_CHECKING:
1670
+ from transformers.cache_utils import Cache
1671
+
1672
+ import sys
1673
+
1674
+ from packaging.version import parse
1675
+
1676
+ if sys.version_info < (3, 8):
1677
+ import importlib_metadata
1678
+ else:
1679
+ import importlib.metadata as importlib_metadata
1680
+
1681
+ try:
1682
+ from liger_kernel.transformers.cross_entropy import LigerCrossEntropyLoss
1683
+ from liger_kernel.transformers.functional import liger_cross_entropy
1684
+ from liger_kernel.transformers.fused_linear_cross_entropy import (
1685
+ LigerFusedLinearCrossEntropyLoss,
1686
+ )
1687
+ from liger_kernel.transformers.rms_norm import LigerRMSNorm
1688
+ from liger_kernel.transformers.rope import liger_rotary_pos_emb
1689
+ from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
1690
+
1691
+ _is_liger_kernel_available = True
1692
+
1693
+ LIGER_KERNEL_MATCHING_VERSION = parse("0.5.2")
1694
+ liger_kernel_version = parse(importlib_metadata.version("liger_kernel"))
1695
+ _is_liger_kernel_version_matching = (
1696
+ liger_kernel_version.major,
1697
+ liger_kernel_version.minor,
1698
+ liger_kernel_version.release[-1],
1699
+ ) == (
1700
+ LIGER_KERNEL_MATCHING_VERSION.major,
1701
+ LIGER_KERNEL_MATCHING_VERSION.minor,
1702
+ LIGER_KERNEL_MATCHING_VERSION.release[-1],
1703
+ )
1704
+ except Exception:
1705
+ _is_liger_kernel_available = False
1706
+ _is_liger_kernel_version_matching = False
1707
+
1708
+
1709
+ def lce_forward_deprecated(
1710
+ self,
1711
+ input_ids: torch.LongTensor = None,
1712
+ attention_mask: Optional[torch.Tensor] = None,
1713
+ position_ids: Optional[torch.LongTensor] = None,
1714
+ past_key_values: Optional[Union["Cache", List[torch.FloatTensor]]] = None,
1715
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1716
+ labels: Optional[torch.LongTensor] = None,
1717
+ use_cache: Optional[bool] = None,
1718
+ output_attentions: Optional[bool] = None,
1719
+ output_hidden_states: Optional[bool] = None,
1720
+ return_dict: Optional[bool] = None,
1721
+ cache_position: Optional[torch.LongTensor] = None,
1722
+ num_logits_to_keep: int = 0,
1723
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1724
+
1725
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1726
+ output_hidden_states = (
1727
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1728
+ )
1729
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1730
+
1731
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1732
+ outputs = self.model(
1733
+ input_ids=input_ids,
1734
+ attention_mask=attention_mask,
1735
+ position_ids=position_ids,
1736
+ past_key_values=past_key_values,
1737
+ inputs_embeds=inputs_embeds,
1738
+ use_cache=use_cache,
1739
+ output_attentions=output_attentions,
1740
+ output_hidden_states=output_hidden_states,
1741
+ return_dict=return_dict,
1742
+ cache_position=cache_position,
1743
+ )
1744
+ hidden_states = outputs[0]
1745
+
1746
+ loss = None
1747
+ logits = None
1748
+
1749
+ if self.training and (labels is not None):
1750
+ if num_logits_to_keep != 0:
1751
+ hidden_states = hidden_states[:, -num_logits_to_keep:, :] # not sure if it has bug
1752
+ hidden_states = hidden_states * self.config.logits_scaling ## muP
1753
+
1754
+ shift_hidden_states = hidden_states[..., :-1, :].contiguous()
1755
+ shift_labels = labels[..., 1:].contiguous()
1756
+
1757
+ # flatten tokens
1758
+ shift_hidden_states = shift_hidden_states.view(-1, self.config.hidden_size)
1759
+ shift_labels = shift_labels.view(-1)
1760
+
1761
+ lce = LigerFusedLinearCrossEntropyLoss()
1762
+ loss = lce(self.lm_head.weight, shift_hidden_states, shift_labels)
1763
+
1764
+ else:
1765
+ assert self.config.pretraining_tp == 1, "not supported"
1766
+ logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
1767
+ logits = logits * self.config.logits_scaling ## muP
1768
+
1769
+ if labels is not None:
1770
+ # Upcast to float if we need to compute the loss to avoid potential precision issues
1771
+ logits = logits.float()
1772
+ # Shift so that tokens < n predict n
1773
+ shift_logits = logits[..., :-1, :].contiguous()
1774
+ shift_labels = labels[..., 1:].contiguous()
1775
+ # Flatten the tokens
1776
+ loss_fct = CrossEntropyLoss()
1777
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1778
+ shift_labels = shift_labels.view(-1)
1779
+ # Enable model parallelism
1780
+ shift_labels = shift_labels.to(shift_logits.device)
1781
+ loss = loss_fct(shift_logits, shift_labels)
1782
+
1783
+ if not return_dict:
1784
+ output = (logits,) + outputs[1:]
1785
+ return (loss,) + output if loss is not None else output
1786
+
1787
+ return CausalLMOutputWithPast(
1788
+ loss=loss,
1789
+ logits=logits,
1790
+ past_key_values=outputs.past_key_values,
1791
+ hidden_states=outputs.hidden_states,
1792
+ attentions=outputs.attentions,
1793
+ )
1794
+
1795
+
1796
+ def _bind_method_to_module(module, method_name: str, new_method: Callable):
1797
+ # Binds a new method to a module instance so that self is passed as the first argument
1798
+ module.__dict__[method_name] = new_method.__get__(module, module.__class__)
1799
+
1800
+
1801
+ def _patch_rms_norm_module(module, offset=0.0, eps=1e-6, casting_mode="llama", in_place=True):
1802
+ module.offset = offset
1803
+ module.casting_mode = casting_mode
1804
+ module.variance_epsilon = getattr(module, "variance_epsilon", None) or getattr(module, "eps", None) or eps
1805
+ module.in_place = in_place
1806
+ _bind_method_to_module(module, "forward", LigerRMSNorm.forward)
1807
+ _bind_method_to_module(module, "extra_repr", LigerRMSNorm.extra_repr)
1808
+
1809
+
1810
+ def apply_liger_kernel_to_hyperclovax(
1811
+ rope: bool = True,
1812
+ cross_entropy: bool = False,
1813
+ fused_linear_cross_entropy: bool = True,
1814
+ rms_norm: bool = True,
1815
+ swiglu: bool = True,
1816
+ model: PreTrainedModel = None,
1817
+ ) -> None:
1818
+
1819
+ assert not cross_entropy, "not supported"
1820
+ if rope:
1821
+ apply_rotary_pos_emb = liger_rotary_pos_emb
1822
+ if rms_norm:
1823
+ HyperCLOVAXRMSNorm = LigerRMSNorm
1824
+ if swiglu:
1825
+ HyperCLOVAXMLP = LigerSwiGLUMLP
1826
+ # to use VLM forward in VLM repo
1827
+ # if fused_linear_cross_entropy:
1828
+ # HyperCLOVAXForCausalLM.forward = lce_forward_deprecated
1829
+
1830
+ if model is not None:
1831
+ # The model instance already exists, so we need to additionally patch the
1832
+ # instance variables that reference already-instantiated modules (e.g. LlamaRMSNorm or LlamaMLP)
1833
+
1834
+ # get the base model from the model instance
1835
+ base_model: HyperCLOVAXModel = getattr(model, model.base_model_prefix, model)
1836
+
1837
+ if rms_norm:
1838
+ _patch_rms_norm_module(base_model.norm)
1839
+
1840
+ for decoder_layer in base_model.layers:
1841
+ if swiglu:
1842
+ _bind_method_to_module(decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward)
1843
+ if rms_norm:
1844
+ _patch_rms_norm_module(decoder_layer.input_layernorm)
1845
+ _patch_rms_norm_module(decoder_layer.post_attention_layernorm)
1846
+ if decoder_layer.use_post_norm:
1847
+ _patch_rms_norm_module(decoder_layer.post_norm1)
1848
+ _patch_rms_norm_module(decoder_layer.post_norm2)
1849
+
1850
+
1851
+ def _apply_liger_kernel_to_instance(model: PreTrainedModel, **kwargs) -> None:
1852
+ model_type = getattr(model, "config", None) and getattr(model.config, "model_type", None)
1853
+ assert model_type == "hyperclovax"
1854
+ apply_fn = apply_liger_kernel_to_hyperclovax
1855
+ apply_fn_signature = inspect.signature(apply_fn)
1856
+
1857
+ # Filter out the keyword arguments that are not supported by the apply function
1858
+ applicable_kwargs = {key: value for key, value in kwargs.items() if key in apply_fn_signature.parameters}
1859
+ logger.info(
1860
+ f"Applying Liger kernels to model instance with model type: {model_type} with kwargs: {applicable_kwargs}"
1861
+ )
1862
+ apply_fn(model=model, **applicable_kwargs)
1863
+
1864
+
1865
+ ################################################################################################
1866
+ ################################################################################################
modeling_vlm.py ADDED
@@ -0,0 +1,1913 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import contextlib
2
+ import math
3
+ import os
4
+ from functools import partial
5
+ from itertools import chain
6
+ from typing import List, Optional, Tuple, Union
7
+
8
+ import torch
9
+ import torch.distributed as dist
10
+ import torch.nn as nn
11
+
12
+ try:
13
+ from einops import rearrange
14
+ from timm.layers import LayerNorm, LayerNorm2d
15
+ from timm.models.regnet import RegStage
16
+ except:
17
+ print("packages needed for anyres are not imported")
18
+ from transformers import (
19
+ AutoConfig,
20
+ AutoModel,
21
+ AutoModelForCausalLM,
22
+ AutoTokenizer,
23
+ PreTrainedModel,
24
+ )
25
+ from transformers.cache_utils import Cache
26
+ from transformers.generation import GenerationMixin
27
+ from transformers.modeling_outputs import (
28
+ BaseModelOutputWithPast,
29
+ CausalLMOutputWithPast,
30
+ SequenceClassifierOutputWithPast,
31
+ TokenClassifierOutput
32
+ )
33
+ from transformers.modeling_utils import no_init_weights
34
+
35
+ from .configuration_vlm import HCXVisionConfig
36
+
37
+
38
+ def get_rank():
39
+ if dist.is_initialized():
40
+ return dist.get_rank()
41
+ return 0
42
+
43
+
44
+ def is_ampere_or_newer():
45
+ if not torch.cuda.is_available():
46
+ return False
47
+
48
+ gpu_name = torch.cuda.get_device_name()
49
+
50
+ ampere_keywords = [
51
+ "RTX 30",
52
+ "RTX 40",
53
+ "A100",
54
+ "H100",
55
+ "A6000",
56
+ "A5000",
57
+ "A4000",
58
+ "A3000",
59
+ "A2000",
60
+ "A1000",
61
+ ]
62
+
63
+ return any(keyword in gpu_name for keyword in ampere_keywords)
64
+
65
+
66
+ EOT = "<|endofturn|>"
67
+ IMG_LOC = "<|IMAGE_PAD|>"
68
+
69
+
70
+ # https://github.com/huggingface/transformers/blob/42fe769928b505158bc6a0342f47b10693b81927/src/transformers/models/llama/modeling_llama.py#L315-L330
71
+ class HCXVisionPreTrainedModel(PreTrainedModel):
72
+ config_class = HCXVisionConfig
73
+ base_model_prefix = "model"
74
+ vision_model_name = "vision_model"
75
+ _no_split_modules = [
76
+ "CLIPAttention",
77
+ "SiglipVisionModel",
78
+ # "Qwen2_5_VLVisionBlock",
79
+ # "Qwen2_5_VLVisionModel",
80
+ # "Qwen2_5_VisionTransformerPretrainedModel",
81
+ ] # LlavaNext 에도 vision attention은 split 하지 않음
82
+ supports_gradient_checkpointing = True
83
+ _skip_keys_device_placement = "past_key_values"
84
+ _supports_flash_attn_2 = True
85
+ _supports_sdpa = True
86
+ _supports_flex_attn = True
87
+ _supports_cache_class = True
88
+ _supports_quantized_cache = True
89
+ _supports_static_cache = True
90
+ _supports_attention_backend = True
91
+
92
+ def _init_weights(self, module):
93
+ # copies from https://github.com/kakaobrain/honeybee/blob/main/honeybee/common_layers.py#L55
94
+ if (
95
+ isinstance(module, nn.Conv2d) # noqa: SIM101
96
+ or isinstance(module, nn.Embedding)
97
+ or isinstance(module, nn.Linear)
98
+ ):
99
+ module.weight.data.normal_(mean=0.0, std=0.02)
100
+ if hasattr(module, "bias") and module.bias is not None:
101
+ module.bias.data.zero_()
102
+
103
+ elif isinstance(module, nn.LayerNorm):
104
+ module.bias.data.zero_()
105
+ module.weight.data.fill_(1.0)
106
+ elif isinstance(module, nn.Parameter):
107
+ embed_std = 1 / torch.sqrt(torch.tensor(module.size(0), dtype=torch.float)).to(module.dtype)
108
+ module.data.normal_(mean=0.0, std=embed_std)
109
+
110
+
111
+ class HCXVisionModel(HCXVisionPreTrainedModel):
112
+ def __init__(
113
+ self,
114
+ config: HCXVisionConfig,
115
+ without_llm=False,
116
+ **kwargs,
117
+ ):
118
+ super().__init__(config)
119
+
120
+ self.flag_changed_max_position_embeddings = False
121
+ self.without_llm = without_llm
122
+
123
+ vision_model_type = config.vision_config.model_type
124
+
125
+ self.is_qwen_visual = False
126
+ if vision_model_type == "qwen2_5_vl_visual":
127
+ self.is_qwen_visual = True
128
+
129
+ self.freeze_before_sampler = kwargs.pop("freeze_before_sampler", False)
130
+
131
+ vision_config = config.vision_config
132
+ vision_config.anyres = config.anyres
133
+ vision_config.max_num_grids = config.max_num_grids
134
+ vision_config.update({"torch_dtype": config.torch_dtype})
135
+ self.vision_config = vision_config
136
+ if config.anyres:
137
+ if not getattr(config, "possible_resolutions", []):
138
+ possible_resolutions = []
139
+ if config.anyres:
140
+ assert config.max_num_grids > 0
141
+ for i in range(1, config.max_num_grids + 1):
142
+ for j in range(1, config.max_num_grids + 1):
143
+ if i == 1 and j == 1 and not config.use_1x1_grid:
144
+ continue
145
+ if i * j <= config.max_num_grids:
146
+ possible_resolutions.append([i, j])
147
+
148
+ possible_resolutions = [
149
+ [ys * vision_config.image_size, xs * vision_config.image_size]
150
+ for ys, xs in possible_resolutions
151
+ ]
152
+ self.config.possible_resolutions = possible_resolutions
153
+ else:
154
+ self.config.possible_resolutions = config.possible_resolutions
155
+
156
+ if without_llm:
157
+ # if vision_config.vision_module_type not in ["officialllava", "cream2"]:
158
+ # service에서, "vision_model_name_or_path" 의 경로가 vuclip_name2save_path 에 있는 default경로가 아니라, custom한 경로를 따라가야함.
159
+ vision_config.vison_pretrained_name_or_path = config.vision_model_name_or_path
160
+ with no_init_weights():
161
+ if self.is_qwen_visual and is_ampere_or_newer():
162
+ vision_config._attn_implementation = "flash_attention_2"
163
+ self.vision_model = AutoModel.from_config(
164
+ vision_config, trust_remote_code=True
165
+ ) # weight will be loaded in from_pretrained
166
+ self.vision_model.gradient_checkpointing_enable()
167
+ if config.mm_projector_type == "qwen_merger":
168
+
169
+ import torch.nn.functional as F
170
+
171
+ def new_forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor) -> torch.Tensor:
172
+ """
173
+ Args:
174
+ hidden_states (`torch.Tensor` of shape `(seq_len, hidden_size)`):
175
+ The final hidden states of the model.
176
+ grid_thw (`torch.Tensor` of shape `(num_images_or_videos, 3)`):
177
+ The temporal, height and width of feature shape of each image in LLM.
178
+
179
+ Returns:
180
+ `torch.Tensor`: hidden_states.
181
+ """
182
+ hidden_states = self.patch_embed(hidden_states)
183
+ rotary_pos_emb = self.rot_pos_emb(grid_thw)
184
+ window_index, cu_window_seqlens = self.get_window_index(grid_thw)
185
+ cu_window_seqlens = torch.tensor(
186
+ cu_window_seqlens,
187
+ device=hidden_states.device,
188
+ dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
189
+ )
190
+ cu_window_seqlens = torch.unique_consecutive(cu_window_seqlens)
191
+
192
+ seq_len, _ = hidden_states.size()
193
+ hidden_states = hidden_states.reshape(
194
+ seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1
195
+ )
196
+ hidden_states = hidden_states[window_index, :, :]
197
+ hidden_states = hidden_states.reshape(seq_len, -1)
198
+ rotary_pos_emb = rotary_pos_emb.reshape(
199
+ seq_len // self.spatial_merge_unit, self.spatial_merge_unit, -1
200
+ )
201
+ rotary_pos_emb = rotary_pos_emb[window_index, :, :]
202
+ rotary_pos_emb = rotary_pos_emb.reshape(seq_len, -1)
203
+ emb = torch.cat((rotary_pos_emb, rotary_pos_emb), dim=-1)
204
+ position_embeddings = (emb.cos(), emb.sin())
205
+
206
+ cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
207
+ dim=0,
208
+ # Select dtype based on the following factors:
209
+ # - FA2 requires that cu_seqlens_q must have dtype int32
210
+ # - torch.onnx.export requires that cu_seqlens_q must have same dtype as grid_thw
211
+ # See https://github.com/huggingface/transformers/pull/34852 for more information
212
+ dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
213
+ )
214
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
215
+
216
+ for layer_num, blk in enumerate(self.blocks):
217
+ if layer_num in self.fullatt_block_indexes:
218
+ cu_seqlens_now = cu_seqlens
219
+ else:
220
+ cu_seqlens_now = cu_window_seqlens
221
+ if self.gradient_checkpointing and self.training:
222
+ hidden_states = self._gradient_checkpointing_func(
223
+ blk.__call__, hidden_states, cu_seqlens_now, None, position_embeddings
224
+ )
225
+ else:
226
+ hidden_states = blk(
227
+ hidden_states, cu_seqlens=cu_seqlens_now, position_embeddings=position_embeddings
228
+ )
229
+
230
+ # hidden_states = self.merger(hidden_states)
231
+ # reverse_indices = torch.argsort(window_index)
232
+ # hidden_states = hidden_states[reverse_indices, :]
233
+
234
+ return hidden_states, window_index
235
+
236
+ import types
237
+
238
+ self.vision_model.forward = types.MethodType(new_forward, self.vision_model)
239
+ self.vision_model.merger = nn.Identity()
240
+
241
+ if hasattr(config, "text_config") and config.text_config is not None:
242
+ text_config = config.text_config
243
+ else:
244
+ raise ValueError("text_config is not defined")
245
+ text_config.update({"torch_dtype": config.torch_dtype})
246
+ if config.text_config.model_type in ["llama", "hyperclovax", "gpt2"]:
247
+ text_config._attn_implementation = config._attn_implementation
248
+ if text_config.model_type != "hyperclovax":
249
+ text_config.logits_scaling = 1.0
250
+
251
+ text_config.vocab_size = (
252
+ text_config.padded_vocab_size if hasattr(text_config, "padded_vocab_size") else text_config.vocab_size
253
+ )
254
+
255
+ if not without_llm:
256
+ with no_init_weights():
257
+ self.language_model = AutoModelForCausalLM.from_config(text_config, trust_remote_code=True)
258
+
259
+ if config.text_config.model_type in ["llama", "hyperclovax", "gpt2"]:
260
+ self.language_model.gradient_checkpointing_enable()
261
+ self.num_queries_vis_abstractor = config.num_queries_vis_abstractor
262
+
263
+ # mm_projctor(==connector); vision_model_hidden_size -> LLM embedding size
264
+ input_hidden_size = vision_config.hidden_size
265
+ if vision_config.model_type == "qwen2_5_vl_visual":
266
+ input_hidden_size = vision_config.out_hidden_size
267
+ if config.mm_projector_type == "linear":
268
+ self.mm_projector = nn.Linear(input_hidden_size, text_config.hidden_size)
269
+
270
+ elif config.mm_projector_type == "cabstractor":
271
+ self.mm_projector = CAbstractor(
272
+ num_queries=self.num_queries_vis_abstractor,
273
+ num_input_tokens=(self.vision_config.image_size // self.vision_config.patch_size) ** 2,
274
+ encoder_hidden_size=input_hidden_size,
275
+ hidden_size=input_hidden_size,
276
+ output_hidden_size=text_config.hidden_size,
277
+ pos_emb=config.proj_pos_emb,
278
+ prenorm=config.proj_prenorm,
279
+ )
280
+ self.mm_projector.pos_emb.to(config.torch_dtype)
281
+ elif config.mm_projector_type == "qwen_merger":
282
+ from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
283
+ Qwen2_5_VLPatchMerger,
284
+ )
285
+
286
+ self.mm_projector = Qwen2_5_VLPatchMerger(dim=text_config.hidden_size, context_dim=input_hidden_size)
287
+
288
+ def new_forward(self, inputs) -> torch.Tensor:
289
+ x, window_index = inputs
290
+ x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
291
+ reverse_indices = torch.argsort(window_index)
292
+ x = x[reverse_indices, :]
293
+ return x
294
+
295
+ self.mm_projector.forward = types.MethodType(new_forward, self.mm_projector)
296
+
297
+ else:
298
+ self.mm_projector = VLM_Mlp(
299
+ config.mm_projector_type,
300
+ input_hidden_size,
301
+ hidden_features=input_hidden_size, # TODO: llava 처럼 hidden_size 를 input_hidden_size 가 아니라 LLM embedding size 로 바꿔주기
302
+ out_features=text_config.hidden_size,
303
+ )
304
+ self.use_nth_layer = config.use_nth_layer
305
+ self.model_parallel = False
306
+ self.device_map = None
307
+ self.vision_model_use_no_grad = None
308
+
309
+ self.text_config = text_config
310
+
311
+ self.anyres = config.anyres
312
+ self.unpad = config.unpad
313
+ self.vision_input_chunk_size = kwargs.pop("vision_input_chunk_size", None)
314
+ if self.anyres:
315
+ self.image_newline = nn.Parameter(torch.empty(text_config.hidden_size, dtype=self.dtype))
316
+
317
+ self.is_safetensor_save = kwargs.get("is_safetensor_save", True)
318
+ self._backward_compatibility_gradient_checkpointing() # self.post_init() 에 포함되어 있는 gc 가능한지 확인하고 켜주는 함수
319
+ self.mm_projector.to(config.torch_dtype)
320
+
321
+ def forward(
322
+ self,
323
+ input_ids: Optional[torch.LongTensor] = None,
324
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None,
325
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
326
+ attention_mask: Optional[torch.FloatTensor] = None,
327
+ position_ids: Optional[torch.LongTensor] = None,
328
+ inputs_embeds: Optional[torch.FloatTensor] = None,
329
+ use_cache: Optional[bool] = None,
330
+ output_attentions: Optional[bool] = None,
331
+ output_hidden_states: Optional[bool] = None,
332
+ return_dict: Optional[bool] = True,
333
+ image_sizes: Optional[List[List[List[int]]]] = None,
334
+ vision_query_lengths: Optional[List[List[int]]] = None,
335
+ non_vision_query_lengths: Optional[List[List[int]]] = None,
336
+ img_start_ids_list: Optional[List[List[int]]] = None,
337
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
338
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
339
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
340
+ is_videos: Optional[List[List[bool]]] = None,
341
+ image_grid_thw: Optional[torch.LongTensor] = None,
342
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
343
+ video_grid_thw: Optional[torch.LongTensor] = None,
344
+ **kwargs,
345
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
346
+ """
347
+ :param input_ids: torch.int64 : torch.size([batchsize, variable)]) : SystemPrompt with Question text token indices for tokenizer.
348
+ In positions where images are inputted, the value is replaced by config.img_start_id, which is a vocabulary index used to indicate the start of image data.
349
+ :param pixel_values: List of List of 4D tensor (torch.float32)
350
+ Each outer list corresponds to a batch and contains inner lists, each holding tensors for images in a sample. The structure accounts for samples with multiple images.
351
+ :param past_key_values: None
352
+ :param inputs_embeds: None
353
+ :param use_cache: None
354
+ :param output_attentions: Optional[bool] : get attention weights of each layers of transformer network (true: 결과값에 포함, false: 결과값에 미포함)
355
+ :param output_hidden_states: Optional[bool] : get hidden states of each layers of transformer network (true: 결과값에 포함, false: 결과값에 미포함)
356
+ :param image_sizes: Stacked as a List of List, representing image sizes (width, height).
357
+ In cases where a sample contains no images, a single dummy image is included.
358
+ :param vision_query_lengths: A List of List that stores the lengths when each image is converted into visual tokens for LLM input.
359
+ In cases where a sample does not contain any images, an empty list is included.
360
+ :param non_vision_query_lengths: contains the lengths of text tokens (excluding visual tokens) for each sample in a batch.
361
+ :img_start_ids_list: contains the indices of the img_start_id tokens for each sample.
362
+ :num_queries_vis_abstractors: A List of List that contains the number of visual tokens for each image grid.
363
+ :num_queries_vis_abstractors_slow: A List of List that contains the number of visual tokens for the slow part when applying the slowfast algorithm to video frames. If the slowfast algorithm is not applied, it will have a value of None.
364
+ :first_last_frames_slows: A List of List that contains the only first and last frames slow mode for each sample in a batch.
365
+ :is_videos: A List of List that contains the boolean value indicating whether each sample in a batch is a video.
366
+ :image_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
367
+ :pixel_values_videos: A 2D tensor (torch.float32) for qwen2.5-vl visual encoder.
368
+ :video_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
369
+ :return:
370
+ """
371
+ output_attentions = (
372
+ output_attentions if output_attentions is not None else self.config.vision_config.output_attentions
373
+ )
374
+ output_hidden_states = (
375
+ output_hidden_states if output_hidden_states is not None else self.config.vision_config.output_hidden_states
376
+ )
377
+
378
+ if inputs_embeds is None and past_key_values is None:
379
+ inputs_embeds = self.extract_inputs_embeds(
380
+ input_ids=input_ids,
381
+ pixel_values=pixel_values,
382
+ past_key_values=past_key_values,
383
+ image_sizes=image_sizes,
384
+ vision_query_lengths=vision_query_lengths,
385
+ non_vision_query_lengths=non_vision_query_lengths,
386
+ img_start_ids_list=img_start_ids_list,
387
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
388
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
389
+ first_last_frames_slows=first_last_frames_slows,
390
+ is_videos=is_videos,
391
+ image_grid_thw=image_grid_thw,
392
+ pixel_values_videos=pixel_values_videos,
393
+ video_grid_thw=video_grid_thw,
394
+ )
395
+
396
+ if inputs_embeds is not None:
397
+ input_ids = None
398
+
399
+ outputs = self.language_model.base_model(
400
+ input_ids=input_ids,
401
+ inputs_embeds=inputs_embeds,
402
+ attention_mask=attention_mask,
403
+ position_ids=position_ids,
404
+ past_key_values=past_key_values,
405
+ use_cache=use_cache,
406
+ output_attentions=output_attentions,
407
+ output_hidden_states=output_hidden_states,
408
+ return_dict=return_dict,
409
+ )
410
+ return outputs
411
+
412
+ def determine_non_vision_query_lengths(self, input_ids, pad_id, img_start_id):
413
+ """non_vision_query_lengths 를 계산하는 함수
414
+ input_ids 가 collate 될때, 오른쪽에 pad_id 가 채워지기 때문에 이 값을 찾는 방식을 통해 계산됨
415
+ 또한 img_start_id 는 visual token 이 들어서는 자리이기 때문에, 해당 indices 은 제거
416
+ """
417
+ non_vision_query_lengths = []
418
+ batch_size, len_seq = input_ids.size(0), input_ids.size(1)
419
+
420
+ for i in range(batch_size):
421
+ temp_idx = (input_ids[i] == pad_id).nonzero()
422
+ eos_idx = temp_idx[0, 0].item() if len(temp_idx) > 0 else len_seq
423
+ num_imgs = (input_ids[i] == img_start_id).sum().item()
424
+ non_vision_query_lengths.append(eos_idx - num_imgs)
425
+
426
+ if all([pad_id in input_id for input_id in input_ids.tolist()]):
427
+ non_vision_query_lengths = [
428
+ non_vision_query_length + 1 for non_vision_query_length in non_vision_query_lengths
429
+ ]
430
+
431
+ return non_vision_query_lengths
432
+
433
+ def determine_vision_query_lengths(self, image_features, image_cnts):
434
+ """vision_query_lengths 를 계산하는 함수
435
+ image_features tensor 의 shape 을 통해 계산된다.
436
+ 이미지가 1장도 없는 sample 의 경우 dummy image 1장이 들어가기 때문에, 따로 빈 list 처리 또한 추가
437
+ """
438
+ vision_query_lengths = [
439
+ [image_feature.size(0) for image_feature in image_feature_list] for image_feature_list in image_features
440
+ ]
441
+
442
+ for i, image_cnt in enumerate(image_cnts):
443
+ if image_cnt == 0:
444
+ assert len(vision_query_lengths[i]) == 1 # 현재 검정 이미지 1개 들어가있음
445
+ vision_query_lengths[i] = [] # 빈 list 로 변환
446
+
447
+ return vision_query_lengths
448
+
449
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_input_embeddings
450
+ def get_input_embeddings(self):
451
+ if self.without_llm:
452
+ return None
453
+ else:
454
+ return self.language_model.get_input_embeddings()
455
+
456
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_input_embeddings
457
+ def set_input_embeddings(self, value):
458
+ self.language_model.set_input_embeddings(value)
459
+
460
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_output_embeddings
461
+ def get_output_embeddings(self):
462
+ if self.without_llm:
463
+ return None
464
+ else:
465
+ return self.language_model.get_output_embeddings()
466
+
467
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_output_embeddings
468
+ def set_output_embeddings(self, new_embeddings):
469
+ self.language_model.set_output_embeddings(new_embeddings)
470
+
471
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_decoder
472
+ def set_decoder(self, decoder):
473
+ self.language_model.set_decoder(decoder)
474
+
475
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_decoder
476
+ def get_decoder(self):
477
+ return self.language_model.get_decoder()
478
+
479
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.tie_weights
480
+ def tie_weights(self):
481
+ if self.without_llm:
482
+ return None
483
+ else:
484
+ return self.language_model.tie_weights()
485
+
486
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.resize_token_embeddings
487
+ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
488
+ model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
489
+ # update vocab size
490
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
491
+ self.vocab_size = model_embeds.num_embeddings
492
+ return model_embeds
493
+
494
+ def extract_inputs_embeds(
495
+ self,
496
+ input_ids: Optional[torch.LongTensor] = None,
497
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None, # list of list of 4D tensors
498
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
499
+ image_sizes: Optional[List[List[List[int]]]] = None,
500
+ vision_query_lengths: Optional[List[List[int]]] = None,
501
+ non_vision_query_lengths: Optional[List[int]] = None,
502
+ img_start_ids_list: Optional[List[List[int]]] = None,
503
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
504
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
505
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
506
+ is_videos: Optional[List[List[bool]]] = None,
507
+ image_grid_thw: Optional[torch.LongTensor] = None,
508
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
509
+ video_grid_thw: Optional[torch.LongTensor] = None,
510
+ ):
511
+ """
512
+ :param input_ids: torch.int64 : torch.size([batchsize, variable)]) : SystemPrompt with Question text token indices for tokenizer.
513
+ In positions where images are inputted, the value is replaced by config.img_start_id, which is a vocabulary index used to indicate the start of image data.
514
+ In cases where a sample contains no images, a single dummy image is included.
515
+ :param pixel_values: List of List of 4D tensor (torch.float32)
516
+ Each outer list corresponds to a batch and contains inner lists, each holding tensors for images in a sample. The structure accounts for samples with multiple images.
517
+ :param past_key_values: None : (batch_size, num_heads, sequence_length - 1, embed_size_per_head): Contains precomputed key and value hidden-states of the attention blocks. Can be used to speed up
518
+ :param image_sizes: Stacked as a List of List, representing image sizes (width, height).
519
+ In cases where a sample contains no images, a single dummy image is included.
520
+ :param vision_query_lengths: A List of List that stores the lengths when each image is converted into visual tokens for LLM input.
521
+ In cases where a sample does not contain any images, an empty list is included.
522
+ :param non_vision_query_lengths: contains the lengths of text tokens (excluding visual tokens) for each sample in a batch.
523
+ :img_start_ids_list: contains the indices of the img_start_id tokens for each sample.
524
+ :num_queries_vis_abstractors: A List of List that contains the number of visual tokens for each image grid.
525
+ :num_queries_vis_abstractors_slow: A List of List that contains the number of visual tokens for the slow part when applying the slowfast algorithm to video frames. If the slowfast algorithm is not applied, it will have a value of None.
526
+ :first_last_frames_slows: A List of bool that contains the information of whether the slowfast algorithm is applied to the first or last frames of the video.
527
+ :is_videos: A List of List that contains the boolean value indicating whether each sample in a batch is a video.
528
+ :image_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
529
+ :pixel_values_videos: A 2D tensor (torch.float32) for qwen2.5-vl visual encoder.
530
+ :video_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
531
+ :return:
532
+ """
533
+ inputs_embeds = None
534
+ if past_key_values:
535
+ pass
536
+ else:
537
+ if self.is_qwen_visual:
538
+ inputs_embeds = self.get_input_embeddings()(input_ids)
539
+ context_vision_model = torch.no_grad() if self.config.freeze_encoder else contextlib.nullcontext()
540
+
541
+ if pixel_values is not None:
542
+ with context_vision_model:
543
+ image_features = self.vision_model(pixel_values, grid_thw=image_grid_thw)
544
+ image_features = self.mm_projector(image_features)
545
+
546
+ if img_start_ids_list is None:
547
+ image_cnts = (input_ids == self.config.img_start_id).sum(dim=1).tolist()
548
+ else:
549
+ image_cnts = [len(img_start_ids) for img_start_ids in img_start_ids_list]
550
+
551
+ mask = input_ids.eq(self.config.img_start_id)
552
+ positions = mask.nonzero(as_tuple=False)
553
+
554
+ batch_idx = positions[:, 0]
555
+ seq_idx = positions[:, 1]
556
+
557
+ if sum(image_cnts) == 0:
558
+ image_features = image_features[0:0] # trick for sft1 data
559
+ inputs_embeds[batch_idx, seq_idx, :] = image_features.to(device=inputs_embeds.device)
560
+
561
+ if pixel_values_videos is not None:
562
+ with context_vision_model:
563
+ video_features = self.vision_model(pixel_values_videos, grid_thw=video_grid_thw)
564
+ video_features = self.mm_projector(video_features)
565
+
566
+ video_cnts = (input_ids == self.config.video_start_id).sum(dim=1).tolist()
567
+ mask = input_ids.eq(self.config.video_start_id)
568
+ positions = mask.nonzero(as_tuple=False)
569
+
570
+ batch_idx = positions[:, 0]
571
+ seq_idx = positions[:, 1]
572
+
573
+ if sum(video_cnts) == 0:
574
+ video_features = video_features[0:0] # trick for no video batch
575
+ inputs_embeds[batch_idx, seq_idx, :] = video_features.to(device=inputs_embeds.device)
576
+ else:
577
+ # CLIP, connector 는 flatten 해서 feature encoding 후 다시 List of List 형태로 변환
578
+ len_pixel_values = [len(pixel_value) for pixel_value in pixel_values]
579
+ concat_pixel_values = torch.cat(list(chain(*pixel_values)), dim=0) # list of list of 4D Tensor
580
+ visual_token_idx = 0 if "siglip" in self.vision_config.model_type else 1
581
+
582
+ # adative anyres 로직을 타야하는지 확인
583
+ # num_queries_vis_abstractors is not None 이면서,
584
+ # self.num_queries_vis_abstractor과 다른 하나 이상의 num_queries_vis_abstractors가 있는지
585
+ is_adaptive_anyres = num_queries_vis_abstractors is not None and any(
586
+ self.num_queries_vis_abstractor != num_queries_vis_abstractor
587
+ for sublist in num_queries_vis_abstractors
588
+ for num_queries_vis_abstractor in sublist
589
+ )
590
+ if not is_adaptive_anyres:
591
+ image_sizes = list(chain(*image_sizes))
592
+ if is_videos is not None:
593
+ is_videos = list(chain(*is_videos))
594
+ else:
595
+ is_videos = [False] * len(image_sizes)
596
+
597
+ group_ids = None
598
+ else:
599
+ # adaptive anyres 의 경우, CAbstractor 에만 구현, CAbstractor가 CheckpointWrapper로 감싸져있을 수 있음
600
+ # assert isinstance(self.mm_projector, CAbstractor)
601
+ is_cabstractor = False
602
+ for submodule in self.mm_projector.modules():
603
+ if isinstance(submodule, CAbstractor):
604
+ is_cabstractor = True
605
+ break
606
+ assert is_cabstractor
607
+
608
+ assert num_queries_vis_abstractors_slow is not None
609
+
610
+ num_queries_vis_abstractors, num_grids, image_sizes, is_videos, group_ids = (
611
+ self.compute_adaptive_params(
612
+ pixel_values,
613
+ num_queries_vis_abstractors,
614
+ num_queries_vis_abstractors_slow,
615
+ image_sizes,
616
+ is_videos,
617
+ first_last_frames_slows,
618
+ )
619
+ )
620
+
621
+ # 모델의 모든 파라미터가 requires_grad=False인지 확인합니다.
622
+ if torch.is_grad_enabled():
623
+ if self.vision_model_use_no_grad is None:
624
+ self.vision_model_use_no_grad = all(
625
+ not p.requires_grad for p in self.vision_model.vision_model.encoder.parameters()
626
+ )
627
+ context_vision_model = torch.no_grad() if self.vision_model_use_no_grad else contextlib.nullcontext()
628
+ if self.vision_input_chunk_size is not None:
629
+ # n_chunks 계산 (몇 번 for loop 돌아야하는지)
630
+ chunk_size = self.vision_input_chunk_size
631
+
632
+ local_batch_size = torch.tensor([concat_pixel_values.size(0)], device=concat_pixel_values.device)
633
+ gathered_batch_sizes = [
634
+ torch.zeros_like(local_batch_size) for _ in range(torch.distributed.get_world_size())
635
+ ]
636
+ torch.distributed.all_gather(gathered_batch_sizes, local_batch_size)
637
+ gathered_batch_sizes = torch.stack(gathered_batch_sizes)
638
+ max_batch_size = gathered_batch_sizes.max().item()
639
+
640
+ n_chunks = math.ceil(max_batch_size / chunk_size)
641
+
642
+ if is_adaptive_anyres:
643
+ chunk_num_queries_vis_abstractors, chunk_num_grids, chunk_is_splits = (
644
+ self.split_adaptive_params(
645
+ num_queries_vis_abstractors,
646
+ num_grids,
647
+ chunk_size,
648
+ n_chunks,
649
+ )
650
+ )
651
+
652
+ # concat_pixel_values의 shape을 기준으로 dummy tensor 생성
653
+ dummy_shape = (1,) + tuple(concat_pixel_values.shape[1:])
654
+ dummy = torch.zeros(
655
+ dummy_shape, dtype=concat_pixel_values.dtype, device=concat_pixel_values.device
656
+ ).to(self.vision_model.dtype)
657
+
658
+ else:
659
+ # chunk 하지 않고, 기존 input 그대로 batch 처리
660
+ chunk_size = concat_pixel_values.size(0)
661
+ n_chunks = 1
662
+
663
+ image_forward_outs = []
664
+
665
+ for i in range(n_chunks):
666
+ start = i * chunk_size
667
+ end = (i + 1) * chunk_size
668
+ # 현재 chunk slice (데이터가 없으면 빈 텐서가 될 수 있음)
669
+ chunk = concat_pixel_values[start:end].to(self.vision_model.dtype)
670
+ current_chunk_size = chunk.size(0)
671
+
672
+ # 만약 현재 chunk의 크기가 0이면, 더미 데이터 forward
673
+ if current_chunk_size == 0:
674
+ chunk = dummy
675
+
676
+ # vision 모델에 chunk를 통과시킴 (use_nth_layer에 따라 처리)
677
+ if self.use_nth_layer == -1:
678
+ # 마지막 레이어의 후처리인 post_layernorm을 Identity로 대체
679
+ self.vision_model.vision_model.post_layernorm = nn.Identity()
680
+ with context_vision_model:
681
+ outs = self.vision_model(chunk)
682
+ outs = outs.last_hidden_state[:, visual_token_idx:]
683
+ else:
684
+ with context_vision_model:
685
+ outs = self.vision_model(chunk, output_hidden_states=True)
686
+ outs = outs.hidden_states[self.use_nth_layer][:, visual_token_idx:]
687
+ if self.vision_model_use_no_grad:
688
+ outs = outs.detach().requires_grad_(True)
689
+ if not is_adaptive_anyres:
690
+ if self.freeze_before_sampler and self.training:
691
+ outs = self.mm_projector(outs, freeze_before_sampler=True)
692
+ else:
693
+ outs = self.mm_projector(outs)
694
+ if current_chunk_size > 0:
695
+ image_forward_outs.append(outs)
696
+ else:
697
+ if n_chunks != 1:
698
+ current_num_queries_vis_abstractors = chunk_num_queries_vis_abstractors[i]
699
+ current_num_grids = chunk_num_grids[i]
700
+ else:
701
+ current_num_queries_vis_abstractors = num_queries_vis_abstractors
702
+ current_num_grids = num_grids
703
+ if self.freeze_before_sampler and self.training:
704
+ outs = self.mm_projector(
705
+ outs,
706
+ num_queries_vis_abstractors=current_num_queries_vis_abstractors,
707
+ num_grids=current_num_grids,
708
+ freeze_before_sampler=True,
709
+ )
710
+ else:
711
+ outs = self.mm_projector(
712
+ outs,
713
+ num_queries_vis_abstractors=current_num_queries_vis_abstractors,
714
+ num_grids=current_num_grids,
715
+ )
716
+ if current_chunk_size > 0:
717
+ if i > 0 and chunk_is_splits[i - 1]:
718
+ # 첫 번째 인덱스는 이전 결과에 합침
719
+ image_forward_outs[-1] = torch.cat([image_forward_outs[-1], outs[0]], dim=0)
720
+ image_forward_outs.extend(outs[1:])
721
+ else:
722
+ image_forward_outs.extend(outs)
723
+ # 모든 chunk의 결과를 concat
724
+ if not is_adaptive_anyres:
725
+ # adaptive anyres 가 아니면 모든 결과를 합쳐서 torch로 변환
726
+ # adaptive anyres 인 경우, 모든 결과가 list 형태로 사용하면 됨
727
+ image_forward_outs = torch.cat(image_forward_outs, dim=0).to(image_forward_outs[0].dtype)
728
+
729
+ if img_start_ids_list is None:
730
+ image_cnts = (input_ids == self.config.img_start_id).sum(dim=1).tolist()
731
+ else:
732
+ image_cnts = [len(img_start_ids) for img_start_ids in img_start_ids_list]
733
+
734
+ if self.anyres:
735
+ split_sizes = [pixel_value.shape[0] for pixel_value in chain(*pixel_values)]
736
+
737
+ # if not is_adaptive_anyres:
738
+ # image_features = anyres_postprocessing(
739
+ # image_forward_outs=image_forward_outs,
740
+ # split_sizes=split_sizes,
741
+ # image_sizes=image_sizes,
742
+ # num_queries_vis_abstractor=self.num_queries_vis_abstractor,
743
+ # unpad=self.unpad,
744
+ # is_videos=is_videos,
745
+ # patch_size=self.vision_model.config.patch_size,
746
+ # grid_size=self.vision_model.config.image_size,
747
+ # image_newline=self.image_newline,
748
+ # possible_resolutions=self.config.possible_resolutions,
749
+ # )
750
+ # else:
751
+ # image_features = adaptive_anyres_postprocessing(
752
+ # image_forward_outs=image_forward_outs,
753
+ # image_sizes=image_sizes,
754
+ # num_queries_vis_abstractors=num_queries_vis_abstractors,
755
+ # unpad=self.unpad,
756
+ # is_videos=is_videos,
757
+ # patch_size=self.vision_model.config.patch_size,
758
+ # grid_size=self.vision_model.config.image_size,
759
+ # image_newline=self.image_newline,
760
+ # possible_resolutions=self.config.possible_resolutions,
761
+ # group_ids=group_ids,
762
+ # )
763
+ else:
764
+ if not is_adaptive_anyres:
765
+ image_features = [image_forward_out for image_forward_out in image_forward_outs]
766
+ else:
767
+ image_features = [image_forward_out.unsqueeze(0) for image_forward_out in image_forward_outs]
768
+
769
+ image_features = [
770
+ image_features[sum(len_pixel_values[:i]) : sum(len_pixel_values[: i + 1])]
771
+ for i in range(len(len_pixel_values))
772
+ ]
773
+
774
+ # llm 없이 inference하는 단계에서는, prompt의 조합이 학습과정과 다르기 때문에, 밖에서 조합한다.
775
+ if self.without_llm:
776
+ return image_features
777
+
778
+ batch_size = input_ids.size(0)
779
+ image_feature_dim = image_features[0][0].size(1)
780
+ image_feature_dtype = image_features[0][0].dtype
781
+
782
+ if img_start_ids_list is None:
783
+ image_cnts = (input_ids == self.config.img_start_id).sum(dim=1).tolist()
784
+ else:
785
+ image_cnts = [len(img_start_ids) for img_start_ids in img_start_ids_list]
786
+
787
+ if non_vision_query_lengths is None:
788
+ non_vision_query_lengths = self.determine_non_vision_query_lengths(
789
+ input_ids, self.config.text_config.pad_token_id, self.config.img_start_id
790
+ )
791
+
792
+ if vision_query_lengths is None:
793
+ vision_query_lengths = self.determine_vision_query_lengths(image_features, image_cnts)
794
+
795
+ # concat보다 슬라이싱이 빠름
796
+ len_inputs_embeds = max(
797
+ [
798
+ sum(vision_query_length) + non_vision_query_length
799
+ for non_vision_query_length, vision_query_length in zip(
800
+ non_vision_query_lengths, vision_query_lengths
801
+ )
802
+ ]
803
+ )
804
+
805
+ inputs_embeds = torch.zeros(
806
+ [batch_size, len_inputs_embeds, image_feature_dim],
807
+ dtype=image_feature_dtype,
808
+ device=self.device,
809
+ requires_grad=True,
810
+ ).clone()
811
+
812
+ # temp_embeds : torch.bfloat16 : [batchsize, 174, 3072]
813
+ temp_embeds = self.get_input_embeddings()(input_ids)
814
+
815
+ # 완성본은 <PROMPT><USER_PREFIX><VISION_QUERIES>Sentence 형태
816
+ for batch_idx, sample in enumerate(input_ids):
817
+ # visual token 과 concat 후 slicing
818
+ non_vision_query_length = non_vision_query_lengths[batch_idx]
819
+ # 안전하게, visual token 과 concat 후 slicing
820
+ sample = sample[: non_vision_query_length + image_cnts[batch_idx]]
821
+
822
+ if image_cnts[batch_idx] == 0: # text instruction data는 image feature를 삽입하지않음
823
+ temp_idx = 0
824
+ # 참고: https://github.com/haotian-liu/LLaVA/commit/44e0562f9497fb79f042427307472a87d266d90a#diff-4477387d506ccb1897a13972cba26c9da3fad4d3e1c32ec4b8bd8ff7acd3f292
825
+ # https://github.com/intel/intel-extension-for-transformers/issues/1201#issuecomment-1915875119
826
+ inputs_embeds[batch_idx, :non_vision_query_length] = temp_embeds[batch_idx][
827
+ :non_vision_query_length
828
+ ]
829
+ inputs_embeds[batch_idx, temp_idx:temp_idx] = image_features[batch_idx][0][
830
+ 0:0
831
+ ] # batch_idx sample 의 첫번째 이미지 (dummy 이미지)
832
+ else:
833
+ if img_start_ids_list is None:
834
+ img_start_ids = (sample == self.config.img_start_id).nonzero()
835
+ else:
836
+ img_start_ids = img_start_ids_list[batch_idx]
837
+ assert len(img_start_ids) == image_cnts[batch_idx] == len(image_features[batch_idx])
838
+ # 입력 임베딩과 임시 임베딩의 시작 지점 초기화
839
+ input_start, temp_start = 0, 0
840
+
841
+ # 배치 내 각 이미지 시작 지점을 순회
842
+ for multi_img_idx, img_start_idx in enumerate(img_start_ids):
843
+ # 현재 이미지 시작 지점까지의 토큰 길이 계산
844
+ token_len = img_start_idx - temp_start
845
+
846
+ # inputs_embeds으로 토큰 복사
847
+ inputs_embeds[batch_idx, input_start : input_start + token_len] = temp_embeds[
848
+ batch_idx, temp_start : temp_start + token_len
849
+ ]
850
+
851
+ # image_features 삽입 위치 계산하여 삽입
852
+ inputs_embeds[
853
+ batch_idx,
854
+ input_start
855
+ + token_len : input_start
856
+ + token_len
857
+ + vision_query_lengths[batch_idx][multi_img_idx],
858
+ ] = image_features[batch_idx][multi_img_idx]
859
+
860
+ # 다음 토큰 처리를 위한 시작 지점 업데이트
861
+ input_start += token_len + vision_query_lengths[batch_idx][multi_img_idx]
862
+ temp_start += token_len + 1 # 이미지 시작 토큰을 넘어서기 위해 1 증가
863
+
864
+ # 마지막 이미지 종료 토큰 이후의 토큰 처리
865
+ token_len = min(sample[temp_start:].size(0), inputs_embeds.size(1) - input_start)
866
+ inputs_embeds[batch_idx, input_start : input_start + token_len] = temp_embeds[
867
+ batch_idx, temp_start : temp_start + token_len
868
+ ]
869
+ return inputs_embeds
870
+
871
+ @classmethod
872
+ def from_pretrained(
873
+ cls,
874
+ pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
875
+ *model_args,
876
+ **kwargs,
877
+ ):
878
+ model = super().from_pretrained(
879
+ pretrained_model_name_or_path,
880
+ *model_args,
881
+ **kwargs,
882
+ )
883
+
884
+ model.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)
885
+ return model
886
+
887
+ def save_pretrained(
888
+ self,
889
+ save_directory: Union[str, os.PathLike],
890
+ *args,
891
+ **kwargs,
892
+ ):
893
+ super().register_for_auto_class("AutoModel")
894
+ self.config.register_for_auto_class()
895
+ super().save_pretrained(save_directory, *args, **kwargs)
896
+
897
+ def compute_adaptive_params(
898
+ self,
899
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None,
900
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
901
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
902
+ image_sizes: Optional[List[List[List[int]]]] = None,
903
+ is_videos: Optional[List[List[bool]]] = None,
904
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
905
+ ):
906
+ # 내부의 모든 원소가 0 이상의 정수인지 확인
907
+ assert all(
908
+ all(isinstance(value, int) and value >= 0 for value in sublist) for sublist in num_queries_vis_abstractors
909
+ ), "All values in num_queries_vis_abstractors must be integers >= 0."
910
+
911
+ assert all(
912
+ all(isinstance(value, int) and value >= 0 for value in sublist)
913
+ for sublist in num_queries_vis_abstractors_slow
914
+ ), "All values in num_queries_vis_abstractors_slow must be integers >= 0."
915
+
916
+ assert is_videos is not None
917
+
918
+ # 첫번째 혹은 마지막 이미지인지? (video 처리 slowfast 적용을 위함)
919
+ is_first_images = []
920
+ is_last_images = []
921
+ for is_video in is_videos:
922
+ for idx, is_video_item in enumerate(is_video):
923
+ if idx == 0:
924
+ is_first_images.append(True)
925
+ else:
926
+ is_first_images.append(False)
927
+ if idx == len(is_video) - 1:
928
+ is_last_images.append(True)
929
+ else:
930
+ is_last_images.append(False)
931
+
932
+ num_queries_vis_abstractors = list(chain(*num_queries_vis_abstractors))
933
+ num_queries_vis_abstractors_slow = list(chain(*num_queries_vis_abstractors_slow))
934
+ image_sizes = list(chain(*image_sizes))
935
+ is_videos = list(chain(*is_videos))
936
+ first_last_frames_slows = list(chain(*first_last_frames_slows))
937
+
938
+ # num_queries_vis_abstractors_slow 내에 visual tokens 수가 하나라도 0 이상인게 존재하면 slowfast mode 사용
939
+ use_slowfast = any([num_query > 0 for num_query in num_queries_vis_abstractors_slow])
940
+
941
+ num_grids = [pixel_value.shape[0] for pixel_value in chain(*pixel_values)]
942
+ num_grids = [0] + num_grids
943
+ group_ids = []
944
+
945
+ if use_slowfast:
946
+ new_num_grids = [num_grids[0]]
947
+ new_num_queries = []
948
+ new_image_sizes = []
949
+ new_is_videos = []
950
+
951
+ # slowfast 를 사용하는 경우, 좀 더 잘게 쪼갠다
952
+ # 0번째 local grid 는 slow frame, 나머지 local grids 는 fast frame
953
+ for (
954
+ num_query,
955
+ num_query_slow,
956
+ num_grid,
957
+ image_size,
958
+ is_video,
959
+ first_last_frames_slow,
960
+ is_first_image,
961
+ is_last_image,
962
+ ) in zip(
963
+ num_queries_vis_abstractors,
964
+ num_queries_vis_abstractors_slow,
965
+ num_grids[1:],
966
+ image_sizes,
967
+ is_videos,
968
+ first_last_frames_slows,
969
+ is_first_images,
970
+ is_last_images,
971
+ ):
972
+
973
+ if not first_last_frames_slow and num_query_slow > 0: # Process all image in slowfast mode
974
+ assert is_video is True # slowfast mode는 video에 대해서만 적용
975
+
976
+ this_group_ids = [group_ids[-1][-1] + 1 if group_ids else 0]
977
+
978
+ # slow frame (제일 첫번째 grid)
979
+ new_num_grids.append(new_num_grids[-1] + 1)
980
+ new_num_queries.append(num_query_slow)
981
+ new_image_sizes.append(image_size)
982
+ new_is_videos.append(is_video)
983
+
984
+ if num_grid >= 2:
985
+ # fast frames
986
+ new_num_grids.append(new_num_grids[-1] + num_grid - 1)
987
+ new_num_queries.append(num_query)
988
+ new_image_sizes.append(image_size)
989
+ new_is_videos.append(is_video)
990
+ this_group_ids.append(this_group_ids[-1] + 1)
991
+
992
+ group_ids.append(this_group_ids)
993
+ elif (
994
+ first_last_frames_slow and num_query_slow > 0 and (is_first_image or is_last_image)
995
+ ): # Process only first/last image in slowfast mode
996
+ # slow frame 를 하는데 first, last만 특별 취급하는 케이스.
997
+ assert is_video is True # slowfast mode는 video에 대해서만 적용
998
+
999
+ this_group_ids = [group_ids[-1][-1] + 1 if group_ids else 0]
1000
+
1001
+ if num_grid == 1:
1002
+ # 고민할 것 없이 그냥 1개만 들어있어서 여기에 slow만 처리하면 끝.
1003
+ new_num_grids.append(new_num_grids[-1] + 1)
1004
+ new_num_queries.append(num_query_slow)
1005
+ new_image_sizes.append(image_size)
1006
+ new_is_videos.append(is_video)
1007
+
1008
+ if num_grid >= 2:
1009
+ if is_first_image: # first and last 라도 여기에 포함.
1010
+ # slow frame (제일 첫번째 grid)
1011
+ new_num_grids.append(new_num_grids[-1] + 1)
1012
+ new_num_queries.append(num_query_slow)
1013
+ new_image_sizes.append(image_size)
1014
+ new_is_videos.append(is_video)
1015
+ # fast frames
1016
+ new_num_grids.append(new_num_grids[-1] + num_grid - 1)
1017
+ new_num_queries.append(num_query)
1018
+ new_image_sizes.append(image_size)
1019
+ new_is_videos.append(is_video)
1020
+ this_group_ids.append(this_group_ids[-1] + 1)
1021
+ elif is_last_image:
1022
+ # fast frames
1023
+ new_num_grids.append(new_num_grids[-1] + num_grid - 1)
1024
+ new_num_queries.append(num_query)
1025
+ new_image_sizes.append(image_size)
1026
+ new_is_videos.append(is_video)
1027
+ # slow frame (제일 마지막 grid)
1028
+ new_num_grids.append(new_num_grids[-1] + 1)
1029
+ new_num_queries.append(num_query_slow)
1030
+ new_image_sizes.append(image_size)
1031
+ new_is_videos.append(is_video)
1032
+ this_group_ids.append(this_group_ids[-1] + 1)
1033
+ else:
1034
+ raise Exception("This case should not be reached.")
1035
+ group_ids.append(this_group_ids)
1036
+
1037
+ else:
1038
+ # slowfast mode가 아닌 경우, 즉, 모두다 num_query 만큼 줄임 (fast)
1039
+ new_num_grids.append(new_num_grids[-1] + num_grid)
1040
+ new_num_queries.append(num_query)
1041
+ new_image_sizes.append(image_size)
1042
+ new_is_videos.append(is_video)
1043
+
1044
+ start_group_id = group_ids[-1][-1] + 1 if group_ids else 0
1045
+ group_ids.append([start_group_id])
1046
+
1047
+ num_grids = new_num_grids
1048
+ num_queries_vis_abstractors = new_num_queries
1049
+ image_sizes = new_image_sizes
1050
+ is_videos = new_is_videos
1051
+ else:
1052
+ num_grids = [sum(num_grids[:i]) for i in range(1, len(num_grids) + 1)]
1053
+ group_ids = [[group_id] for group_id in range(len(is_videos))]
1054
+
1055
+ return num_queries_vis_abstractors, num_grids, image_sizes, is_videos, group_ids
1056
+
1057
+ def split_adaptive_params(
1058
+ self, num_queries_vis_abstractors, num_grids, chunk_size: int, n_chunks: int # len = n # len = n+1, 첫 값 0
1059
+ ):
1060
+ """
1061
+ num_grids/num_queries 를 chunk_size 단위로 최대 n_chunks 만큼 자른다.
1062
+ 실제 데이터가 부족하면 남은 chunk 는 더미([0,1]) 로 채운다.
1063
+
1064
+ Returns
1065
+ -------
1066
+ chunk_qs : List[List[int]]
1067
+ chunk_grids: List[List[int]]
1068
+ 각 원소 길이는 동일하며, 전체 길이는 정확히 n_chunks.
1069
+ """
1070
+ total_len = num_grids[-1] # 마지막 grid 위치
1071
+ chunk_qs, chunk_grids, is_splits = [], [], []
1072
+
1073
+ # (start, end) = (0,chunk_size), (chunk_size,2*chunk_size), ...
1074
+ # 단, n_chunks 만큼만 만든다.
1075
+ slices = list(zip(num_grids[:-1], num_grids[1:], num_queries_vis_abstractors))
1076
+ slice_idx = 0 # 현재 살펴보는 slice 위치
1077
+
1078
+ for chunk_idx in range(n_chunks):
1079
+ start = chunk_idx * chunk_size
1080
+ end = start + chunk_size # [start, end)
1081
+
1082
+ # 1) 입력을 이미 다 소화한 경우: 더미 chunk (1grid 짜리)
1083
+ if start >= total_len:
1084
+ chunk_grids.append([0, 1]) # 최소 길이 1 dummy
1085
+ chunk_qs.append([num_queries_vis_abstractors[-1]])
1086
+ is_splits.append(False)
1087
+ continue
1088
+
1089
+ grids_in_chunk = [0] # 항상 0부터
1090
+ qs_in_chunk = []
1091
+
1092
+ # 현재 chunk와 겹치지 않는 slice 모두 스킵
1093
+ while slice_idx < len(slices) and slices[slice_idx][1] <= start:
1094
+ slice_idx += 1
1095
+
1096
+ is_split = False
1097
+ j = slice_idx
1098
+ while j < len(slices) and slices[j][0] < end:
1099
+ s, e, q = slices[j]
1100
+
1101
+ # chunk 내부 경계
1102
+ left = max(s, start)
1103
+ right = min(e, end)
1104
+ off = right - start # chunk local offset
1105
+
1106
+ if off not in grids_in_chunk:
1107
+ grids_in_chunk.append(off)
1108
+ qs_in_chunk.append(q)
1109
+ if right == end and e != end:
1110
+ is_split = True # 기존 num_grids 에선 나눠지지 않았던 부분이 잘렸음.
1111
+
1112
+ # slice 가 chunk를 뚫고 나가면, 다음 chunk에서 이어서 처리
1113
+ if e > end:
1114
+ break
1115
+ j += 1
1116
+ slice_idx = j
1117
+
1118
+ # 마지막 offset이 chunk 끝(또는 실제 데이터 끝)과 다르면 보정
1119
+ final_off = min(end, total_len) - start
1120
+ if grids_in_chunk[-1] != final_off:
1121
+ grids_in_chunk.append(final_off)
1122
+ qs_in_chunk.append(qs_in_chunk[-1] if qs_in_chunk else num_queries_vis_abstractors[-1])
1123
+ # 잘렸다는 것 기록
1124
+ is_split = True
1125
+
1126
+ chunk_grids.append(grids_in_chunk)
1127
+ chunk_qs.append(qs_in_chunk)
1128
+ is_splits.append(is_split)
1129
+
1130
+ return chunk_qs, chunk_grids, is_splits
1131
+
1132
+
1133
+ class HCXVisionForCausalLM(HCXVisionPreTrainedModel, GenerationMixin):
1134
+ def __init__(
1135
+ self,
1136
+ config: HCXVisionConfig,
1137
+ without_llm=False,
1138
+ **kwargs,
1139
+ ):
1140
+ super().__init__(config, without_llm=without_llm, **kwargs)
1141
+ text_config = config.get_text_config()
1142
+ self.model = HCXVisionModel(config=config, **kwargs)
1143
+
1144
+ def forward(
1145
+ self,
1146
+ input_ids: Optional[torch.LongTensor] = None,
1147
+ pixel_values: Optional[List[List[torch.FloatTensor]]] = None,
1148
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
1149
+ attention_mask: Optional[torch.FloatTensor] = None,
1150
+ position_ids: Optional[torch.LongTensor] = None,
1151
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1152
+ labels: Optional[torch.LongTensor] = None,
1153
+ use_cache: Optional[bool] = None,
1154
+ output_attentions: Optional[bool] = None,
1155
+ output_hidden_states: Optional[bool] = None,
1156
+ return_dict: Optional[bool] = True,
1157
+ image_sizes: Optional[List[List[List[int]]]] = None,
1158
+ vision_query_lengths: Optional[List[List[int]]] = None,
1159
+ non_vision_query_lengths: Optional[List[List[int]]] = None,
1160
+ img_start_ids_list: Optional[List[List[int]]] = None,
1161
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
1162
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
1163
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
1164
+ is_videos: Optional[List[List[bool]]] = None,
1165
+ image_grid_thw: Optional[torch.LongTensor] = None,
1166
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
1167
+ video_grid_thw: Optional[torch.LongTensor] = None,
1168
+ logits_to_keep: Union[int, torch.Tensor] = 0,
1169
+ **kwargs,
1170
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1171
+ """
1172
+ :param input_ids: torch.int64 : torch.size([batchsize, variable)]) : SystemPrompt with Question text token indices for tokenizer.
1173
+ In positions where images are inputted, the value is replaced by config.img_start_id, which is a vocabulary index used to indicate the start of image data.
1174
+ :param pixel_values: List of List of 4D tensor (torch.float32)
1175
+ Each outer list corresponds to a batch and contains inner lists, each holding tensors for images in a sample. The structure accounts for samples with multiple images.
1176
+ :param past_key_values: None
1177
+ :param inputs_embeds: None
1178
+ :param labels: Optional[torch.int64] : [batchsize, variable (input_ids.size(1)+ num visual tokens)] visual token 들은 모두 IGNORE_INDEX
1179
+ :param use_cache: None
1180
+ :param output_attentions: Optional[bool] : get attention weights of each layers of transformer network (true: 결과값에 포함, false: 결과값에 미포함)
1181
+ :param output_hidden_states: Optional[bool] : get hidden states of each layers of transformer network (true: 결과값에 포함, false: 결과값에 미포함)
1182
+ :param image_sizes: Stacked as a List of List, representing image sizes (width, height).
1183
+ In cases where a sample contains no images, a single dummy image is included.
1184
+ :param vision_query_lengths: A List of List that stores the lengths when each image is converted into visual tokens for LLM input.
1185
+ In cases where a sample does not contain any images, an empty list is included.
1186
+ :param non_vision_query_lengths: contains the lengths of text tokens (excluding visual tokens) for each sample in a batch.
1187
+ :img_start_ids_list: contains the indices of the img_start_id tokens for each sample.
1188
+ :num_queries_vis_abstractors: A List of List that contains the number of visual tokens for each image grid.
1189
+ :num_queries_vis_abstractors_slow: A List of List that contains the number of visual tokens for the slow part when applying the slowfast algorithm to video frames. If the slowfast algorithm is not applied, it will have a value of None.
1190
+ :first_last_frames_slows: A List of List that contains the only first and last frames slow mode for each sample in a batch.
1191
+ :is_videos: A List of List that contains the boolean value indicating whether each sample in a batch is a video.
1192
+ :image_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
1193
+ :pixel_values_videos: A 2D tensor (torch.float32) for qwen2.5-vl visual encoder.
1194
+ :video_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
1195
+ :return:
1196
+ """
1197
+ loss = None
1198
+ logits = None
1199
+ outputs = self.model.forward(
1200
+ input_ids=input_ids,
1201
+ pixel_values=pixel_values,
1202
+ past_key_values=past_key_values,
1203
+ attention_mask=attention_mask,
1204
+ position_ids=position_ids,
1205
+ inputs_embeds=inputs_embeds,
1206
+ use_cache=use_cache,
1207
+ output_attentions=output_attentions,
1208
+ output_hidden_states=output_hidden_states,
1209
+ return_dict=return_dict,
1210
+ image_sizes=image_sizes,
1211
+ vision_query_lengths=vision_query_lengths,
1212
+ non_vision_query_lengths=non_vision_query_lengths,
1213
+ img_start_ids_list=img_start_ids_list,
1214
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
1215
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
1216
+ first_last_frames_slows=first_last_frames_slows,
1217
+ is_videos=is_videos,
1218
+ image_grid_thw=image_grid_thw,
1219
+ pixel_values_videos=pixel_values_videos,
1220
+ video_grid_thw=video_grid_thw,
1221
+ )
1222
+ hidden_states = outputs.last_hidden_state
1223
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
1224
+ logits = self.model.language_model.lm_head(hidden_states[:, slice_indices, :]) * getattr(
1225
+ self.config.text_config, "logits_scaling", 1
1226
+ )
1227
+
1228
+ loss = None
1229
+ if labels is not None:
1230
+ loss = self.loss_function(
1231
+ logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
1232
+ )
1233
+ return CausalLMOutputWithPast(
1234
+ loss=loss,
1235
+ logits=logits,
1236
+ past_key_values=outputs.past_key_values,
1237
+ hidden_states=outputs.hidden_states,
1238
+ attentions=outputs.attentions,
1239
+ )
1240
+
1241
+ @torch.no_grad()
1242
+ def inference(
1243
+ self,
1244
+ input_ids: Optional[torch.LongTensor] = None,
1245
+ pixel_values: Optional[
1246
+ Union[List[List[torch.FloatTensor]], torch.FloatTensor]
1247
+ ] = None, # torch.FloatTensor for qwen2.5-vl visual encoder
1248
+ image_sizes: Optional[List[List[List[int]]]] = None,
1249
+ vision_query_lengths: Optional[List[List[int]]] = None,
1250
+ non_vision_query_lengths: Optional[List[int]] = None,
1251
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
1252
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
1253
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
1254
+ is_videos: Optional[List[List[bool]]] = None,
1255
+ img_start_ids_list: Optional[List[List[int]]] = None,
1256
+ image_grid_thw: Optional[torch.LongTensor] = None,
1257
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
1258
+ video_grid_thw: Optional[torch.LongTensor] = None,
1259
+ max_length: int = 196,
1260
+ min_length: int = 2,
1261
+ do_sample: bool = True,
1262
+ num_beams: int = 1,
1263
+ top_p: float = 0.6,
1264
+ top_k: int = 0,
1265
+ temperature: float = 0.5,
1266
+ repetition_penalty: float = 1.0,
1267
+ length_penalty: int = 1,
1268
+ early_stopping: Union[bool, str] = False,
1269
+ use_cache: bool = True,
1270
+ **kwargs,
1271
+ ):
1272
+ """
1273
+ :param input_ids: torch.int64 : torch.size([batchsize, variable)]) : SystemPrompt with Question text token indices for tokenizer.
1274
+ In positions where images are inputted, the value is replaced by config.img_start_id, which is a vocabulary index used to indicate the start of image data.
1275
+ In cases where a sample contains no images, a single dummy image is included.
1276
+ :param pixel_values: List of List of 4D tensor (torch.float32)
1277
+ Each outer list corresponds to a batch and contains inner lists, each holding tensors for images in a sample. The structure accounts for samples with multiple images.
1278
+ :param attention_mask: not used
1279
+ :param max_length: int : The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens.
1280
+ :param min_length: int : The minimum length of the sequence to be generated. Corresponds to the length of the input prompt + min_new_tokens.
1281
+ :param num_beams: int : Number of beams for beam search. 1 means no beam search.
1282
+ :param top_k: int : The number of highest probability vocabulary tokens to keep for top-k-filtering.
1283
+ :param temperature: float : The value used to modulate the next token probabilities. ( scores / self.temperature )
1284
+ :param repetition_penalty: float : The parameter for repetition penalty.
1285
+ :param length_penalty: int : It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence.
1286
+ :param early_stopping: Union[bool, str] : True, where the generation stops as soon as there are num_beams complete candidates;
1287
+ False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates;
1288
+ "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm)
1289
+ :param use_cache: bool : Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
1290
+ :param verbose: bool : print debug mention
1291
+ :param image_sizes: Stacked as a List of List, representing image sizes (width, height).
1292
+ In cases where a sample contains no images, a single dummy image is included.
1293
+ :param vision_query_lengths: A List of List that stores the lengths when each image is converted into visual tokens for LLM input.
1294
+ In cases where a sample does not contain any images, an empty list is included.
1295
+ :param non_vision_query_lengths: contains the lengths of text tokens (excluding visual tokens) for each sample in a batch.
1296
+ :param num_queries_vis_abstractors: A List of List that contains the number of visual tokens for each image grid.
1297
+ :param num_queries_vis_abstractors_slow: A List of List that contains the number of visual tokens for the slow part when applying the slowfast algorithm to video frames. If the slowfast algorithm is not applied, it will have a value of None.
1298
+ :param first_last_frames_slows: A List of List that stores the only first and last frames slow mode for each sample in a batch.
1299
+ :param is_videos: A List of List that stores the boolean value indicating whether each sample in a batch is a video.
1300
+ :image_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
1301
+ :pixel_values_videos: A 2D tensor (torch.float32) for qwen2.5-vl visual encoder.
1302
+ :video_grid_thw: A 3D tensor (torch.int64) for qwen2.5-vl visual encoder.
1303
+ :param kwargs:
1304
+ :return:
1305
+ """
1306
+ # inputs_embeds: torch.bfloat16 : [batchsize, variable(visual token, text token, system prompt 모두 포함)]
1307
+ # attention_mask: torch.float32 : [batchsize, variable(위와 동일)]
1308
+ inputs_embeds = self.model.extract_inputs_embeds(
1309
+ input_ids=input_ids,
1310
+ pixel_values=self.to_vision_model_device(pixel_values),
1311
+ image_sizes=image_sizes,
1312
+ vision_query_lengths=vision_query_lengths,
1313
+ non_vision_query_lengths=non_vision_query_lengths,
1314
+ img_start_ids_list=img_start_ids_list,
1315
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
1316
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
1317
+ first_last_frames_slows=first_last_frames_slows,
1318
+ is_videos=is_videos,
1319
+ image_grid_thw=image_grid_thw,
1320
+ pixel_values_videos=pixel_values_videos,
1321
+ video_grid_thw=video_grid_thw,
1322
+ )
1323
+ # inference만을 요구하는 특성상 모두 eval mode라 가정. 또한, inputs_embeds가 list of list tensor임. [batchsize, [num_images, [num_squence, num_chanels]]]
1324
+ # inputs_embeds = inputs_embeds.detach()
1325
+ # inputs_embeds.requires_grad = False
1326
+
1327
+ # llm 없이 inference할때에는, image_feature 값임.
1328
+ # self.vision_model에 assign된 gpu device와 llm에 assign된 gpu device가 다름
1329
+ if self.without_llm:
1330
+ inputs_embeds = (
1331
+ inputs_embeds.to(self.vision_model.device) if isinstance(inputs_embeds, torch.Tensor) else inputs_embeds
1332
+ )
1333
+ return inputs_embeds
1334
+
1335
+ inputs_embeds = (
1336
+ inputs_embeds.to(self.base_model.device) if isinstance(inputs_embeds, torch.Tensor) else inputs_embeds
1337
+ )
1338
+
1339
+ # pred : torch.int64 : [batchsize, generated token_length]
1340
+ pred = self.language_model.generate( # <|im_end|>
1341
+ inputs_embeds=inputs_embeds,
1342
+ pad_token_id=self.config.text_config.pad_token_id,
1343
+ eos_token_id=self.config.text_config.eos_token_id,
1344
+ bad_words_ids=[
1345
+ [
1346
+ self.config.text_config.bos_token_id,
1347
+ ],
1348
+ [
1349
+ self.config.text_config.eos_token_id,
1350
+ ],
1351
+ ],
1352
+ max_new_tokens=max_length,
1353
+ min_length=min_length,
1354
+ num_beams=num_beams,
1355
+ do_sample=False if temperature == 0.0 else do_sample, # set do_sample=False if invalid temperature
1356
+ top_k=top_k,
1357
+ top_p=top_p,
1358
+ temperature=temperature,
1359
+ repetition_penalty=repetition_penalty,
1360
+ length_penalty=length_penalty,
1361
+ early_stopping=False if num_beams <= 1 else True, # set early_stopping=False when not beam_search
1362
+ use_cache=use_cache,
1363
+ )
1364
+ return pred
1365
+
1366
+ def to_vision_model_device(self, input_tensor):
1367
+ if isinstance(input_tensor, list): # 입력 데이터가 리스트인 경우
1368
+ return [self.to_vision_model_device(item) for item in input_tensor] # 재귀적으로 각 요소에 대해 함수 호출
1369
+ elif isinstance(input_tensor, torch.Tensor): # 입력 데이터가 정수인 경우
1370
+ return input_tensor.to(self.vision_model.device)
1371
+ else:
1372
+ raise TypeError(
1373
+ "Unsupported data type. Only tensors and lists are allowed."
1374
+ ) # 지원되지 않는 데이터 타입에 대한 에러 처리
1375
+
1376
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_input_embeddings
1377
+ def get_input_embeddings(self):
1378
+ if self.without_llm:
1379
+ return None
1380
+ else:
1381
+ return self.language_model.get_input_embeddings()
1382
+
1383
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_input_embeddings
1384
+ def set_input_embeddings(self, value):
1385
+ self.language_model.set_input_embeddings(value)
1386
+
1387
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_output_embeddings
1388
+ def get_output_embeddings(self):
1389
+ if self.without_llm:
1390
+ return None
1391
+ else:
1392
+ return self.language_model.get_output_embeddings()
1393
+
1394
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_output_embeddings
1395
+ def set_output_embeddings(self, new_embeddings):
1396
+ self.language_model.set_output_embeddings(new_embeddings)
1397
+
1398
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_decoder
1399
+ def set_decoder(self, decoder):
1400
+ self.language_model.set_decoder(decoder)
1401
+
1402
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_decoder
1403
+ def get_decoder(self):
1404
+ return self.language_model.get_decoder()
1405
+
1406
+ # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.tie_weights
1407
+ def tie_weights(self):
1408
+ if self.without_llm:
1409
+ return None
1410
+ else:
1411
+ return self.language_model.tie_weights()
1412
+
1413
+ @classmethod
1414
+ def from_pretrained(
1415
+ cls,
1416
+ pretrained_model_name_or_path: Optional[Union[str, os.PathLike]],
1417
+ *model_args,
1418
+ **kwargs,
1419
+ ):
1420
+ model = super().from_pretrained(
1421
+ pretrained_model_name_or_path,
1422
+ *model_args,
1423
+ **kwargs,
1424
+ )
1425
+
1426
+ model.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)
1427
+ return model
1428
+
1429
+ def save_pretrained(
1430
+ self,
1431
+ save_directory: Union[str, os.PathLike],
1432
+ *args,
1433
+ **kwargs,
1434
+ ):
1435
+ super().register_for_auto_class("AutoModelForCausalLM")
1436
+ self.config.register_for_auto_class()
1437
+ super().save_pretrained(save_directory, *args, **kwargs)
1438
+ self.config.architectures = ["HCXVisionV2ForCausalLM"]
1439
+ self.config.auto_map["AutoModelForCausalLM"] = "modeling_vlm.HCXVisionForCausalLM"
1440
+ self.config.auto_map["AutoModelForSequenceClassification"] = "modeling_vlm.HCXVisionForSequenceClassification"
1441
+ self.config.save_pretrained(save_directory)
1442
+
1443
+ # https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/models/llava/modeling_llava.py#L379-L390
1444
+ @property
1445
+ def is_qwen_visual(self):
1446
+ return self.model.is_qwen_visual
1447
+
1448
+ @property
1449
+ def language_model(self):
1450
+ return self.model.language_model
1451
+
1452
+ @property
1453
+ def vision_model(self):
1454
+ return self.model.vision_model
1455
+
1456
+ @property
1457
+ def text_config(self):
1458
+ return self.model.text_config
1459
+
1460
+ @property
1461
+ def vision_config(self):
1462
+ return self.model.vision_config
1463
+
1464
+ @property
1465
+ def mm_projector(self):
1466
+ return self.model.mm_projector
1467
+
1468
+ @property
1469
+ def anyres(self):
1470
+ return self.model.anyres
1471
+
1472
+ @property
1473
+ def is_safetensor_save(self):
1474
+ return self.model.is_safetensor_save
1475
+
1476
+ @property
1477
+ def without_llm(self):
1478
+ return self.model.without_llm
1479
+
1480
+ @property
1481
+ def image_newline(self):
1482
+ return self.model.image_newline
1483
+
1484
+
1485
+ class HCXVisionForSequenceClassification(HCXVisionPreTrainedModel):
1486
+ """
1487
+ HCX Vision model for sequence classification tasks.
1488
+ """
1489
+
1490
+ def __init__(self, config, **kwargs):
1491
+ super().__init__(config, without_llm=True, **kwargs)
1492
+ self.num_labels = config.num_labels if hasattr(config, "num_labels") else 2
1493
+ self.model = HCXVisionModel(config=config, **kwargs)
1494
+ self.score = nn.Linear(config.text_config.hidden_size, self.num_labels, bias=False)
1495
+ self.post_init()
1496
+
1497
+ def forward(
1498
+ self,
1499
+ pixel_values: Optional[torch.FloatTensor] = None,
1500
+ input_ids: Optional[torch.LongTensor] = None,
1501
+ attention_mask: Optional[torch.Tensor] = None,
1502
+ position_ids: Optional[torch.LongTensor] = None,
1503
+ past_key_values: Optional[Cache] = None,
1504
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1505
+ labels: Optional[torch.LongTensor] = None,
1506
+ use_cache: Optional[bool] = None,
1507
+ output_attentions: Optional[bool] = None,
1508
+ output_hidden_states: Optional[bool] = None,
1509
+ return_dict: Optional[bool] = True,
1510
+ image_sizes: Optional[List[List[List[int]]]] = None,
1511
+ vision_query_lengths: Optional[List[List[int]]] = None,
1512
+ non_vision_query_lengths: Optional[List[List[int]]] = None,
1513
+ img_start_ids_list: Optional[List[List[int]]] = None,
1514
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
1515
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
1516
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
1517
+ is_videos: Optional[List[List[bool]]] = None,
1518
+ image_grid_thw: Optional[torch.LongTensor] = None,
1519
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
1520
+ video_grid_thw: Optional[torch.LongTensor] = None,
1521
+ ) -> SequenceClassifierOutputWithPast:
1522
+ """
1523
+ Forward pass for sequence classification.
1524
+ """
1525
+ transformer_outputs: BaseModelOutputWithPast = self.model(
1526
+ pixel_values=pixel_values,
1527
+ input_ids=input_ids,
1528
+ attention_mask=attention_mask,
1529
+ position_ids=position_ids,
1530
+ past_key_values=past_key_values,
1531
+ inputs_embeds=inputs_embeds,
1532
+ use_cache=use_cache,
1533
+ output_attentions=output_attentions,
1534
+ output_hidden_states=output_hidden_states,
1535
+ return_dict=return_dict,
1536
+ image_sizes=image_sizes,
1537
+ vision_query_lengths=vision_query_lengths,
1538
+ non_vision_query_lengths=non_vision_query_lengths,
1539
+ img_start_ids_list=img_start_ids_list,
1540
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
1541
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
1542
+ first_last_frames_slows=first_last_frames_slows,
1543
+ is_videos=is_videos,
1544
+ image_grid_thw=image_grid_thw,
1545
+ pixel_values_videos=pixel_values_videos,
1546
+ video_grid_thw=video_grid_thw,
1547
+ )
1548
+ hidden_states = transformer_outputs.last_hidden_state
1549
+ logits = self.score(hidden_states)
1550
+
1551
+ if input_ids is not None:
1552
+ batch_size = input_ids.shape[0]
1553
+ else:
1554
+ batch_size = inputs_embeds.shape[0]
1555
+
1556
+ if self.config.pad_token_id is None and batch_size != 1:
1557
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1558
+ if self.config.pad_token_id is None:
1559
+ last_non_pad_token = -1
1560
+ elif input_ids is not None:
1561
+ # To handle both left- and right- padding, we take the rightmost token that is not equal to pad_token_id
1562
+ non_pad_mask = (input_ids != self.config.pad_token_id).to(logits.device, torch.int32)
1563
+ token_indices = torch.arange(input_ids.shape[-1], device=logits.device, dtype=torch.int32)
1564
+ last_non_pad_token = (token_indices * non_pad_mask).argmax(-1)
1565
+ else:
1566
+ last_non_pad_token = -1
1567
+
1568
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), last_non_pad_token]
1569
+
1570
+ loss = None
1571
+ if labels is not None:
1572
+ loss = self.loss_function(logits=logits, labels=labels, pooled_logits=pooled_logits, config=self.config)
1573
+
1574
+ return SequenceClassifierOutputWithPast(
1575
+ loss=loss,
1576
+ logits=pooled_logits,
1577
+ past_key_values=transformer_outputs.past_key_values,
1578
+ hidden_states=transformer_outputs.hidden_states,
1579
+ attentions=transformer_outputs.attentions,
1580
+ )
1581
+
1582
+ def save_pretrained(
1583
+ self,
1584
+ save_directory: Union[str, os.PathLike],
1585
+ *args,
1586
+ **kwargs,
1587
+ ):
1588
+ super().register_for_auto_class("AutoModelForSequenceClassification")
1589
+ self.config.register_for_auto_class()
1590
+ super().save_pretrained(save_directory, *args, **kwargs)
1591
+
1592
+
1593
+ class HCXVisionForTokenClassification(HCXVisionPreTrainedModel):
1594
+ """
1595
+ HCX Vision model for token classification tasks (e.g., per-token value prediction for PPO critic).
1596
+ Returns logits for each token instead of pooled output.
1597
+ """
1598
+
1599
+ def __init__(self, config, **kwargs):
1600
+ super().__init__(config, without_llm=True, **kwargs)
1601
+ self.num_labels = config.num_labels if hasattr(config, "num_labels") else 1
1602
+ self.model = HCXVisionModel(config=config, **kwargs)
1603
+
1604
+ # Dropout for regularization
1605
+ if getattr(config, "classifier_dropout", None) is not None:
1606
+ classifier_dropout = config.classifier_dropout
1607
+ elif getattr(config.text_config, "hidden_dropout", None) is not None:
1608
+ classifier_dropout = config.text_config.hidden_dropout
1609
+ else:
1610
+ classifier_dropout = 0.1
1611
+ self.dropout = nn.Dropout(classifier_dropout)
1612
+
1613
+ # Token classification head - projects each token's hidden state to num_labels
1614
+ self.score = nn.Linear(config.text_config.hidden_size, self.num_labels, bias=False)
1615
+ self.post_init()
1616
+
1617
+ def forward(
1618
+ self,
1619
+ pixel_values: Optional[torch.FloatTensor] = None,
1620
+ input_ids: Optional[torch.LongTensor] = None,
1621
+ attention_mask: Optional[torch.Tensor] = None,
1622
+ position_ids: Optional[torch.LongTensor] = None,
1623
+ past_key_values: Optional[Cache] = None,
1624
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1625
+ labels: Optional[torch.LongTensor] = None,
1626
+ use_cache: Optional[bool] = None,
1627
+ output_attentions: Optional[bool] = None,
1628
+ output_hidden_states: Optional[bool] = None,
1629
+ return_dict: Optional[bool] = True,
1630
+ image_sizes: Optional[List[List[List[int]]]] = None,
1631
+ vision_query_lengths: Optional[List[List[int]]] = None,
1632
+ non_vision_query_lengths: Optional[List[List[int]]] = None,
1633
+ img_start_ids_list: Optional[List[List[int]]] = None,
1634
+ num_queries_vis_abstractors: Optional[List[List[int]]] = None,
1635
+ num_queries_vis_abstractors_slow: Optional[List[List[int]]] = None,
1636
+ first_last_frames_slows: Optional[List[List[bool]]] = None,
1637
+ is_videos: Optional[List[List[bool]]] = None,
1638
+ image_grid_thw: Optional[torch.LongTensor] = None,
1639
+ pixel_values_videos: Optional[torch.FloatTensor] = None,
1640
+ video_grid_thw: Optional[torch.LongTensor] = None,
1641
+ ) -> TokenClassifierOutput:
1642
+ """
1643
+ Forward pass for token classification.
1644
+
1645
+ Returns:
1646
+ TokenClassifierOutput with logits of shape [batch_size, sequence_length, num_labels]
1647
+ """
1648
+ transformer_outputs: BaseModelOutputWithPast = self.model(
1649
+ pixel_values=pixel_values,
1650
+ input_ids=input_ids,
1651
+ attention_mask=attention_mask,
1652
+ position_ids=position_ids,
1653
+ past_key_values=past_key_values,
1654
+ inputs_embeds=inputs_embeds,
1655
+ use_cache=use_cache,
1656
+ output_attentions=output_attentions,
1657
+ output_hidden_states=output_hidden_states,
1658
+ return_dict=return_dict,
1659
+ image_sizes=image_sizes,
1660
+ vision_query_lengths=vision_query_lengths,
1661
+ non_vision_query_lengths=non_vision_query_lengths,
1662
+ img_start_ids_list=img_start_ids_list,
1663
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
1664
+ num_queries_vis_abstractors_slow=num_queries_vis_abstractors_slow,
1665
+ first_last_frames_slows=first_last_frames_slows,
1666
+ is_videos=is_videos,
1667
+ image_grid_thw=image_grid_thw,
1668
+ pixel_values_videos=pixel_values_videos,
1669
+ video_grid_thw=video_grid_thw,
1670
+ )
1671
+
1672
+ # Get hidden states for all tokens
1673
+ hidden_states = transformer_outputs.last_hidden_state # [batch_size, seq_len, hidden_size]
1674
+
1675
+ # Project to num_labels for each token
1676
+ logits = self.score(hidden_states) # [batch_size, seq_len, num_labels]
1677
+
1678
+ return TokenClassifierOutput(
1679
+ loss=None,
1680
+ logits=logits, # [batch_size, seq_len, num_labels] - ALL tokens!
1681
+ hidden_states=transformer_outputs.hidden_states,
1682
+ attentions=transformer_outputs.attentions,
1683
+ )
1684
+
1685
+ def save_pretrained(
1686
+ self,
1687
+ save_directory: Union[str, os.PathLike],
1688
+ *args,
1689
+ **kwargs,
1690
+ ):
1691
+ super().register_for_auto_class("AutoModelForTokenClassification")
1692
+ self.config.register_for_auto_class()
1693
+ super().save_pretrained(save_directory, *args, **kwargs)
1694
+
1695
+
1696
+
1697
+ class VLM_Mlp(nn.Module):
1698
+ """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
1699
+
1700
+ def __init__(
1701
+ self,
1702
+ mm_projector_type,
1703
+ in_features,
1704
+ hidden_features=None,
1705
+ out_features=None,
1706
+ act_layer=nn.GELU,
1707
+ ):
1708
+ super().__init__()
1709
+ out_features = out_features or in_features
1710
+ hidden_features = hidden_features or in_features
1711
+ self.mm_projector_type = mm_projector_type
1712
+ if self.mm_projector_type == "mlp":
1713
+ self.fc1 = nn.Linear(in_features, hidden_features)
1714
+ self.act = act_layer()
1715
+ self.fc2 = nn.Linear(hidden_features, out_features)
1716
+ elif self.mm_projector_type == "inverted_mlp":
1717
+ self.fc1 = nn.Linear(in_features, 2 * hidden_features)
1718
+ self.act = act_layer()
1719
+ self.fc2 = nn.Linear(2 * hidden_features, out_features)
1720
+ else:
1721
+ raise NotImplementedError("{} is not implemented".format(self.mm_projector_type))
1722
+
1723
+ def forward(self, x):
1724
+ x = self.fc1(x)
1725
+ x = self.act(x)
1726
+ x = self.fc2(x)
1727
+ return x
1728
+
1729
+
1730
+ class Projector(nn.Module):
1731
+ """Base projector class"""
1732
+
1733
+ def __init__(
1734
+ self,
1735
+ num_queries: int,
1736
+ num_input_tokens: int,
1737
+ encoder_hidden_size: int,
1738
+ hidden_size: int,
1739
+ output_hidden_size: int,
1740
+ pos_emb=True,
1741
+ prenorm=False,
1742
+ ):
1743
+ super().__init__()
1744
+ self.num_input_tokens = num_input_tokens
1745
+ self.output_hidden_size = output_hidden_size
1746
+
1747
+ # pos emb
1748
+ if pos_emb:
1749
+ self.pos_emb = torch.nn.Parameter(torch.zeros(1, num_input_tokens, encoder_hidden_size))
1750
+ # nn.init.trunc_normal_(self.pos_emb, mean=0.0, std=0.02)
1751
+ self.pos_emb.data.normal_(mean=0.0, std=0.02)
1752
+ else:
1753
+ self.pos_emb = None
1754
+
1755
+ if prenorm:
1756
+ self.prenorm = LayerNorm(encoder_hidden_size)
1757
+ else:
1758
+ self.prenorm = None
1759
+
1760
+ self.build_net(num_queries, encoder_hidden_size, hidden_size, output_hidden_size)
1761
+
1762
+ def build_net(self):
1763
+ raise NotImplementedError()
1764
+
1765
+ def _forward(
1766
+ self,
1767
+ x,
1768
+ num_queries_vis_abstractors: Optional[List[int]] = None,
1769
+ num_grids: Optional[List[int]] = None,
1770
+ freeze_before_sampler: bool = False,
1771
+ ):
1772
+ raise NotImplementedError()
1773
+
1774
+ def forward(
1775
+ self,
1776
+ x: torch.Tensor,
1777
+ num_queries_vis_abstractors: Optional[List[int]] = None,
1778
+ num_grids: Optional[List[int]] = None,
1779
+ freeze_before_sampler: bool = False,
1780
+ ) -> torch.Tensor:
1781
+ """
1782
+ Args:
1783
+ x: (B, L, encoder_hidden_size) tensor from the visual backbone (CLIP visual encoder), including cls token.
1784
+ """
1785
+ if self.prenorm is not None:
1786
+ x = self.prenorm(x)
1787
+
1788
+ if self.pos_emb is not None:
1789
+ x = x + self.pos_emb
1790
+
1791
+ x = self._forward(
1792
+ x,
1793
+ num_queries_vis_abstractors=num_queries_vis_abstractors,
1794
+ num_grids=num_grids,
1795
+ freeze_before_sampler=freeze_before_sampler,
1796
+ ) # (B, L, output_hidden_size)
1797
+
1798
+ return x
1799
+
1800
+
1801
+ class ConvProjector(Projector):
1802
+ def _forward(
1803
+ self,
1804
+ x,
1805
+ num_queries_vis_abstractors: Optional[List[int]] = None,
1806
+ num_grids: Optional[List[int]] = None,
1807
+ freeze_before_sampler: bool = False,
1808
+ ):
1809
+ # x: [B, L, dim]
1810
+ hw = int(x.size(1) ** 0.5)
1811
+ x = rearrange(x, "b (h w) d -> b d h w", h=hw, w=hw)
1812
+
1813
+ if num_queries_vis_abstractors is not None:
1814
+ assert num_grids is not None
1815
+
1816
+ return self._forward_adaptive_num_query(x, num_queries_vis_abstractors, num_grids, freeze_before_sampler)
1817
+
1818
+ if freeze_before_sampler:
1819
+ with torch.no_grad():
1820
+ x = self.net[0](x)
1821
+ x = self.net[1](x)
1822
+ x = self.net[2](x)
1823
+ else:
1824
+ x = self.net(x)
1825
+ x = rearrange(x, "b d h w -> b (h w) d")
1826
+ x = self.readout(x)
1827
+
1828
+ return x
1829
+
1830
+ def _forward_adaptive_num_query(
1831
+ self,
1832
+ x,
1833
+ num_queries_vis_abstractors: Optional[List[int]] = None,
1834
+ num_grids: Optional[List[int]] = None,
1835
+ freeze_before_sampler: bool = False,
1836
+ ):
1837
+ # self.net 은 3 개의 layer로 구성되어 있음 (s1, sampler, s2)
1838
+ # self.net[1] 인 sampler 를 adaptive pooling으로 대체
1839
+ assert len(self.net) == 3
1840
+
1841
+ if freeze_before_sampler:
1842
+ with torch.no_grad():
1843
+ x = self.net[0](x)
1844
+ else:
1845
+ x = self.net[0](x)
1846
+
1847
+ new_x = []
1848
+ for i, num_queries in enumerate(num_queries_vis_abstractors):
1849
+ hw = int(num_queries**0.5)
1850
+ sampler = nn.AdaptiveAvgPool2d((hw, hw))
1851
+ out = sampler(x[num_grids[i] : num_grids[i + 1], :])
1852
+ out = self.net[2](out)
1853
+
1854
+ out = rearrange(out, "b d h w -> b (h w) d")
1855
+ out = self.readout(out)
1856
+
1857
+ new_x.append(out)
1858
+
1859
+ return new_x
1860
+
1861
+
1862
+ class CAbstractor(ConvProjector):
1863
+ """C-Abstractor"""
1864
+
1865
+ def build_net(self, n_queries, encoder_hidden_size, hidden_size, output_hidden_size, depth=3, mlp_depth=2):
1866
+ assert (n_queries**0.5).is_integer(), "n_queries must be square number"
1867
+ hw = int(n_queries**0.5)
1868
+
1869
+ # RegBlock = ResBlock + SE
1870
+ RegBlock = partial(
1871
+ RegStage,
1872
+ stride=1,
1873
+ dilation=1,
1874
+ act_layer=nn.SiLU,
1875
+ norm_layer=LayerNorm2d,
1876
+ )
1877
+
1878
+ s1 = RegBlock(
1879
+ depth,
1880
+ encoder_hidden_size,
1881
+ hidden_size,
1882
+ )
1883
+ sampler = nn.AdaptiveAvgPool2d((hw, hw))
1884
+ s2 = RegBlock(
1885
+ depth,
1886
+ hidden_size,
1887
+ hidden_size,
1888
+ )
1889
+
1890
+ self.net = nn.Sequential(s1, sampler, s2)
1891
+
1892
+ self.readout = self.build_mlp(mlp_depth, hidden_size, output_hidden_size)
1893
+
1894
+ def build_mlp(self, depth, hidden_size, output_hidden_size):
1895
+ layers = [nn.Linear(hidden_size, output_hidden_size)]
1896
+ for _ in range(1, depth):
1897
+ layers.append(nn.SiLU())
1898
+ layers.append(nn.Linear(output_hidden_size, output_hidden_size))
1899
+ return nn.Sequential(*layers)
1900
+
1901
+
1902
+ AutoConfig.register("vlm", HCXVisionConfig)
1903
+ try:
1904
+ from .configuration_hyperclovax import HyperCLOVAXConfig
1905
+ from .modeling_hyperclovax import HyperCLOVAXForCausalLM
1906
+
1907
+ AutoConfig.register("hyperclovax", HyperCLOVAXConfig)
1908
+ AutoModelForCausalLM.register(
1909
+ HyperCLOVAXConfig,
1910
+ HyperCLOVAXForCausalLM,
1911
+ )
1912
+ except:
1913
+ pass
preprocessor_config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_vlm.HCXVisionV2Processor"
4
+ },
5
+ "do_convert_rgb": true,
6
+ "do_normalize": true,
7
+ "do_rescale": true,
8
+ "do_resize": true,
9
+ "image_mean": [
10
+ 0.48145466,
11
+ 0.4578275,
12
+ 0.40821073
13
+ ],
14
+ "image_processor_type": "Qwen2VLImageProcessor",
15
+ "image_std": [
16
+ 0.26862954,
17
+ 0.26130258,
18
+ 0.27577711
19
+ ],
20
+ "max_pixels": 2073600,
21
+ "merge_size": 2,
22
+ "min_pixels": 3136,
23
+ "patch_size": 14,
24
+ "processor_class": "HCXVisionV2Processor",
25
+ "resample": 3,
26
+ "rescale_factor": 0.00392156862745098,
27
+ "size": {
28
+ "longest_edge": 2073600,
29
+ "shortest_edge": 3136
30
+ },
31
+ "temporal_patch_size": 2
32
+ }
processing_vlm.py ADDED
@@ -0,0 +1,823 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+ import math
3
+ import os
4
+ from typing import Dict, List, Optional, Union
5
+
6
+ import numpy as np
7
+ import torch
8
+ from PIL import Image
9
+ from transformers import Qwen2_5_VLProcessor
10
+ from transformers.image_processing_utils import (
11
+ BaseImageProcessor,
12
+ BatchFeature,
13
+ get_size_dict,
14
+ )
15
+ from transformers.image_transforms import (
16
+ convert_to_rgb,
17
+ get_resize_output_image_size,
18
+ resize,
19
+ to_channel_dimension_format,
20
+ )
21
+ from transformers.image_utils import (
22
+ OPENAI_CLIP_MEAN,
23
+ OPENAI_CLIP_STD,
24
+ ChannelDimension,
25
+ ImageInput,
26
+ PILImageResampling,
27
+ get_image_size,
28
+ infer_channel_dimension_format,
29
+ is_scaled_image,
30
+ make_list_of_images,
31
+ to_numpy_array,
32
+ valid_images,
33
+ )
34
+ from transformers.models.qwen2_5_vl.processing_qwen2_5_vl import (
35
+ Qwen2_5_VLProcessorKwargs,
36
+ )
37
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
38
+ from transformers.utils import TensorType, logging
39
+ from transformers.video_utils import VideoInput
40
+ from typing_extensions import Unpack
41
+
42
+ logger = logging.get_logger(__name__)
43
+
44
+
45
+ def determine_possible_resolutions(anyres: bool, max_num_grids: int, grid_size: int, use_1x1_grid: bool = False):
46
+ """총 max_num_grids 이하의 possible resolution 조합을 찾아 반환합니다.
47
+ max_num_grids 가 예를 들어 4인 경우, 총 가능한 grid 조합은 [1x1, 1x2, 1x3, 1x4, 2x1, 2x2, 3x1, 4x1] 이고, 따라서 아래와 같이 계산됩니다.
48
+ >>> possible_resolutions = determine_possible_resolutions(anyres=True, max_num_grids=4, grid_size=336)
49
+ >>> print(possible_resolutions)
50
+ [[336, 336], [336, 672], [336, 1008], [336, 1344], [672, 336], [672, 672], [1008, 336], [1344, 336]]
51
+ """
52
+ possible_resolutions = []
53
+ if anyres:
54
+ assert max_num_grids > 0
55
+ for i in range(1, max_num_grids + 1):
56
+ for j in range(1, max_num_grids + 1):
57
+ if i == 1 and j == 1 and not use_1x1_grid:
58
+ continue
59
+ if i * j <= max_num_grids:
60
+ possible_resolutions.append([i, j])
61
+
62
+ possible_resolutions = [[ys * grid_size, xs * grid_size] for ys, xs in possible_resolutions]
63
+
64
+ return possible_resolutions
65
+
66
+
67
+ def divide_to_grids(image: np.array, grid_size: int, input_data_format=None) -> List[np.array]:
68
+ """local image 를 (grid_size x grid_size) grid 로 divide"""
69
+ grids = []
70
+ height, width = get_image_size(image, channel_dim=input_data_format)
71
+ for i in range(0, height, grid_size):
72
+ for j in range(0, width, grid_size):
73
+ if input_data_format == ChannelDimension.LAST:
74
+ grid = image[i : i + grid_size, j : j + grid_size]
75
+ else:
76
+ grid = image[:, i : i + grid_size, j : j + grid_size]
77
+ grids.append(grid)
78
+
79
+ return grids
80
+
81
+
82
+ def pad(image: np.array, target_size: tuple, background_color=(127, 127, 127), input_data_format=None) -> np.array:
83
+ """image 양옆, 좌우에 padding 을 하여 target_height, target_width 만큼 키움"""
84
+ target_height, target_width = target_size
85
+ height, width = get_image_size(image, channel_dim=input_data_format)
86
+
87
+ # result = np.ones((target_height, target_width, image.shape[2]), dtype=image.dtype) * background_color
88
+ result = np.empty((target_height, target_width, image.shape[2]), dtype=image.dtype)
89
+ for i in range(image.shape[2]):
90
+ result[..., i].fill(background_color[i])
91
+
92
+ paste_x = (target_width - width) // 2
93
+ paste_y = (target_height - height) // 2
94
+
95
+ result[paste_y : paste_y + height, paste_x : paste_x + width, :] = image
96
+
97
+ return result
98
+
99
+
100
+ def expand2square(
101
+ image: np.array, bboxes_dict=None, background_color=(127, 127, 127), input_data_format=None
102
+ ) -> np.array:
103
+ """
104
+ 새로운 canvas 를 만들어 두고, 거기에 이미지를 붙여넣는 방식으로 이미지를 정사각형으로 만드는 함수
105
+ 유의할 사항은, 이미지를 붙여 넣을 때 중앙으로 붙여넣는다는 점. 양옆 또는 위아래로 PADDING 이 들어가는 형태
106
+ Args:
107
+ pil_img: numpy array
108
+ bboxes_dict: dict, {"ocr": NDArray shape (N, 4, 2), "html": NDArray shape (N, 4, 2), ... }
109
+ `[[xtl, ytl], [xtr, ytr], [xbr, ybr], [xbl, ybl]]` 형태로 박스 형태는 통일. OCR, HTML 등 다양한 박스들을 한번에 처리 가능
110
+ background_color: tuple, RGB
111
+ # >>> _img = np.ones((80, 100), dtype=np.uint8) * 100
112
+ # >>> _bboxes_dict = {"words": np.array([[[10, 10], [20, 10], [20, 20], [10, 20]],
113
+ # ... [[30, 30], [40, 30], [40, 40], [30, 40]]])}
114
+ # >>> _img, _bboxes_dict = expand2square(_img, _bboxes_dict, (255, 255, 255))
115
+ # >>> _img.shape
116
+ # (100, 100)
117
+ # >>> guessed_ocr_bboxes = np.array([[[20, 10], [30, 10], [30, 20], [20, 20]],
118
+ # ... [[40, 30], [50, 30], [50, 40], [40, 40]]])
119
+ # >>> np.testing.assert_array_almost_equal(_bboxes_dict["words"], guessed_ocr_bboxes) is None
120
+ # True
121
+ """
122
+ height, width = get_image_size(image, channel_dim=input_data_format)
123
+ if width == height:
124
+ return image, bboxes_dict
125
+ elif width > height:
126
+ # result = np.ones((width, width, image.shape[2]), dtype=image.dtype) * background_color
127
+ result = np.empty((width, width, image.shape[2]), dtype=image.dtype)
128
+ for i in range(image.shape[2]):
129
+ result[..., i].fill(background_color[i])
130
+
131
+ result[(width - height) // 2 : (width - height) // 2 + height, :] = image
132
+ if bboxes_dict is not None:
133
+ for key in bboxes_dict:
134
+ bboxes_dict[key][:, :, 1] += (width - height) // 2
135
+ return result, bboxes_dict
136
+ else:
137
+ # result = np.ones((height, height, image.shape[2]), dtype=image.dtype) * background_color
138
+ result = np.empty((height, height, image.shape[2]), dtype=image.dtype)
139
+ for i in range(image.shape[2]):
140
+ result[..., i].fill(background_color[i])
141
+
142
+ result[:, (height - width) // 2 : (height - width) // 2 + width] = image
143
+ if bboxes_dict is not None:
144
+ for key in bboxes_dict:
145
+ bboxes_dict[key][:, :, 0] += (height - width) // 2
146
+ return result, bboxes_dict
147
+
148
+
149
+ def resize_longside(
150
+ image: np.array,
151
+ size: int,
152
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
153
+ data_format: Optional[Union[str, ChannelDimension]] = None,
154
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
155
+ ):
156
+ """
157
+ 장축 길이를 size 에 맞게 resize
158
+ """
159
+ height, width = get_image_size(image, channel_dim=input_data_format)
160
+
161
+ if width == height:
162
+ target_height, target_width = size, size
163
+ elif width > height:
164
+ target_width = size
165
+ target_height = math.ceil(height / width * size)
166
+ else:
167
+ target_width = math.ceil(width / height * size)
168
+ target_height = size
169
+
170
+ return resize(
171
+ image,
172
+ size=(target_height, target_width),
173
+ resample=resample,
174
+ data_format=data_format,
175
+ input_data_format=input_data_format,
176
+ )
177
+
178
+
179
+ def select_best_resolution(original_size: tuple, possible_resolutions: list) -> tuple:
180
+ """From LLaVA-Next (https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/models/llava_next/image_processing_llava_next.py)
181
+ Selects the best resolution from a list of possible resolutions based on the original size.
182
+ This is done by calculating the effective and wasted resolution for each possible resolution.
183
+ The best fit resolution is the one that maximizes the effective resolution and minimizes the wasted resolution.
184
+
185
+ Args:
186
+ original_size (tuple):
187
+ The original size of the image in the format (height, width).
188
+ possible_resolutions (list):
189
+ A list of possible resolutions in the format [(height1, width1), (height2, width2), ...].
190
+
191
+ Returns:
192
+ tuple: The best fit resolution in the format (height, width).
193
+ """
194
+ original_height, original_width = original_size
195
+ best_fit = None
196
+ max_effective_resolution = 0
197
+ min_wasted_resolution = float("inf")
198
+
199
+ for height, width in possible_resolutions:
200
+ scale = min(width / original_width, height / original_height)
201
+ downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
202
+ effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
203
+ wasted_resolution = (width * height) - effective_resolution
204
+
205
+ if effective_resolution > max_effective_resolution or (
206
+ effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution
207
+ ):
208
+ max_effective_resolution = effective_resolution
209
+ min_wasted_resolution = wasted_resolution
210
+ best_fit = (height, width)
211
+
212
+ return best_fit
213
+
214
+
215
+ def _get_local_grids_output_size(image: np.array, target_resolution: tuple, input_data_format=None):
216
+ original_height, original_width = get_image_size(image, channel_dim=input_data_format)
217
+ target_height, target_width = target_resolution
218
+
219
+ scale_w = target_width / original_width
220
+ scale_h = target_height / original_height
221
+
222
+ if scale_w < scale_h:
223
+ new_width = target_width
224
+ new_height = min(math.ceil(original_height * scale_w), target_height)
225
+ else:
226
+ new_height = target_height
227
+ new_width = min(math.ceil(original_width * scale_h), target_width)
228
+
229
+ return new_height, new_width
230
+
231
+
232
+ def determine_anyres_num_vision_patches(
233
+ num_grids,
234
+ image_size,
235
+ grid_size,
236
+ patch_size,
237
+ possible_resolutions,
238
+ anyres=False,
239
+ unpad=True,
240
+ num_queries_vis_abstractor=0,
241
+ num_queries_vis_abstractor_slow=0,
242
+ video=False,
243
+ first_last_frames_slow=False,
244
+ is_first_or_last_frames=False,
245
+ ):
246
+ """visual tokens 수를 계산해주는 함수"""
247
+ if not anyres:
248
+ return num_queries_vis_abstractor if num_queries_vis_abstractor > 0 else (grid_size // patch_size) ** 2
249
+
250
+ if num_queries_vis_abstractor > 0:
251
+ num_patch_per_grid = int(num_queries_vis_abstractor**0.5)
252
+ else:
253
+ num_patch_per_grid = grid_size // patch_size
254
+
255
+ num_global_per_grid = num_patch_per_grid
256
+
257
+ # anyres는 global image가 있어서 2개 이상이지만, video에는 global image가 없어서, 1개가 들어올 수 있어서 주석 처리
258
+ # assert num_grids > 1
259
+
260
+ # patch 수 계산
261
+ height, width = select_best_resolution(image_size, possible_resolutions)
262
+
263
+ num_patch_height = (height // grid_size) * num_patch_per_grid
264
+ num_patch_width = (width // grid_size) * num_patch_per_grid
265
+
266
+ # local images
267
+ if unpad:
268
+ original_height, original_width = image_size
269
+
270
+ original_aspect_ratio = original_width / original_height
271
+ current_aspect_ratio = num_patch_width / num_patch_height
272
+
273
+ if original_aspect_ratio > current_aspect_ratio:
274
+ scale_factor = num_patch_width / original_width
275
+ new_height = int(original_height * scale_factor)
276
+ padding = (num_patch_height - new_height) // 2
277
+ num_patch_height = num_patch_height - padding * 2
278
+ else:
279
+ scale_factor = num_patch_height / original_height
280
+ new_width = int(original_width * scale_factor)
281
+ padding = (num_patch_width - new_width) // 2
282
+ num_patch_width = num_patch_width - padding * 2
283
+
284
+ num_patches = num_patch_width * num_patch_height + num_patch_height
285
+ else:
286
+ num_patches = num_patch_width * num_patch_height
287
+
288
+ # slow는 첫프레임 마지막 프레임 적용 전략일때는 첫프레임과 마지막 프레임만 적용
289
+ if num_queries_vis_abstractor_slow > 0:
290
+ if first_last_frames_slow:
291
+ if is_first_or_last_frames:
292
+ num_patches += num_queries_vis_abstractor_slow - num_queries_vis_abstractor
293
+ else:
294
+ num_patches += num_queries_vis_abstractor_slow - num_queries_vis_abstractor
295
+ # slowfast 기능은 unpad False 에만 적용
296
+ assert unpad is False
297
+
298
+ # video 에는 global image 가 포함되지 않음
299
+ if not video:
300
+ num_patches += num_global_per_grid**2
301
+
302
+ return num_patches
303
+
304
+
305
+ class HCXVisionImageProcessor(BaseImageProcessor):
306
+ r"""
307
+ Constructs a VLM image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques for processing high resolution images.
308
+
309
+ Args:
310
+ anyres: (bool) anyres 기능을 사용할지 안할지
311
+ unpad: (bool) anyres 사용시, unpad 기능 (순수 pad 영역에 해당하는 visual tokens 은 LLM input 에서 제거) 을 사용할지 안할지
312
+ num_queries_vis_abstractor: (int) 각 grid 에 대해서 resampler 를 사용하는 경우, visual query 수
313
+ possible_resolutions: (List) anyres 기능 사용시, 가능한 resolution 조합, 예: [[336, 336], [336, 672], [672, 336]]
314
+ patch_size: (int) ViT patch size
315
+ pad_to_square: (bool) 정사각형으로 padding 을 수행할지, 안할지를 결정. False 이면 정사각형이 아니기 때문에 center crop 을 거쳐 ViT 의 입력으로 들어감
316
+ """
317
+
318
+ model_input_names = ["pixel_values"]
319
+
320
+ def __init__(
321
+ self,
322
+ do_resize: bool = True,
323
+ size: Dict[str, int] = None,
324
+ anyres: bool = False,
325
+ unpad: bool = False,
326
+ num_queries_vis_abstractor: int = 0,
327
+ possible_resolutions: List = [],
328
+ patch_size: int = 14,
329
+ pad_to_square: bool = True,
330
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
331
+ do_center_crop: bool = True,
332
+ crop_size: Dict[str, int] = None,
333
+ do_rescale: bool = True,
334
+ rescale_factor: Union[int, float] = 1 / 255,
335
+ do_normalize: bool = True,
336
+ image_mean: Optional[Union[float, List[float]]] = None,
337
+ image_std: Optional[Union[float, List[float]]] = None,
338
+ do_convert_rgb: bool = True,
339
+ **kwargs,
340
+ ) -> None:
341
+ super().__init__(**kwargs)
342
+ size = size if size is not None else {"shortest_edge": 336}
343
+ size = get_size_dict(size, default_to_square=False)
344
+ crop_size = crop_size if crop_size is not None else {"height": 336, "width": 336}
345
+ crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
346
+
347
+ self.do_resize = do_resize
348
+ self.size = size
349
+ self.anyres = anyres
350
+ self.unpad = unpad
351
+ self.num_queries_vis_abstractor = num_queries_vis_abstractor
352
+ self.possible_resolutions = [_resolution for _resolution in possible_resolutions]
353
+ self.patch_size = patch_size
354
+ self.pad_to_square = pad_to_square
355
+ self.resample = resample
356
+ self.do_center_crop = do_center_crop
357
+ self.crop_size = crop_size
358
+ self.do_rescale = do_rescale
359
+ self.rescale_factor = rescale_factor
360
+ self.do_normalize = do_normalize
361
+ self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
362
+ self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
363
+ self.do_convert_rgb = do_convert_rgb
364
+
365
+ def resize(
366
+ self,
367
+ image: np.ndarray,
368
+ size: Dict[str, int],
369
+ resample: PILImageResampling = PILImageResampling.BICUBIC,
370
+ data_format: Optional[Union[str, ChannelDimension]] = None,
371
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
372
+ **kwargs,
373
+ ) -> np.ndarray:
374
+ default_to_square = True
375
+ if "shortest_edge" in size:
376
+ size = size["shortest_edge"]
377
+ default_to_square = False
378
+ elif "height" in size and "width" in size:
379
+ size = (size["height"], size["width"])
380
+ else:
381
+ raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.")
382
+
383
+ output_size = get_resize_output_image_size(
384
+ image,
385
+ size=size,
386
+ default_to_square=default_to_square,
387
+ input_data_format=input_data_format,
388
+ )
389
+
390
+ return resize(
391
+ image,
392
+ size=output_size,
393
+ resample=resample,
394
+ data_format=data_format,
395
+ input_data_format=input_data_format,
396
+ **kwargs,
397
+ )
398
+
399
+ def _preprocess(
400
+ self,
401
+ images: ImageInput,
402
+ do_resize: bool = None,
403
+ size: Dict[str, int] = None,
404
+ resample: PILImageResampling = None,
405
+ do_center_crop: bool = None,
406
+ crop_size: int = None,
407
+ do_rescale: bool = None,
408
+ rescale_factor: float = None,
409
+ do_normalize: bool = None,
410
+ image_mean: Optional[Union[float, List[float]]] = None,
411
+ image_std: Optional[Union[float, List[float]]] = None,
412
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
413
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
414
+ ) -> Image.Image:
415
+ images = make_list_of_images(images)
416
+
417
+ if do_resize:
418
+ images = [
419
+ self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
420
+ for image in images
421
+ ]
422
+
423
+ if do_center_crop:
424
+ images = [
425
+ self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
426
+ ]
427
+
428
+ if do_rescale:
429
+ images = [
430
+ self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) for image in images
431
+ ]
432
+
433
+ if do_normalize:
434
+ images = [
435
+ self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
436
+ for image in images
437
+ ]
438
+
439
+ images = [
440
+ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
441
+ ]
442
+
443
+ return images
444
+
445
+ def _resize_for_local_grids(
446
+ self, image: np.array, target_resolution: tuple, resample, input_data_format: ChannelDimension
447
+ ) -> np.array:
448
+ new_height, new_width = _get_local_grids_output_size(image, target_resolution, input_data_format)
449
+
450
+ # Resize the image
451
+ resized_image = resize(image, (new_height, new_width), resample=resample, input_data_format=input_data_format)
452
+
453
+ return resized_image
454
+
455
+ def _pad_for_patching(
456
+ self, image: np.array, target_resolution: tuple, input_data_format: ChannelDimension
457
+ ) -> np.array:
458
+ """
459
+ Pad an image to a target resolution while maintaining aspect ratio.
460
+ """
461
+ target_height, target_width = target_resolution
462
+
463
+ background_color = tuple(int(x * 255) for x in self.image_mean)
464
+ padded_image = pad(
465
+ image,
466
+ target_size=(target_height, target_width),
467
+ background_color=background_color,
468
+ input_data_format=input_data_format,
469
+ )
470
+
471
+ return padded_image
472
+
473
+ def get_image_grids(
474
+ self,
475
+ image: np.array,
476
+ possible_resolutions,
477
+ grid_size: int,
478
+ resample: PILImageResampling,
479
+ data_format: ChannelDimension,
480
+ input_data_format: ChannelDimension,
481
+ ) -> List[np.array]:
482
+ if not isinstance(possible_resolutions, list):
483
+ raise ValueError("possible_resolutions must be a list of possible resolutions.")
484
+
485
+ image_size = get_image_size(image, channel_dim=input_data_format)
486
+ best_resolution = select_best_resolution(image_size, possible_resolutions)
487
+ resized_image = self._resize_for_local_grids(
488
+ image, best_resolution, resample=resample, input_data_format=input_data_format
489
+ )
490
+ padded_image = self._pad_for_patching(resized_image, best_resolution, input_data_format=input_data_format)
491
+ local_grids = divide_to_grids(padded_image, grid_size=grid_size, input_data_format=input_data_format)
492
+
493
+ # make sure that all patches are in the input data format
494
+ local_grids = [
495
+ to_channel_dimension_format(grid, channel_dim=data_format, input_channel_dim=input_data_format)
496
+ for grid in local_grids
497
+ ]
498
+
499
+ return local_grids
500
+
501
+ def preprocess(
502
+ self,
503
+ images: ImageInput,
504
+ do_resize: bool = None,
505
+ size: Dict[str, int] = None,
506
+ anyres: bool = None,
507
+ unpad: bool = None,
508
+ video: bool = None,
509
+ num_queries_vis_abstractor: int = None,
510
+ possible_resolutions: List = None,
511
+ patch_size: int = None,
512
+ pad_to_square: bool = None,
513
+ resample: PILImageResampling = None,
514
+ do_center_crop: bool = None,
515
+ crop_size: int = None,
516
+ do_rescale: bool = None,
517
+ rescale_factor: float = None,
518
+ do_normalize: bool = None,
519
+ image_mean: Optional[Union[float, List[float]]] = None,
520
+ image_std: Optional[Union[float, List[float]]] = None,
521
+ do_convert_rgb: bool = None,
522
+ return_tensors: Optional[Union[str, TensorType]] = None,
523
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
524
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
525
+ return_dummy_image: bool = False,
526
+ num_queries_vis_abstractor_slow: int = 0,
527
+ first_last_frames_slow: bool = False,
528
+ is_first_or_last_frames: bool = False,
529
+ ):
530
+ """
531
+ HCXVisionImageProcessor 로 image tensor, original image size (width, height), visual tokens
532
+
533
+ :return pixel_values: List of 4D tensor 로 image tensor
534
+ :return image_sizes: List of Dict 로 image width, height [{"width": image 1 의 width, "height": image 1 의 height}, {"width": image 2 의 width, "height": image 2 의 height}, ...]
535
+ :return vision_query_lengths: List of int 로 각 image 가 LLM 입력으로 전달될때 변환되는 visual token 수
536
+ """
537
+ do_resize = do_resize if do_resize is not None else self.do_resize
538
+ size = size if size is not None else self.size
539
+ size = get_size_dict(size, param_name="size", default_to_square=False)
540
+ anyres = anyres if anyres is not None else self.anyres
541
+ unpad = unpad if unpad is not None else self.unpad
542
+ if video:
543
+ unpad = False
544
+ num_queries_vis_abstractor = (
545
+ num_queries_vis_abstractor if num_queries_vis_abstractor is not None else self.num_queries_vis_abstractor
546
+ )
547
+ possible_resolutions = possible_resolutions if possible_resolutions is not None else self.possible_resolutions
548
+ patch_size = patch_size if patch_size is not None else self.patch_size
549
+ pad_to_square = pad_to_square if pad_to_square is not None else self.pad_to_square
550
+ resample = resample if resample is not None else self.resample
551
+ do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
552
+ crop_size = crop_size if crop_size is not None else self.crop_size
553
+ crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
554
+ do_rescale = do_rescale if do_rescale is not None else self.do_rescale
555
+ rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
556
+ do_normalize = do_normalize if do_normalize is not None else self.do_normalize
557
+ image_mean = image_mean if image_mean is not None else self.image_mean
558
+ image_std = image_std if image_std is not None else self.image_std
559
+ do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
560
+
561
+ if return_dummy_image:
562
+ images = Image.new("RGB", (224, 224), (0, 0, 0))
563
+
564
+ images = make_list_of_images(images)
565
+
566
+ if not valid_images(images):
567
+ raise ValueError(
568
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
569
+ "torch.Tensor, tf.Tensor or jax.ndarray."
570
+ )
571
+
572
+ if do_convert_rgb:
573
+ images = [convert_to_rgb(image) for image in images]
574
+
575
+ # All transformations expect numpy arrays.
576
+ images = [to_numpy_array(image) for image in images]
577
+
578
+ if is_scaled_image(images[0]) and do_rescale:
579
+ logger.warning_once(
580
+ "It looks like you are trying to rescale already rescaled images. If the input"
581
+ " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
582
+ )
583
+
584
+ if input_data_format is None:
585
+ # We assume that all images have the same channel dimension format.
586
+ input_data_format = infer_channel_dimension_format(images[0])
587
+
588
+ new_images = []
589
+ image_sizes = [get_image_size(image, channel_dim=input_data_format) for image in images]
590
+ vision_query_lengths = []
591
+
592
+ assert crop_size["height"] == crop_size["width"]
593
+
594
+ # global image 의 padding 연산은, image original width, height 가 클 때 bottleneck 이 될 수 있음
595
+ # 장축의 길이를 size["shortest_edge"] 로 resize 를 먼저 한 뒤에, padding
596
+ if anyres:
597
+ anyres_global_images = copy.deepcopy(images)
598
+ if pad_to_square:
599
+ background_color = tuple(int(x * 255) for x in self.image_mean)
600
+ anyres_global_images = [
601
+ resize_longside(copy.deepcopy(image), size["shortest_edge"], resample, input_data_format)
602
+ for image in anyres_global_images
603
+ ]
604
+ anyres_global_images = [
605
+ expand2square(image, background_color=background_color, input_data_format=input_data_format)[0]
606
+ for image in anyres_global_images
607
+ ]
608
+ else:
609
+ anyres_global_images = [
610
+ self.resize(
611
+ image=image,
612
+ size={"height": size["shortest_edge"], "width": size["shortest_edge"]},
613
+ resample=resample,
614
+ input_data_format=input_data_format,
615
+ )
616
+ for image in anyres_global_images
617
+ ]
618
+ else:
619
+ anyres_global_images = [None for _ in range(len(images))]
620
+ if pad_to_square:
621
+ background_color = tuple(int(x * 255) for x in self.image_mean)
622
+ images = [
623
+ resize_longside(image, size["shortest_edge"], resample, input_data_format) for image in images
624
+ ]
625
+ images = [
626
+ expand2square(image, background_color=background_color, input_data_format=input_data_format)[0]
627
+ for image in images
628
+ ]
629
+
630
+ for image, anyres_global_image, image_size in zip(images, anyres_global_images, image_sizes):
631
+ if anyres:
632
+ # convert image into a list of grids
633
+ # we intentially use the same data format as the input data format
634
+ image_grids = self.get_image_grids(
635
+ image,
636
+ possible_resolutions,
637
+ grid_size=crop_size["height"],
638
+ resample=resample,
639
+ data_format=input_data_format,
640
+ input_data_format=input_data_format,
641
+ )
642
+ # video 에 대해서는 global image (thumbnail) 를 사용하지 않음
643
+ if not video:
644
+ image_grids = [anyres_global_image] + image_grids
645
+ else:
646
+ image_grids = [image]
647
+
648
+ pixel_values = self._preprocess(
649
+ image_grids,
650
+ do_resize=do_resize,
651
+ size=size,
652
+ resample=resample,
653
+ do_center_crop=do_center_crop,
654
+ crop_size=crop_size,
655
+ do_rescale=do_rescale,
656
+ rescale_factor=rescale_factor,
657
+ do_normalize=do_normalize,
658
+ image_mean=image_mean,
659
+ image_std=image_std,
660
+ data_format=data_format,
661
+ input_data_format=input_data_format,
662
+ )
663
+
664
+ pixel_values = np.array(pixel_values)
665
+ new_images.append(pixel_values)
666
+
667
+ num_grids = pixel_values.shape[0]
668
+
669
+ vision_query_length = determine_anyres_num_vision_patches(
670
+ num_grids=num_grids,
671
+ image_size=image_size,
672
+ grid_size=crop_size["height"],
673
+ patch_size=patch_size,
674
+ possible_resolutions=possible_resolutions,
675
+ anyres=anyres,
676
+ unpad=unpad,
677
+ num_queries_vis_abstractor=num_queries_vis_abstractor,
678
+ num_queries_vis_abstractor_slow=num_queries_vis_abstractor_slow,
679
+ video=video,
680
+ first_last_frames_slow=first_last_frames_slow,
681
+ is_first_or_last_frames=is_first_or_last_frames,
682
+ )
683
+
684
+ vision_query_lengths.append(vision_query_length)
685
+
686
+ if return_dummy_image:
687
+ vision_query_lengths = []
688
+
689
+ data = {
690
+ "pixel_values": [torch.tensor(new_image) for new_image in new_images],
691
+ "image_sizes": [{"width": image_size[1], "height": image_size[0]} for image_size in image_sizes],
692
+ "vision_query_lengths": vision_query_lengths,
693
+ }
694
+
695
+ return BatchFeature(data=data)
696
+
697
+ def save_pretrained(
698
+ self,
699
+ save_directory: Union[str, os.PathLike],
700
+ *args,
701
+ **kwargs,
702
+ ):
703
+ self.register_for_auto_class()
704
+ super().save_pretrained(save_directory, *args, **kwargs)
705
+
706
+
707
+ class HCXVisionV2Processor(Qwen2_5_VLProcessor):
708
+ attributes = ["image_processor", "tokenizer", "video_processor"]
709
+ image_processor_class = "AutoImageProcessor"
710
+ video_processor_class = "AutoVideoProcessor"
711
+ tokenizer_class = ("GPT2Tokenizer", "GPT2TokenizerFast", "PreTrainedTokenizer", "PreTrainedTokenizerFast")
712
+
713
+ def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
714
+ self.tokenizer = tokenizer
715
+ super().__init__(image_processor, tokenizer, video_processor, chat_template=self.tokenizer.chat_template)
716
+
717
+ def save_pretrained(
718
+ self,
719
+ save_directory: Union[str, os.PathLike],
720
+ *args,
721
+ **kwargs,
722
+ ):
723
+ self.register_for_auto_class()
724
+ super().save_pretrained(save_directory, *args, **kwargs)
725
+
726
+ def __call__(
727
+ self,
728
+ images: ImageInput = None,
729
+ text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
730
+ videos: VideoInput = None,
731
+ **kwargs: Unpack[Qwen2_5_VLProcessorKwargs],
732
+ ) -> BatchFeature:
733
+ """
734
+ Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
735
+ and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
736
+ the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
737
+ Qwen2VLImageProcessor's [`~Qwen2VLImageProcessor.__call__`] if `vision_infos` is not `None`.
738
+
739
+ Args:
740
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`):
741
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
742
+ tensor. Both channels-first and channels-last formats are supported.
743
+ text (`str`, `list[str]`, `list[list[str]]`):
744
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
745
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
746
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
747
+ videos (`np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`):
748
+ The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
749
+ tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
750
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
751
+ If set, will return tensors of a particular framework. Acceptable values are:
752
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
753
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
754
+ - `'np'`: Return NumPy `np.ndarray` objects.
755
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
756
+
757
+ Returns:
758
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
759
+
760
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
761
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
762
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
763
+ `None`).
764
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
765
+ - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
766
+ - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
767
+ - **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
768
+ """
769
+ output_kwargs = self._merge_kwargs(
770
+ Qwen2_5_VLProcessorKwargs,
771
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
772
+ **kwargs,
773
+ )
774
+
775
+ image_inputs = videos_inputs = {}
776
+ if images is not None:
777
+ image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
778
+ image_grid_thw = image_inputs["image_grid_thw"]
779
+
780
+ if videos is not None:
781
+ videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
782
+ video_grid_thw = videos_inputs["video_grid_thw"]
783
+
784
+ if not isinstance(text, list):
785
+ text = [text]
786
+
787
+ text = text.copy() # below lines change text in-place
788
+
789
+ if images is not None:
790
+ merge_length = self.image_processor.merge_size**2
791
+ index = 0
792
+ for i in range(len(text)):
793
+ while self.image_token in text[i]:
794
+ num_image_tokens = image_grid_thw[index].prod() // merge_length
795
+ text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
796
+ text[i] = text[i].replace(
797
+ '{"resolution": [w, h]}', '{"resolution": ' + str(list(images[i].size)) + "}"
798
+ )
799
+ index += 1
800
+ text[i] = text[i].replace("<|placeholder|>", self.image_token)
801
+
802
+ if videos is not None:
803
+ merge_length = self.video_processor.merge_size**2
804
+ index = 0
805
+ for i in range(len(text)):
806
+ while self.video_token in text[i]:
807
+ num_video_tokens = video_grid_thw[index].prod() // merge_length
808
+ text[i] = text[i].replace(self.video_token, "<|placeholder|>" * num_video_tokens, 1)
809
+ index += 1
810
+ text[i] = text[i].replace("<|placeholder|>", self.video_token)
811
+
812
+ return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
813
+ return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", False)
814
+ text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"], return_tensors=None)
815
+ self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
816
+
817
+ if return_mm_token_type_ids:
818
+ array_ids = np.array(text_inputs["input_ids"])
819
+ mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
820
+ mm_token_type_ids[array_ids == self.image_token_id] = 1
821
+ text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
822
+
823
+ return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
processor_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_vlm.HCXVisionV2Processor"
4
+ },
5
+ "processor_class": "HCXVisionV2Processor"
6
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|im_end|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "image_token": "<|IMAGE_PAD|>",
17
+ "pad_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "sep_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ "unk_token": {
32
+ "content": "<|endoftext|>",
33
+ "lstrip": false,
34
+ "normalized": false,
35
+ "rstrip": false,
36
+ "single_word": false
37
+ },
38
+ "video_token": "<|VIDEO_PAD|>"
39
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,2079 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "128000": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "128001": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "128002": {
29
+ "content": "<|stop|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "128003": {
37
+ "content": "<|endofturn|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "128004": {
45
+ "content": "<|fim_prefix|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "128005": {
53
+ "content": "<|fim_middle|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "128006": {
61
+ "content": "<|fim_suffix|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "128007": {
69
+ "content": "<repo_name>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "128008": {
77
+ "content": "<file_sep>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "128009": {
85
+ "content": "<issue_start>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "128010": {
93
+ "content": "<issue_comment>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "128011": {
101
+ "content": "<issue_closed>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "128012": {
109
+ "content": "<jupyter_start>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "128013": {
117
+ "content": "<jupyter_text>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "128014": {
125
+ "content": "<jupyter_code>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "128015": {
133
+ "content": "<jupyter_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ },
140
+ "128016": {
141
+ "content": "<jupyter_script>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": true
147
+ },
148
+ "128017": {
149
+ "content": "<empty_output>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": true
155
+ },
156
+ "128018": {
157
+ "content": "<code_to_intermediate>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": true
163
+ },
164
+ "128019": {
165
+ "content": "<intermediate_to_code>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": true
171
+ },
172
+ "128020": {
173
+ "content": "<pr>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": true
179
+ },
180
+ "128021": {
181
+ "content": "<pr_status>",
182
+ "lstrip": false,
183
+ "normalized": false,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": true
187
+ },
188
+ "128022": {
189
+ "content": "<pr_is_merged>",
190
+ "lstrip": false,
191
+ "normalized": false,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": true
195
+ },
196
+ "128023": {
197
+ "content": "<pr_base>",
198
+ "lstrip": false,
199
+ "normalized": false,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": true
203
+ },
204
+ "128024": {
205
+ "content": "<pr_file>",
206
+ "lstrip": false,
207
+ "normalized": false,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": true
211
+ },
212
+ "128025": {
213
+ "content": "<pr_base_code>",
214
+ "lstrip": false,
215
+ "normalized": false,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": true
219
+ },
220
+ "128026": {
221
+ "content": "<pr_diff>",
222
+ "lstrip": false,
223
+ "normalized": false,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": true
227
+ },
228
+ "128027": {
229
+ "content": "<pr_diff_hunk>",
230
+ "lstrip": false,
231
+ "normalized": false,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": true
235
+ },
236
+ "128028": {
237
+ "content": "<pr_comment>",
238
+ "lstrip": false,
239
+ "normalized": false,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": true
243
+ },
244
+ "128029": {
245
+ "content": "<pr_event_id>",
246
+ "lstrip": false,
247
+ "normalized": false,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": true
251
+ },
252
+ "128030": {
253
+ "content": "<pr_review>",
254
+ "lstrip": false,
255
+ "normalized": false,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": true
259
+ },
260
+ "128031": {
261
+ "content": "<pr_review_state>",
262
+ "lstrip": false,
263
+ "normalized": false,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": true
267
+ },
268
+ "128032": {
269
+ "content": "<pr_review_comment>",
270
+ "lstrip": false,
271
+ "normalized": false,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": true
275
+ },
276
+ "128033": {
277
+ "content": "<pr_in_reply_to_review_id>",
278
+ "lstrip": false,
279
+ "normalized": false,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": true
283
+ },
284
+ "128034": {
285
+ "content": "<pr_in_reply_to_comment_id>",
286
+ "lstrip": false,
287
+ "normalized": false,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": true
291
+ },
292
+ "128035": {
293
+ "content": "<pr_diff_hunk_comment_line>",
294
+ "lstrip": false,
295
+ "normalized": false,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": true
299
+ },
300
+ "128036": {
301
+ "content": "<NAME>",
302
+ "lstrip": false,
303
+ "normalized": false,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": true
307
+ },
308
+ "128037": {
309
+ "content": "<EMAIL>",
310
+ "lstrip": false,
311
+ "normalized": false,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": true
315
+ },
316
+ "128038": {
317
+ "content": "<KEY>",
318
+ "lstrip": false,
319
+ "normalized": false,
320
+ "rstrip": false,
321
+ "single_word": false,
322
+ "special": true
323
+ },
324
+ "128039": {
325
+ "content": "<PASSWORD>",
326
+ "lstrip": false,
327
+ "normalized": false,
328
+ "rstrip": false,
329
+ "single_word": false,
330
+ "special": true
331
+ },
332
+ "128040": {
333
+ "content": "<think>",
334
+ "lstrip": false,
335
+ "normalized": false,
336
+ "rstrip": false,
337
+ "single_word": false,
338
+ "special": false
339
+ },
340
+ "128041": {
341
+ "content": "</think>",
342
+ "lstrip": false,
343
+ "normalized": false,
344
+ "rstrip": false,
345
+ "single_word": false,
346
+ "special": false
347
+ },
348
+ "128042": {
349
+ "content": "<tool_call>",
350
+ "lstrip": false,
351
+ "normalized": false,
352
+ "rstrip": false,
353
+ "single_word": false,
354
+ "special": false
355
+ },
356
+ "128043": {
357
+ "content": "</tool_call>",
358
+ "lstrip": false,
359
+ "normalized": false,
360
+ "rstrip": false,
361
+ "single_word": false,
362
+ "special": false
363
+ },
364
+ "128044": {
365
+ "content": "<arg_key>",
366
+ "lstrip": false,
367
+ "normalized": false,
368
+ "rstrip": false,
369
+ "single_word": false,
370
+ "special": false
371
+ },
372
+ "128045": {
373
+ "content": "</arg_key>",
374
+ "lstrip": false,
375
+ "normalized": false,
376
+ "rstrip": false,
377
+ "single_word": false,
378
+ "special": false
379
+ },
380
+ "128046": {
381
+ "content": "<arg_value>",
382
+ "lstrip": false,
383
+ "normalized": false,
384
+ "rstrip": false,
385
+ "single_word": false,
386
+ "special": false
387
+ },
388
+ "128047": {
389
+ "content": "</arg_value>",
390
+ "lstrip": false,
391
+ "normalized": false,
392
+ "rstrip": false,
393
+ "single_word": false,
394
+ "special": false
395
+ },
396
+ "128048": {
397
+ "content": "<tool_response>",
398
+ "lstrip": false,
399
+ "normalized": false,
400
+ "rstrip": false,
401
+ "single_word": false,
402
+ "special": false
403
+ },
404
+ "128049": {
405
+ "content": "</tool_response>",
406
+ "lstrip": false,
407
+ "normalized": false,
408
+ "rstrip": false,
409
+ "single_word": false,
410
+ "special": false
411
+ },
412
+ "128050": {
413
+ "content": "<tools>",
414
+ "lstrip": false,
415
+ "normalized": false,
416
+ "rstrip": false,
417
+ "single_word": false,
418
+ "special": false
419
+ },
420
+ "128051": {
421
+ "content": "</tools>",
422
+ "lstrip": false,
423
+ "normalized": false,
424
+ "rstrip": false,
425
+ "single_word": false,
426
+ "special": false
427
+ },
428
+ "128052": {
429
+ "content": "<|mime_start|>",
430
+ "lstrip": false,
431
+ "normalized": false,
432
+ "rstrip": false,
433
+ "single_word": false,
434
+ "special": true
435
+ },
436
+ "128053": {
437
+ "content": "<|mime_end|>",
438
+ "lstrip": false,
439
+ "normalized": false,
440
+ "rstrip": false,
441
+ "single_word": false,
442
+ "special": true
443
+ },
444
+ "128054": {
445
+ "content": "<|document_start|>",
446
+ "lstrip": false,
447
+ "normalized": false,
448
+ "rstrip": false,
449
+ "single_word": false,
450
+ "special": true
451
+ },
452
+ "128055": {
453
+ "content": "<|document_end|>",
454
+ "lstrip": false,
455
+ "normalized": false,
456
+ "rstrip": false,
457
+ "single_word": false,
458
+ "special": true
459
+ },
460
+ "128056": {
461
+ "content": "<|image_start|>",
462
+ "lstrip": false,
463
+ "normalized": false,
464
+ "rstrip": false,
465
+ "single_word": false,
466
+ "special": true
467
+ },
468
+ "128057": {
469
+ "content": "<|image_end|>",
470
+ "lstrip": false,
471
+ "normalized": false,
472
+ "rstrip": false,
473
+ "single_word": false,
474
+ "special": true
475
+ },
476
+ "128058": {
477
+ "content": "<|video_start|>",
478
+ "lstrip": false,
479
+ "normalized": false,
480
+ "rstrip": false,
481
+ "single_word": false,
482
+ "special": true
483
+ },
484
+ "128059": {
485
+ "content": "<|video_end|>",
486
+ "lstrip": false,
487
+ "normalized": false,
488
+ "rstrip": false,
489
+ "single_word": false,
490
+ "special": true
491
+ },
492
+ "128060": {
493
+ "content": "<|IMAGE_PAD|>",
494
+ "lstrip": false,
495
+ "normalized": false,
496
+ "rstrip": false,
497
+ "single_word": false,
498
+ "special": true
499
+ },
500
+ "128061": {
501
+ "content": "<|VIDEO_PAD|>",
502
+ "lstrip": false,
503
+ "normalized": false,
504
+ "rstrip": false,
505
+ "single_word": false,
506
+ "special": true
507
+ },
508
+ "128062": {
509
+ "content": "<|vision_aux_start|>",
510
+ "lstrip": false,
511
+ "normalized": false,
512
+ "rstrip": false,
513
+ "single_word": false,
514
+ "special": true
515
+ },
516
+ "128063": {
517
+ "content": "<|vision_aux_end|>",
518
+ "lstrip": false,
519
+ "normalized": false,
520
+ "rstrip": false,
521
+ "single_word": false,
522
+ "special": true
523
+ },
524
+ "128064": {
525
+ "content": "<|code_switching|>",
526
+ "lstrip": false,
527
+ "normalized": false,
528
+ "rstrip": false,
529
+ "single_word": false,
530
+ "special": true
531
+ },
532
+ "128065": {
533
+ "content": "<|back_translation|>",
534
+ "lstrip": false,
535
+ "normalized": false,
536
+ "rstrip": false,
537
+ "single_word": false,
538
+ "special": true
539
+ },
540
+ "128066": {
541
+ "content": "<|instruction_pretraining|>",
542
+ "lstrip": false,
543
+ "normalized": false,
544
+ "rstrip": false,
545
+ "single_word": false,
546
+ "special": true
547
+ },
548
+ "128067": {
549
+ "content": "<|_placeholder_067|>",
550
+ "lstrip": false,
551
+ "normalized": false,
552
+ "rstrip": false,
553
+ "single_word": false,
554
+ "special": true
555
+ },
556
+ "128068": {
557
+ "content": "<|_placeholder_068|>",
558
+ "lstrip": false,
559
+ "normalized": false,
560
+ "rstrip": false,
561
+ "single_word": false,
562
+ "special": true
563
+ },
564
+ "128069": {
565
+ "content": "<|_placeholder_069|>",
566
+ "lstrip": false,
567
+ "normalized": false,
568
+ "rstrip": false,
569
+ "single_word": false,
570
+ "special": true
571
+ },
572
+ "128070": {
573
+ "content": "<|_placeholder_070|>",
574
+ "lstrip": false,
575
+ "normalized": false,
576
+ "rstrip": false,
577
+ "single_word": false,
578
+ "special": true
579
+ },
580
+ "128071": {
581
+ "content": "<|_placeholder_071|>",
582
+ "lstrip": false,
583
+ "normalized": false,
584
+ "rstrip": false,
585
+ "single_word": false,
586
+ "special": true
587
+ },
588
+ "128072": {
589
+ "content": "<|_placeholder_072|>",
590
+ "lstrip": false,
591
+ "normalized": false,
592
+ "rstrip": false,
593
+ "single_word": false,
594
+ "special": true
595
+ },
596
+ "128073": {
597
+ "content": "<|_placeholder_073|>",
598
+ "lstrip": false,
599
+ "normalized": false,
600
+ "rstrip": false,
601
+ "single_word": false,
602
+ "special": true
603
+ },
604
+ "128074": {
605
+ "content": "<|_placeholder_074|>",
606
+ "lstrip": false,
607
+ "normalized": false,
608
+ "rstrip": false,
609
+ "single_word": false,
610
+ "special": true
611
+ },
612
+ "128075": {
613
+ "content": "<|_placeholder_075|>",
614
+ "lstrip": false,
615
+ "normalized": false,
616
+ "rstrip": false,
617
+ "single_word": false,
618
+ "special": true
619
+ },
620
+ "128076": {
621
+ "content": "<|_placeholder_076|>",
622
+ "lstrip": false,
623
+ "normalized": false,
624
+ "rstrip": false,
625
+ "single_word": false,
626
+ "special": true
627
+ },
628
+ "128077": {
629
+ "content": "<|_placeholder_077|>",
630
+ "lstrip": false,
631
+ "normalized": false,
632
+ "rstrip": false,
633
+ "single_word": false,
634
+ "special": true
635
+ },
636
+ "128078": {
637
+ "content": "<|_placeholder_078|>",
638
+ "lstrip": false,
639
+ "normalized": false,
640
+ "rstrip": false,
641
+ "single_word": false,
642
+ "special": true
643
+ },
644
+ "128079": {
645
+ "content": "<|_placeholder_079|>",
646
+ "lstrip": false,
647
+ "normalized": false,
648
+ "rstrip": false,
649
+ "single_word": false,
650
+ "special": true
651
+ },
652
+ "128080": {
653
+ "content": "<|_placeholder_080|>",
654
+ "lstrip": false,
655
+ "normalized": false,
656
+ "rstrip": false,
657
+ "single_word": false,
658
+ "special": true
659
+ },
660
+ "128081": {
661
+ "content": "<|_placeholder_081|>",
662
+ "lstrip": false,
663
+ "normalized": false,
664
+ "rstrip": false,
665
+ "single_word": false,
666
+ "special": true
667
+ },
668
+ "128082": {
669
+ "content": "<|_placeholder_082|>",
670
+ "lstrip": false,
671
+ "normalized": false,
672
+ "rstrip": false,
673
+ "single_word": false,
674
+ "special": true
675
+ },
676
+ "128083": {
677
+ "content": "<|_placeholder_083|>",
678
+ "lstrip": false,
679
+ "normalized": false,
680
+ "rstrip": false,
681
+ "single_word": false,
682
+ "special": true
683
+ },
684
+ "128084": {
685
+ "content": "<|_placeholder_084|>",
686
+ "lstrip": false,
687
+ "normalized": false,
688
+ "rstrip": false,
689
+ "single_word": false,
690
+ "special": true
691
+ },
692
+ "128085": {
693
+ "content": "<|_placeholder_085|>",
694
+ "lstrip": false,
695
+ "normalized": false,
696
+ "rstrip": false,
697
+ "single_word": false,
698
+ "special": true
699
+ },
700
+ "128086": {
701
+ "content": "<|_placeholder_086|>",
702
+ "lstrip": false,
703
+ "normalized": false,
704
+ "rstrip": false,
705
+ "single_word": false,
706
+ "special": true
707
+ },
708
+ "128087": {
709
+ "content": "<|_placeholder_087|>",
710
+ "lstrip": false,
711
+ "normalized": false,
712
+ "rstrip": false,
713
+ "single_word": false,
714
+ "special": true
715
+ },
716
+ "128088": {
717
+ "content": "<|_placeholder_088|>",
718
+ "lstrip": false,
719
+ "normalized": false,
720
+ "rstrip": false,
721
+ "single_word": false,
722
+ "special": true
723
+ },
724
+ "128089": {
725
+ "content": "<|_placeholder_089|>",
726
+ "lstrip": false,
727
+ "normalized": false,
728
+ "rstrip": false,
729
+ "single_word": false,
730
+ "special": true
731
+ },
732
+ "128090": {
733
+ "content": "<|_placeholder_090|>",
734
+ "lstrip": false,
735
+ "normalized": false,
736
+ "rstrip": false,
737
+ "single_word": false,
738
+ "special": true
739
+ },
740
+ "128091": {
741
+ "content": "<|_placeholder_091|>",
742
+ "lstrip": false,
743
+ "normalized": false,
744
+ "rstrip": false,
745
+ "single_word": false,
746
+ "special": true
747
+ },
748
+ "128092": {
749
+ "content": "<|_placeholder_092|>",
750
+ "lstrip": false,
751
+ "normalized": false,
752
+ "rstrip": false,
753
+ "single_word": false,
754
+ "special": true
755
+ },
756
+ "128093": {
757
+ "content": "<|_placeholder_093|>",
758
+ "lstrip": false,
759
+ "normalized": false,
760
+ "rstrip": false,
761
+ "single_word": false,
762
+ "special": true
763
+ },
764
+ "128094": {
765
+ "content": "<|_placeholder_094|>",
766
+ "lstrip": false,
767
+ "normalized": false,
768
+ "rstrip": false,
769
+ "single_word": false,
770
+ "special": true
771
+ },
772
+ "128095": {
773
+ "content": "<|_placeholder_095|>",
774
+ "lstrip": false,
775
+ "normalized": false,
776
+ "rstrip": false,
777
+ "single_word": false,
778
+ "special": true
779
+ },
780
+ "128096": {
781
+ "content": "<|_placeholder_096|>",
782
+ "lstrip": false,
783
+ "normalized": false,
784
+ "rstrip": false,
785
+ "single_word": false,
786
+ "special": true
787
+ },
788
+ "128097": {
789
+ "content": "<|_placeholder_097|>",
790
+ "lstrip": false,
791
+ "normalized": false,
792
+ "rstrip": false,
793
+ "single_word": false,
794
+ "special": true
795
+ },
796
+ "128098": {
797
+ "content": "<|_placeholder_098|>",
798
+ "lstrip": false,
799
+ "normalized": false,
800
+ "rstrip": false,
801
+ "single_word": false,
802
+ "special": true
803
+ },
804
+ "128099": {
805
+ "content": "<|_placeholder_099|>",
806
+ "lstrip": false,
807
+ "normalized": false,
808
+ "rstrip": false,
809
+ "single_word": false,
810
+ "special": true
811
+ },
812
+ "128100": {
813
+ "content": "<|_placeholder_100|>",
814
+ "lstrip": false,
815
+ "normalized": false,
816
+ "rstrip": false,
817
+ "single_word": false,
818
+ "special": true
819
+ },
820
+ "128101": {
821
+ "content": "<|_placeholder_101|>",
822
+ "lstrip": false,
823
+ "normalized": false,
824
+ "rstrip": false,
825
+ "single_word": false,
826
+ "special": true
827
+ },
828
+ "128102": {
829
+ "content": "<|_placeholder_102|>",
830
+ "lstrip": false,
831
+ "normalized": false,
832
+ "rstrip": false,
833
+ "single_word": false,
834
+ "special": true
835
+ },
836
+ "128103": {
837
+ "content": "<|_placeholder_103|>",
838
+ "lstrip": false,
839
+ "normalized": false,
840
+ "rstrip": false,
841
+ "single_word": false,
842
+ "special": true
843
+ },
844
+ "128104": {
845
+ "content": "<|_placeholder_104|>",
846
+ "lstrip": false,
847
+ "normalized": false,
848
+ "rstrip": false,
849
+ "single_word": false,
850
+ "special": true
851
+ },
852
+ "128105": {
853
+ "content": "<|_placeholder_105|>",
854
+ "lstrip": false,
855
+ "normalized": false,
856
+ "rstrip": false,
857
+ "single_word": false,
858
+ "special": true
859
+ },
860
+ "128106": {
861
+ "content": "<|_placeholder_106|>",
862
+ "lstrip": false,
863
+ "normalized": false,
864
+ "rstrip": false,
865
+ "single_word": false,
866
+ "special": true
867
+ },
868
+ "128107": {
869
+ "content": "<|_placeholder_107|>",
870
+ "lstrip": false,
871
+ "normalized": false,
872
+ "rstrip": false,
873
+ "single_word": false,
874
+ "special": true
875
+ },
876
+ "128108": {
877
+ "content": "<|_placeholder_108|>",
878
+ "lstrip": false,
879
+ "normalized": false,
880
+ "rstrip": false,
881
+ "single_word": false,
882
+ "special": true
883
+ },
884
+ "128109": {
885
+ "content": "<|_placeholder_109|>",
886
+ "lstrip": false,
887
+ "normalized": false,
888
+ "rstrip": false,
889
+ "single_word": false,
890
+ "special": true
891
+ },
892
+ "128110": {
893
+ "content": "<|_placeholder_110|>",
894
+ "lstrip": false,
895
+ "normalized": false,
896
+ "rstrip": false,
897
+ "single_word": false,
898
+ "special": true
899
+ },
900
+ "128111": {
901
+ "content": "<|_placeholder_111|>",
902
+ "lstrip": false,
903
+ "normalized": false,
904
+ "rstrip": false,
905
+ "single_word": false,
906
+ "special": true
907
+ },
908
+ "128112": {
909
+ "content": "<|_placeholder_112|>",
910
+ "lstrip": false,
911
+ "normalized": false,
912
+ "rstrip": false,
913
+ "single_word": false,
914
+ "special": true
915
+ },
916
+ "128113": {
917
+ "content": "<|_placeholder_113|>",
918
+ "lstrip": false,
919
+ "normalized": false,
920
+ "rstrip": false,
921
+ "single_word": false,
922
+ "special": true
923
+ },
924
+ "128114": {
925
+ "content": "<|_placeholder_114|>",
926
+ "lstrip": false,
927
+ "normalized": false,
928
+ "rstrip": false,
929
+ "single_word": false,
930
+ "special": true
931
+ },
932
+ "128115": {
933
+ "content": "<|_placeholder_115|>",
934
+ "lstrip": false,
935
+ "normalized": false,
936
+ "rstrip": false,
937
+ "single_word": false,
938
+ "special": true
939
+ },
940
+ "128116": {
941
+ "content": "<|_placeholder_116|>",
942
+ "lstrip": false,
943
+ "normalized": false,
944
+ "rstrip": false,
945
+ "single_word": false,
946
+ "special": true
947
+ },
948
+ "128117": {
949
+ "content": "<|_placeholder_117|>",
950
+ "lstrip": false,
951
+ "normalized": false,
952
+ "rstrip": false,
953
+ "single_word": false,
954
+ "special": true
955
+ },
956
+ "128118": {
957
+ "content": "<|_placeholder_118|>",
958
+ "lstrip": false,
959
+ "normalized": false,
960
+ "rstrip": false,
961
+ "single_word": false,
962
+ "special": true
963
+ },
964
+ "128119": {
965
+ "content": "<|_placeholder_119|>",
966
+ "lstrip": false,
967
+ "normalized": false,
968
+ "rstrip": false,
969
+ "single_word": false,
970
+ "special": true
971
+ },
972
+ "128120": {
973
+ "content": "<|_placeholder_120|>",
974
+ "lstrip": false,
975
+ "normalized": false,
976
+ "rstrip": false,
977
+ "single_word": false,
978
+ "special": true
979
+ },
980
+ "128121": {
981
+ "content": "<|_placeholder_121|>",
982
+ "lstrip": false,
983
+ "normalized": false,
984
+ "rstrip": false,
985
+ "single_word": false,
986
+ "special": true
987
+ },
988
+ "128122": {
989
+ "content": "<|_placeholder_122|>",
990
+ "lstrip": false,
991
+ "normalized": false,
992
+ "rstrip": false,
993
+ "single_word": false,
994
+ "special": true
995
+ },
996
+ "128123": {
997
+ "content": "<|_placeholder_123|>",
998
+ "lstrip": false,
999
+ "normalized": false,
1000
+ "rstrip": false,
1001
+ "single_word": false,
1002
+ "special": true
1003
+ },
1004
+ "128124": {
1005
+ "content": "<|_placeholder_124|>",
1006
+ "lstrip": false,
1007
+ "normalized": false,
1008
+ "rstrip": false,
1009
+ "single_word": false,
1010
+ "special": true
1011
+ },
1012
+ "128125": {
1013
+ "content": "<|_placeholder_125|>",
1014
+ "lstrip": false,
1015
+ "normalized": false,
1016
+ "rstrip": false,
1017
+ "single_word": false,
1018
+ "special": true
1019
+ },
1020
+ "128126": {
1021
+ "content": "<|_placeholder_126|>",
1022
+ "lstrip": false,
1023
+ "normalized": false,
1024
+ "rstrip": false,
1025
+ "single_word": false,
1026
+ "special": true
1027
+ },
1028
+ "128127": {
1029
+ "content": "<|_placeholder_127|>",
1030
+ "lstrip": false,
1031
+ "normalized": false,
1032
+ "rstrip": false,
1033
+ "single_word": false,
1034
+ "special": true
1035
+ },
1036
+ "128128": {
1037
+ "content": "<|_placeholder_128|>",
1038
+ "lstrip": false,
1039
+ "normalized": false,
1040
+ "rstrip": false,
1041
+ "single_word": false,
1042
+ "special": true
1043
+ },
1044
+ "128129": {
1045
+ "content": "<|_placeholder_129|>",
1046
+ "lstrip": false,
1047
+ "normalized": false,
1048
+ "rstrip": false,
1049
+ "single_word": false,
1050
+ "special": true
1051
+ },
1052
+ "128130": {
1053
+ "content": "<|_placeholder_130|>",
1054
+ "lstrip": false,
1055
+ "normalized": false,
1056
+ "rstrip": false,
1057
+ "single_word": false,
1058
+ "special": true
1059
+ },
1060
+ "128131": {
1061
+ "content": "<|_placeholder_131|>",
1062
+ "lstrip": false,
1063
+ "normalized": false,
1064
+ "rstrip": false,
1065
+ "single_word": false,
1066
+ "special": true
1067
+ },
1068
+ "128132": {
1069
+ "content": "<|_placeholder_132|>",
1070
+ "lstrip": false,
1071
+ "normalized": false,
1072
+ "rstrip": false,
1073
+ "single_word": false,
1074
+ "special": true
1075
+ },
1076
+ "128133": {
1077
+ "content": "<|_placeholder_133|>",
1078
+ "lstrip": false,
1079
+ "normalized": false,
1080
+ "rstrip": false,
1081
+ "single_word": false,
1082
+ "special": true
1083
+ },
1084
+ "128134": {
1085
+ "content": "<|_placeholder_134|>",
1086
+ "lstrip": false,
1087
+ "normalized": false,
1088
+ "rstrip": false,
1089
+ "single_word": false,
1090
+ "special": true
1091
+ },
1092
+ "128135": {
1093
+ "content": "<|_placeholder_135|>",
1094
+ "lstrip": false,
1095
+ "normalized": false,
1096
+ "rstrip": false,
1097
+ "single_word": false,
1098
+ "special": true
1099
+ },
1100
+ "128136": {
1101
+ "content": "<|_placeholder_136|>",
1102
+ "lstrip": false,
1103
+ "normalized": false,
1104
+ "rstrip": false,
1105
+ "single_word": false,
1106
+ "special": true
1107
+ },
1108
+ "128137": {
1109
+ "content": "<|_placeholder_137|>",
1110
+ "lstrip": false,
1111
+ "normalized": false,
1112
+ "rstrip": false,
1113
+ "single_word": false,
1114
+ "special": true
1115
+ },
1116
+ "128138": {
1117
+ "content": "<|_placeholder_138|>",
1118
+ "lstrip": false,
1119
+ "normalized": false,
1120
+ "rstrip": false,
1121
+ "single_word": false,
1122
+ "special": true
1123
+ },
1124
+ "128139": {
1125
+ "content": "<|_placeholder_139|>",
1126
+ "lstrip": false,
1127
+ "normalized": false,
1128
+ "rstrip": false,
1129
+ "single_word": false,
1130
+ "special": true
1131
+ },
1132
+ "128140": {
1133
+ "content": "<|_placeholder_140|>",
1134
+ "lstrip": false,
1135
+ "normalized": false,
1136
+ "rstrip": false,
1137
+ "single_word": false,
1138
+ "special": true
1139
+ },
1140
+ "128141": {
1141
+ "content": "<|_placeholder_141|>",
1142
+ "lstrip": false,
1143
+ "normalized": false,
1144
+ "rstrip": false,
1145
+ "single_word": false,
1146
+ "special": true
1147
+ },
1148
+ "128142": {
1149
+ "content": "<|_placeholder_142|>",
1150
+ "lstrip": false,
1151
+ "normalized": false,
1152
+ "rstrip": false,
1153
+ "single_word": false,
1154
+ "special": true
1155
+ },
1156
+ "128143": {
1157
+ "content": "<|_placeholder_143|>",
1158
+ "lstrip": false,
1159
+ "normalized": false,
1160
+ "rstrip": false,
1161
+ "single_word": false,
1162
+ "special": true
1163
+ },
1164
+ "128144": {
1165
+ "content": "<|_placeholder_144|>",
1166
+ "lstrip": false,
1167
+ "normalized": false,
1168
+ "rstrip": false,
1169
+ "single_word": false,
1170
+ "special": true
1171
+ },
1172
+ "128145": {
1173
+ "content": "<|_placeholder_145|>",
1174
+ "lstrip": false,
1175
+ "normalized": false,
1176
+ "rstrip": false,
1177
+ "single_word": false,
1178
+ "special": true
1179
+ },
1180
+ "128146": {
1181
+ "content": "<|_placeholder_146|>",
1182
+ "lstrip": false,
1183
+ "normalized": false,
1184
+ "rstrip": false,
1185
+ "single_word": false,
1186
+ "special": true
1187
+ },
1188
+ "128147": {
1189
+ "content": "<|_placeholder_147|>",
1190
+ "lstrip": false,
1191
+ "normalized": false,
1192
+ "rstrip": false,
1193
+ "single_word": false,
1194
+ "special": true
1195
+ },
1196
+ "128148": {
1197
+ "content": "<|_placeholder_148|>",
1198
+ "lstrip": false,
1199
+ "normalized": false,
1200
+ "rstrip": false,
1201
+ "single_word": false,
1202
+ "special": true
1203
+ },
1204
+ "128149": {
1205
+ "content": "<|_placeholder_149|>",
1206
+ "lstrip": false,
1207
+ "normalized": false,
1208
+ "rstrip": false,
1209
+ "single_word": false,
1210
+ "special": true
1211
+ },
1212
+ "128150": {
1213
+ "content": "<|_placeholder_150|>",
1214
+ "lstrip": false,
1215
+ "normalized": false,
1216
+ "rstrip": false,
1217
+ "single_word": false,
1218
+ "special": true
1219
+ },
1220
+ "128151": {
1221
+ "content": "<|_placeholder_151|>",
1222
+ "lstrip": false,
1223
+ "normalized": false,
1224
+ "rstrip": false,
1225
+ "single_word": false,
1226
+ "special": true
1227
+ },
1228
+ "128152": {
1229
+ "content": "<|_placeholder_152|>",
1230
+ "lstrip": false,
1231
+ "normalized": false,
1232
+ "rstrip": false,
1233
+ "single_word": false,
1234
+ "special": true
1235
+ },
1236
+ "128153": {
1237
+ "content": "<|_placeholder_153|>",
1238
+ "lstrip": false,
1239
+ "normalized": false,
1240
+ "rstrip": false,
1241
+ "single_word": false,
1242
+ "special": true
1243
+ },
1244
+ "128154": {
1245
+ "content": "<|_placeholder_154|>",
1246
+ "lstrip": false,
1247
+ "normalized": false,
1248
+ "rstrip": false,
1249
+ "single_word": false,
1250
+ "special": true
1251
+ },
1252
+ "128155": {
1253
+ "content": "<|_placeholder_155|>",
1254
+ "lstrip": false,
1255
+ "normalized": false,
1256
+ "rstrip": false,
1257
+ "single_word": false,
1258
+ "special": true
1259
+ },
1260
+ "128156": {
1261
+ "content": "<|_placeholder_156|>",
1262
+ "lstrip": false,
1263
+ "normalized": false,
1264
+ "rstrip": false,
1265
+ "single_word": false,
1266
+ "special": true
1267
+ },
1268
+ "128157": {
1269
+ "content": "<|_placeholder_157|>",
1270
+ "lstrip": false,
1271
+ "normalized": false,
1272
+ "rstrip": false,
1273
+ "single_word": false,
1274
+ "special": true
1275
+ },
1276
+ "128158": {
1277
+ "content": "<|_placeholder_158|>",
1278
+ "lstrip": false,
1279
+ "normalized": false,
1280
+ "rstrip": false,
1281
+ "single_word": false,
1282
+ "special": true
1283
+ },
1284
+ "128159": {
1285
+ "content": "<|_placeholder_159|>",
1286
+ "lstrip": false,
1287
+ "normalized": false,
1288
+ "rstrip": false,
1289
+ "single_word": false,
1290
+ "special": true
1291
+ },
1292
+ "128160": {
1293
+ "content": "<|_placeholder_160|>",
1294
+ "lstrip": false,
1295
+ "normalized": false,
1296
+ "rstrip": false,
1297
+ "single_word": false,
1298
+ "special": true
1299
+ },
1300
+ "128161": {
1301
+ "content": "<|_placeholder_161|>",
1302
+ "lstrip": false,
1303
+ "normalized": false,
1304
+ "rstrip": false,
1305
+ "single_word": false,
1306
+ "special": true
1307
+ },
1308
+ "128162": {
1309
+ "content": "<|_placeholder_162|>",
1310
+ "lstrip": false,
1311
+ "normalized": false,
1312
+ "rstrip": false,
1313
+ "single_word": false,
1314
+ "special": true
1315
+ },
1316
+ "128163": {
1317
+ "content": "<|_placeholder_163|>",
1318
+ "lstrip": false,
1319
+ "normalized": false,
1320
+ "rstrip": false,
1321
+ "single_word": false,
1322
+ "special": true
1323
+ },
1324
+ "128164": {
1325
+ "content": "<|_placeholder_164|>",
1326
+ "lstrip": false,
1327
+ "normalized": false,
1328
+ "rstrip": false,
1329
+ "single_word": false,
1330
+ "special": true
1331
+ },
1332
+ "128165": {
1333
+ "content": "<|_placeholder_165|>",
1334
+ "lstrip": false,
1335
+ "normalized": false,
1336
+ "rstrip": false,
1337
+ "single_word": false,
1338
+ "special": true
1339
+ },
1340
+ "128166": {
1341
+ "content": "<|_placeholder_166|>",
1342
+ "lstrip": false,
1343
+ "normalized": false,
1344
+ "rstrip": false,
1345
+ "single_word": false,
1346
+ "special": true
1347
+ },
1348
+ "128167": {
1349
+ "content": "<|_placeholder_167|>",
1350
+ "lstrip": false,
1351
+ "normalized": false,
1352
+ "rstrip": false,
1353
+ "single_word": false,
1354
+ "special": true
1355
+ },
1356
+ "128168": {
1357
+ "content": "<|_placeholder_168|>",
1358
+ "lstrip": false,
1359
+ "normalized": false,
1360
+ "rstrip": false,
1361
+ "single_word": false,
1362
+ "special": true
1363
+ },
1364
+ "128169": {
1365
+ "content": "<|_placeholder_169|>",
1366
+ "lstrip": false,
1367
+ "normalized": false,
1368
+ "rstrip": false,
1369
+ "single_word": false,
1370
+ "special": true
1371
+ },
1372
+ "128170": {
1373
+ "content": "<|_placeholder_170|>",
1374
+ "lstrip": false,
1375
+ "normalized": false,
1376
+ "rstrip": false,
1377
+ "single_word": false,
1378
+ "special": true
1379
+ },
1380
+ "128171": {
1381
+ "content": "<|_placeholder_171|>",
1382
+ "lstrip": false,
1383
+ "normalized": false,
1384
+ "rstrip": false,
1385
+ "single_word": false,
1386
+ "special": true
1387
+ },
1388
+ "128172": {
1389
+ "content": "<|_placeholder_172|>",
1390
+ "lstrip": false,
1391
+ "normalized": false,
1392
+ "rstrip": false,
1393
+ "single_word": false,
1394
+ "special": true
1395
+ },
1396
+ "128173": {
1397
+ "content": "<|_placeholder_173|>",
1398
+ "lstrip": false,
1399
+ "normalized": false,
1400
+ "rstrip": false,
1401
+ "single_word": false,
1402
+ "special": true
1403
+ },
1404
+ "128174": {
1405
+ "content": "<|_placeholder_174|>",
1406
+ "lstrip": false,
1407
+ "normalized": false,
1408
+ "rstrip": false,
1409
+ "single_word": false,
1410
+ "special": true
1411
+ },
1412
+ "128175": {
1413
+ "content": "<|_placeholder_175|>",
1414
+ "lstrip": false,
1415
+ "normalized": false,
1416
+ "rstrip": false,
1417
+ "single_word": false,
1418
+ "special": true
1419
+ },
1420
+ "128176": {
1421
+ "content": "<|_placeholder_176|>",
1422
+ "lstrip": false,
1423
+ "normalized": false,
1424
+ "rstrip": false,
1425
+ "single_word": false,
1426
+ "special": true
1427
+ },
1428
+ "128177": {
1429
+ "content": "<|_placeholder_177|>",
1430
+ "lstrip": false,
1431
+ "normalized": false,
1432
+ "rstrip": false,
1433
+ "single_word": false,
1434
+ "special": true
1435
+ },
1436
+ "128178": {
1437
+ "content": "<|_placeholder_178|>",
1438
+ "lstrip": false,
1439
+ "normalized": false,
1440
+ "rstrip": false,
1441
+ "single_word": false,
1442
+ "special": true
1443
+ },
1444
+ "128179": {
1445
+ "content": "<|_placeholder_179|>",
1446
+ "lstrip": false,
1447
+ "normalized": false,
1448
+ "rstrip": false,
1449
+ "single_word": false,
1450
+ "special": true
1451
+ },
1452
+ "128180": {
1453
+ "content": "<|_placeholder_180|>",
1454
+ "lstrip": false,
1455
+ "normalized": false,
1456
+ "rstrip": false,
1457
+ "single_word": false,
1458
+ "special": true
1459
+ },
1460
+ "128181": {
1461
+ "content": "<|_placeholder_181|>",
1462
+ "lstrip": false,
1463
+ "normalized": false,
1464
+ "rstrip": false,
1465
+ "single_word": false,
1466
+ "special": true
1467
+ },
1468
+ "128182": {
1469
+ "content": "<|_placeholder_182|>",
1470
+ "lstrip": false,
1471
+ "normalized": false,
1472
+ "rstrip": false,
1473
+ "single_word": false,
1474
+ "special": true
1475
+ },
1476
+ "128183": {
1477
+ "content": "<|_placeholder_183|>",
1478
+ "lstrip": false,
1479
+ "normalized": false,
1480
+ "rstrip": false,
1481
+ "single_word": false,
1482
+ "special": true
1483
+ },
1484
+ "128184": {
1485
+ "content": "<|_placeholder_184|>",
1486
+ "lstrip": false,
1487
+ "normalized": false,
1488
+ "rstrip": false,
1489
+ "single_word": false,
1490
+ "special": true
1491
+ },
1492
+ "128185": {
1493
+ "content": "<|_placeholder_185|>",
1494
+ "lstrip": false,
1495
+ "normalized": false,
1496
+ "rstrip": false,
1497
+ "single_word": false,
1498
+ "special": true
1499
+ },
1500
+ "128186": {
1501
+ "content": "<|_placeholder_186|>",
1502
+ "lstrip": false,
1503
+ "normalized": false,
1504
+ "rstrip": false,
1505
+ "single_word": false,
1506
+ "special": true
1507
+ },
1508
+ "128187": {
1509
+ "content": "<|_placeholder_187|>",
1510
+ "lstrip": false,
1511
+ "normalized": false,
1512
+ "rstrip": false,
1513
+ "single_word": false,
1514
+ "special": true
1515
+ },
1516
+ "128188": {
1517
+ "content": "<|_placeholder_188|>",
1518
+ "lstrip": false,
1519
+ "normalized": false,
1520
+ "rstrip": false,
1521
+ "single_word": false,
1522
+ "special": true
1523
+ },
1524
+ "128189": {
1525
+ "content": "<|_placeholder_189|>",
1526
+ "lstrip": false,
1527
+ "normalized": false,
1528
+ "rstrip": false,
1529
+ "single_word": false,
1530
+ "special": true
1531
+ },
1532
+ "128190": {
1533
+ "content": "<|_placeholder_190|>",
1534
+ "lstrip": false,
1535
+ "normalized": false,
1536
+ "rstrip": false,
1537
+ "single_word": false,
1538
+ "special": true
1539
+ },
1540
+ "128191": {
1541
+ "content": "<|_placeholder_191|>",
1542
+ "lstrip": false,
1543
+ "normalized": false,
1544
+ "rstrip": false,
1545
+ "single_word": false,
1546
+ "special": true
1547
+ },
1548
+ "128192": {
1549
+ "content": "<|_placeholder_192|>",
1550
+ "lstrip": false,
1551
+ "normalized": false,
1552
+ "rstrip": false,
1553
+ "single_word": false,
1554
+ "special": true
1555
+ },
1556
+ "128193": {
1557
+ "content": "<|_placeholder_193|>",
1558
+ "lstrip": false,
1559
+ "normalized": false,
1560
+ "rstrip": false,
1561
+ "single_word": false,
1562
+ "special": true
1563
+ },
1564
+ "128194": {
1565
+ "content": "<|_placeholder_194|>",
1566
+ "lstrip": false,
1567
+ "normalized": false,
1568
+ "rstrip": false,
1569
+ "single_word": false,
1570
+ "special": true
1571
+ },
1572
+ "128195": {
1573
+ "content": "<|_placeholder_195|>",
1574
+ "lstrip": false,
1575
+ "normalized": false,
1576
+ "rstrip": false,
1577
+ "single_word": false,
1578
+ "special": true
1579
+ },
1580
+ "128196": {
1581
+ "content": "<|_placeholder_196|>",
1582
+ "lstrip": false,
1583
+ "normalized": false,
1584
+ "rstrip": false,
1585
+ "single_word": false,
1586
+ "special": true
1587
+ },
1588
+ "128197": {
1589
+ "content": "<|_placeholder_197|>",
1590
+ "lstrip": false,
1591
+ "normalized": false,
1592
+ "rstrip": false,
1593
+ "single_word": false,
1594
+ "special": true
1595
+ },
1596
+ "128198": {
1597
+ "content": "<|_placeholder_198|>",
1598
+ "lstrip": false,
1599
+ "normalized": false,
1600
+ "rstrip": false,
1601
+ "single_word": false,
1602
+ "special": true
1603
+ },
1604
+ "128199": {
1605
+ "content": "<|_placeholder_199|>",
1606
+ "lstrip": false,
1607
+ "normalized": false,
1608
+ "rstrip": false,
1609
+ "single_word": false,
1610
+ "special": true
1611
+ },
1612
+ "128200": {
1613
+ "content": "<|_placeholder_200|>",
1614
+ "lstrip": false,
1615
+ "normalized": false,
1616
+ "rstrip": false,
1617
+ "single_word": false,
1618
+ "special": true
1619
+ },
1620
+ "128201": {
1621
+ "content": "<|_placeholder_201|>",
1622
+ "lstrip": false,
1623
+ "normalized": false,
1624
+ "rstrip": false,
1625
+ "single_word": false,
1626
+ "special": true
1627
+ },
1628
+ "128202": {
1629
+ "content": "<|_placeholder_202|>",
1630
+ "lstrip": false,
1631
+ "normalized": false,
1632
+ "rstrip": false,
1633
+ "single_word": false,
1634
+ "special": true
1635
+ },
1636
+ "128203": {
1637
+ "content": "<|_placeholder_203|>",
1638
+ "lstrip": false,
1639
+ "normalized": false,
1640
+ "rstrip": false,
1641
+ "single_word": false,
1642
+ "special": true
1643
+ },
1644
+ "128204": {
1645
+ "content": "<|_placeholder_204|>",
1646
+ "lstrip": false,
1647
+ "normalized": false,
1648
+ "rstrip": false,
1649
+ "single_word": false,
1650
+ "special": true
1651
+ },
1652
+ "128205": {
1653
+ "content": "<|_placeholder_205|>",
1654
+ "lstrip": false,
1655
+ "normalized": false,
1656
+ "rstrip": false,
1657
+ "single_word": false,
1658
+ "special": true
1659
+ },
1660
+ "128206": {
1661
+ "content": "<|_placeholder_206|>",
1662
+ "lstrip": false,
1663
+ "normalized": false,
1664
+ "rstrip": false,
1665
+ "single_word": false,
1666
+ "special": true
1667
+ },
1668
+ "128207": {
1669
+ "content": "<|_placeholder_207|>",
1670
+ "lstrip": false,
1671
+ "normalized": false,
1672
+ "rstrip": false,
1673
+ "single_word": false,
1674
+ "special": true
1675
+ },
1676
+ "128208": {
1677
+ "content": "<|_placeholder_208|>",
1678
+ "lstrip": false,
1679
+ "normalized": false,
1680
+ "rstrip": false,
1681
+ "single_word": false,
1682
+ "special": true
1683
+ },
1684
+ "128209": {
1685
+ "content": "<|_placeholder_209|>",
1686
+ "lstrip": false,
1687
+ "normalized": false,
1688
+ "rstrip": false,
1689
+ "single_word": false,
1690
+ "special": true
1691
+ },
1692
+ "128210": {
1693
+ "content": "<|_placeholder_210|>",
1694
+ "lstrip": false,
1695
+ "normalized": false,
1696
+ "rstrip": false,
1697
+ "single_word": false,
1698
+ "special": true
1699
+ },
1700
+ "128211": {
1701
+ "content": "<|_placeholder_211|>",
1702
+ "lstrip": false,
1703
+ "normalized": false,
1704
+ "rstrip": false,
1705
+ "single_word": false,
1706
+ "special": true
1707
+ },
1708
+ "128212": {
1709
+ "content": "<|_placeholder_212|>",
1710
+ "lstrip": false,
1711
+ "normalized": false,
1712
+ "rstrip": false,
1713
+ "single_word": false,
1714
+ "special": true
1715
+ },
1716
+ "128213": {
1717
+ "content": "<|_placeholder_213|>",
1718
+ "lstrip": false,
1719
+ "normalized": false,
1720
+ "rstrip": false,
1721
+ "single_word": false,
1722
+ "special": true
1723
+ },
1724
+ "128214": {
1725
+ "content": "<|_placeholder_214|>",
1726
+ "lstrip": false,
1727
+ "normalized": false,
1728
+ "rstrip": false,
1729
+ "single_word": false,
1730
+ "special": true
1731
+ },
1732
+ "128215": {
1733
+ "content": "<|_placeholder_215|>",
1734
+ "lstrip": false,
1735
+ "normalized": false,
1736
+ "rstrip": false,
1737
+ "single_word": false,
1738
+ "special": true
1739
+ },
1740
+ "128216": {
1741
+ "content": "<|_placeholder_216|>",
1742
+ "lstrip": false,
1743
+ "normalized": false,
1744
+ "rstrip": false,
1745
+ "single_word": false,
1746
+ "special": true
1747
+ },
1748
+ "128217": {
1749
+ "content": "<|_placeholder_217|>",
1750
+ "lstrip": false,
1751
+ "normalized": false,
1752
+ "rstrip": false,
1753
+ "single_word": false,
1754
+ "special": true
1755
+ },
1756
+ "128218": {
1757
+ "content": "<|_placeholder_218|>",
1758
+ "lstrip": false,
1759
+ "normalized": false,
1760
+ "rstrip": false,
1761
+ "single_word": false,
1762
+ "special": true
1763
+ },
1764
+ "128219": {
1765
+ "content": "<|_placeholder_219|>",
1766
+ "lstrip": false,
1767
+ "normalized": false,
1768
+ "rstrip": false,
1769
+ "single_word": false,
1770
+ "special": true
1771
+ },
1772
+ "128220": {
1773
+ "content": "<|_placeholder_220|>",
1774
+ "lstrip": false,
1775
+ "normalized": false,
1776
+ "rstrip": false,
1777
+ "single_word": false,
1778
+ "special": true
1779
+ },
1780
+ "128221": {
1781
+ "content": "<|_placeholder_221|>",
1782
+ "lstrip": false,
1783
+ "normalized": false,
1784
+ "rstrip": false,
1785
+ "single_word": false,
1786
+ "special": true
1787
+ },
1788
+ "128222": {
1789
+ "content": "<|_placeholder_222|>",
1790
+ "lstrip": false,
1791
+ "normalized": false,
1792
+ "rstrip": false,
1793
+ "single_word": false,
1794
+ "special": true
1795
+ },
1796
+ "128223": {
1797
+ "content": "<|_placeholder_223|>",
1798
+ "lstrip": false,
1799
+ "normalized": false,
1800
+ "rstrip": false,
1801
+ "single_word": false,
1802
+ "special": true
1803
+ },
1804
+ "128224": {
1805
+ "content": "<|_placeholder_224|>",
1806
+ "lstrip": false,
1807
+ "normalized": false,
1808
+ "rstrip": false,
1809
+ "single_word": false,
1810
+ "special": true
1811
+ },
1812
+ "128225": {
1813
+ "content": "<|_placeholder_225|>",
1814
+ "lstrip": false,
1815
+ "normalized": false,
1816
+ "rstrip": false,
1817
+ "single_word": false,
1818
+ "special": true
1819
+ },
1820
+ "128226": {
1821
+ "content": "<|_placeholder_226|>",
1822
+ "lstrip": false,
1823
+ "normalized": false,
1824
+ "rstrip": false,
1825
+ "single_word": false,
1826
+ "special": true
1827
+ },
1828
+ "128227": {
1829
+ "content": "<|_placeholder_227|>",
1830
+ "lstrip": false,
1831
+ "normalized": false,
1832
+ "rstrip": false,
1833
+ "single_word": false,
1834
+ "special": true
1835
+ },
1836
+ "128228": {
1837
+ "content": "<|_placeholder_228|>",
1838
+ "lstrip": false,
1839
+ "normalized": false,
1840
+ "rstrip": false,
1841
+ "single_word": false,
1842
+ "special": true
1843
+ },
1844
+ "128229": {
1845
+ "content": "<|_placeholder_229|>",
1846
+ "lstrip": false,
1847
+ "normalized": false,
1848
+ "rstrip": false,
1849
+ "single_word": false,
1850
+ "special": true
1851
+ },
1852
+ "128230": {
1853
+ "content": "<|_placeholder_230|>",
1854
+ "lstrip": false,
1855
+ "normalized": false,
1856
+ "rstrip": false,
1857
+ "single_word": false,
1858
+ "special": true
1859
+ },
1860
+ "128231": {
1861
+ "content": "<|_placeholder_231|>",
1862
+ "lstrip": false,
1863
+ "normalized": false,
1864
+ "rstrip": false,
1865
+ "single_word": false,
1866
+ "special": true
1867
+ },
1868
+ "128232": {
1869
+ "content": "<|_placeholder_232|>",
1870
+ "lstrip": false,
1871
+ "normalized": false,
1872
+ "rstrip": false,
1873
+ "single_word": false,
1874
+ "special": true
1875
+ },
1876
+ "128233": {
1877
+ "content": "<|_placeholder_233|>",
1878
+ "lstrip": false,
1879
+ "normalized": false,
1880
+ "rstrip": false,
1881
+ "single_word": false,
1882
+ "special": true
1883
+ },
1884
+ "128234": {
1885
+ "content": "<|_placeholder_234|>",
1886
+ "lstrip": false,
1887
+ "normalized": false,
1888
+ "rstrip": false,
1889
+ "single_word": false,
1890
+ "special": true
1891
+ },
1892
+ "128235": {
1893
+ "content": "<|_placeholder_235|>",
1894
+ "lstrip": false,
1895
+ "normalized": false,
1896
+ "rstrip": false,
1897
+ "single_word": false,
1898
+ "special": true
1899
+ },
1900
+ "128236": {
1901
+ "content": "<|_placeholder_236|>",
1902
+ "lstrip": false,
1903
+ "normalized": false,
1904
+ "rstrip": false,
1905
+ "single_word": false,
1906
+ "special": true
1907
+ },
1908
+ "128237": {
1909
+ "content": "<|_placeholder_237|>",
1910
+ "lstrip": false,
1911
+ "normalized": false,
1912
+ "rstrip": false,
1913
+ "single_word": false,
1914
+ "special": true
1915
+ },
1916
+ "128238": {
1917
+ "content": "<|_placeholder_238|>",
1918
+ "lstrip": false,
1919
+ "normalized": false,
1920
+ "rstrip": false,
1921
+ "single_word": false,
1922
+ "special": true
1923
+ },
1924
+ "128239": {
1925
+ "content": "<|_placeholder_239|>",
1926
+ "lstrip": false,
1927
+ "normalized": false,
1928
+ "rstrip": false,
1929
+ "single_word": false,
1930
+ "special": true
1931
+ },
1932
+ "128240": {
1933
+ "content": "<|_placeholder_240|>",
1934
+ "lstrip": false,
1935
+ "normalized": false,
1936
+ "rstrip": false,
1937
+ "single_word": false,
1938
+ "special": true
1939
+ },
1940
+ "128241": {
1941
+ "content": "<|_placeholder_241|>",
1942
+ "lstrip": false,
1943
+ "normalized": false,
1944
+ "rstrip": false,
1945
+ "single_word": false,
1946
+ "special": true
1947
+ },
1948
+ "128242": {
1949
+ "content": "<|_placeholder_242|>",
1950
+ "lstrip": false,
1951
+ "normalized": false,
1952
+ "rstrip": false,
1953
+ "single_word": false,
1954
+ "special": true
1955
+ },
1956
+ "128243": {
1957
+ "content": "<|_placeholder_243|>",
1958
+ "lstrip": false,
1959
+ "normalized": false,
1960
+ "rstrip": false,
1961
+ "single_word": false,
1962
+ "special": true
1963
+ },
1964
+ "128244": {
1965
+ "content": "<|_placeholder_244|>",
1966
+ "lstrip": false,
1967
+ "normalized": false,
1968
+ "rstrip": false,
1969
+ "single_word": false,
1970
+ "special": true
1971
+ },
1972
+ "128245": {
1973
+ "content": "<|_placeholder_245|>",
1974
+ "lstrip": false,
1975
+ "normalized": false,
1976
+ "rstrip": false,
1977
+ "single_word": false,
1978
+ "special": true
1979
+ },
1980
+ "128246": {
1981
+ "content": "<|_placeholder_246|>",
1982
+ "lstrip": false,
1983
+ "normalized": false,
1984
+ "rstrip": false,
1985
+ "single_word": false,
1986
+ "special": true
1987
+ },
1988
+ "128247": {
1989
+ "content": "<|_placeholder_247|>",
1990
+ "lstrip": false,
1991
+ "normalized": false,
1992
+ "rstrip": false,
1993
+ "single_word": false,
1994
+ "special": true
1995
+ },
1996
+ "128248": {
1997
+ "content": "<|_placeholder_248|>",
1998
+ "lstrip": false,
1999
+ "normalized": false,
2000
+ "rstrip": false,
2001
+ "single_word": false,
2002
+ "special": true
2003
+ },
2004
+ "128249": {
2005
+ "content": "<|_placeholder_249|>",
2006
+ "lstrip": false,
2007
+ "normalized": false,
2008
+ "rstrip": false,
2009
+ "single_word": false,
2010
+ "special": true
2011
+ },
2012
+ "128250": {
2013
+ "content": "<|_placeholder_250|>",
2014
+ "lstrip": false,
2015
+ "normalized": false,
2016
+ "rstrip": false,
2017
+ "single_word": false,
2018
+ "special": true
2019
+ },
2020
+ "128251": {
2021
+ "content": "<|_placeholder_251|>",
2022
+ "lstrip": false,
2023
+ "normalized": false,
2024
+ "rstrip": false,
2025
+ "single_word": false,
2026
+ "special": true
2027
+ },
2028
+ "128252": {
2029
+ "content": "<|_placeholder_252|>",
2030
+ "lstrip": false,
2031
+ "normalized": false,
2032
+ "rstrip": false,
2033
+ "single_word": false,
2034
+ "special": true
2035
+ },
2036
+ "128253": {
2037
+ "content": "<|_placeholder_253|>",
2038
+ "lstrip": false,
2039
+ "normalized": false,
2040
+ "rstrip": false,
2041
+ "single_word": false,
2042
+ "special": true
2043
+ },
2044
+ "128254": {
2045
+ "content": "<|_placeholder_254|>",
2046
+ "lstrip": false,
2047
+ "normalized": false,
2048
+ "rstrip": false,
2049
+ "single_word": false,
2050
+ "special": true
2051
+ },
2052
+ "128255": {
2053
+ "content": "<|_placeholder_255|>",
2054
+ "lstrip": false,
2055
+ "normalized": false,
2056
+ "rstrip": false,
2057
+ "single_word": false,
2058
+ "special": true
2059
+ }
2060
+ },
2061
+ "auto_map": {
2062
+ "AutoProcessor": "processing_vlm.HCXVisionV2Processor"
2063
+ },
2064
+ "bos_token": "<|endoftext|>",
2065
+ "clean_up_tokenization_spaces": true,
2066
+ "eos_token": "<|im_end|>",
2067
+ "extra_special_tokens": {
2068
+ "image_token": "<|IMAGE_PAD|>",
2069
+ "video_token": "<|VIDEO_PAD|>"
2070
+ },
2071
+ "image_token": "<|IMAGE_PAD|>",
2072
+ "model_max_length": 1000000000000000019884624838656,
2073
+ "pad_token": "<|endoftext|>",
2074
+ "processor_class": "HCXVisionV2Processor",
2075
+ "sep_token": "<|endoftext|>",
2076
+ "tokenizer_class": "GPT2Tokenizer",
2077
+ "unk_token": "<|endoftext|>",
2078
+ "video_token": "<|VIDEO_PAD|>"
2079
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_valid_kwargs_names": [
3
+ "do_convert_rgb",
4
+ "do_resize",
5
+ "size",
6
+ "size_divisor",
7
+ "default_to_square",
8
+ "resample",
9
+ "do_rescale",
10
+ "rescale_factor",
11
+ "do_normalize",
12
+ "image_mean",
13
+ "image_std",
14
+ "do_pad",
15
+ "do_center_crop",
16
+ "crop_size",
17
+ "data_format",
18
+ "input_data_format",
19
+ "device",
20
+ "min_pixels",
21
+ "max_pixels",
22
+ "patch_size",
23
+ "temporal_patch_size",
24
+ "merge_size"
25
+ ],
26
+ "auto_map": {
27
+ "AutoProcessor": "processing_vlm.HCXVisionV2Processor"
28
+ },
29
+ "crop_size": null,
30
+ "data_format": "channels_first",
31
+ "default_to_square": true,
32
+ "device": null,
33
+ "do_center_crop": null,
34
+ "do_convert_rgb": true,
35
+ "do_normalize": true,
36
+ "do_pad": null,
37
+ "do_rescale": true,
38
+ "do_resize": true,
39
+ "image_mean": [
40
+ 0.48145466,
41
+ 0.4578275,
42
+ 0.40821073
43
+ ],
44
+ "image_processor_type": "Qwen2VLImageProcessor",
45
+ "image_std": [
46
+ 0.26862954,
47
+ 0.26130258,
48
+ 0.27577711
49
+ ],
50
+ "input_data_format": null,
51
+ "max_pixels": 12845056,
52
+ "merge_size": 2,
53
+ "min_pixels": 3136,
54
+ "model_valid_processing_keys": [
55
+ "do_convert_rgb",
56
+ "do_resize",
57
+ "size",
58
+ "size_divisor",
59
+ "default_to_square",
60
+ "resample",
61
+ "do_rescale",
62
+ "rescale_factor",
63
+ "do_normalize",
64
+ "image_mean",
65
+ "image_std",
66
+ "do_pad",
67
+ "do_center_crop",
68
+ "crop_size",
69
+ "data_format",
70
+ "input_data_format",
71
+ "device",
72
+ "min_pixels",
73
+ "max_pixels",
74
+ "patch_size",
75
+ "temporal_patch_size",
76
+ "merge_size"
77
+ ],
78
+ "patch_size": 14,
79
+ "processor_class": "HCXVisionV2Processor",
80
+ "resample": 3,
81
+ "rescale_factor": 0.00392156862745098,
82
+ "size": {
83
+ "longest_edge": 12845056,
84
+ "shortest_edge": 3136
85
+ },
86
+ "size_divisor": null,
87
+ "temporal_patch_size": 2,
88
+ "video_processor_type": "Qwen2VLVideoProcessor"
89
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff