asnassar commited on
Commit
22024b7
·
1 Parent(s): f3e9d3a

Pre release update.

Browse files
README.md CHANGED
@@ -17,26 +17,63 @@ Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two k
17
  Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
18
 
19
  - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
20
- - 🧮 Enhanceed Inline Equations: Better inline math recognition
21
- - 📸 Robust OCR for Documents in the wild: Accurately extracts text from handheld scans, photos, and low-quality images
22
- - 🗝️ Key-Value Pair Extraction: Identifies structured key-value relationships (e.g., forms, receipts)
23
  - 🧘 Improved Stability: Tends to avoid infinite loops more effectively
 
 
24
  - 🌍 Japanese, Arabic and Chinese support (_experimental_)
25
 
26
 
27
  ## Evaluations
28
 
29
- | | smoldocling-256m-preview | granite-docling-258m |
30
- |------|------|-------|
31
- | | OCR | | |
32
- | | Layout | | |
33
- | | Code | | |
34
- | | Equation | | |
35
- | | Table | | |
36
- | | Loop | | |
37
- | | MMStar | | |
38
- | | OCRBench | | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
 
40
 
41
  ## Getting started
42
 
@@ -134,7 +171,7 @@ from pathlib import Path
134
  MODEL_PATH = "ibm-granite/granite-docling-258M"
135
  IMAGE_DIR = "img/" # Place your page images here
136
  OUTPUT_DIR = "out/"
137
- PROMPT_TEXT = "Convert page to Docling."
138
 
139
  # Ensure output directory exists
140
  os.makedirs(OUTPUT_DIR, exist_ok=True)
@@ -154,9 +191,13 @@ image_names = []
154
  for img_file in sorted(os.listdir(IMAGE_DIR)):
155
  if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
156
  img_path = os.path.join(IMAGE_DIR, img_file)
157
- image = Image.open(img_path).convert("RGB")
 
158
 
159
- prompt = f"<|start_of_role|>user:<image>{PROMPT_TEXT}<end_of_utterance>\nassistant:"
 
 
 
160
  batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
161
  image_names.append(os.path.splitext(img_file)[0])
162
 
@@ -242,14 +283,14 @@ print(f"Total time: {time.time() - start_time:.2f} sec")
242
 
243
  The architecture of granite-docling-258m consists of the following components:
244
 
245
- (1) Vision encoder: siglip2-base-patch16-512 (https://huggingface.co/google/siglip2-base-patch16-512).
246
 
247
  (2) Vision-language connector: pixel shuffle projector (as in idefics3)
248
 
249
  (3) Large language model: Granite 165M.
250
 
251
- We built upon Idefics3 (https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
252
-
253
  # Training Data:
254
 
255
  Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) internally created synthetic data targeting specific capabilities.
 
17
  Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
18
 
19
  - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
20
+ - 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
 
 
21
  - 🧘 Improved Stability: Tends to avoid infinite loops more effectively
22
+ - 🧮 Enhanceed Inline Equations: Better inline math recognition
23
+ - 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
24
  - 🌍 Japanese, Arabic and Chinese support (_experimental_)
25
 
26
 
27
  ## Evaluations
28
 
29
+ <table>
30
+ <thead>
31
+ <tr>
32
+ <th></th>
33
+ <th><b>smoldocling-256m-preview</b></th>
34
+ <th><b>granite-docling-258m</b></th>
35
+ </tr>
36
+ </thead>
37
+ <tbody>
38
+ <tr><td colspan="3"><b>Layout</b></td></tr>
39
+ <tr><td>MAP ↑</td><td>0.21</td><td><b>0.28</b></td></tr>
40
+ <tr><td>F1 ↑</td><td>0.79</td><td><b>0.85</b></td></tr>
41
+ <tr><td>Precision ↑</td><td>0.86</td><td><b>0.87</b></td></tr>
42
+ <tr><td>Recall ↑</td><td>0.82</td><td><b>0.89</b></td></tr>
43
+
44
+ <tr><td colspan="3"><b>Full Page OCR</b></td></tr>
45
+ <tr><td>Edit-distance ↓</td><td>0.48 (0.46)</td><td><b>0.46</b> (<b>0.44</b>)</td></tr>
46
+ <tr><td>F1 ↑</td><td><b>0.80</b> (0.76)</td><td>0.75 (<b>0.78</b>)</td></tr>
47
+ <tr><td>Precision ↑</td><td><b>0.89</b> (0.85)</td><td>0.81 (0.85)</td></tr>
48
+ <tr><td>Recall ↑</td><td><b>0.79</b> (0.74)</td><td>0.73 (<b>0.77</b>)</td></tr>
49
+ <tr><td>BLEU ↑</td><td><b>0.58</b> (0.54)</td><td>0.56 (<b>0.59</b>)</td></tr>
50
+ <tr><td>Meteor ↑</td><td>0.67 (0.67)</td><td>0.67 (<b>0.70</b>)</td></tr>
51
+
52
+ <tr><td colspan="3"><b>Code Recognition</b></td></tr>
53
+ <tr><td>Edit-distance ↓</td><td>0.114</td><td><b>0.013</b></td></tr>
54
+ <tr><td>F1 ↑</td><td>0.915</td><td><b>0.988</b></td></tr>
55
+ <tr><td>Precision ↑</td><td>0.94</td><td><b>0.99</b></td></tr>
56
+ <tr><td>Recall ↑</td><td>0.909</td><td><b>0.988</b></td></tr>
57
+ <tr><td>BLEU ↑</td><td>0.875</td><td><b>0.983</b></td></tr>
58
+ <tr><td>Meteor ↑</td><td>0.889</td><td><b>0.986</b></td></tr>
59
+
60
+ <tr><td colspan="3"><b>Equation Recognition</b></td></tr>
61
+ <tr><td>Edit-distance ↓</td><td>0.119</td><td><b>0.073</b></td></tr>
62
+ <tr><td>F1 ↑</td><td>0.947</td><td><b>0.968</b></td></tr>
63
+ <tr><td>Precision ↑</td><td>0.959</td><td><b>0.968</b></td></tr>
64
+ <tr><td>Recall ↑</td><td>0.941</td><td><b>0.969</b></td></tr>
65
+ <tr><td>BLEU ↑</td><td>0.824</td><td><b>0.893</b></td></tr>
66
+ <tr><td>Meteor ↑</td><td>0.878</td><td><b>0.927</b></td></tr>
67
+
68
+ <tr><td colspan="3"><b>Table Recognition (FinTabNet 150dpi)</b></td></tr>
69
+ <tr><td>TEDS (structure) ↑</td><td>0.82</td><td><b>0.97</b></td></tr>
70
+ <tr><td>TEDS (w/content) ↑</td><td>0.76</td><td><b>0.96</b></td></tr>
71
+ <tr><td colspan="3"><b>Other Benchmarks</b></td></tr>
72
+ <tr><td>MMStar ↑</td><td>0.17</td><td><b>0.3</b></td></tr>
73
+ <tr><td>OCRBench ↑</td><td>338</td><td><b>500</b></td></tr>
74
+
75
 
76
+ </table>
77
 
78
  ## Getting started
79
 
 
171
  MODEL_PATH = "ibm-granite/granite-docling-258M"
172
  IMAGE_DIR = "img/" # Place your page images here
173
  OUTPUT_DIR = "out/"
174
+ PROMPT_TEXT = "Convert page to docling."
175
 
176
  # Ensure output directory exists
177
  os.makedirs(OUTPUT_DIR, exist_ok=True)
 
191
  for img_file in sorted(os.listdir(IMAGE_DIR)):
192
  if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
193
  img_path = os.path.join(IMAGE_DIR, img_file)
194
+ with Image.open(img_path) as im:
195
+ image = im.convert("RGB")
196
 
197
+ prompt = (
198
+ f"<|start_of_role|>user<|end_of_role|><image>{PROMPT_TEXT}<|end_of_text|>\n"
199
+ f"<|start_of_role|>assistant<|end_of_role|>"
200
+ )
201
  batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
202
  image_names.append(os.path.splitext(img_file)[0])
203
 
 
283
 
284
  The architecture of granite-docling-258m consists of the following components:
285
 
286
+ (1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512).
287
 
288
  (2) Vision-language connector: pixel shuffle projector (as in idefics3)
289
 
290
  (3) Large language model: Granite 165M.
291
 
292
+ We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
293
+ The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models
294
  # Training Data:
295
 
296
  Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) internally created synthetic data targeting specific capabilities.
added_tokens.json CHANGED
@@ -1,38 +1,3 @@
1
  {
2
- "<row_1_col_1>": 100352,
3
- "<row_1_col_2>": 100353,
4
- "<row_1_col_3>": 100354,
5
- "<row_1_col_4>": 100355,
6
- "<row_1_col_5>": 100356,
7
- "<row_1_col_6>": 100357,
8
- "<row_2_col_1>": 100358,
9
- "<row_2_col_2>": 100359,
10
- "<row_2_col_3>": 100360,
11
- "<row_2_col_4>": 100361,
12
- "<row_2_col_5>": 100362,
13
- "<row_2_col_6>": 100363,
14
- "<row_3_col_1>": 100364,
15
- "<row_3_col_2>": 100365,
16
- "<row_3_col_3>": 100366,
17
- "<row_3_col_4>": 100367,
18
- "<row_3_col_5>": 100368,
19
- "<row_3_col_6>": 100369,
20
- "<row_4_col_1>": 100370,
21
- "<row_4_col_2>": 100371,
22
- "<row_4_col_3>": 100372,
23
- "<row_4_col_4>": 100373,
24
- "<row_4_col_5>": 100374,
25
- "<row_4_col_6>": 100375,
26
- "<row_5_col_1>": 100376,
27
- "<row_5_col_2>": 100377,
28
- "<row_5_col_3>": 100378,
29
- "<row_5_col_4>": 100379,
30
- "<row_5_col_5>": 100380,
31
- "<row_5_col_6>": 100381,
32
- "<row_6_col_1>": 100382,
33
- "<row_6_col_2>": 100383,
34
- "<row_6_col_3>": 100384,
35
- "<row_6_col_4>": 100385,
36
- "<row_6_col_5>": 100386,
37
- "<row_6_col_6>": 100387
38
  }
 
1
  {
2
+ "<end_of_utterance>": 100352
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  }
chat_template.jinja CHANGED
@@ -1,2 +1,21 @@
1
- <|start_of_role|>{% for message in messages %}{{message['role']}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
2
- {% endfor %}{% if add_generation_prompt %}{{ 'assistant:' }}{% endif %}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- for message in messages -%}
2
+ {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
3
+ {%- if message['content'] is string -%}
4
+ {{- message['content'] -}}
5
+ {%- else -%}
6
+ {%- for part in message['content'] -%}
7
+ {%- if part['type'] == 'text' -%}
8
+ {{- part['text'] -}}
9
+ {%- elif part['type'] == 'image' -%}
10
+ {{- '<image>' -}}
11
+ {%- endif -%}
12
+ {%- endfor -%}
13
+ {%- endif -%}
14
+ {{- '<|end_of_text|>
15
+ ' -}}
16
+ {%- endfor -%}
17
+ {%- if add_generation_prompt -%}
18
+ {{- '<|start_of_role|>assistant' -}}
19
+ {%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
20
+ {{- '<|end_of_role|>' -}}
21
+ {%- endif -%}
config.json CHANGED
@@ -1,52 +1,24 @@
1
  {
2
- "_flash_attn_2_enabled": true,
3
  "architectures": [
4
  "Idefics3ForConditionalGeneration"
5
  ],
6
- "attention_bias": false,
7
- "attention_dropout": 0.0,
8
  "bos_token_id": 100264,
9
- "eos_token_id": 100338,
10
- "head_dim": 64,
11
- "hidden_act": "silu",
12
- "hidden_size": 576,
13
  "image_token_id": 100270,
14
- "initializer_range": 0.02,
15
- "intermediate_size": 1536,
16
- "max_position_embeddings": 8192,
17
- "mlp_bias": false,
18
  "model_type": "idefics3",
19
- "neftune_noise_alpha": 0.0,
20
- "num_attention_heads": 9,
21
- "num_hidden_layers": 30,
22
- "num_key_value_heads": 3,
23
- "pad_token_id": 128002,
24
- "perceiver_config": {
25
- "attention_dropout": 0.0,
26
- "hidden_act": "silu",
27
- "model_type": "vllama3",
28
- "num_key_value_heads": 1,
29
- "qk_layer_norms_perceiver": false,
30
- "resampler_depth": 6,
31
- "resampler_head_dim": 96,
32
- "resampler_n_heads": 16,
33
- "resampler_n_latents": 64
34
- },
35
- "pixel_shuffle_factor": 4,
36
- "pretraining_tp": 1,
37
- "qk_layer_norms": false,
38
- "rms_norm_eps": 1e-05,
39
- "rope_scaling": null,
40
- "rope_theta": 100000.0,
41
  "scale_factor": 4,
42
  "text_config": {
 
43
  "architectures": [
44
- "VLlama3ForCausalLM"
45
  ],
46
  "attention_bias": false,
47
  "attention_dropout": 0.0,
48
  "bos_token_id": 100264,
49
- "eos_token_id": 100338,
 
50
  "head_dim": 64,
51
  "hidden_act": "silu",
52
  "hidden_size": 576,
@@ -58,20 +30,18 @@
58
  "num_attention_heads": 9,
59
  "num_hidden_layers": 30,
60
  "num_key_value_heads": 3,
 
61
  "pretraining_tp": 1,
62
  "rms_norm_eps": 1e-05,
63
  "rope_scaling": null,
64
- "rope_theta": 10000.0,
65
  "tie_word_embeddings": true,
66
- "torch_dtype": "bfloat16",
67
- "use_cache": true,
68
- "vocab_size": 100480
69
  },
70
  "tie_word_embeddings": true,
71
- "torch_dtype": "bfloat16",
72
- "transformers_version": "4.53.0.dev0",
73
  "use_cache": true,
74
- "use_resampler": false,
75
  "vision_config": {
76
  "attention_dropout": 0.0,
77
  "hidden_act": "gelu_pytorch_tanh",
@@ -89,11 +59,8 @@
89
  "num_hidden_layers": 12,
90
  "patch_size": 16,
91
  "size": {
92
- "longest_edge": 2048
93
- },
94
- "tie_word_embeddings": false,
95
- "torch_dtype": "bfloat16",
96
- "use_base_siglip": true
97
  },
98
- "vocab_size": 100480
99
  }
 
1
  {
 
2
  "architectures": [
3
  "Idefics3ForConditionalGeneration"
4
  ],
 
 
5
  "bos_token_id": 100264,
6
+ "dtype": "bfloat16",
7
+ "eos_token_id": 100257,
 
 
8
  "image_token_id": 100270,
 
 
 
 
9
  "model_type": "idefics3",
10
+ "pad_token_id": 100257,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  "scale_factor": 4,
12
  "text_config": {
13
+ "_name_or_path": "models/granitev06_hf_ai4k_sft_data_v4",
14
  "architectures": [
15
+ "LlamaForCausalLM"
16
  ],
17
  "attention_bias": false,
18
  "attention_dropout": 0.0,
19
  "bos_token_id": 100264,
20
+ "dtype": "bfloat16",
21
+ "eos_token_id": 100257,
22
  "head_dim": 64,
23
  "hidden_act": "silu",
24
  "hidden_size": 576,
 
30
  "num_attention_heads": 9,
31
  "num_hidden_layers": 30,
32
  "num_key_value_heads": 3,
33
+ "pad_token_id": 100257,
34
  "pretraining_tp": 1,
35
  "rms_norm_eps": 1e-05,
36
  "rope_scaling": null,
37
+ "rope_theta": 100000.0,
38
  "tie_word_embeddings": true,
39
+ "use_cache": false,
40
+ "vocab_size": 100352
 
41
  },
42
  "tie_word_embeddings": true,
43
+ "transformers_version": "4.56.1",
 
44
  "use_cache": true,
 
45
  "vision_config": {
46
  "attention_dropout": 0.0,
47
  "hidden_act": "gelu_pytorch_tanh",
 
59
  "num_hidden_layers": 12,
60
  "patch_size": 16,
61
  "size": {
62
+ "longest_edge": 512
63
+ }
 
 
 
64
  },
65
+ "vocab_size": 100352
66
  }
generation_config.json CHANGED
@@ -1,7 +1,8 @@
1
  {
2
  "_from_model_config": true,
3
  "bos_token_id": 100264,
4
- "eos_token_id": 100338,
5
- "pad_token_id": 128002,
6
- "transformers_version": "4.53.0.dev0"
 
7
  }
 
1
  {
2
  "_from_model_config": true,
3
  "bos_token_id": 100264,
4
+ "eos_token_id": 100257,
5
+ "pad_token_id": 100257,
6
+ "transformers_version": "4.56.1",
7
+ "use_cache": false
8
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:da69b13b7e145c8ece853ab911e3009dd5f4348388aa52ce9da9103d023fd78c
3
- size 515240560
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1cdad234deb1cde18ee6a586f849057f19851daf1fedce2e40aff791dbe46f61
3
+ size 515093104
special_tokens_map.json CHANGED
@@ -23,7 +23,7 @@
23
  }
24
  ],
25
  "bos_token": {
26
- "content": "<|end_of_text|>",
27
  "lstrip": false,
28
  "normalized": false,
29
  "rstrip": false,
@@ -36,13 +36,7 @@
36
  "rstrip": false,
37
  "single_word": false
38
  },
39
- "pad_token": {
40
- "content": "<|pad|>",
41
- "lstrip": false,
42
- "normalized": false,
43
- "rstrip": false,
44
- "single_word": false
45
- },
46
  "unk_token": {
47
  "content": "<|unk|>",
48
  "lstrip": false,
 
23
  }
24
  ],
25
  "bos_token": {
26
+ "content": "<|start_of_role|>",
27
  "lstrip": false,
28
  "normalized": false,
29
  "rstrip": false,
 
36
  "rstrip": false,
37
  "single_word": false
38
  },
39
+ "pad_token": "<|end_of_text|>",
 
 
 
 
 
 
40
  "unk_token": {
41
  "content": "<|unk|>",
42
  "lstrip": false,
tokenizer.json CHANGED
@@ -23,7 +23,7 @@
23
  },
24
  {
25
  "id": 100258,
26
- "content": "<|fim_prefix|>",
27
  "single_word": false,
28
  "lstrip": false,
29
  "rstrip": false,
@@ -32,7 +32,7 @@
32
  },
33
  {
34
  "id": 100259,
35
- "content": "<|fim_middle|>",
36
  "single_word": false,
37
  "lstrip": false,
38
  "rstrip": false,
@@ -41,7 +41,7 @@
41
  },
42
  {
43
  "id": 100260,
44
- "content": "<|fim_suffix|>",
45
  "single_word": false,
46
  "lstrip": false,
47
  "rstrip": false,
@@ -50,7 +50,7 @@
50
  },
51
  {
52
  "id": 100261,
53
- "content": "<|fim_pad|>",
54
  "single_word": false,
55
  "lstrip": false,
56
  "rstrip": false,
@@ -59,7 +59,7 @@
59
  },
60
  {
61
  "id": 100262,
62
- "content": "<|filename|>",
63
  "single_word": false,
64
  "lstrip": false,
65
  "rstrip": false,
@@ -68,7 +68,7 @@
68
  },
69
  {
70
  "id": 100263,
71
- "content": "<|reponame|>",
72
  "single_word": false,
73
  "lstrip": false,
74
  "rstrip": false,
@@ -95,7 +95,7 @@
95
  },
96
  {
97
  "id": 100266,
98
- "content": "<|tool_call|>",
99
  "single_word": false,
100
  "lstrip": false,
101
  "rstrip": false,
@@ -104,7 +104,7 @@
104
  },
105
  {
106
  "id": 100267,
107
- "content": "<|start_of_plugin|>",
108
  "single_word": false,
109
  "lstrip": false,
110
  "rstrip": false,
@@ -113,7 +113,7 @@
113
  },
114
  {
115
  "id": 100268,
116
- "content": "<|end_of_plugin|>",
117
  "single_word": false,
118
  "lstrip": false,
119
  "rstrip": false,
@@ -122,7 +122,7 @@
122
  },
123
  {
124
  "id": 100269,
125
- "content": "<|unk|>",
126
  "single_word": false,
127
  "lstrip": false,
128
  "rstrip": false,
@@ -554,7 +554,7 @@
554
  },
555
  {
556
  "id": 100317,
557
- "content": "<loc_",
558
  "single_word": false,
559
  "lstrip": false,
560
  "rstrip": false,
@@ -563,7 +563,7 @@
563
  },
564
  {
565
  "id": 100318,
566
- "content": "<paragraph",
567
  "single_word": false,
568
  "lstrip": false,
569
  "rstrip": false,
@@ -662,7 +662,7 @@
662
  },
663
  {
664
  "id": 100329,
665
- "content": "<text_break>",
666
  "single_word": false,
667
  "lstrip": false,
668
  "rstrip": false,
@@ -743,7 +743,7 @@
743
  },
744
  {
745
  "id": 100338,
746
- "content": "<end_of_utterance>",
747
  "single_word": false,
748
  "lstrip": false,
749
  "rstrip": false,
@@ -770,7 +770,7 @@
770
  },
771
  {
772
  "id": 100341,
773
- "content": "<|unused_72|>",
774
  "single_word": false,
775
  "lstrip": false,
776
  "rstrip": false,
@@ -779,7 +779,7 @@
779
  },
780
  {
781
  "id": 100342,
782
- "content": "<|unused_73|>",
783
  "single_word": false,
784
  "lstrip": false,
785
  "rstrip": false,
@@ -788,7 +788,7 @@
788
  },
789
  {
790
  "id": 100343,
791
- "content": "<|unused_74|>",
792
  "single_word": false,
793
  "lstrip": false,
794
  "rstrip": false,
@@ -797,7 +797,7 @@
797
  },
798
  {
799
  "id": 100344,
800
- "content": "<|unused_75|>",
801
  "single_word": false,
802
  "lstrip": false,
803
  "rstrip": false,
@@ -806,7 +806,7 @@
806
  },
807
  {
808
  "id": 100345,
809
- "content": "<|unused_76|>",
810
  "single_word": false,
811
  "lstrip": false,
812
  "rstrip": false,
@@ -815,7 +815,7 @@
815
  },
816
  {
817
  "id": 100346,
818
- "content": "<|unused_77|>",
819
  "single_word": false,
820
  "lstrip": false,
821
  "rstrip": false,
@@ -824,7 +824,7 @@
824
  },
825
  {
826
  "id": 100347,
827
- "content": "<|unused_78|>",
828
  "single_word": false,
829
  "lstrip": false,
830
  "rstrip": false,
@@ -833,7 +833,7 @@
833
  },
834
  {
835
  "id": 100348,
836
- "content": "<|unused_79|>",
837
  "single_word": false,
838
  "lstrip": false,
839
  "rstrip": false,
@@ -842,7 +842,7 @@
842
  },
843
  {
844
  "id": 100349,
845
- "content": "<|unused_80|>",
846
  "single_word": false,
847
  "lstrip": false,
848
  "rstrip": false,
@@ -851,7 +851,7 @@
851
  },
852
  {
853
  "id": 100350,
854
- "content": "<|unused_81|>",
855
  "single_word": false,
856
  "lstrip": false,
857
  "rstrip": false,
@@ -860,7 +860,7 @@
860
  },
861
  {
862
  "id": 100351,
863
- "content": "<|unused_82|>",
864
  "single_word": false,
865
  "lstrip": false,
866
  "rstrip": false,
@@ -869,322 +869,7 @@
869
  },
870
  {
871
  "id": 100352,
872
- "content": "<row_1_col_1>",
873
- "single_word": false,
874
- "lstrip": false,
875
- "rstrip": false,
876
- "normalized": false,
877
- "special": true
878
- },
879
- {
880
- "id": 100353,
881
- "content": "<row_1_col_2>",
882
- "single_word": false,
883
- "lstrip": false,
884
- "rstrip": false,
885
- "normalized": false,
886
- "special": true
887
- },
888
- {
889
- "id": 100354,
890
- "content": "<row_1_col_3>",
891
- "single_word": false,
892
- "lstrip": false,
893
- "rstrip": false,
894
- "normalized": false,
895
- "special": true
896
- },
897
- {
898
- "id": 100355,
899
- "content": "<row_1_col_4>",
900
- "single_word": false,
901
- "lstrip": false,
902
- "rstrip": false,
903
- "normalized": false,
904
- "special": true
905
- },
906
- {
907
- "id": 100356,
908
- "content": "<row_1_col_5>",
909
- "single_word": false,
910
- "lstrip": false,
911
- "rstrip": false,
912
- "normalized": false,
913
- "special": true
914
- },
915
- {
916
- "id": 100357,
917
- "content": "<row_1_col_6>",
918
- "single_word": false,
919
- "lstrip": false,
920
- "rstrip": false,
921
- "normalized": false,
922
- "special": true
923
- },
924
- {
925
- "id": 100358,
926
- "content": "<row_2_col_1>",
927
- "single_word": false,
928
- "lstrip": false,
929
- "rstrip": false,
930
- "normalized": false,
931
- "special": true
932
- },
933
- {
934
- "id": 100359,
935
- "content": "<row_2_col_2>",
936
- "single_word": false,
937
- "lstrip": false,
938
- "rstrip": false,
939
- "normalized": false,
940
- "special": true
941
- },
942
- {
943
- "id": 100360,
944
- "content": "<row_2_col_3>",
945
- "single_word": false,
946
- "lstrip": false,
947
- "rstrip": false,
948
- "normalized": false,
949
- "special": true
950
- },
951
- {
952
- "id": 100361,
953
- "content": "<row_2_col_4>",
954
- "single_word": false,
955
- "lstrip": false,
956
- "rstrip": false,
957
- "normalized": false,
958
- "special": true
959
- },
960
- {
961
- "id": 100362,
962
- "content": "<row_2_col_5>",
963
- "single_word": false,
964
- "lstrip": false,
965
- "rstrip": false,
966
- "normalized": false,
967
- "special": true
968
- },
969
- {
970
- "id": 100363,
971
- "content": "<row_2_col_6>",
972
- "single_word": false,
973
- "lstrip": false,
974
- "rstrip": false,
975
- "normalized": false,
976
- "special": true
977
- },
978
- {
979
- "id": 100364,
980
- "content": "<row_3_col_1>",
981
- "single_word": false,
982
- "lstrip": false,
983
- "rstrip": false,
984
- "normalized": false,
985
- "special": true
986
- },
987
- {
988
- "id": 100365,
989
- "content": "<row_3_col_2>",
990
- "single_word": false,
991
- "lstrip": false,
992
- "rstrip": false,
993
- "normalized": false,
994
- "special": true
995
- },
996
- {
997
- "id": 100366,
998
- "content": "<row_3_col_3>",
999
- "single_word": false,
1000
- "lstrip": false,
1001
- "rstrip": false,
1002
- "normalized": false,
1003
- "special": true
1004
- },
1005
- {
1006
- "id": 100367,
1007
- "content": "<row_3_col_4>",
1008
- "single_word": false,
1009
- "lstrip": false,
1010
- "rstrip": false,
1011
- "normalized": false,
1012
- "special": true
1013
- },
1014
- {
1015
- "id": 100368,
1016
- "content": "<row_3_col_5>",
1017
- "single_word": false,
1018
- "lstrip": false,
1019
- "rstrip": false,
1020
- "normalized": false,
1021
- "special": true
1022
- },
1023
- {
1024
- "id": 100369,
1025
- "content": "<row_3_col_6>",
1026
- "single_word": false,
1027
- "lstrip": false,
1028
- "rstrip": false,
1029
- "normalized": false,
1030
- "special": true
1031
- },
1032
- {
1033
- "id": 100370,
1034
- "content": "<row_4_col_1>",
1035
- "single_word": false,
1036
- "lstrip": false,
1037
- "rstrip": false,
1038
- "normalized": false,
1039
- "special": true
1040
- },
1041
- {
1042
- "id": 100371,
1043
- "content": "<row_4_col_2>",
1044
- "single_word": false,
1045
- "lstrip": false,
1046
- "rstrip": false,
1047
- "normalized": false,
1048
- "special": true
1049
- },
1050
- {
1051
- "id": 100372,
1052
- "content": "<row_4_col_3>",
1053
- "single_word": false,
1054
- "lstrip": false,
1055
- "rstrip": false,
1056
- "normalized": false,
1057
- "special": true
1058
- },
1059
- {
1060
- "id": 100373,
1061
- "content": "<row_4_col_4>",
1062
- "single_word": false,
1063
- "lstrip": false,
1064
- "rstrip": false,
1065
- "normalized": false,
1066
- "special": true
1067
- },
1068
- {
1069
- "id": 100374,
1070
- "content": "<row_4_col_5>",
1071
- "single_word": false,
1072
- "lstrip": false,
1073
- "rstrip": false,
1074
- "normalized": false,
1075
- "special": true
1076
- },
1077
- {
1078
- "id": 100375,
1079
- "content": "<row_4_col_6>",
1080
- "single_word": false,
1081
- "lstrip": false,
1082
- "rstrip": false,
1083
- "normalized": false,
1084
- "special": true
1085
- },
1086
- {
1087
- "id": 100376,
1088
- "content": "<row_5_col_1>",
1089
- "single_word": false,
1090
- "lstrip": false,
1091
- "rstrip": false,
1092
- "normalized": false,
1093
- "special": true
1094
- },
1095
- {
1096
- "id": 100377,
1097
- "content": "<row_5_col_2>",
1098
- "single_word": false,
1099
- "lstrip": false,
1100
- "rstrip": false,
1101
- "normalized": false,
1102
- "special": true
1103
- },
1104
- {
1105
- "id": 100378,
1106
- "content": "<row_5_col_3>",
1107
- "single_word": false,
1108
- "lstrip": false,
1109
- "rstrip": false,
1110
- "normalized": false,
1111
- "special": true
1112
- },
1113
- {
1114
- "id": 100379,
1115
- "content": "<row_5_col_4>",
1116
- "single_word": false,
1117
- "lstrip": false,
1118
- "rstrip": false,
1119
- "normalized": false,
1120
- "special": true
1121
- },
1122
- {
1123
- "id": 100380,
1124
- "content": "<row_5_col_5>",
1125
- "single_word": false,
1126
- "lstrip": false,
1127
- "rstrip": false,
1128
- "normalized": false,
1129
- "special": true
1130
- },
1131
- {
1132
- "id": 100381,
1133
- "content": "<row_5_col_6>",
1134
- "single_word": false,
1135
- "lstrip": false,
1136
- "rstrip": false,
1137
- "normalized": false,
1138
- "special": true
1139
- },
1140
- {
1141
- "id": 100382,
1142
- "content": "<row_6_col_1>",
1143
- "single_word": false,
1144
- "lstrip": false,
1145
- "rstrip": false,
1146
- "normalized": false,
1147
- "special": true
1148
- },
1149
- {
1150
- "id": 100383,
1151
- "content": "<row_6_col_2>",
1152
- "single_word": false,
1153
- "lstrip": false,
1154
- "rstrip": false,
1155
- "normalized": false,
1156
- "special": true
1157
- },
1158
- {
1159
- "id": 100384,
1160
- "content": "<row_6_col_3>",
1161
- "single_word": false,
1162
- "lstrip": false,
1163
- "rstrip": false,
1164
- "normalized": false,
1165
- "special": true
1166
- },
1167
- {
1168
- "id": 100385,
1169
- "content": "<row_6_col_4>",
1170
- "single_word": false,
1171
- "lstrip": false,
1172
- "rstrip": false,
1173
- "normalized": false,
1174
- "special": true
1175
- },
1176
- {
1177
- "id": 100386,
1178
- "content": "<row_6_col_5>",
1179
- "single_word": false,
1180
- "lstrip": false,
1181
- "rstrip": false,
1182
- "normalized": false,
1183
- "special": true
1184
- },
1185
- {
1186
- "id": 100387,
1187
- "content": "<row_6_col_6>",
1188
  "single_word": false,
1189
  "lstrip": false,
1190
  "rstrip": false,
@@ -101479,18 +101164,18 @@
101479
  "ĠConveyor": 100255,
101480
  "<|pad|>": 100256,
101481
  "<|end_of_text|>": 100257,
101482
- "<|fim_prefix|>": 100258,
101483
- "<|fim_middle|>": 100259,
101484
- "<|fim_suffix|>": 100260,
101485
- "<|fim_pad|>": 100261,
101486
- "<|filename|>": 100262,
101487
- "<|reponame|>": 100263,
101488
  "<|start_of_role|>": 100264,
101489
  "<|end_of_role|>": 100265,
101490
- "<|tool_call|>": 100266,
101491
- "<|start_of_plugin|>": 100267,
101492
- "<|end_of_plugin|>": 100268,
101493
- "<|unk|>": 100269,
101494
  "<image>": 100270,
101495
  "<caption>": 100271,
101496
  "</caption>": 100272,
@@ -101538,8 +101223,8 @@
101538
  "<page_break>": 100314,
101539
  "<smiles>": 100315,
101540
  "</smiles>": 100316,
101541
- "<loc_": 100317,
101542
- "<paragraph": 100318,
101543
  "</paragraph>": 100319,
101544
  "<references>": 100320,
101545
  "</references>": 100321,
@@ -101550,7 +101235,7 @@
101550
  "<group>": 100326,
101551
  "<doctag>": 100327,
101552
  "</doctag>": 100328,
101553
- "<text_break>": 100329,
101554
  "<fcel>": 100330,
101555
  "<ecel>": 100331,
101556
  "<lcel>": 100332,
@@ -101559,20 +101244,20 @@
101559
  "<nl>": 100335,
101560
  "<ched>": 100336,
101561
  "<rhed>": 100337,
101562
- "<end_of_utterance>": 100338,
101563
  "<fake_token_around_image>": 100339,
101564
  "<global-img>": 100340,
101565
- "<|unused_72|>": 100341,
101566
- "<|unused_73|>": 100342,
101567
- "<|unused_74|>": 100343,
101568
- "<|unused_75|>": 100344,
101569
- "<|unused_76|>": 100345,
101570
- "<|unused_77|>": 100346,
101571
- "<|unused_78|>": 100347,
101572
- "<|unused_79|>": 100348,
101573
- "<|unused_80|>": 100349,
101574
- "<|unused_81|>": 100350,
101575
- "<|unused_82|>": 100351
101576
  },
101577
  "merges": [
101578
  [
 
23
  },
24
  {
25
  "id": 100258,
26
+ "content": "<row_1_col_1>",
27
  "single_word": false,
28
  "lstrip": false,
29
  "rstrip": false,
 
32
  },
33
  {
34
  "id": 100259,
35
+ "content": "<row_1_col_2>",
36
  "single_word": false,
37
  "lstrip": false,
38
  "rstrip": false,
 
41
  },
42
  {
43
  "id": 100260,
44
+ "content": "<text>",
45
  "single_word": false,
46
  "lstrip": false,
47
  "rstrip": false,
 
50
  },
51
  {
52
  "id": 100261,
53
+ "content": "<row_1_col_3>",
54
  "single_word": false,
55
  "lstrip": false,
56
  "rstrip": false,
 
59
  },
60
  {
61
  "id": 100262,
62
+ "content": "<row_1_col_4>",
63
  "single_word": false,
64
  "lstrip": false,
65
  "rstrip": false,
 
68
  },
69
  {
70
  "id": 100263,
71
+ "content": "<row_2_col_1>",
72
  "single_word": false,
73
  "lstrip": false,
74
  "rstrip": false,
 
95
  },
96
  {
97
  "id": 100266,
98
+ "content": "</title>",
99
  "single_word": false,
100
  "lstrip": false,
101
  "rstrip": false,
 
104
  },
105
  {
106
  "id": 100267,
107
+ "content": "<row_2_col_2>",
108
  "single_word": false,
109
  "lstrip": false,
110
  "rstrip": false,
 
113
  },
114
  {
115
  "id": 100268,
116
+ "content": "<row_2_col_3>",
117
  "single_word": false,
118
  "lstrip": false,
119
  "rstrip": false,
 
122
  },
123
  {
124
  "id": 100269,
125
+ "content": "<title>",
126
  "single_word": false,
127
  "lstrip": false,
128
  "rstrip": false,
 
554
  },
555
  {
556
  "id": 100317,
557
+ "content": "</text>",
558
  "single_word": false,
559
  "lstrip": false,
560
  "rstrip": false,
 
563
  },
564
  {
565
  "id": 100318,
566
+ "content": "<paragraph>",
567
  "single_word": false,
568
  "lstrip": false,
569
  "rstrip": false,
 
662
  },
663
  {
664
  "id": 100329,
665
+ "content": "<rec_",
666
  "single_word": false,
667
  "lstrip": false,
668
  "rstrip": false,
 
743
  },
744
  {
745
  "id": 100338,
746
+ "content": "<|unk|>",
747
  "single_word": false,
748
  "lstrip": false,
749
  "rstrip": false,
 
770
  },
771
  {
772
  "id": 100341,
773
+ "content": "<row_2_col_4>",
774
  "single_word": false,
775
  "lstrip": false,
776
  "rstrip": false,
 
779
  },
780
  {
781
  "id": 100342,
782
+ "content": "<row_3_col_1>",
783
  "single_word": false,
784
  "lstrip": false,
785
  "rstrip": false,
 
788
  },
789
  {
790
  "id": 100343,
791
+ "content": "<row_3_col_2>",
792
  "single_word": false,
793
  "lstrip": false,
794
  "rstrip": false,
 
797
  },
798
  {
799
  "id": 100344,
800
+ "content": "<row_3_col_3>",
801
  "single_word": false,
802
  "lstrip": false,
803
  "rstrip": false,
 
806
  },
807
  {
808
  "id": 100345,
809
+ "content": "<row_3_col_4>",
810
  "single_word": false,
811
  "lstrip": false,
812
  "rstrip": false,
 
815
  },
816
  {
817
  "id": 100346,
818
+ "content": "<row_4_col_1>",
819
  "single_word": false,
820
  "lstrip": false,
821
  "rstrip": false,
 
824
  },
825
  {
826
  "id": 100347,
827
+ "content": "<row_4_col_2>",
828
  "single_word": false,
829
  "lstrip": false,
830
  "rstrip": false,
 
833
  },
834
  {
835
  "id": 100348,
836
+ "content": "<row_4_col_3>",
837
  "single_word": false,
838
  "lstrip": false,
839
  "rstrip": false,
 
842
  },
843
  {
844
  "id": 100349,
845
+ "content": "<row_4_col_4>",
846
  "single_word": false,
847
  "lstrip": false,
848
  "rstrip": false,
 
851
  },
852
  {
853
  "id": 100350,
854
+ "content": "<code>",
855
  "single_word": false,
856
  "lstrip": false,
857
  "rstrip": false,
 
860
  },
861
  {
862
  "id": 100351,
863
+ "content": "</code>",
864
  "single_word": false,
865
  "lstrip": false,
866
  "rstrip": false,
 
869
  },
870
  {
871
  "id": 100352,
872
+ "content": "<end_of_utterance>",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
873
  "single_word": false,
874
  "lstrip": false,
875
  "rstrip": false,
 
101164
  "ĠConveyor": 100255,
101165
  "<|pad|>": 100256,
101166
  "<|end_of_text|>": 100257,
101167
+ "<row_1_col_1>": 100258,
101168
+ "<row_1_col_2>": 100259,
101169
+ "<text>": 100260,
101170
+ "<row_1_col_3>": 100261,
101171
+ "<row_1_col_4>": 100262,
101172
+ "<row_2_col_1>": 100263,
101173
  "<|start_of_role|>": 100264,
101174
  "<|end_of_role|>": 100265,
101175
+ "</title>": 100266,
101176
+ "<row_2_col_2>": 100267,
101177
+ "<row_2_col_3>": 100268,
101178
+ "<title>": 100269,
101179
  "<image>": 100270,
101180
  "<caption>": 100271,
101181
  "</caption>": 100272,
 
101223
  "<page_break>": 100314,
101224
  "<smiles>": 100315,
101225
  "</smiles>": 100316,
101226
+ "</text>": 100317,
101227
+ "<paragraph>": 100318,
101228
  "</paragraph>": 100319,
101229
  "<references>": 100320,
101230
  "</references>": 100321,
 
101235
  "<group>": 100326,
101236
  "<doctag>": 100327,
101237
  "</doctag>": 100328,
101238
+ "<rec_": 100329,
101239
  "<fcel>": 100330,
101240
  "<ecel>": 100331,
101241
  "<lcel>": 100332,
 
101244
  "<nl>": 100335,
101245
  "<ched>": 100336,
101246
  "<rhed>": 100337,
101247
+ "<|unk|>": 100338,
101248
  "<fake_token_around_image>": 100339,
101249
  "<global-img>": 100340,
101250
+ "<row_2_col_4>": 100341,
101251
+ "<row_3_col_1>": 100342,
101252
+ "<row_3_col_2>": 100343,
101253
+ "<row_3_col_3>": 100344,
101254
+ "<row_3_col_4>": 100345,
101255
+ "<row_4_col_1>": 100346,
101256
+ "<row_4_col_2>": 100347,
101257
+ "<row_4_col_3>": 100348,
101258
+ "<row_4_col_4>": 100349,
101259
+ "<code>": 100350,
101260
+ "</code>": 100351
101261
  },
101262
  "merges": [
101263
  [
tokenizer_config.json CHANGED
@@ -19,7 +19,7 @@
19
  "special": true
20
  },
21
  "100258": {
22
- "content": "<|fim_prefix|>",
23
  "lstrip": false,
24
  "normalized": false,
25
  "rstrip": false,
@@ -27,7 +27,7 @@
27
  "special": true
28
  },
29
  "100259": {
30
- "content": "<|fim_middle|>",
31
  "lstrip": false,
32
  "normalized": false,
33
  "rstrip": false,
@@ -35,7 +35,7 @@
35
  "special": true
36
  },
37
  "100260": {
38
- "content": "<|fim_suffix|>",
39
  "lstrip": false,
40
  "normalized": false,
41
  "rstrip": false,
@@ -43,7 +43,7 @@
43
  "special": true
44
  },
45
  "100261": {
46
- "content": "<|fim_pad|>",
47
  "lstrip": false,
48
  "normalized": false,
49
  "rstrip": false,
@@ -51,7 +51,7 @@
51
  "special": true
52
  },
53
  "100262": {
54
- "content": "<|filename|>",
55
  "lstrip": false,
56
  "normalized": false,
57
  "rstrip": false,
@@ -59,7 +59,7 @@
59
  "special": true
60
  },
61
  "100263": {
62
- "content": "<|reponame|>",
63
  "lstrip": false,
64
  "normalized": false,
65
  "rstrip": false,
@@ -83,7 +83,7 @@
83
  "special": true
84
  },
85
  "100266": {
86
- "content": "<|tool_call|>",
87
  "lstrip": false,
88
  "normalized": false,
89
  "rstrip": false,
@@ -91,7 +91,7 @@
91
  "special": true
92
  },
93
  "100267": {
94
- "content": "<|start_of_plugin|>",
95
  "lstrip": false,
96
  "normalized": false,
97
  "rstrip": false,
@@ -99,7 +99,7 @@
99
  "special": true
100
  },
101
  "100268": {
102
- "content": "<|end_of_plugin|>",
103
  "lstrip": false,
104
  "normalized": false,
105
  "rstrip": false,
@@ -107,7 +107,7 @@
107
  "special": true
108
  },
109
  "100269": {
110
- "content": "<|unk|>",
111
  "lstrip": false,
112
  "normalized": false,
113
  "rstrip": false,
@@ -491,7 +491,7 @@
491
  "special": true
492
  },
493
  "100317": {
494
- "content": "<loc_",
495
  "lstrip": false,
496
  "normalized": false,
497
  "rstrip": false,
@@ -499,7 +499,7 @@
499
  "special": true
500
  },
501
  "100318": {
502
- "content": "<paragraph",
503
  "lstrip": false,
504
  "normalized": false,
505
  "rstrip": false,
@@ -587,7 +587,7 @@
587
  "special": true
588
  },
589
  "100329": {
590
- "content": "<text_break>",
591
  "lstrip": false,
592
  "normalized": false,
593
  "rstrip": false,
@@ -659,7 +659,7 @@
659
  "special": true
660
  },
661
  "100338": {
662
- "content": "<end_of_utterance>",
663
  "lstrip": false,
664
  "normalized": false,
665
  "rstrip": false,
@@ -683,7 +683,7 @@
683
  "special": true
684
  },
685
  "100341": {
686
- "content": "<|unused_72|>",
687
  "lstrip": false,
688
  "normalized": false,
689
  "rstrip": false,
@@ -691,7 +691,7 @@
691
  "special": true
692
  },
693
  "100342": {
694
- "content": "<|unused_73|>",
695
  "lstrip": false,
696
  "normalized": false,
697
  "rstrip": false,
@@ -699,7 +699,7 @@
699
  "special": true
700
  },
701
  "100343": {
702
- "content": "<|unused_74|>",
703
  "lstrip": false,
704
  "normalized": false,
705
  "rstrip": false,
@@ -707,7 +707,7 @@
707
  "special": true
708
  },
709
  "100344": {
710
- "content": "<|unused_75|>",
711
  "lstrip": false,
712
  "normalized": false,
713
  "rstrip": false,
@@ -715,7 +715,7 @@
715
  "special": true
716
  },
717
  "100345": {
718
- "content": "<|unused_76|>",
719
  "lstrip": false,
720
  "normalized": false,
721
  "rstrip": false,
@@ -723,7 +723,7 @@
723
  "special": true
724
  },
725
  "100346": {
726
- "content": "<|unused_77|>",
727
  "lstrip": false,
728
  "normalized": false,
729
  "rstrip": false,
@@ -731,7 +731,7 @@
731
  "special": true
732
  },
733
  "100347": {
734
- "content": "<|unused_78|>",
735
  "lstrip": false,
736
  "normalized": false,
737
  "rstrip": false,
@@ -739,7 +739,7 @@
739
  "special": true
740
  },
741
  "100348": {
742
- "content": "<|unused_79|>",
743
  "lstrip": false,
744
  "normalized": false,
745
  "rstrip": false,
@@ -747,7 +747,7 @@
747
  "special": true
748
  },
749
  "100349": {
750
- "content": "<|unused_80|>",
751
  "lstrip": false,
752
  "normalized": false,
753
  "rstrip": false,
@@ -755,7 +755,7 @@
755
  "special": true
756
  },
757
  "100350": {
758
- "content": "<|unused_81|>",
759
  "lstrip": false,
760
  "normalized": false,
761
  "rstrip": false,
@@ -763,7 +763,7 @@
763
  "special": true
764
  },
765
  "100351": {
766
- "content": "<|unused_82|>",
767
  "lstrip": false,
768
  "normalized": false,
769
  "rstrip": false,
@@ -771,287 +771,7 @@
771
  "special": true
772
  },
773
  "100352": {
774
- "content": "<row_1_col_1>",
775
- "lstrip": false,
776
- "normalized": false,
777
- "rstrip": false,
778
- "single_word": false,
779
- "special": true
780
- },
781
- "100353": {
782
- "content": "<row_1_col_2>",
783
- "lstrip": false,
784
- "normalized": false,
785
- "rstrip": false,
786
- "single_word": false,
787
- "special": true
788
- },
789
- "100354": {
790
- "content": "<row_1_col_3>",
791
- "lstrip": false,
792
- "normalized": false,
793
- "rstrip": false,
794
- "single_word": false,
795
- "special": true
796
- },
797
- "100355": {
798
- "content": "<row_1_col_4>",
799
- "lstrip": false,
800
- "normalized": false,
801
- "rstrip": false,
802
- "single_word": false,
803
- "special": true
804
- },
805
- "100356": {
806
- "content": "<row_1_col_5>",
807
- "lstrip": false,
808
- "normalized": false,
809
- "rstrip": false,
810
- "single_word": false,
811
- "special": true
812
- },
813
- "100357": {
814
- "content": "<row_1_col_6>",
815
- "lstrip": false,
816
- "normalized": false,
817
- "rstrip": false,
818
- "single_word": false,
819
- "special": true
820
- },
821
- "100358": {
822
- "content": "<row_2_col_1>",
823
- "lstrip": false,
824
- "normalized": false,
825
- "rstrip": false,
826
- "single_word": false,
827
- "special": true
828
- },
829
- "100359": {
830
- "content": "<row_2_col_2>",
831
- "lstrip": false,
832
- "normalized": false,
833
- "rstrip": false,
834
- "single_word": false,
835
- "special": true
836
- },
837
- "100360": {
838
- "content": "<row_2_col_3>",
839
- "lstrip": false,
840
- "normalized": false,
841
- "rstrip": false,
842
- "single_word": false,
843
- "special": true
844
- },
845
- "100361": {
846
- "content": "<row_2_col_4>",
847
- "lstrip": false,
848
- "normalized": false,
849
- "rstrip": false,
850
- "single_word": false,
851
- "special": true
852
- },
853
- "100362": {
854
- "content": "<row_2_col_5>",
855
- "lstrip": false,
856
- "normalized": false,
857
- "rstrip": false,
858
- "single_word": false,
859
- "special": true
860
- },
861
- "100363": {
862
- "content": "<row_2_col_6>",
863
- "lstrip": false,
864
- "normalized": false,
865
- "rstrip": false,
866
- "single_word": false,
867
- "special": true
868
- },
869
- "100364": {
870
- "content": "<row_3_col_1>",
871
- "lstrip": false,
872
- "normalized": false,
873
- "rstrip": false,
874
- "single_word": false,
875
- "special": true
876
- },
877
- "100365": {
878
- "content": "<row_3_col_2>",
879
- "lstrip": false,
880
- "normalized": false,
881
- "rstrip": false,
882
- "single_word": false,
883
- "special": true
884
- },
885
- "100366": {
886
- "content": "<row_3_col_3>",
887
- "lstrip": false,
888
- "normalized": false,
889
- "rstrip": false,
890
- "single_word": false,
891
- "special": true
892
- },
893
- "100367": {
894
- "content": "<row_3_col_4>",
895
- "lstrip": false,
896
- "normalized": false,
897
- "rstrip": false,
898
- "single_word": false,
899
- "special": true
900
- },
901
- "100368": {
902
- "content": "<row_3_col_5>",
903
- "lstrip": false,
904
- "normalized": false,
905
- "rstrip": false,
906
- "single_word": false,
907
- "special": true
908
- },
909
- "100369": {
910
- "content": "<row_3_col_6>",
911
- "lstrip": false,
912
- "normalized": false,
913
- "rstrip": false,
914
- "single_word": false,
915
- "special": true
916
- },
917
- "100370": {
918
- "content": "<row_4_col_1>",
919
- "lstrip": false,
920
- "normalized": false,
921
- "rstrip": false,
922
- "single_word": false,
923
- "special": true
924
- },
925
- "100371": {
926
- "content": "<row_4_col_2>",
927
- "lstrip": false,
928
- "normalized": false,
929
- "rstrip": false,
930
- "single_word": false,
931
- "special": true
932
- },
933
- "100372": {
934
- "content": "<row_4_col_3>",
935
- "lstrip": false,
936
- "normalized": false,
937
- "rstrip": false,
938
- "single_word": false,
939
- "special": true
940
- },
941
- "100373": {
942
- "content": "<row_4_col_4>",
943
- "lstrip": false,
944
- "normalized": false,
945
- "rstrip": false,
946
- "single_word": false,
947
- "special": true
948
- },
949
- "100374": {
950
- "content": "<row_4_col_5>",
951
- "lstrip": false,
952
- "normalized": false,
953
- "rstrip": false,
954
- "single_word": false,
955
- "special": true
956
- },
957
- "100375": {
958
- "content": "<row_4_col_6>",
959
- "lstrip": false,
960
- "normalized": false,
961
- "rstrip": false,
962
- "single_word": false,
963
- "special": true
964
- },
965
- "100376": {
966
- "content": "<row_5_col_1>",
967
- "lstrip": false,
968
- "normalized": false,
969
- "rstrip": false,
970
- "single_word": false,
971
- "special": true
972
- },
973
- "100377": {
974
- "content": "<row_5_col_2>",
975
- "lstrip": false,
976
- "normalized": false,
977
- "rstrip": false,
978
- "single_word": false,
979
- "special": true
980
- },
981
- "100378": {
982
- "content": "<row_5_col_3>",
983
- "lstrip": false,
984
- "normalized": false,
985
- "rstrip": false,
986
- "single_word": false,
987
- "special": true
988
- },
989
- "100379": {
990
- "content": "<row_5_col_4>",
991
- "lstrip": false,
992
- "normalized": false,
993
- "rstrip": false,
994
- "single_word": false,
995
- "special": true
996
- },
997
- "100380": {
998
- "content": "<row_5_col_5>",
999
- "lstrip": false,
1000
- "normalized": false,
1001
- "rstrip": false,
1002
- "single_word": false,
1003
- "special": true
1004
- },
1005
- "100381": {
1006
- "content": "<row_5_col_6>",
1007
- "lstrip": false,
1008
- "normalized": false,
1009
- "rstrip": false,
1010
- "single_word": false,
1011
- "special": true
1012
- },
1013
- "100382": {
1014
- "content": "<row_6_col_1>",
1015
- "lstrip": false,
1016
- "normalized": false,
1017
- "rstrip": false,
1018
- "single_word": false,
1019
- "special": true
1020
- },
1021
- "100383": {
1022
- "content": "<row_6_col_2>",
1023
- "lstrip": false,
1024
- "normalized": false,
1025
- "rstrip": false,
1026
- "single_word": false,
1027
- "special": true
1028
- },
1029
- "100384": {
1030
- "content": "<row_6_col_3>",
1031
- "lstrip": false,
1032
- "normalized": false,
1033
- "rstrip": false,
1034
- "single_word": false,
1035
- "special": true
1036
- },
1037
- "100385": {
1038
- "content": "<row_6_col_4>",
1039
- "lstrip": false,
1040
- "normalized": false,
1041
- "rstrip": false,
1042
- "single_word": false,
1043
- "special": true
1044
- },
1045
- "100386": {
1046
- "content": "<row_6_col_5>",
1047
- "lstrip": false,
1048
- "normalized": false,
1049
- "rstrip": false,
1050
- "single_word": false,
1051
- "special": true
1052
- },
1053
- "100387": {
1054
- "content": "<row_6_col_6>",
1055
  "lstrip": false,
1056
  "normalized": false,
1057
  "rstrip": false,
@@ -1066,11 +786,11 @@
1066
  ],
1067
  "bos_token": "<|start_of_role|>",
1068
  "clean_up_tokenization_spaces": false,
1069
- "eos_token": "<|end_of_utterance|>",
1070
  "errors": "replace",
1071
  "extra_special_tokens": {},
1072
  "model_max_length": 8192,
1073
- "pad_token": "<|pad|>",
1074
  "padding_side": "left",
1075
  "processor_class": "Idefics3Processor",
1076
  "tokenizer_class": "GPT2Tokenizer",
 
19
  "special": true
20
  },
21
  "100258": {
22
+ "content": "<row_1_col_1>",
23
  "lstrip": false,
24
  "normalized": false,
25
  "rstrip": false,
 
27
  "special": true
28
  },
29
  "100259": {
30
+ "content": "<row_1_col_2>",
31
  "lstrip": false,
32
  "normalized": false,
33
  "rstrip": false,
 
35
  "special": true
36
  },
37
  "100260": {
38
+ "content": "<text>",
39
  "lstrip": false,
40
  "normalized": false,
41
  "rstrip": false,
 
43
  "special": true
44
  },
45
  "100261": {
46
+ "content": "<row_1_col_3>",
47
  "lstrip": false,
48
  "normalized": false,
49
  "rstrip": false,
 
51
  "special": true
52
  },
53
  "100262": {
54
+ "content": "<row_1_col_4>",
55
  "lstrip": false,
56
  "normalized": false,
57
  "rstrip": false,
 
59
  "special": true
60
  },
61
  "100263": {
62
+ "content": "<row_2_col_1>",
63
  "lstrip": false,
64
  "normalized": false,
65
  "rstrip": false,
 
83
  "special": true
84
  },
85
  "100266": {
86
+ "content": "</title>",
87
  "lstrip": false,
88
  "normalized": false,
89
  "rstrip": false,
 
91
  "special": true
92
  },
93
  "100267": {
94
+ "content": "<row_2_col_2>",
95
  "lstrip": false,
96
  "normalized": false,
97
  "rstrip": false,
 
99
  "special": true
100
  },
101
  "100268": {
102
+ "content": "<row_2_col_3>",
103
  "lstrip": false,
104
  "normalized": false,
105
  "rstrip": false,
 
107
  "special": true
108
  },
109
  "100269": {
110
+ "content": "<title>",
111
  "lstrip": false,
112
  "normalized": false,
113
  "rstrip": false,
 
491
  "special": true
492
  },
493
  "100317": {
494
+ "content": "</text>",
495
  "lstrip": false,
496
  "normalized": false,
497
  "rstrip": false,
 
499
  "special": true
500
  },
501
  "100318": {
502
+ "content": "<paragraph>",
503
  "lstrip": false,
504
  "normalized": false,
505
  "rstrip": false,
 
587
  "special": true
588
  },
589
  "100329": {
590
+ "content": "<rec_",
591
  "lstrip": false,
592
  "normalized": false,
593
  "rstrip": false,
 
659
  "special": true
660
  },
661
  "100338": {
662
+ "content": "<|unk|>",
663
  "lstrip": false,
664
  "normalized": false,
665
  "rstrip": false,
 
683
  "special": true
684
  },
685
  "100341": {
686
+ "content": "<row_2_col_4>",
687
  "lstrip": false,
688
  "normalized": false,
689
  "rstrip": false,
 
691
  "special": true
692
  },
693
  "100342": {
694
+ "content": "<row_3_col_1>",
695
  "lstrip": false,
696
  "normalized": false,
697
  "rstrip": false,
 
699
  "special": true
700
  },
701
  "100343": {
702
+ "content": "<row_3_col_2>",
703
  "lstrip": false,
704
  "normalized": false,
705
  "rstrip": false,
 
707
  "special": true
708
  },
709
  "100344": {
710
+ "content": "<row_3_col_3>",
711
  "lstrip": false,
712
  "normalized": false,
713
  "rstrip": false,
 
715
  "special": true
716
  },
717
  "100345": {
718
+ "content": "<row_3_col_4>",
719
  "lstrip": false,
720
  "normalized": false,
721
  "rstrip": false,
 
723
  "special": true
724
  },
725
  "100346": {
726
+ "content": "<row_4_col_1>",
727
  "lstrip": false,
728
  "normalized": false,
729
  "rstrip": false,
 
731
  "special": true
732
  },
733
  "100347": {
734
+ "content": "<row_4_col_2>",
735
  "lstrip": false,
736
  "normalized": false,
737
  "rstrip": false,
 
739
  "special": true
740
  },
741
  "100348": {
742
+ "content": "<row_4_col_3>",
743
  "lstrip": false,
744
  "normalized": false,
745
  "rstrip": false,
 
747
  "special": true
748
  },
749
  "100349": {
750
+ "content": "<row_4_col_4>",
751
  "lstrip": false,
752
  "normalized": false,
753
  "rstrip": false,
 
755
  "special": true
756
  },
757
  "100350": {
758
+ "content": "<code>",
759
  "lstrip": false,
760
  "normalized": false,
761
  "rstrip": false,
 
763
  "special": true
764
  },
765
  "100351": {
766
+ "content": "</code>",
767
  "lstrip": false,
768
  "normalized": false,
769
  "rstrip": false,
 
771
  "special": true
772
  },
773
  "100352": {
774
+ "content": "<end_of_utterance>",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
775
  "lstrip": false,
776
  "normalized": false,
777
  "rstrip": false,
 
786
  ],
787
  "bos_token": "<|start_of_role|>",
788
  "clean_up_tokenization_spaces": false,
789
+ "eos_token": "<|end_of_text|>",
790
  "errors": "replace",
791
  "extra_special_tokens": {},
792
  "model_max_length": 8192,
793
+ "pad_token": "<|end_of_text|>",
794
  "padding_side": "left",
795
  "processor_class": "Idefics3Processor",
796
  "tokenizer_class": "GPT2Tokenizer",
vocab.json CHANGED
The diff for this file is too large to render. See raw diff