File size: 5,941 Bytes
47a30c5
 
 
 
 
 
 
 
 
 
 
 
a56f220
 
 
 
 
 
 
23fc153
369cecf
fbf453a
4ad1d2c
2f88e6d
 
fbf453a
23fc153
 
 
 
 
 
 
369cecf
23fc153
 
 
 
 
 
 
fbf453a
23fc153
 
 
 
 
 
 
 
fbf453a
 
 
23fc153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fbf453a
 
23fc153
 
 
 
 
 
 
 
 
 
8893b7c
 
a14932e
8893b7c
23fc153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8893b7c
 
23fc153
 
 
 
 
 
fbf453a
23fc153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369cecf
 
 
 
 
 
2f88e6d
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-Coder-7B-Instruct
pipeline_tag: feature-extraction
library_name: transformers
tags:
- code
---
<div align="center" style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <a href="https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/" style="display: flex; align-items: center; text-decoration: none; color: inherit;">
        <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="30" height="30" style="vertical-align: middle; margin-right: 8px;">
        <span style="font-size: 1.5em; font-weight: bold;">CodeFuse-Embeddings</span>
    </a>
</div>



# A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

[Paper](https://huggingface.co/papers/2512.21332) | [Code](https://github.com/codefuse-ai/CodeFuse-Embeddings)

**C2LLMs (Code Contrastive Large Language Models)** are powerful new models for generating code embeddings, designed to capture the deep semantics of source code. 

#### Key Features

- **Powerful Base Model**: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities.
- **Intelligent Pooling with PMA**: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses **PMA (Pooling by Multi-head Attention)**. This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding.
- **Trained for Retrieval**: C2LLM was fine-tuned on a massive dataset of **3 million query-document pairs**, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks.

C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG). For more details, please see our [GitHub repository](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main). 

#  Model Details

# How to use

## Usage (**HuggingFace Transformers**)

```Python
from transformers import AutoModel, AutoTokenizer
import torch

model_path = "codefuse-ai/C2LLM-7B"

# Load the model
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)

# Prepare your custom instruction
instruction = "xxxxx"

# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;

byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);

if (derived0.length != derived1.length) return false;

int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''	
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']

sentences = [instruction+sentence for sentence in sentences]

# Get the embeddings
embeddings = model.encode(sentences)
```

## Usage (**Sentence-Transformers**)

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("codefuse-ai/C2LLM-7B", trust_remote_code=True, tokenizer_kwargs={"padding_side":"left"})

# Prepare your custom instruction
instruction = "xxxxx"

# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;

byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);

if (derived0.length != derived1.length) return false;

int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''	
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']

sentences = [instruction+sentence for sentence in sentences]

# Get the embeddings
embeddings = model.encode(sentences)
```

## Evaluation (**MTEB**)

```python
from sentence_transformers import SentenceTransformer
from mteb.models import ModelMeta
from mteb.cache import ResultCache

model_name = "codefuse-ai/C2LLM-7B"

# Load the model
model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)

# Select tasks
tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"])

# Cache the result
cache = ResultCache("./c2llm_results")

# Evaluate
results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16})
```

## Support Us

If you find this project helpful, please give it a star. It means a lot to us!

[![GitHub stars](https://img.shields.io/github/stars/codefuse-ai/CodeFuse-Embeddings?style=social)](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main)

## Citation

@article{2025C2LLM,
  title={C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling}, 
      author={Jin Qin and Zihan Liao and Ziyin Zhang and Hang Yu and Peng Di and Rui Wang},
  journal      = {CoRR},
  volume       = {abs/2512.21332},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2512.21332},
  doi          = {10.48550/ARXIV.2512.21332},
  eprinttype    = {arXiv},
  eprint       = {2512.21332}
}