File size: 5,941 Bytes
47a30c5 a56f220 23fc153 369cecf fbf453a 4ad1d2c 2f88e6d fbf453a 23fc153 369cecf 23fc153 fbf453a 23fc153 fbf453a 23fc153 fbf453a 23fc153 8893b7c a14932e 8893b7c 23fc153 8893b7c 23fc153 fbf453a 23fc153 369cecf 2f88e6d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
language:
- zh
- en
base_model:
- Qwen/Qwen2.5-Coder-7B-Instruct
pipeline_tag: feature-extraction
library_name: transformers
tags:
- code
---
<div align="center" style="display: flex; justify-content: center; align-items: center; gap: 20px;">
<a href="https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/" style="display: flex; align-items: center; text-decoration: none; color: inherit;">
<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="30" height="30" style="vertical-align: middle; margin-right: 8px;">
<span style="font-size: 1.5em; font-weight: bold;">CodeFuse-Embeddings</span>
</a>
</div>
# A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
[Paper](https://huggingface.co/papers/2512.21332) | [Code](https://github.com/codefuse-ai/CodeFuse-Embeddings)
**C2LLMs (Code Contrastive Large Language Models)** are powerful new models for generating code embeddings, designed to capture the deep semantics of source code.
#### Key Features
- **Powerful Base Model**: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities.
- **Intelligent Pooling with PMA**: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses **PMA (Pooling by Multi-head Attention)**. This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding.
- **Trained for Retrieval**: C2LLM was fine-tuned on a massive dataset of **3 million query-document pairs**, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks.
C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG). For more details, please see our [GitHub repository](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main).
# Model Details
# How to use
## Usage (**HuggingFace Transformers**)
```Python
from transformers import AutoModel, AutoTokenizer
import torch
model_path = "codefuse-ai/C2LLM-7B"
# Load the model
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
# Prepare your custom instruction
instruction = "xxxxx"
# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;
byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);
if (derived0.length != derived1.length) return false;
int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']
sentences = [instruction+sentence for sentence in sentences]
# Get the embeddings
embeddings = model.encode(sentences)
```
## Usage (**Sentence-Transformers**)
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("codefuse-ai/C2LLM-7B", trust_remote_code=True, tokenizer_kwargs={"padding_side":"left"})
# Prepare your custom instruction
instruction = "xxxxx"
# Prepare the data
sentences = ['''int r = (int) params >> 8 & 0xff;
int p = (int) params & 0xff;
byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);
if (derived0.length != derived1.length) return false;
int result = 0;
for (int i = 0; i < derived0.length; i++) {
result |= derived0[i] ^ derived1[i];
}
return result == 0;
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("JVM doesn't support UTF-8?");
} catch (GeneralSecurityException e) {
throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
}
}''',
'''
}
if (tempFrom > tempTo) {
return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
}
return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
}''']
sentences = [instruction+sentence for sentence in sentences]
# Get the embeddings
embeddings = model.encode(sentences)
```
## Evaluation (**MTEB**)
```python
from sentence_transformers import SentenceTransformer
from mteb.models import ModelMeta
from mteb.cache import ResultCache
model_name = "codefuse-ai/C2LLM-7B"
# Load the model
model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)
# Select tasks
tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"])
# Cache the result
cache = ResultCache("./c2llm_results")
# Evaluate
results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16})
```
## Support Us
If you find this project helpful, please give it a star. It means a lot to us!
[](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main)
## Citation
@article{2025C2LLM,
title={C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling},
author={Jin Qin and Zihan Liao and Ziyin Zhang and Hang Yu and Peng Di and Rui Wang},
journal = {CoRR},
volume = {abs/2512.21332},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2512.21332},
doi = {10.48550/ARXIV.2512.21332},
eprinttype = {arXiv},
eprint = {2512.21332}
} |