C2LLM-7B / README.md

Update README.md

2f88e6d verified 2 days ago

5.94 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	base_model:
	- Qwen/Qwen2.5-Coder-7B-Instruct
	pipeline_tag: feature-extraction
	library_name: transformers
	tags:
	- code
	---
	<div align="center" style="display: flex; justify-content: center; align-items: center; gap: 20px;">
	<a href="https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/" style="display: flex; align-items: center; text-decoration: none; color: inherit;">
	<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="30" height="30" style="vertical-align: middle; margin-right: 8px;">
	<span style="font-size: 1.5em; font-weight: bold;">CodeFuse-Embeddings</span>
	</a>
	</div>



	# A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

	[Paper](https://huggingface.co/papers/2512.21332) \| [Code](https://github.com/codefuse-ai/CodeFuse-Embeddings)

	C2LLMs (Code Contrastive Large Language Models) are powerful new models for generating code embeddings, designed to capture the deep semantics of source code.

	#### Key Features

	- Powerful Base Model: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities.
	- Intelligent Pooling with PMA: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses PMA (Pooling by Multi-head Attention). This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding.
	- Trained for Retrieval: C2LLM was fine-tuned on a massive dataset of 3 million query-document pairs, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks.

	C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG). For more details, please see our [GitHub repository](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main).

	# Model Details

	# How to use

	## Usage (HuggingFace Transformers)

	```Python
	from transformers import AutoModel, AutoTokenizer
	import torch

	model_path = "codefuse-ai/C2LLM-7B"

	# Load the model
	model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)

	# Prepare your custom instruction
	instruction = "xxxxx"

	# Prepare the data
	sentences = ['''int r = (int) params >> 8 & 0xff;
	int p = (int) params & 0xff;

	byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);

	if (derived0.length != derived1.length) return false;

	int result = 0;
	for (int i = 0; i < derived0.length; i++) {
	result \|= derived0[i] ^ derived1[i];
	}
	return result == 0;
	} catch (UnsupportedEncodingException e) {
	throw new IllegalStateException("JVM doesn't support UTF-8?");
	} catch (GeneralSecurityException e) {
	throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
	}
	}''',
	'''
	}
	if (tempFrom > tempTo) {
	return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
	}
	return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
	}''']

	sentences = [instruction+sentence for sentence in sentences]

	# Get the embeddings
	embeddings = model.encode(sentences)
	```

	## Usage (Sentence-Transformers)

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer("codefuse-ai/C2LLM-7B", trust_remote_code=True, tokenizer_kwargs={"padding_side":"left"})

	# Prepare your custom instruction
	instruction = "xxxxx"

	# Prepare the data
	sentences = ['''int r = (int) params >> 8 & 0xff;
	int p = (int) params & 0xff;

	byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);

	if (derived0.length != derived1.length) return false;

	int result = 0;
	for (int i = 0; i < derived0.length; i++) {
	result \|= derived0[i] ^ derived1[i];
	}
	return result == 0;
	} catch (UnsupportedEncodingException e) {
	throw new IllegalStateException("JVM doesn't support UTF-8?");
	} catch (GeneralSecurityException e) {
	throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
	}
	}''',
	'''
	}
	if (tempFrom > tempTo) {
	return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
	}
	return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
	}''']

	sentences = [instruction+sentence for sentence in sentences]

	# Get the embeddings
	embeddings = model.encode(sentences)
	```

	## Evaluation (MTEB)

	```python
	from sentence_transformers import SentenceTransformer
	from mteb.models import ModelMeta
	from mteb.cache import ResultCache

	model_name = "codefuse-ai/C2LLM-7B"

	# Load the model
	model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)

	# Select tasks
	tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"])

	# Cache the result
	cache = ResultCache("./c2llm_results")

	# Evaluate
	results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16})
	```

	## Support Us

	If you find this project helpful, please give it a star. It means a lot to us!

	[![GitHub stars](https://img.shields.io/github/stars/codefuse-ai/CodeFuse-Embeddings?style=social)](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main)

	## Citation

	@article{2025C2LLM,
	title={C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling},
	author={Jin Qin and Zihan Liao and Ziyin Zhang and Hang Yu and Peng Di and Rui Wang},
	journal = {CoRR},
	volume = {abs/2512.21332},
	year = {2025},
	url = {https://doi.org/10.48550/arXiv.2512.21332},
	doi = {10.48550/ARXIV.2512.21332},
	eprinttype = {arXiv},
	eprint = {2512.21332}
	}