Elvis-t9 commited on
Commit
23fc153
·
verified ·
1 Parent(s): c1dc16d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -3
README.md CHANGED
@@ -3,8 +3,131 @@ license: apache-2.0
3
  language:
4
  - en
5
  - zh
 
6
  tags:
7
  - code
8
- pipeline_tag: feature-extraction
9
- library_name: transformers
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - en
5
  - zh
6
+ library_name: transformers
7
  tags:
8
  - code
9
+ ---
10
+ # Introduction
11
+
12
+ C2LLM: Advanced Code Embeddings for Deep Semantic Understanding
13
+
14
+ **C2LLMs (Code Contrastive Large Language Model)** is a powerful new model for generating code embeddings, designed to capture the deep semantics of source code.
15
+
16
+ #### Key Features
17
+
18
+ - **Powerful Base Model**: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities.
19
+ - **Intelligent Pooling with PMA**: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses **PMA (Pooling by Multi-head Attention)**. This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding.
20
+ - **Trained for Retrieval**: C2LLM was fine-tuned on a massive dataset of **3 million query-document pairs**, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks.
21
+
22
+ C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG).
23
+
24
+ # Model Details
25
+
26
+ # How to use
27
+
28
+ ## Usage (**HuggingFace Transformers**)
29
+
30
+ ```plain
31
+ from transformers import AutoModel, AutoTokenizer
32
+ import torch
33
+
34
+ model_path = "codefuse-ai/C2LLM-7B"
35
+
36
+ # Load the model
37
+ model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True)
38
+
39
+ # Prepare the data
40
+ sentences = ['''int r = (int) params >> 8 & 0xff;
41
+ int p = (int) params & 0xff;
42
+
43
+ byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);
44
+
45
+ if (derived0.length != derived1.length) return false;
46
+
47
+ int result = 0;
48
+ for (int i = 0; i < derived0.length; i++) {
49
+ result |= derived0[i] ^ derived1[i];
50
+ }
51
+ return result == 0;
52
+ } catch (UnsupportedEncodingException e) {
53
+ throw new IllegalStateException("JVM doesn't support UTF-8?");
54
+ } catch (GeneralSecurityException e) {
55
+ throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
56
+ }
57
+ }''',
58
+ '''
59
+ }
60
+ if (tempFrom > tempTo) {
61
+ return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
62
+ }
63
+ return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
64
+ }''']
65
+
66
+ # Get the embeddings
67
+ embeddings = model.encode(sentences)
68
+ ```
69
+
70
+ ## Usage (**Sentence-Transformers**)
71
+
72
+ ```python
73
+ from sentence_transformers import SentenceTransformer
74
+
75
+ # Load the model
76
+ model = SentenceTransformer("codefuse-ai/C2LLM-7B", trust_remote_code=True)
77
+
78
+ # Prepare the data
79
+ sentences = ['''int r = (int) params >> 8 & 0xff;
80
+ int p = (int) params & 0xff;
81
+
82
+ byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32);
83
+
84
+ if (derived0.length != derived1.length) return false;
85
+
86
+ int result = 0;
87
+ for (int i = 0; i < derived0.length; i++) {
88
+ result |= derived0[i] ^ derived1[i];
89
+ }
90
+ return result == 0;
91
+ } catch (UnsupportedEncodingException e) {
92
+ throw new IllegalStateException("JVM doesn't support UTF-8?");
93
+ } catch (GeneralSecurityException e) {
94
+ throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?");
95
+ }
96
+ }''',
97
+ '''
98
+ }
99
+ if (tempFrom > tempTo) {
100
+ return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true);
101
+ }
102
+ return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false);
103
+ }''']
104
+
105
+ # Get the embeddings
106
+ embeddings = model.encode(sentences)
107
+ ```
108
+
109
+ ## Evaluation (**MTEB**)
110
+
111
+ ```plain
112
+ from sentence_transformers import SentenceTransformer
113
+ from mteb.models import ModelMeta
114
+ from mteb.cache import ResultCache
115
+
116
+ model_name = "codefuse-ai/C2LLM-7B"
117
+
118
+ # Load the model
119
+ model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name)
120
+
121
+ # Select tasks
122
+ tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"])
123
+
124
+ # Cache the result
125
+ cache = ResultCache("./c2llm_results")
126
+
127
+ # Evaluate
128
+ results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16})
129
+ ```
130
+
131
+ ## Correspondence to
132
+
133
+ Jin Qin ([email protected]), Zihan Liao ([email protected]), Ziyin Zhang ([email protected]), Hang Yu ([email protected]), Peng Di ([email protected])