Mxode
/

Pythia-70m-C-Language-KnowledgeExtract

Text Generation

knowledge extraction

text-generation-inference

Model card Files Files and versions

Mxode commited on Oct 5, 2023

Commit

e77286d

·

1 Parent(s): 9c32014

Update README.md

Files changed (1) hide show

README.md +88 -0

README.md CHANGED Viewed

@@ -1,3 +1,91 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+- en
+tags:
+- code
+- knowledge extraction
+- tiny
+- small
 ---
+A model that can **extract the knowledge points** involved from a given **C language code**.
+The base model is [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m). This model was fine-tuned with 10 epochs using [Q-Lora](https://github.com/artidoro/qlora) method on my own training set.
+A usage example is as follows, first import the model and prepare the code:
+```python
+from transformers import GPTNeoXForCausalLM, AutoTokenizer
+model_name_or_path = 'Mxode/Pythia-70m-C-Language-KnowledgeExtract'
+device = 'cuda'
+model = GPTNeoXForCausalLM.from_pretrained(model_name_or_path).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+instruction = '[Summarize the knowledge points in the code below]\n'		# instruction template
+# any c-lang pieces you like, could be partial functions or statements
+input_content = '''```c
+int partition(int arr[], int low, int high) {
+    int pivot = arr[high];
+    int i = (low - 1);
+    for (int j = low; j <= high - 1; j++) {
+        if (arr[j] < pivot) {
+            i++;
+            swap(&arr[i], &arr[j]);
+        }
+    }
+    swap(&arr[i + 1], &arr[high]);
+    return (i + 1);
+}
+void quickSort(int arr[], int low, int high) {
+    if (low < high) {
+        int pi = partition(arr, low, high);
+        quickSort(arr, low, pi - 1);
+        quickSort(arr, pi + 1, high);
+    }
+}
+```'''
+text = instruction + input_content
+```
+Then generate:
+```python
+inputs = tokenizer(text, return_tensors="pt").to(device)
+tokens = model.generate(
+    **inputs,
+    pad_token_id=tokenizer.eos_token_id,
+    max_new_tokens=32,
+)
+response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]	# deduplicate inputs
+```
+However, in practical use, in order to achieve more diverse representations, it's recommended to do multiple inferences. Don't worry, it's really small so the inferences don't take much time, as follows:
+```python
+ans_dict = {}
+def increment_insert(key):
+    ans_dict[key] = ans_dict.get(key, 0) + 1
+for i in range(30):		# maybe 20 times or less enough too
+    inputs = tokenizer(text, return_tensors="pt").to(device)
+    tokens = model.generate(
+        **inputs,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=32,
+        do_sample=True,
+        temperature=2.0,                       # high temperature for diversity
+        top_p=0.95,
+        top_k=30,
+    )
+    response = tokenizer.decode(tokens[0]).split('```')[-1].split('<')[0]
+    increment_insert(response)
+print(ans_dict)
+### output as below, could take high-freq answers
+### {'Backtracking': 1, 'Heap': 1, 'Quick sort': 25, 'Recurrence': 2, 'Queue': 1}
+```