File size: 6,104 Bytes
7736105
bf4e99b
 
 
 
36d9cab
bf4e99b
 
7736105
 
 
 
 
bf4e99b
 
 
7736105
 
 
 
8936563
 
 
7736105
 
 
8936563
 
 
7736105
 
ef2b2f1
7736105
ef2b2f1
bf4e99b
 
 
7736105
bf4e99b
7736105
bf4e99b
3c873ad
8c9db25
bf4e99b
 
 
 
 
 
ef2b2f1
7736105
3b04abe
7736105
bf4e99b
3b04abe
bf4e99b
3b04abe
bf4e99b
 
 
3b04abe
bf4e99b
3b04abe
bf4e99b
 
 
3b04abe
bf4e99b
 
 
 
 
 
3b04abe
bf4e99b
 
 
 
 
 
3b04abe
bf4e99b
3b04abe
bf4e99b
3b04abe
bf4e99b
 
 
 
 
 
3b04abe
bf4e99b
7736105
bf4e99b
 
 
 
7736105
bf4e99b
7736105
bf4e99b
7736105
bf4e99b
7736105
 
 
 
bf4e99b
7736105
 
 
ef2b2f1
bf4e99b
7736105
bf4e99b
 
 
7736105
bf4e99b
7736105
 
 
bf4e99b
 
 
 
 
7736105
bf4e99b
 
1467022
bf4e99b
1467022
9ac79be
 
 
 
 
 
 
 
1467022
7736105
 
 
 
 
bf4e99b
 
 
 
 
7736105
bf4e99b
7736105
bf4e99b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6f0f603
7736105
 
6f0f603
 
 
 
 
7736105
 
bf4e99b
 
 
 
8c9db25
bf4e99b
 
32ad7d4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
license: apache-2.0
language:
- en
base_model:
- cisco-ai/SecureBERT2.0-base
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- securebert
- IR
- docembedding
- generated_from_trainer
- dataset_size:35705
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: >-
    What is the primary responsibility of the Information Security Oversight
    Committee in an organization?
  sentences:
  - Least privilege
  - By searching for repeating ciphertext sequences at fixed displacements.
  - >-
    Ensuring and supporting information protection awareness and training
    programs
---

# Model Card for cisco-ai/SecureBERT2.0-biencoder

The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base).  
It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**.

---

## Model Details

### Model Description

- **Developed by:** Cisco AI  
- **Model type:** Bi-Encoder (Sentence Transformer)  
- **Architecture:** ModernBERT backbone with dual encoders  
- **Max sequence length:** 1024 tokens  
- **Output dimension:** 768  
- **Language:** English  
- **License:** Apache-2.0  
- **Finetuned from:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base)

---

## Uses

### Direct Use

- **Semantic search** and **document similarity** in cybersecurity corpora  
- **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes  
- **Document embedding** for retrieval-augmented generation (RAG) and clustering  

### Downstream Use

- Threat intelligence knowledge graph construction  
- Cybersecurity QA and reasoning systems  
- Security operations center (SOC) data mining  

### Out-of-Scope Use

- Non-technical or general-domain text similarity  
- Generative or conversational tasks  

---

## Model Architecture

The Bi-Encoder encodes queries and documents **independently** into a joint vector space.  
This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking.

---

## Datasets

### Fine-Tuning Datasets

| Dataset Category | Number of Records |
|:-----------------|:-----------------:|
| Cybersecurity QA corpus | 43 000 |
| Security governance QA corpus | 60 000 |
| Cybersecurity instruction–response corpus | 25 000 |
| Cybersecurity rules corpus (evaluation) | 5 000 |

#### Dataset Descriptions

- **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.  
- **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.  
- **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following.  
- **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation.

---

## How to Get Started with the Model

### Using Sentence Transformers

```bash
pip install -U sentence-transformers
```
### Run Model to Encode
```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder")

sentences = [
    "How would you use Amcache analysis to detect fileless malware?",
    "Amcache analysis provides forensic artifacts for detecting fileless malware ...",
    "To capture and display network traffic"
]

embeddings = model.encode(sentences)
print(embeddings.shape)
```
### Compute Similarity
```python
from sentence_transformers import util
similarity = util.cos_sim(embeddings, embeddings)
print(similarity)

```
 ---

## Framework Versions

* python: 3.10.10
* sentence_transformers: 5.0.0
* transformers: 4.52.4
* PyTorch: 2.7.0+cu128
* accelerate: 1.9.0
* datasets: 3.6.0
 
---


## Training Details

### Training Dataset

The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.

- **Dataset Size:** 35,705 samples  
- **Columns:** `sentence_0`, `sentence_1`, `label`  


#### Example Schema

| Field | Type | Description |
|:------|:------|:------------|
| sentence_0 | string | Query or short text input |
| sentence_1 | string | Candidate or document text |
| label | float | Similarity score (1.0 = relevant) |

#### Example Samples

| sentence_0 | sentence_1 | label |
|:------------|:-----------|:------:|
| *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 |
| *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 |

---

### Training Objective and Loss

The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.

- **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)  

#### Loss Parameters
```json
{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}
```
## Reference

```
@article{aghaei2025securebert,
  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
  journal={arXiv preprint arXiv:2510.00240},
  year={2025}
}
```
---

## Model Card Authors

Cisco AI

## Model Card Contact
For inquiries, please contact [[email protected]](mailto:[email protected])