File size: 6,104 Bytes
7736105 bf4e99b 36d9cab bf4e99b 7736105 bf4e99b 7736105 8936563 7736105 8936563 7736105 ef2b2f1 7736105 ef2b2f1 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 3c873ad 8c9db25 bf4e99b ef2b2f1 7736105 3b04abe 7736105 bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 3b04abe bf4e99b 7736105 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 7736105 ef2b2f1 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 1467022 bf4e99b 1467022 9ac79be 1467022 7736105 bf4e99b 7736105 bf4e99b 7736105 bf4e99b 6f0f603 7736105 6f0f603 7736105 bf4e99b 8c9db25 bf4e99b 32ad7d4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
---
license: apache-2.0
language:
- en
base_model:
- cisco-ai/SecureBERT2.0-base
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- securebert
- IR
- docembedding
- generated_from_trainer
- dataset_size:35705
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: >-
What is the primary responsibility of the Information Security Oversight
Committee in an organization?
sentences:
- Least privilege
- By searching for repeating ciphertext sequences at fixed displacements.
- >-
Ensuring and supporting information protection awareness and training
programs
---
# Model Card for cisco-ai/SecureBERT2.0-biencoder
The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base).
It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**.
---
## Model Details
### Model Description
- **Developed by:** Cisco AI
- **Model type:** Bi-Encoder (Sentence Transformer)
- **Architecture:** ModernBERT backbone with dual encoders
- **Max sequence length:** 1024 tokens
- **Output dimension:** 768
- **Language:** English
- **License:** Apache-2.0
- **Finetuned from:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base)
---
## Uses
### Direct Use
- **Semantic search** and **document similarity** in cybersecurity corpora
- **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes
- **Document embedding** for retrieval-augmented generation (RAG) and clustering
### Downstream Use
- Threat intelligence knowledge graph construction
- Cybersecurity QA and reasoning systems
- Security operations center (SOC) data mining
### Out-of-Scope Use
- Non-technical or general-domain text similarity
- Generative or conversational tasks
---
## Model Architecture
The Bi-Encoder encodes queries and documents **independently** into a joint vector space.
This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking.
---
## Datasets
### Fine-Tuning Datasets
| Dataset Category | Number of Records |
|:-----------------|:-----------------:|
| Cybersecurity QA corpus | 43 000 |
| Security governance QA corpus | 60 000 |
| Cybersecurity instruction–response corpus | 25 000 |
| Cybersecurity rules corpus (evaluation) | 5 000 |
#### Dataset Descriptions
- **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
- **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
- **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following.
- **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation.
---
## How to Get Started with the Model
### Using Sentence Transformers
```bash
pip install -U sentence-transformers
```
### Run Model to Encode
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder")
sentences = [
"How would you use Amcache analysis to detect fileless malware?",
"Amcache analysis provides forensic artifacts for detecting fileless malware ...",
"To capture and display network traffic"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
```
### Compute Similarity
```python
from sentence_transformers import util
similarity = util.cos_sim(embeddings, embeddings)
print(similarity)
```
---
## Framework Versions
* python: 3.10.10
* sentence_transformers: 5.0.0
* transformers: 4.52.4
* PyTorch: 2.7.0+cu128
* accelerate: 1.9.0
* datasets: 3.6.0
---
## Training Details
### Training Dataset
The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
- **Dataset Size:** 35,705 samples
- **Columns:** `sentence_0`, `sentence_1`, `label`
#### Example Schema
| Field | Type | Description |
|:------|:------|:------------|
| sentence_0 | string | Query or short text input |
| sentence_1 | string | Candidate or document text |
| label | float | Similarity score (1.0 = relevant) |
#### Example Samples
| sentence_0 | sentence_1 | label |
|:------------|:-----------|:------:|
| *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 |
| *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 |
---
### Training Objective and Loss
The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
- **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
#### Loss Parameters
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
## Reference
```
@article{aghaei2025securebert,
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
journal={arXiv preprint arXiv:2510.00240},
year={2025}
}
```
---
## Model Card Authors
Cisco AI
## Model Card Contact
For inquiries, please contact [[email protected]](mailto:[email protected]) |