Update README.md
Browse files
README.md
CHANGED
|
@@ -1,9 +1,19 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
tags:
|
| 3 |
- sentence-transformers
|
| 4 |
- sentence-similarity
|
| 5 |
- feature-extraction
|
| 6 |
- dense
|
|
|
|
|
|
|
|
|
|
| 7 |
- generated_from_trainer
|
| 8 |
- dataset_size:35705
|
| 9 |
- loss:MultipleNegativesRankingLoss
|
|
@@ -17,163 +27,178 @@ widget:
|
|
| 17 |
- >-
|
| 18 |
Ensuring and supporting information protection awareness and training
|
| 19 |
programs
|
| 20 |
-
|
| 21 |
-
pipeline_tag: sentence-similarity
|
| 22 |
-
library_name: sentence-transformers
|
| 23 |
-
base_model:
|
| 24 |
-
- CiscoAITeam/SecureBERT2.0-base
|
| 25 |
---
|
| 26 |
|
| 27 |
-
#
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
| 30 |
|
|
|
|
| 31 |
|
|
|
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
-
|
|
|
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
-
##
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
| 61 |
-
~43k records of question–answer pairs, incident reports, and domain knowledge across subdomains such as network security, malware analysis, cryptography, and cloud security. Provides varied technical text for both precise retrieval and contextual understanding.
|
| 62 |
|
| 63 |
-
-
|
| 64 |
-
~60k curated QA pairs on governance, compliance, vulnerability management, and exploit analysis. Emphasizes concise, expert-validated answers for robust semantic generalization.
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
-
|
| 70 |
-
~5k structured cybersecurity policies, guidelines, and best practices. Used as an evaluation benchmark for assessing retrieval quality against standards and compliance rules.
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
```bash
|
| 80 |
pip install -U sentence-transformers
|
| 81 |
```
|
| 82 |
-
|
| 83 |
-
Then you can load this model and run inference.
|
| 84 |
-
|
| 85 |
-
#### Semantic Similarity:
|
| 86 |
```python
|
| 87 |
from sentence_transformers import SentenceTransformer
|
| 88 |
|
| 89 |
-
# Download from the 🤗 Hub
|
| 90 |
model = SentenceTransformer("CiscoAITeam/SecureBERT2.0-biencoder")
|
| 91 |
-
|
| 92 |
sentences = [
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
]
|
|
|
|
| 97 |
embeddings = model.encode(sentences)
|
| 98 |
print(embeddings.shape)
|
| 99 |
-
# [3, 768]
|
| 100 |
-
|
| 101 |
-
# Get the similarity scores for the embeddings
|
| 102 |
-
similarities = model.similarity(embeddings, embeddings)
|
| 103 |
-
print(similarities)
|
| 104 |
-
# tensor([[ 1.0000, 0.8653, 0.0078],
|
| 105 |
-
# [ 0.8653, 1.0000, -0.0407],
|
| 106 |
-
# [ 0.0078, -0.0407, 1.0000]])
|
| 107 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
-
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
-
from sentence_transformers import CrossEncoder, util
|
| 113 |
-
from transformers import AutoTokenizer, TFAutoModel
|
| 114 |
|
| 115 |
-
|
| 116 |
-
tokenizer = AutoTokenizer.from_pretrained("CiscoAITeam/SecureBERT2.0-biencoder"")
|
| 117 |
-
model = TFAutoModel.from_pretrained("CiscoAITeam/SecureBERT2.0-biencoder"")
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
# Compute similarity
|
| 124 |
-
import tensorflow as tf
|
| 125 |
-
similarity = tf.keras.losses.cosine_similarity(query_embedding, doc_embedding)
|
| 126 |
-
print(similarity)
|
| 127 |
-
```
|
| 128 |
|
| 129 |
## Training Details
|
| 130 |
|
| 131 |
### Training Dataset
|
| 132 |
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
| <code>Under what circumstances does attribution bias distort intrusion linking, and what countermeasures can mitigate it?</code> | <code>Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents. This distortion manifests through several mechanisms: confirmation bias leads analysts to interpret ambiguous indicators as evidence supporting their initial hypotheses; availability heuristic causes over-reliance on memorable or recent attacks; and anchoring effect results in disproportionate weight given to early intelligence assessments.\n\nThe MITRE ATT&CK framework's Groups matrix exemplifies how attribution bias emerges when linking tactics, techniques, and procedures (TTPs) across campaigns. Analysts may incorrectly associate similar tools or methodologies without sufficient evidence of operational connection, particularly when adversaries employ publicly available malware or common techniques across multiple threat groups. This challenge intensifies during high...</code> | <code>1.0</code> |
|
| 146 |
-
| <code>How can you identify store buffer bypass speculation artifacts?</code> | <code>Store buffer bypass speculation artifacts represent sophisticated side-channel vulnerabilities that exploit microarchitectural features in modern processors. These vulnerabilities occur when speculative execution accesses memory locations before stores are committed, creating observable timing differences that leak sensitive information.\n\n**Detection Methodologies:**\n\nStatic analysis tools can identify potential store buffer bypass vulnerabilities by examining assembly code patterns where load instructions follow store operations without proper memory barriers. Tools like Intel's Pin or LLVM-based analyzers can detect missing lfence/stfence instructions that prevent speculative execution from bypassing stores.\n\nDynamic analysis approaches involve precise timing measurements using high-resolution performance counters. Researchers measure memory access latencies to identify anomalous patterns where load operations complete faster than expected, indicating potential speculation thro...</code> | <code>1.0</code> |
|
| 147 |
-
| <code>How would you develop methods for analyzing attacks targeting industrial control systems?</code> | <code>Developing comprehensive methods for analyzing attacks targeting Industrial Control Systems (ICS) requires a multi-layered approach integrating specialized frameworks and industry standards.\n\n**Framework Integration**: Begin with NIST's Cybersecurity Framework, particularly the Identify function, to catalog ICS assets, data flows, and criticality levels. The Detect function provides guidance on implementing continuous monitoring capabilities tailored for operational technology environments. MITRE ATT&CK for ICS offers crucial tactical intelligence, mapping adversary behaviors specific to industrial contexts through techniques like \\\"Inhibit Response Function\\\" (T0803) or \\\"Modify Parameter\\\" (T0832).\n\n**Technical Analysis Methodology**: Establish baseline behavioral profiles for normal ICS operations using network traffic analysis, protocol inspection, and system state monitoring. Deploy specialized tools capable of deep packet inspection for industrial protocols (Modbus, D...</code> | <code>1.0</code> |
|
| 148 |
-
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
|
| 149 |
-
```json
|
| 150 |
-
{
|
| 151 |
-
"scale": 20.0,
|
| 152 |
-
"similarity_fct": "cos_sim"
|
| 153 |
-
}
|
| 154 |
-
```
|
| 155 |
-
|
| 156 |
-
### Training Hyperparameters
|
| 157 |
-
#### Non-Default Hyperparameters
|
| 158 |
-
|
| 159 |
-
- `eval_strategy`: steps
|
| 160 |
-
- `per_device_train_batch_size`: 32
|
| 161 |
-
- `per_device_eval_batch_size`: 32
|
| 162 |
-
- `num_train_epochs`: 20
|
| 163 |
-
- `multi_dataset_batch_sampler`: round_robin
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
### Framework Versions
|
| 168 |
-
- Python: 3.10.10
|
| 169 |
-
- Sentence Transformers: 5.0.0
|
| 170 |
-
- Transformers: 4.52.4
|
| 171 |
-
- PyTorch: 2.7.0+cu128
|
| 172 |
-
- Accelerate: 1.9.0
|
| 173 |
-
- Datasets: 3.6.0
|
| 174 |
-
- Tokenizers: 0.21.1
|
| 175 |
|
|
|
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
## Reference
|
| 178 |
|
| 179 |
```
|
|
@@ -184,3 +209,11 @@ print(similarity)
|
|
| 184 |
year={2025}
|
| 185 |
}
|
| 186 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- CiscoAITeam/SecureBERT2.0-base
|
| 7 |
+
pipeline_tag: sentence-similarity
|
| 8 |
+
library_name: sentence-transformers
|
| 9 |
tags:
|
| 10 |
- sentence-transformers
|
| 11 |
- sentence-similarity
|
| 12 |
- feature-extraction
|
| 13 |
- dense
|
| 14 |
+
- securebert
|
| 15 |
+
- IR
|
| 16 |
+
- docembedding
|
| 17 |
- generated_from_trainer
|
| 18 |
- dataset_size:35705
|
| 19 |
- loss:MultipleNegativesRankingLoss
|
|
|
|
| 27 |
- >-
|
| 28 |
Ensuring and supporting information protection awareness and training
|
| 29 |
programs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
---
|
| 31 |
|
| 32 |
+
# Model Card for CiscoAITeam/SecureBERT2.0-biencoder
|
| 33 |
|
| 34 |
+
The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base).
|
| 35 |
+
It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
|
| 39 |
+
## Model Details
|
| 40 |
|
| 41 |
+
### Model Description
|
| 42 |
|
| 43 |
+
- **Developed by:** Cisco AI Team
|
| 44 |
+
- **Model type:** Bi-Encoder (Sentence Transformer)
|
| 45 |
+
- **Architecture:** ModernBERT backbone with dual encoders
|
| 46 |
+
- **Max sequence length:** 1024 tokens
|
| 47 |
+
- **Output dimension:** 768
|
| 48 |
+
- **Language:** English
|
| 49 |
+
- **License:** Apache-2.0
|
| 50 |
+
- **Finetuned from:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
|
| 51 |
|
| 52 |
+
### Model Sources
|
| 53 |
+
|
| 54 |
+
- **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-biencoder](https://huggingface.co/CiscoAITeam/SecureBERT2.0-biencoder)
|
| 55 |
+
- **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
+
## Uses
|
| 60 |
|
| 61 |
+
### Direct Use
|
| 62 |
|
| 63 |
+
- **Semantic search** and **document similarity** in cybersecurity corpora
|
| 64 |
+
- **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes
|
| 65 |
+
- **Document embedding** for retrieval-augmented generation (RAG) and clustering
|
| 66 |
|
| 67 |
+
### Downstream Use
|
| 68 |
|
| 69 |
+
- Threat intelligence knowledge graph construction
|
| 70 |
+
- Cybersecurity QA and reasoning systems
|
| 71 |
+
- Security operations center (SOC) data mining
|
| 72 |
|
| 73 |
+
### Out-of-Scope Use
|
| 74 |
+
|
| 75 |
+
- Non-technical or general-domain text similarity
|
| 76 |
+
- Generative or conversational tasks
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
|
| 80 |
+
## Model Architecture
|
| 81 |
+
|
| 82 |
+
The Bi-Encoder encodes queries and documents **independently** into a joint vector space.
|
| 83 |
+
This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
|
| 87 |
+
## Datasets
|
|
|
|
| 88 |
|
| 89 |
+
### Fine-Tuning Datasets
|
|
|
|
| 90 |
|
| 91 |
+
| Dataset Category | Number of Records |
|
| 92 |
+
|:-----------------|:-----------------:|
|
| 93 |
+
| Cybersecurity QA corpus | 43 000 |
|
| 94 |
+
| Security governance QA corpus | 60 000 |
|
| 95 |
+
| Cybersecurity instruction–response corpus | 25 000 |
|
| 96 |
+
| Cybersecurity rules corpus (evaluation) | 5 000 |
|
| 97 |
|
| 98 |
+
#### Dataset Descriptions
|
|
|
|
| 99 |
|
| 100 |
+
- **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
|
| 101 |
+
- **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
|
| 102 |
+
- **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following.
|
| 103 |
+
- **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation.
|
| 104 |
|
| 105 |
+
---
|
| 106 |
|
| 107 |
+
## How to Get Started with the Model
|
| 108 |
|
| 109 |
+
### Using Sentence Transformers
|
| 110 |
|
| 111 |
```bash
|
| 112 |
pip install -U sentence-transformers
|
| 113 |
```
|
| 114 |
+
### Run Model to Encode
|
|
|
|
|
|
|
|
|
|
| 115 |
```python
|
| 116 |
from sentence_transformers import SentenceTransformer
|
| 117 |
|
|
|
|
| 118 |
model = SentenceTransformer("CiscoAITeam/SecureBERT2.0-biencoder")
|
| 119 |
+
|
| 120 |
sentences = [
|
| 121 |
+
"How would you use Amcache analysis to detect fileless malware?",
|
| 122 |
+
"Amcache analysis provides forensic artifacts for detecting fileless malware ...",
|
| 123 |
+
"To capture and display network traffic"
|
| 124 |
]
|
| 125 |
+
|
| 126 |
embeddings = model.encode(sentences)
|
| 127 |
print(embeddings.shape)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
```
|
| 129 |
+
### Compute Similarity
|
| 130 |
+
```python
|
| 131 |
+
from sentence_transformers import util
|
| 132 |
+
similarity = util.cos_sim(embeddings, embeddings)
|
| 133 |
+
print(similarity)
|
| 134 |
|
| 135 |
+
```
|
| 136 |
+
---
|
| 137 |
|
| 138 |
+
## Framework Versions
|
|
|
|
|
|
|
| 139 |
|
| 140 |
+
Python: 3.10.10
|
|
|
|
|
|
|
| 141 |
|
| 142 |
+
Sentence Transformers: 5.0.0
|
| 143 |
+
|
| 144 |
+
Transformers: 4.52.4
|
| 145 |
+
|
| 146 |
+
PyTorch: 2.7.0 + cu128
|
| 147 |
+
|
| 148 |
+
Accelerate: 1.9.0
|
| 149 |
+
|
| 150 |
+
Datasets: 3.6.0
|
| 151 |
+
|
| 152 |
+
Tokenizers: 0.21.1
|
| 153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
## Training Details
|
| 156 |
|
| 157 |
### Training Dataset
|
| 158 |
|
| 159 |
+
The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
|
| 160 |
+
|
| 161 |
+
- **Dataset Size:** 35,705 samples
|
| 162 |
+
- **Columns:** `sentence_0`, `sentence_1`, `label`
|
| 163 |
+
|
| 164 |
+
#### Statistics (first 1000 samples)
|
| 165 |
+
|
| 166 |
+
| Field | Type | Mean Tokens | Min | Max |
|
| 167 |
+
|:------|:-----|:-----------:|:---:|:---:|
|
| 168 |
+
| sentence_0 | string | 20.14 | 9 | 103 |
|
| 169 |
+
| sentence_1 | string | 293.14 | 3 | 934 |
|
| 170 |
+
| label | float | 1.0 | 1.0 | 1.0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
+
#### Example Schema
|
| 173 |
|
| 174 |
+
| Field | Type | Description |
|
| 175 |
+
|:------|:------|:------------|
|
| 176 |
+
| sentence_0 | string | Query or short text input |
|
| 177 |
+
| sentence_1 | string | Candidate or document text |
|
| 178 |
+
| label | float | Similarity score (1.0 = relevant) |
|
| 179 |
+
|
| 180 |
+
#### Example Samples
|
| 181 |
+
|
| 182 |
+
| sentence_0 | sentence_1 | label |
|
| 183 |
+
|:------------|:-----------|:------:|
|
| 184 |
+
| *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 |
|
| 185 |
+
| *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 |
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
### Training Objective and Loss
|
| 190 |
+
|
| 191 |
+
The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
|
| 192 |
+
|
| 193 |
+
- **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
|
| 194 |
+
|
| 195 |
+
#### Loss Parameters
|
| 196 |
+
```json
|
| 197 |
+
{
|
| 198 |
+
"scale": 20.0,
|
| 199 |
+
"similarity_fct": "cos_sim"
|
| 200 |
+
}
|
| 201 |
+
```
|
| 202 |
## Reference
|
| 203 |
|
| 204 |
```
|
|
|
|
| 209 |
year={2025}
|
| 210 |
}
|
| 211 |
```
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
+
## Model Card Authors
|
| 215 |
+
|
| 216 |
+
Cisco AI Team
|
| 217 |
+
|
| 218 |
+
## Model Card Contact
|
| 219 |
+
For inquiries, please contact [Cisco AI Team]([email protected])
|