|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- cisco-ai/SecureBERT2.0-base |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- dense |
|
|
- securebert |
|
|
- IR |
|
|
- docembedding |
|
|
- generated_from_trainer |
|
|
- dataset_size:35705 |
|
|
- loss:MultipleNegativesRankingLoss |
|
|
widget: |
|
|
- source_sentence: >- |
|
|
What is the primary responsibility of the Information Security Oversight |
|
|
Committee in an organization? |
|
|
sentences: |
|
|
- Least privilege |
|
|
- By searching for repeating ciphertext sequences at fixed displacements. |
|
|
- >- |
|
|
Ensuring and supporting information protection awareness and training |
|
|
programs |
|
|
--- |
|
|
|
|
|
# Model Card for cisco-ai/SecureBERT2.0-biencoder |
|
|
|
|
|
The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base). |
|
|
It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Cisco AI |
|
|
- **Model type:** Bi-Encoder (Sentence Transformer) |
|
|
- **Architecture:** ModernBERT backbone with dual encoders |
|
|
- **Max sequence length:** 1024 tokens |
|
|
- **Output dimension:** 768 |
|
|
- **Language:** English |
|
|
- **License:** Apache-2.0 |
|
|
- **Finetuned from:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base) |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- **Semantic search** and **document similarity** in cybersecurity corpora |
|
|
- **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes |
|
|
- **Document embedding** for retrieval-augmented generation (RAG) and clustering |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
- Threat intelligence knowledge graph construction |
|
|
- Cybersecurity QA and reasoning systems |
|
|
- Security operations center (SOC) data mining |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Non-technical or general-domain text similarity |
|
|
- Generative or conversational tasks |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The Bi-Encoder encodes queries and documents **independently** into a joint vector space. |
|
|
This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking. |
|
|
|
|
|
--- |
|
|
|
|
|
## Datasets |
|
|
|
|
|
### Fine-Tuning Datasets |
|
|
|
|
|
| Dataset Category | Number of Records | |
|
|
|:-----------------|:-----------------:| |
|
|
| Cybersecurity QA corpus | 43 000 | |
|
|
| Security governance QA corpus | 60 000 | |
|
|
| Cybersecurity instruction–response corpus | 25 000 | |
|
|
| Cybersecurity rules corpus (evaluation) | 5 000 | |
|
|
|
|
|
#### Dataset Descriptions |
|
|
|
|
|
- **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security. |
|
|
- **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses. |
|
|
- **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following. |
|
|
- **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
### Using Sentence Transformers |
|
|
|
|
|
```bash |
|
|
pip install -U sentence-transformers |
|
|
``` |
|
|
### Run Model to Encode |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder") |
|
|
|
|
|
sentences = [ |
|
|
"How would you use Amcache analysis to detect fileless malware?", |
|
|
"Amcache analysis provides forensic artifacts for detecting fileless malware ...", |
|
|
"To capture and display network traffic" |
|
|
] |
|
|
|
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) |
|
|
``` |
|
|
### Compute Similarity |
|
|
```python |
|
|
from sentence_transformers import util |
|
|
similarity = util.cos_sim(embeddings, embeddings) |
|
|
print(similarity) |
|
|
|
|
|
``` |
|
|
--- |
|
|
|
|
|
## Framework Versions |
|
|
|
|
|
* python: 3.10.10 |
|
|
* sentence_transformers: 5.0.0 |
|
|
* transformers: 4.52.4 |
|
|
* PyTorch: 2.7.0+cu128 |
|
|
* accelerate: 1.9.0 |
|
|
* datasets: 3.6.0 |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Dataset |
|
|
|
|
|
The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning. |
|
|
|
|
|
- **Dataset Size:** 35,705 samples |
|
|
- **Columns:** `sentence_0`, `sentence_1`, `label` |
|
|
|
|
|
|
|
|
#### Example Schema |
|
|
|
|
|
| Field | Type | Description | |
|
|
|:------|:------|:------------| |
|
|
| sentence_0 | string | Query or short text input | |
|
|
| sentence_1 | string | Candidate or document text | |
|
|
| label | float | Similarity score (1.0 = relevant) | |
|
|
|
|
|
#### Example Samples |
|
|
|
|
|
| sentence_0 | sentence_1 | label | |
|
|
|:------------|:-----------|:------:| |
|
|
| *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 | |
|
|
| *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 | |
|
|
|
|
|
--- |
|
|
|
|
|
### Training Objective and Loss |
|
|
|
|
|
The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning. |
|
|
|
|
|
- **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) |
|
|
|
|
|
#### Loss Parameters |
|
|
```json |
|
|
{ |
|
|
"scale": 20.0, |
|
|
"similarity_fct": "cos_sim" |
|
|
} |
|
|
``` |
|
|
## Reference |
|
|
|
|
|
``` |
|
|
@article{aghaei2025securebert, |
|
|
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence}, |
|
|
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun}, |
|
|
journal={arXiv preprint arXiv:2510.00240}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Cisco AI |
|
|
|
|
|
## Model Card Contact |
|
|
For inquiries, please contact [[email protected]](mailto:[email protected]) |