cisco-ehsan's picture
Update README.md
ef2b2f1 verified
---
license: apache-2.0
language:
- en
base_model:
- cisco-ai/SecureBERT2.0-base
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- securebert
- IR
- docembedding
- generated_from_trainer
- dataset_size:35705
- loss:MultipleNegativesRankingLoss
widget:
- source_sentence: >-
What is the primary responsibility of the Information Security Oversight
Committee in an organization?
sentences:
- Least privilege
- By searching for repeating ciphertext sequences at fixed displacements.
- >-
Ensuring and supporting information protection awareness and training
programs
---
# Model Card for cisco-ai/SecureBERT2.0-biencoder
The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base).
It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**.
---
## Model Details
### Model Description
- **Developed by:** Cisco AI
- **Model type:** Bi-Encoder (Sentence Transformer)
- **Architecture:** ModernBERT backbone with dual encoders
- **Max sequence length:** 1024 tokens
- **Output dimension:** 768
- **Language:** English
- **License:** Apache-2.0
- **Finetuned from:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base)
---
## Uses
### Direct Use
- **Semantic search** and **document similarity** in cybersecurity corpora
- **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes
- **Document embedding** for retrieval-augmented generation (RAG) and clustering
### Downstream Use
- Threat intelligence knowledge graph construction
- Cybersecurity QA and reasoning systems
- Security operations center (SOC) data mining
### Out-of-Scope Use
- Non-technical or general-domain text similarity
- Generative or conversational tasks
---
## Model Architecture
The Bi-Encoder encodes queries and documents **independently** into a joint vector space.
This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking.
---
## Datasets
### Fine-Tuning Datasets
| Dataset Category | Number of Records |
|:-----------------|:-----------------:|
| Cybersecurity QA corpus | 43 000 |
| Security governance QA corpus | 60 000 |
| Cybersecurity instruction–response corpus | 25 000 |
| Cybersecurity rules corpus (evaluation) | 5 000 |
#### Dataset Descriptions
- **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
- **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
- **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following.
- **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation.
---
## How to Get Started with the Model
### Using Sentence Transformers
```bash
pip install -U sentence-transformers
```
### Run Model to Encode
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder")
sentences = [
"How would you use Amcache analysis to detect fileless malware?",
"Amcache analysis provides forensic artifacts for detecting fileless malware ...",
"To capture and display network traffic"
]
embeddings = model.encode(sentences)
print(embeddings.shape)
```
### Compute Similarity
```python
from sentence_transformers import util
similarity = util.cos_sim(embeddings, embeddings)
print(similarity)
```
---
## Framework Versions
* python: 3.10.10
* sentence_transformers: 5.0.0
* transformers: 4.52.4
* PyTorch: 2.7.0+cu128
* accelerate: 1.9.0
* datasets: 3.6.0
---
## Training Details
### Training Dataset
The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
- **Dataset Size:** 35,705 samples
- **Columns:** `sentence_0`, `sentence_1`, `label`
#### Example Schema
| Field | Type | Description |
|:------|:------|:------------|
| sentence_0 | string | Query or short text input |
| sentence_1 | string | Candidate or document text |
| label | float | Similarity score (1.0 = relevant) |
#### Example Samples
| sentence_0 | sentence_1 | label |
|:------------|:-----------|:------:|
| *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 |
| *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 |
---
### Training Objective and Loss
The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
- **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
#### Loss Parameters
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
## Reference
```
@article{aghaei2025securebert,
title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
journal={arXiv preprint arXiv:2510.00240},
year={2025}
}
```
---
## Model Card Authors
Cisco AI
## Model Card Contact
For inquiries, please contact [[email protected]](mailto:[email protected])