cisco-ai
/

SecureBERT2.0-biencoder

@@ -1,9 +1,19 @@
 ---
 tags:
 - sentence-transformers
 - sentence-similarity
 - feature-extraction
 - dense
 - generated_from_trainer
 - dataset_size:35705
 - loss:MultipleNegativesRankingLoss
@@ -17,163 +27,178 @@ widget:
   - >-
     Ensuring and supporting information protection awareness and training
     programs
-pipeline_tag: sentence-similarity
-library_name: sentence-transformers
-base_model:
-- CiscoAITeam/SecureBERT2.0-base
 ---
-# SecureBERT 2.0 for Document Embedding and Similarity Search Model (bi-encoder)
-This is a **Bi-Encoder** model fine-tuned on top of [**SecureBERT 2.0**](CiscoAITeam/SecureBERT2.0-code-vuln-detection), a cybersecurity domain-specific Model. It computes similarity scores for pairs of texts, which can be used for text ranking, semantic search, documnet embedding or other cybersecurity-related natural language tasks.
-Document embeddings are central to modern cybersecurity pipelines, enabling efficient use of large and complex text corpora. They power applications such as **Retrieval-Augmented Generation (RAG)**, semantic search, ranking, and threat intelligence retrieval.
-- In **RAG**, embeddings retrieve contextually relevant documents that improve generation accuracy.
-- Embedding-based ranking prioritizes advisories, vulnerability reports, and incident descriptions.
-- Unlike keyword-based search, embedding-driven retrieval supports **semantic matching** for tasks such as threat hunting, compliance checking, and knowledge management.
 ---
-## Architecture
-- **Bi-encoder** encode queries and documents independently into a shared vector space, enabling scalable approximate nearest-neighbor retrieval for initial candidate selection.
----
-## Datasets
-We fine-tuned embedding models using multiple cybersecurity-specific datasets:
-| Dataset Category | Number of Records |
-|-----------------|-----------------|
-| Cybersecurity QA corpus | 43,000 |
-| Security governance QA corpus | 60,000 |
-| Cybersecurity instruction–response corpus | 25,000 |
-| Cybersecurity rules corpus (evaluation) | 5,000 |
-### Dataset Descriptions
-- **Cybersecurity QA corpus:**
-  ~43k records of question–answer pairs, incident reports, and domain knowledge across subdomains such as network security, malware analysis, cryptography, and cloud security. Provides varied technical text for both precise retrieval and contextual understanding.
-- **Security governance QA corpus:**
-  ~60k curated QA pairs on governance, compliance, vulnerability management, and exploit analysis. Emphasizes concise, expert-validated answers for robust semantic generalization.
-- **Cybersecurity instruction–response corpus:**
-  ~25k instruction–response pairs (e.g., “Describe mitigation techniques for cross-site scripting”), designed for instruction-following and contextual reasoning. Supports improved reranking and semantic search.
-- **Cybersecurity rules corpus (evaluation):**
-  ~5k structured cybersecurity policies, guidelines, and best practices. Used as an evaluation benchmark for assessing retrieval quality against standards and compliance rules.
-## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
-#### Semantic Similarity:
 ```python
 from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
 model = SentenceTransformer("CiscoAITeam/SecureBERT2.0-biencoder")
-# Run inference
 sentences = [
-    'How would you use Amcache analysis to detect fileless malware that drops temporary components for initial system compromise?',
-    'Amcache analysis provides critical forensic artifacts for detecting fileless malware employing temporary component deployment during initial system compromise, aligning with MITRE ATT&CK techniques T1055 (Process Injection) and T1620 (Reflective Code Loading).\\n\\n**Amcache Artifact Analysis Framework:**\\n\\nThe Amcache.hve registry hive maintains comprehensive application execution metadata, including file paths, hashes, and execution timestamps. For fileless malware detection, focus on:\\n\\n1. **Temporary File Creation Patterns**: Analyze entries with suspicious temporal clustering in the \\\\\\"Programs\\\\\\" key, particularly executables stored in system directories (C:\\\\\\\\Windows\\\\\\\\Temp, C:\\\\\\\\Users\\\\\\\\[User]\\\\\\\\AppData\\\\\\\\Local\\\\\\\\Temp). Legitimate applications typically exhibit predictable installation patterns, while malicious components often manifest as isolated, recently-created executables.\\n\\n2. **Hash-Based Indicators**: Cross-reference SHA-1 hashes against threat intelligence feeds and known malware signatures. Fileless malware frequently employs legitimate system binaries for process hollowing (T1055.012) or reflective DLL loading (T1620), making hash analysis crucial for identifying repurposed executables.\\n\\n3. **Execution Chain Analysis**: Examine parent-child relationships within Amcache entries to identify anomalous process spawning patterns. Fileless malware often exhibits unusual execution chains, particularly when temporary components spawn from unexpected parent processes or system services.\\n\\n**NIST CSF Implementation Strategy:**\\n\\nUnder the Detect (DE) function, specifically DE.AE-2 (Detected events are analyzed), implement continuous Amcache monitoring through:\\n\\n- **Baseline Establishment**: Create organizational baselines for normal temporary file creation patterns and execution behaviors\\n- **Anomaly Detection**: Deploy automated analysis tools to identify deviations from established baselines\\n- **Correlation Analysis**: Integrate Amcache findings with network traffic analysis and endpoint detection systems\\n\\n**Advanced Detection Methodologies:**\\n\\nUtilize PowerShell-based parsing scripts or specialized forensic tools like KAPE to extract and analyze Amcache artifacts. Focus on:\\n\\n- Unusual file extensions in temporary directories\\n- Executables created immediately before suspicious network activity\\n- Components with execution timestamps correlating with initial access events\\n- Hash collisions or similarities between temporary files and known malware families\\n\\nThis approach enables proactive identification of fileless malware campaigns leveraging temporary components for system compromise, supporting comprehensive threat hunting and incident response activities within enterprise environments.',
-    'To capture and display network traffic',
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities)
-# tensor([[ 1.0000,  0.8653,  0.0078],
-#         [ 0.8653,  1.0000, -0.0407],
-#         [ 0.0078, -0.0407,  1.0000]])
 ```
-#### Doc/Query Embedding
-```python
-from sentence_transformers import CrossEncoder, util
-from transformers import AutoTokenizer, TFAutoModel
-# Load the bi-encoder
-tokenizer = AutoTokenizer.from_pretrained("CiscoAITeam/SecureBERT2.0-biencoder"")
-model = TFAutoModel.from_pretrained("CiscoAITeam/SecureBERT2.0-biencoder"")
-# Encode queries and documents
-query_embedding = model(tokenizer("Example query", return_tensors="tf"))[0]
-doc_embedding = model(tokenizer("Candidate document", return_tensors="tf"))[0]
-# Compute similarity
-import tensorflow as tf
-similarity = tf.keras.losses.cosine_similarity(query_embedding, doc_embedding)
-print(similarity)
-```
 ## Training Details
 ### Training Dataset
-#### Unnamed Dataset
-* Size: 35,705 training samples
-* Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
-* Approximate statistics based on the first 1000 samples:
-  |         | sentence_0                                                                         | sentence_1                                                                          | label                                                         |
-  |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:--------------------------------------------------------------|
-  | type    | string                                                                             | string                                                                              | float                                                         |
-  | details | <ul><li>min: 9 tokens</li><li>mean: 20.14 tokens</li><li>max: 103 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 293.14 tokens</li><li>max: 934 tokens</li></ul> | <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> |
-* Samples:
-  | sentence_0                                                                                                                       | sentence_1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | label            |
-  |:---------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
-  | <code>Under what circumstances does attribution bias distort intrusion linking, and what countermeasures can mitigate it?</code> | <code>Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents. This distortion manifests through several mechanisms: confirmation bias leads analysts to interpret ambiguous indicators as evidence supporting their initial hypotheses; availability heuristic causes over-reliance on memorable or recent attacks; and anchoring effect results in disproportionate weight given to early intelligence assessments.\n\nThe MITRE ATT&CK framework's Groups matrix exemplifies how attribution bias emerges when linking tactics, techniques, and procedures (TTPs) across campaigns. Analysts may incorrectly associate similar tools or methodologies without sufficient evidence of operational connection, particularly when adversaries employ publicly available malware or common techniques across multiple threat groups. This challenge intensifies during high...</code> | <code>1.0</code> |
-  | <code>How can you identify store buffer bypass speculation artifacts?</code>                                                     | <code>Store buffer bypass speculation artifacts represent sophisticated side-channel vulnerabilities that exploit microarchitectural features in modern processors. These vulnerabilities occur when speculative execution accesses memory locations before stores are committed, creating observable timing differences that leak sensitive information.\n\n**Detection Methodologies:**\n\nStatic analysis tools can identify potential store buffer bypass vulnerabilities by examining assembly code patterns where load instructions follow store operations without proper memory barriers. Tools like Intel's Pin or LLVM-based analyzers can detect missing lfence/stfence instructions that prevent speculative execution from bypassing stores.\n\nDynamic analysis approaches involve precise timing measurements using high-resolution performance counters. Researchers measure memory access latencies to identify anomalous patterns where load operations complete faster than expected, indicating potential speculation thro...</code> | <code>1.0</code> |
-  | <code>How would you develop methods for analyzing attacks targeting industrial control systems?</code>                           | <code>Developing comprehensive methods for analyzing attacks targeting Industrial Control Systems (ICS) requires a multi-layered approach integrating specialized frameworks and industry standards.\n\n**Framework Integration**: Begin with NIST's Cybersecurity Framework, particularly the Identify function, to catalog ICS assets, data flows, and criticality levels. The Detect function provides guidance on implementing continuous monitoring capabilities tailored for operational technology environments. MITRE ATT&CK for ICS offers crucial tactical intelligence, mapping adversary behaviors specific to industrial contexts through techniques like \\\"Inhibit Response Function\\\" (T0803) or \\\"Modify Parameter\\\" (T0832).\n\n**Technical Analysis Methodology**: Establish baseline behavioral profiles for normal ICS operations using network traffic analysis, protocol inspection, and system state monitoring. Deploy specialized tools capable of deep packet inspection for industrial protocols (Modbus, D...</code> | <code>1.0</code> |
-* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
-  ```json
-  {
-      "scale": 20.0,
-      "similarity_fct": "cos_sim"
-  }
-  ```
-### Training Hyperparameters
-#### Non-Default Hyperparameters
-- `eval_strategy`: steps
-- `per_device_train_batch_size`: 32
-- `per_device_eval_batch_size`: 32
-- `num_train_epochs`: 20
-- `multi_dataset_batch_sampler`: round_robin
-### Framework Versions
-- Python: 3.10.10
-- Sentence Transformers: 5.0.0
-- Transformers: 4.52.4
-- PyTorch: 2.7.0+cu128
-- Accelerate: 1.9.0
-- Datasets: 3.6.0
-- Tokenizers: 0.21.1
 ## Reference
 ```
@@ -184,3 +209,11 @@ print(similarity)
   year={2025}
 }
 ```

 ---
+license: apache-2.0
+language:
+- en
+base_model:
+- CiscoAITeam/SecureBERT2.0-base
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
 tags:
 - sentence-transformers
 - sentence-similarity
 - feature-extraction
 - dense
+- securebert
+- IR
+- docembedding
 - generated_from_trainer
 - dataset_size:35705
 - loss:MultipleNegativesRankingLoss
   - >-
     Ensuring and supporting information protection awareness and training
     programs
 ---
+# Model Card for CiscoAITeam/SecureBERT2.0-biencoder
+The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base).
+It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**.
+---
+## Model Details
+### Model Description
+- **Developed by:** Cisco AI Team
+- **Model type:** Bi-Encoder (Sentence Transformer)
+- **Architecture:** ModernBERT backbone with dual encoders
+- **Max sequence length:** 1024 tokens
+- **Output dimension:** 768
+- **Language:** English
+- **License:** Apache-2.0
+- **Finetuned from:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
+### Model Sources
+- **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-biencoder](https://huggingface.co/CiscoAITeam/SecureBERT2.0-biencoder)
+- **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
 ---
+## Uses
+### Direct Use
+- **Semantic search** and **document similarity** in cybersecurity corpora
+- **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes
+- **Document embedding** for retrieval-augmented generation (RAG) and clustering
+### Downstream Use
+- Threat intelligence knowledge graph construction
+- Cybersecurity QA and reasoning systems
+- Security operations center (SOC) data mining
+### Out-of-Scope Use
+- Non-technical or general-domain text similarity
+- Generative or conversational tasks
+---
+## Model Architecture
+The Bi-Encoder encodes queries and documents **independently** into a joint vector space.
+This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking.
+---
+## Datasets
+### Fine-Tuning Datasets
+| Dataset Category | Number of Records |
+|:-----------------|:-----------------:|
+| Cybersecurity QA corpus | 43 000 |
+| Security governance QA corpus | 60 000 |
+| Cybersecurity instruction–response corpus | 25 000 |
+| Cybersecurity rules corpus (evaluation) | 5 000 |
+#### Dataset Descriptions
+- **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
+- **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
+- **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following.
+- **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation.
+---
+## How to Get Started with the Model
+### Using Sentence Transformers
 ```bash
 pip install -U sentence-transformers
 ```
+### Run Model to Encode
 ```python
 from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("CiscoAITeam/SecureBERT2.0-biencoder")
 sentences = [
+    "How would you use Amcache analysis to detect fileless malware?",
+    "Amcache analysis provides forensic artifacts for detecting fileless malware ...",
+    "To capture and display network traffic"
 ]
 embeddings = model.encode(sentences)
 print(embeddings.shape)
 ```
+### Compute Similarity
+```python
+from sentence_transformers import util
+similarity = util.cos_sim(embeddings, embeddings)
+print(similarity)
+```
+ ---
+## Framework Versions
+Python: 3.10.10
+Sentence Transformers: 5.0.0
+Transformers: 4.52.4
+PyTorch: 2.7.0 + cu128
+Accelerate: 1.9.0
+Datasets: 3.6.0
+Tokenizers: 0.21.1
 ## Training Details
 ### Training Dataset
+The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
+- **Dataset Size:** 35,705 samples
+- **Columns:** `sentence_0`, `sentence_1`, `label`
+#### Statistics (first 1000 samples)
+| Field | Type | Mean Tokens | Min | Max |
+|:------|:-----|:-----------:|:---:|:---:|
+| sentence_0 | string | 20.14 | 9 | 103 |
+| sentence_1 | string | 293.14 | 3 | 934 |
+| label | float | 1.0 | 1.0 | 1.0 |
+#### Example Schema
+| Field | Type | Description |
+|:------|:------|:------------|
+| sentence_0 | string | Query or short text input |
+| sentence_1 | string | Candidate or document text |
+| label | float | Similarity score (1.0 = relevant) |
+#### Example Samples
+| sentence_0 | sentence_1 | label |
+|:------------|:-----------|:------:|
+| *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 |
+| *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 |
+---
+### Training Objective and Loss
+The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
+- **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
+#### Loss Parameters
+```json
+{
+    "scale": 20.0,
+    "similarity_fct": "cos_sim"
+}
+```
 ## Reference
 ```
   year={2025}
 }
 ```
+---
+## Model Card Authors
+Cisco AI Team
+## Model Card Contact
+For inquiries, please contact [Cisco AI Team]([email protected])