cisco-ehsan commited on
Commit
bf4e99b
·
verified ·
1 Parent(s): b040c8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -108
README.md CHANGED
@@ -1,9 +1,19 @@
1
  ---
 
 
 
 
 
 
 
2
  tags:
3
  - sentence-transformers
4
  - sentence-similarity
5
  - feature-extraction
6
  - dense
 
 
 
7
  - generated_from_trainer
8
  - dataset_size:35705
9
  - loss:MultipleNegativesRankingLoss
@@ -17,163 +27,178 @@ widget:
17
  - >-
18
  Ensuring and supporting information protection awareness and training
19
  programs
20
-
21
- pipeline_tag: sentence-similarity
22
- library_name: sentence-transformers
23
- base_model:
24
- - CiscoAITeam/SecureBERT2.0-base
25
  ---
26
 
27
- # SecureBERT 2.0 for Document Embedding and Similarity Search Model (bi-encoder)
28
 
29
- This is a **Bi-Encoder** model fine-tuned on top of [**SecureBERT 2.0**](CiscoAITeam/SecureBERT2.0-code-vuln-detection), a cybersecurity domain-specific Model. It computes similarity scores for pairs of texts, which can be used for text ranking, semantic search, documnet embedding or other cybersecurity-related natural language tasks.
 
 
 
30
 
 
31
 
 
32
 
33
- Document embeddings are central to modern cybersecurity pipelines, enabling efficient use of large and complex text corpora. They power applications such as **Retrieval-Augmented Generation (RAG)**, semantic search, ranking, and threat intelligence retrieval.
 
 
 
 
 
 
 
34
 
35
- - In **RAG**, embeddings retrieve contextually relevant documents that improve generation accuracy.
36
- - Embedding-based ranking prioritizes advisories, vulnerability reports, and incident descriptions.
37
- - Unlike keyword-based search, embedding-driven retrieval supports **semantic matching** for tasks such as threat hunting, compliance checking, and knowledge management.
 
38
 
39
  ---
40
 
41
- ## Architecture
42
 
43
- - **Bi-encoder** encode queries and documents independently into a shared vector space, enabling scalable approximate nearest-neighbor retrieval for initial candidate selection.
44
 
45
- ---
 
 
46
 
47
- ## Datasets
48
 
49
- We fine-tuned embedding models using multiple cybersecurity-specific datasets:
 
 
50
 
51
- | Dataset Category | Number of Records |
52
- |-----------------|-----------------|
53
- | Cybersecurity QA corpus | 43,000 |
54
- | Security governance QA corpus | 60,000 |
55
- | Cybersecurity instruction–response corpus | 25,000 |
56
- | Cybersecurity rules corpus (evaluation) | 5,000 |
57
 
58
- ### Dataset Descriptions
 
 
 
 
 
59
 
60
- - **Cybersecurity QA corpus:**
61
- ~43k records of question–answer pairs, incident reports, and domain knowledge across subdomains such as network security, malware analysis, cryptography, and cloud security. Provides varied technical text for both precise retrieval and contextual understanding.
62
 
63
- - **Security governance QA corpus:**
64
- ~60k curated QA pairs on governance, compliance, vulnerability management, and exploit analysis. Emphasizes concise, expert-validated answers for robust semantic generalization.
65
 
66
- - **Cybersecurity instruction–response corpus:**
67
- ~25k instruction–response pairs (e.g., “Describe mitigation techniques for cross-site scripting”), designed for instruction-following and contextual reasoning. Supports improved reranking and semantic search.
 
 
 
 
68
 
69
- - **Cybersecurity rules corpus (evaluation):**
70
- ~5k structured cybersecurity policies, guidelines, and best practices. Used as an evaluation benchmark for assessing retrieval quality against standards and compliance rules.
71
 
 
 
 
 
72
 
73
- ## Usage
74
 
75
- ### Direct Usage (Sentence Transformers)
76
 
77
- First install the Sentence Transformers library:
78
 
79
  ```bash
80
  pip install -U sentence-transformers
81
  ```
82
-
83
- Then you can load this model and run inference.
84
-
85
- #### Semantic Similarity:
86
  ```python
87
  from sentence_transformers import SentenceTransformer
88
 
89
- # Download from the 🤗 Hub
90
  model = SentenceTransformer("CiscoAITeam/SecureBERT2.0-biencoder")
91
- # Run inference
92
  sentences = [
93
- 'How would you use Amcache analysis to detect fileless malware that drops temporary components for initial system compromise?',
94
- 'Amcache analysis provides critical forensic artifacts for detecting fileless malware employing temporary component deployment during initial system compromise, aligning with MITRE ATT&CK techniques T1055 (Process Injection) and T1620 (Reflective Code Loading).\\n\\n**Amcache Artifact Analysis Framework:**\\n\\nThe Amcache.hve registry hive maintains comprehensive application execution metadata, including file paths, hashes, and execution timestamps. For fileless malware detection, focus on:\\n\\n1. **Temporary File Creation Patterns**: Analyze entries with suspicious temporal clustering in the \\\\\\"Programs\\\\\\" key, particularly executables stored in system directories (C:\\\\\\\\Windows\\\\\\\\Temp, C:\\\\\\\\Users\\\\\\\\[User]\\\\\\\\AppData\\\\\\\\Local\\\\\\\\Temp). Legitimate applications typically exhibit predictable installation patterns, while malicious components often manifest as isolated, recently-created executables.\\n\\n2. **Hash-Based Indicators**: Cross-reference SHA-1 hashes against threat intelligence feeds and known malware signatures. Fileless malware frequently employs legitimate system binaries for process hollowing (T1055.012) or reflective DLL loading (T1620), making hash analysis crucial for identifying repurposed executables.\\n\\n3. **Execution Chain Analysis**: Examine parent-child relationships within Amcache entries to identify anomalous process spawning patterns. Fileless malware often exhibits unusual execution chains, particularly when temporary components spawn from unexpected parent processes or system services.\\n\\n**NIST CSF Implementation Strategy:**\\n\\nUnder the Detect (DE) function, specifically DE.AE-2 (Detected events are analyzed), implement continuous Amcache monitoring through:\\n\\n- **Baseline Establishment**: Create organizational baselines for normal temporary file creation patterns and execution behaviors\\n- **Anomaly Detection**: Deploy automated analysis tools to identify deviations from established baselines\\n- **Correlation Analysis**: Integrate Amcache findings with network traffic analysis and endpoint detection systems\\n\\n**Advanced Detection Methodologies:**\\n\\nUtilize PowerShell-based parsing scripts or specialized forensic tools like KAPE to extract and analyze Amcache artifacts. Focus on:\\n\\n- Unusual file extensions in temporary directories\\n- Executables created immediately before suspicious network activity\\n- Components with execution timestamps correlating with initial access events\\n- Hash collisions or similarities between temporary files and known malware families\\n\\nThis approach enables proactive identification of fileless malware campaigns leveraging temporary components for system compromise, supporting comprehensive threat hunting and incident response activities within enterprise environments.',
95
- 'To capture and display network traffic',
96
  ]
 
97
  embeddings = model.encode(sentences)
98
  print(embeddings.shape)
99
- # [3, 768]
100
-
101
- # Get the similarity scores for the embeddings
102
- similarities = model.similarity(embeddings, embeddings)
103
- print(similarities)
104
- # tensor([[ 1.0000, 0.8653, 0.0078],
105
- # [ 0.8653, 1.0000, -0.0407],
106
- # [ 0.0078, -0.0407, 1.0000]])
107
  ```
 
 
 
 
 
108
 
109
- #### Doc/Query Embedding
 
110
 
111
- ```python
112
- from sentence_transformers import CrossEncoder, util
113
- from transformers import AutoTokenizer, TFAutoModel
114
 
115
- # Load the bi-encoder
116
- tokenizer = AutoTokenizer.from_pretrained("CiscoAITeam/SecureBERT2.0-biencoder"")
117
- model = TFAutoModel.from_pretrained("CiscoAITeam/SecureBERT2.0-biencoder"")
118
 
119
- # Encode queries and documents
120
- query_embedding = model(tokenizer("Example query", return_tensors="tf"))[0]
121
- doc_embedding = model(tokenizer("Candidate document", return_tensors="tf"))[0]
 
 
 
 
 
 
 
 
122
 
123
- # Compute similarity
124
- import tensorflow as tf
125
- similarity = tf.keras.losses.cosine_similarity(query_embedding, doc_embedding)
126
- print(similarity)
127
- ```
128
 
129
  ## Training Details
130
 
131
  ### Training Dataset
132
 
133
- #### Unnamed Dataset
134
-
135
- * Size: 35,705 training samples
136
- * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
137
- * Approximate statistics based on the first 1000 samples:
138
- | | sentence_0 | sentence_1 | label |
139
- |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:--------------------------------------------------------------|
140
- | type | string | string | float |
141
- | details | <ul><li>min: 9 tokens</li><li>mean: 20.14 tokens</li><li>max: 103 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 293.14 tokens</li><li>max: 934 tokens</li></ul> | <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> |
142
- * Samples:
143
- | sentence_0 | sentence_1 | label |
144
- |:---------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
145
- | <code>Under what circumstances does attribution bias distort intrusion linking, and what countermeasures can mitigate it?</code> | <code>Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents. This distortion manifests through several mechanisms: confirmation bias leads analysts to interpret ambiguous indicators as evidence supporting their initial hypotheses; availability heuristic causes over-reliance on memorable or recent attacks; and anchoring effect results in disproportionate weight given to early intelligence assessments.\n\nThe MITRE ATT&CK framework's Groups matrix exemplifies how attribution bias emerges when linking tactics, techniques, and procedures (TTPs) across campaigns. Analysts may incorrectly associate similar tools or methodologies without sufficient evidence of operational connection, particularly when adversaries employ publicly available malware or common techniques across multiple threat groups. This challenge intensifies during high...</code> | <code>1.0</code> |
146
- | <code>How can you identify store buffer bypass speculation artifacts?</code> | <code>Store buffer bypass speculation artifacts represent sophisticated side-channel vulnerabilities that exploit microarchitectural features in modern processors. These vulnerabilities occur when speculative execution accesses memory locations before stores are committed, creating observable timing differences that leak sensitive information.\n\n**Detection Methodologies:**\n\nStatic analysis tools can identify potential store buffer bypass vulnerabilities by examining assembly code patterns where load instructions follow store operations without proper memory barriers. Tools like Intel's Pin or LLVM-based analyzers can detect missing lfence/stfence instructions that prevent speculative execution from bypassing stores.\n\nDynamic analysis approaches involve precise timing measurements using high-resolution performance counters. Researchers measure memory access latencies to identify anomalous patterns where load operations complete faster than expected, indicating potential speculation thro...</code> | <code>1.0</code> |
147
- | <code>How would you develop methods for analyzing attacks targeting industrial control systems?</code> | <code>Developing comprehensive methods for analyzing attacks targeting Industrial Control Systems (ICS) requires a multi-layered approach integrating specialized frameworks and industry standards.\n\n**Framework Integration**: Begin with NIST's Cybersecurity Framework, particularly the Identify function, to catalog ICS assets, data flows, and criticality levels. The Detect function provides guidance on implementing continuous monitoring capabilities tailored for operational technology environments. MITRE ATT&CK for ICS offers crucial tactical intelligence, mapping adversary behaviors specific to industrial contexts through techniques like \\\"Inhibit Response Function\\\" (T0803) or \\\"Modify Parameter\\\" (T0832).\n\n**Technical Analysis Methodology**: Establish baseline behavioral profiles for normal ICS operations using network traffic analysis, protocol inspection, and system state monitoring. Deploy specialized tools capable of deep packet inspection for industrial protocols (Modbus, D...</code> | <code>1.0</code> |
148
- * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
149
- ```json
150
- {
151
- "scale": 20.0,
152
- "similarity_fct": "cos_sim"
153
- }
154
- ```
155
-
156
- ### Training Hyperparameters
157
- #### Non-Default Hyperparameters
158
-
159
- - `eval_strategy`: steps
160
- - `per_device_train_batch_size`: 32
161
- - `per_device_eval_batch_size`: 32
162
- - `num_train_epochs`: 20
163
- - `multi_dataset_batch_sampler`: round_robin
164
-
165
-
166
-
167
- ### Framework Versions
168
- - Python: 3.10.10
169
- - Sentence Transformers: 5.0.0
170
- - Transformers: 4.52.4
171
- - PyTorch: 2.7.0+cu128
172
- - Accelerate: 1.9.0
173
- - Datasets: 3.6.0
174
- - Tokenizers: 0.21.1
175
 
 
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ## Reference
178
 
179
  ```
@@ -184,3 +209,11 @@ print(similarity)
184
  year={2025}
185
  }
186
  ```
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - CiscoAITeam/SecureBERT2.0-base
7
+ pipeline_tag: sentence-similarity
8
+ library_name: sentence-transformers
9
  tags:
10
  - sentence-transformers
11
  - sentence-similarity
12
  - feature-extraction
13
  - dense
14
+ - securebert
15
+ - IR
16
+ - docembedding
17
  - generated_from_trainer
18
  - dataset_size:35705
19
  - loss:MultipleNegativesRankingLoss
 
27
  - >-
28
  Ensuring and supporting information protection awareness and training
29
  programs
 
 
 
 
 
30
  ---
31
 
32
+ # Model Card for CiscoAITeam/SecureBERT2.0-biencoder
33
 
34
+ The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base).
35
+ It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**.
36
+
37
+ ---
38
 
39
+ ## Model Details
40
 
41
+ ### Model Description
42
 
43
+ - **Developed by:** Cisco AI Team
44
+ - **Model type:** Bi-Encoder (Sentence Transformer)
45
+ - **Architecture:** ModernBERT backbone with dual encoders
46
+ - **Max sequence length:** 1024 tokens
47
+ - **Output dimension:** 768
48
+ - **Language:** English
49
+ - **License:** Apache-2.0
50
+ - **Finetuned from:** [CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
51
 
52
+ ### Model Sources
53
+
54
+ - **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-biencoder](https://huggingface.co/CiscoAITeam/SecureBERT2.0-biencoder)
55
+ - **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
56
 
57
  ---
58
 
59
+ ## Uses
60
 
61
+ ### Direct Use
62
 
63
+ - **Semantic search** and **document similarity** in cybersecurity corpora
64
+ - **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes
65
+ - **Document embedding** for retrieval-augmented generation (RAG) and clustering
66
 
67
+ ### Downstream Use
68
 
69
+ - Threat intelligence knowledge graph construction
70
+ - Cybersecurity QA and reasoning systems
71
+ - Security operations center (SOC) data mining
72
 
73
+ ### Out-of-Scope Use
74
+
75
+ - Non-technical or general-domain text similarity
76
+ - Generative or conversational tasks
77
+
78
+ ---
79
 
80
+ ## Model Architecture
81
+
82
+ The Bi-Encoder encodes queries and documents **independently** into a joint vector space.
83
+ This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking.
84
+
85
+ ---
86
 
87
+ ## Datasets
 
88
 
89
+ ### Fine-Tuning Datasets
 
90
 
91
+ | Dataset Category | Number of Records |
92
+ |:-----------------|:-----------------:|
93
+ | Cybersecurity QA corpus | 43 000 |
94
+ | Security governance QA corpus | 60 000 |
95
+ | Cybersecurity instruction–response corpus | 25 000 |
96
+ | Cybersecurity rules corpus (evaluation) | 5 000 |
97
 
98
+ #### Dataset Descriptions
 
99
 
100
+ - **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security.
101
+ - **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses.
102
+ - **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following.
103
+ - **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation.
104
 
105
+ ---
106
 
107
+ ## How to Get Started with the Model
108
 
109
+ ### Using Sentence Transformers
110
 
111
  ```bash
112
  pip install -U sentence-transformers
113
  ```
114
+ ### Run Model to Encode
 
 
 
115
  ```python
116
  from sentence_transformers import SentenceTransformer
117
 
 
118
  model = SentenceTransformer("CiscoAITeam/SecureBERT2.0-biencoder")
119
+
120
  sentences = [
121
+ "How would you use Amcache analysis to detect fileless malware?",
122
+ "Amcache analysis provides forensic artifacts for detecting fileless malware ...",
123
+ "To capture and display network traffic"
124
  ]
125
+
126
  embeddings = model.encode(sentences)
127
  print(embeddings.shape)
 
 
 
 
 
 
 
 
128
  ```
129
+ ### Compute Similarity
130
+ ```python
131
+ from sentence_transformers import util
132
+ similarity = util.cos_sim(embeddings, embeddings)
133
+ print(similarity)
134
 
135
+ ```
136
+ ---
137
 
138
+ ## Framework Versions
 
 
139
 
140
+ Python: 3.10.10
 
 
141
 
142
+ Sentence Transformers: 5.0.0
143
+
144
+ Transformers: 4.52.4
145
+
146
+ PyTorch: 2.7.0 + cu128
147
+
148
+ Accelerate: 1.9.0
149
+
150
+ Datasets: 3.6.0
151
+
152
+ Tokenizers: 0.21.1
153
 
 
 
 
 
 
154
 
155
  ## Training Details
156
 
157
  ### Training Dataset
158
 
159
+ The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning.
160
+
161
+ - **Dataset Size:** 35,705 samples
162
+ - **Columns:** `sentence_0`, `sentence_1`, `label`
163
+
164
+ #### Statistics (first 1000 samples)
165
+
166
+ | Field | Type | Mean Tokens | Min | Max |
167
+ |:------|:-----|:-----------:|:---:|:---:|
168
+ | sentence_0 | string | 20.14 | 9 | 103 |
169
+ | sentence_1 | string | 293.14 | 3 | 934 |
170
+ | label | float | 1.0 | 1.0 | 1.0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
+ #### Example Schema
173
 
174
+ | Field | Type | Description |
175
+ |:------|:------|:------------|
176
+ | sentence_0 | string | Query or short text input |
177
+ | sentence_1 | string | Candidate or document text |
178
+ | label | float | Similarity score (1.0 = relevant) |
179
+
180
+ #### Example Samples
181
+
182
+ | sentence_0 | sentence_1 | label |
183
+ |:------------|:-----------|:------:|
184
+ | *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 |
185
+ | *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 |
186
+
187
+ ---
188
+
189
+ ### Training Objective and Loss
190
+
191
+ The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning.
192
+
193
+ - **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss)
194
+
195
+ #### Loss Parameters
196
+ ```json
197
+ {
198
+ "scale": 20.0,
199
+ "similarity_fct": "cos_sim"
200
+ }
201
+ ```
202
  ## Reference
203
 
204
  ```
 
209
  year={2025}
210
  }
211
  ```
212
+ ---
213
+
214
+ ## Model Card Authors
215
+
216
+ Cisco AI Team
217
+
218
+ ## Model Card Contact
219
+ For inquiries, please contact [Cisco AI Team]([email protected])