NAICS + Area of Focus Classifier
Semantic matching model for classifying companies into:
- NAICS codes (1,012 industry classifications)
- Area of Focus (59 technology categories)
Model Description
This model uses semantic similarity matching (not classification heads) to predict both NAICS industry codes and Area of Focus categories for technology companies.
Architecture:
- Base:
BAAI/bge-small-en-v1.5(384-dim embeddings) - Method: Cosine similarity matching
- NAICS: 1,012 reference embeddings (3 types: short, keywords, long)
- AoF: 59 reference embeddings
Performance
NAICS Classification:
- Top-1 Accuracy: ~70% (with detailed descriptions)
- Confidence: 60-90%
Area of Focus Classification:
- Top-1 Accuracy: ~85%
- Confidence: 80-95%
Speed:
- Both predictions: <40ms
- No GPU required for inference
Usage
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
import pickle
# Load model
model_name = "jmurray10/naics-aof-classifier"
encoder = AutoModel.from_pretrained(f"{model_name}/encoder")
tokenizer = AutoTokenizer.from_pretrained(f"{model_name}/encoder")
# Load index with embeddings
import urllib.request
urllib.request.urlretrieve(
f"https://huggingface.co/{model_name}/resolve/main/naics_index.pkl",
"naics_index.pkl"
)
with open('naics_index.pkl', 'rb') as f:
naics_index = pickle.load(f)
# Prediction function
def predict(description):
# Encode
text = f"Represent this business for industry classification: {description}"
inputs = tokenizer(text, padding=True, truncation=True,
max_length=128, return_tensors='pt')
with torch.no_grad():
outputs = encoder(**inputs)
mask = inputs['attention_mask'].unsqueeze(-1).float()
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
emb = F.normalize(pooled, p=2, dim=1)
# NAICS
naics_emb_short = torch.tensor(naics_index['emb_short'])
naics_emb_kw = torch.tensor(naics_index['emb_kw'])
naics_emb_long = torch.tensor(naics_index['emb_long'])
sim_s = F.cosine_similarity(emb, naics_emb_short)
sim_k = F.cosine_similarity(emb, naics_emb_kw)
sim_l = F.cosine_similarity(emb, naics_emb_long)
naics_scores = 0.4 * sim_s + 0.35 * sim_k + 0.25 * sim_l
top_naics = naics_scores.argmax().item()
naics_code = naics_index['codes'][top_naics]
naics_title = naics_index['titles'][top_naics]
naics_conf = naics_scores[top_naics].item()
# AoF
aof_emb = torch.tensor(naics_index['aof_emb'])
aof_scores = F.cosine_similarity(emb, aof_emb)
top_aof = aof_scores.argmax().item()
aof_class = naics_index['aof_codes'][top_aof]
aof_conf = aof_scores[top_aof].item()
return {
'naics': {'code': naics_code, 'title': naics_title, 'confidence': naics_conf},
'aof': {'class': aof_class, 'confidence': aof_conf}
}
# Example
result = predict("payment processing APIs for online merchants")
print(result)
# {'naics': {'code': '522320', 'title': 'Financial Transactions Processing...', 'confidence': 0.654},
# 'aof': {'class': 'Fin-Tech: Payments', 'confidence': 0.806}}
Examples
| Description | NAICS | Area of Focus | AoF Confidence |
|---|---|---|---|
| payment processing APIs | 522320 | Fin-Tech: Payments | 0.806 |
| endpoint security software | 541519 | Cyber Security | 0.928 |
| game development engine | 513210 | Gaming | 0.857 |
| cryptocurrency exchange | 523210 | Cryptocurrency: no ICO | 0.919 |
| cloud infrastructure | 518210 | Cloud Infrastructure | 0.956 |
Area of Focus Categories
The model classifies into 59 technology-focused categories:
Financial Technology:
- Fin-Tech: Payments, Lending, Broker/Dealer, Personal Finance, Consumer Banking, Investment Management
Security:
- Cyber Security, Network Security
Software & SaaS:
- SaaS - All Other, Enterprise Software, Software Development
Infrastructure:
- Cloud Infrastructure and Cloud Storage, Database, Web Service Providers
Emerging Tech:
- Artificial Intelligence, Blockchain Technology, Cryptocurrency, IoT, Robotics, Autonomous Vehicles
And 40+ more categories...
Training Data
- NAICS: Trained on NAICS 2022 manual descriptions, index terms, and examples (~40K examples)
- AoF: Created from semantic definitions of 59 technology categories (no training required - semantic matching)
Limitations
- Best for technology companies - AoF categories focus on tech industries
- Requires detailed descriptions - Works best with 20+ word descriptions
- NAICS baseline - 70% top-1 accuracy reflects the difficulty of 1,012-class classification
Citation
If you use this model, please cite:
@misc{naics_aof_classifier,
author = {Murray, James},
title = {NAICS and Area of Focus Classifier},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/jmurray10/naics-aof-classifier}
}
License
MIT License
Contact
For questions or feedback: https://huggingface.co/jmurray10