You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

NAICS + Area of Focus Classifier

Semantic matching model for classifying companies into:

NAICS codes (1,012 industry classifications)
Area of Focus (59 technology categories)

Model Description

This model uses semantic similarity matching (not classification heads) to predict both NAICS industry codes and Area of Focus categories for technology companies.

Architecture:

Base: BAAI/bge-small-en-v1.5 (384-dim embeddings)
Method: Cosine similarity matching
NAICS: 1,012 reference embeddings (3 types: short, keywords, long)
AoF: 59 reference embeddings

Performance

NAICS Classification:

Top-1 Accuracy: ~70% (with detailed descriptions)
Confidence: 60-90%

Area of Focus Classification:

Top-1 Accuracy: ~85%
Confidence: 80-95%

Speed:

Both predictions: <40ms
No GPU required for inference

Usage

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
import pickle

# Load model
model_name = "jmurray10/naics-aof-classifier"
encoder = AutoModel.from_pretrained(f"{model_name}/encoder")
tokenizer = AutoTokenizer.from_pretrained(f"{model_name}/encoder")

# Load index with embeddings
import urllib.request
urllib.request.urlretrieve(
    f"https://huggingface.co/{model_name}/resolve/main/naics_index.pkl",
    "naics_index.pkl"
)
with open('naics_index.pkl', 'rb') as f:
    naics_index = pickle.load(f)

# Prediction function
def predict(description):
    # Encode
    text = f"Represent this business for industry classification: {description}"
    inputs = tokenizer(text, padding=True, truncation=True, 
                      max_length=128, return_tensors='pt')
    
    with torch.no_grad():
        outputs = encoder(**inputs)
        mask = inputs['attention_mask'].unsqueeze(-1).float()
        pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        emb = F.normalize(pooled, p=2, dim=1)
    
    # NAICS
    naics_emb_short = torch.tensor(naics_index['emb_short'])
    naics_emb_kw = torch.tensor(naics_index['emb_kw'])
    naics_emb_long = torch.tensor(naics_index['emb_long'])
    
    sim_s = F.cosine_similarity(emb, naics_emb_short)
    sim_k = F.cosine_similarity(emb, naics_emb_kw)
    sim_l = F.cosine_similarity(emb, naics_emb_long)
    naics_scores = 0.4 * sim_s + 0.35 * sim_k + 0.25 * sim_l
    
    top_naics = naics_scores.argmax().item()
    naics_code = naics_index['codes'][top_naics]
    naics_title = naics_index['titles'][top_naics]
    naics_conf = naics_scores[top_naics].item()
    
    # AoF
    aof_emb = torch.tensor(naics_index['aof_emb'])
    aof_scores = F.cosine_similarity(emb, aof_emb)
    
    top_aof = aof_scores.argmax().item()
    aof_class = naics_index['aof_codes'][top_aof]
    aof_conf = aof_scores[top_aof].item()
    
    return {
        'naics': {'code': naics_code, 'title': naics_title, 'confidence': naics_conf},
        'aof': {'class': aof_class, 'confidence': aof_conf}
    }

# Example
result = predict("payment processing APIs for online merchants")
print(result)
# {'naics': {'code': '522320', 'title': 'Financial Transactions Processing...', 'confidence': 0.654},
#  'aof': {'class': 'Fin-Tech: Payments', 'confidence': 0.806}}

Examples

Description	NAICS	Area of Focus	AoF Confidence
payment processing APIs	522320	Fin-Tech: Payments	0.806
endpoint security software	541519	Cyber Security	0.928
game development engine	513210	Gaming	0.857
cryptocurrency exchange	523210	Cryptocurrency: no ICO	0.919
cloud infrastructure	518210	Cloud Infrastructure	0.956

Area of Focus Categories

The model classifies into 59 technology-focused categories:

Financial Technology:

Fin-Tech: Payments, Lending, Broker/Dealer, Personal Finance, Consumer Banking, Investment Management

Security:

Cyber Security, Network Security

Software & SaaS:

SaaS - All Other, Enterprise Software, Software Development

Infrastructure:

Cloud Infrastructure and Cloud Storage, Database, Web Service Providers

Emerging Tech:

Artificial Intelligence, Blockchain Technology, Cryptocurrency, IoT, Robotics, Autonomous Vehicles

And 40+ more categories...

Training Data

NAICS: Trained on NAICS 2022 manual descriptions, index terms, and examples (~40K examples)
AoF: Created from semantic definitions of 59 technology categories (no training required - semantic matching)

Limitations

Best for technology companies - AoF categories focus on tech industries
Requires detailed descriptions - Works best with 20+ word descriptions
NAICS baseline - 70% top-1 accuracy reflects the difficulty of 1,012-class classification

Citation

If you use this model, please cite:

@misc{naics_aof_classifier,
  author = {Murray, James},
  title = {NAICS and Area of Focus Classifier},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/jmurray10/naics-aof-classifier}
}

License

MIT License

Contact

For questions or feedback: https://huggingface.co/jmurray10

Downloads last month: -; Downloads are not tracked for this model. How to track