You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NAICS + Area of Focus Classifier

Semantic matching model for classifying companies into:

  • NAICS codes (1,012 industry classifications)
  • Area of Focus (59 technology categories)

Model Description

This model uses semantic similarity matching (not classification heads) to predict both NAICS industry codes and Area of Focus categories for technology companies.

Architecture:

  • Base: BAAI/bge-small-en-v1.5 (384-dim embeddings)
  • Method: Cosine similarity matching
  • NAICS: 1,012 reference embeddings (3 types: short, keywords, long)
  • AoF: 59 reference embeddings

Performance

NAICS Classification:

  • Top-1 Accuracy: ~70% (with detailed descriptions)
  • Confidence: 60-90%

Area of Focus Classification:

  • Top-1 Accuracy: ~85%
  • Confidence: 80-95%

Speed:

  • Both predictions: <40ms
  • No GPU required for inference

Usage

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
import pickle

# Load model
model_name = "jmurray10/naics-aof-classifier"
encoder = AutoModel.from_pretrained(f"{model_name}/encoder")
tokenizer = AutoTokenizer.from_pretrained(f"{model_name}/encoder")

# Load index with embeddings
import urllib.request
urllib.request.urlretrieve(
    f"https://huggingface.co/{model_name}/resolve/main/naics_index.pkl",
    "naics_index.pkl"
)
with open('naics_index.pkl', 'rb') as f:
    naics_index = pickle.load(f)

# Prediction function
def predict(description):
    # Encode
    text = f"Represent this business for industry classification: {description}"
    inputs = tokenizer(text, padding=True, truncation=True, 
                      max_length=128, return_tensors='pt')
    
    with torch.no_grad():
        outputs = encoder(**inputs)
        mask = inputs['attention_mask'].unsqueeze(-1).float()
        pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        emb = F.normalize(pooled, p=2, dim=1)
    
    # NAICS
    naics_emb_short = torch.tensor(naics_index['emb_short'])
    naics_emb_kw = torch.tensor(naics_index['emb_kw'])
    naics_emb_long = torch.tensor(naics_index['emb_long'])
    
    sim_s = F.cosine_similarity(emb, naics_emb_short)
    sim_k = F.cosine_similarity(emb, naics_emb_kw)
    sim_l = F.cosine_similarity(emb, naics_emb_long)
    naics_scores = 0.4 * sim_s + 0.35 * sim_k + 0.25 * sim_l
    
    top_naics = naics_scores.argmax().item()
    naics_code = naics_index['codes'][top_naics]
    naics_title = naics_index['titles'][top_naics]
    naics_conf = naics_scores[top_naics].item()
    
    # AoF
    aof_emb = torch.tensor(naics_index['aof_emb'])
    aof_scores = F.cosine_similarity(emb, aof_emb)
    
    top_aof = aof_scores.argmax().item()
    aof_class = naics_index['aof_codes'][top_aof]
    aof_conf = aof_scores[top_aof].item()
    
    return {
        'naics': {'code': naics_code, 'title': naics_title, 'confidence': naics_conf},
        'aof': {'class': aof_class, 'confidence': aof_conf}
    }

# Example
result = predict("payment processing APIs for online merchants")
print(result)
# {'naics': {'code': '522320', 'title': 'Financial Transactions Processing...', 'confidence': 0.654},
#  'aof': {'class': 'Fin-Tech: Payments', 'confidence': 0.806}}

Examples

Description NAICS Area of Focus AoF Confidence
payment processing APIs 522320 Fin-Tech: Payments 0.806
endpoint security software 541519 Cyber Security 0.928
game development engine 513210 Gaming 0.857
cryptocurrency exchange 523210 Cryptocurrency: no ICO 0.919
cloud infrastructure 518210 Cloud Infrastructure 0.956

Area of Focus Categories

The model classifies into 59 technology-focused categories:

Financial Technology:

  • Fin-Tech: Payments, Lending, Broker/Dealer, Personal Finance, Consumer Banking, Investment Management

Security:

  • Cyber Security, Network Security

Software & SaaS:

  • SaaS - All Other, Enterprise Software, Software Development

Infrastructure:

  • Cloud Infrastructure and Cloud Storage, Database, Web Service Providers

Emerging Tech:

  • Artificial Intelligence, Blockchain Technology, Cryptocurrency, IoT, Robotics, Autonomous Vehicles

And 40+ more categories...

Training Data

  • NAICS: Trained on NAICS 2022 manual descriptions, index terms, and examples (~40K examples)
  • AoF: Created from semantic definitions of 59 technology categories (no training required - semantic matching)

Limitations

  • Best for technology companies - AoF categories focus on tech industries
  • Requires detailed descriptions - Works best with 20+ word descriptions
  • NAICS baseline - 70% top-1 accuracy reflects the difficulty of 1,012-class classification

Citation

If you use this model, please cite:

@misc{naics_aof_classifier,
  author = {Murray, James},
  title = {NAICS and Area of Focus Classifier},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/jmurray10/naics-aof-classifier}
}

License

MIT License

Contact

For questions or feedback: https://huggingface.co/jmurray10

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support