Kelvinmbewe's picture
Update README.md
64df334 verified
metadata
language:
  - en
  - bem
  - ny
tags:
  - multi-task
  - sentiment-analysis
  - topic-classification
  - language-identification
  - multilingual
  - transformer
  - zambia
  - lusaka
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
model-index:
  - name: LusakaLang-MultiTask
    results:
      - task:
          type: text-classification
          name: Language Identification
        dataset:
          name: LusakaLang Language Data
          type: lusakalang
          split: test
        metrics:
          - type: accuracy
            value: 0.97
            name: accuracy
          - type: f1
            value: 0.96
            name: f1_macro
          - type: accuracy
            value: 0.9322
            name: accuracy
          - type: f1
            value: 0.9216
            name: f1_macro
          - type: f1
            value: 0.8649
            name: f1_negative
          - type: f1
            value: 0.95
            name: f1_neutral
          - type: f1
            value: 0.95
            name: f1_positive
          - type: accuracy
            value: 0.91
            name: accuracy
          - type: f1
            value: 0.9
            name: f1_macro
base_model:
  - Kelvinmbewe/mbert_Lusaka_Language_Analysis
  - Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis
  - Kelvinmbewe/mbert_LusakaLang_Topic

LusakaLang MultiTask Model

This model is a unified transformer architecture built on top of bert-base-multilingual-cased, designed to perform three tasks simultaneously:

  1. Language Identification
  2. Sentiment Analysis
  3. Topic Classification

The system integrates three fineโ€‘tuned LusakaLang checkpoints:

  • mbert_Lusaka_Language_Analysis
  • mbert_LusakaLang_Sentiment_Analysis
  • mbert_LusakaLang_Topic

All tasks share a single mBERT encoder, supported by three independent classifier heads. This architecture enhances computational efficiency, reduces memory overhead and promotes consistent, harmonized predictions across all tasks.


Why This Model Matters

Zambian communication is inherently multilingual, fluid, and deeply shaped by context. A single message may blend English, Bemba, Nyanja, local slang, and frequent codeโ€‘switching, often expressed through culturally grounded idioms and subtle emotional cues. This model is designed specifically for that environment, where meaning depends not only on the words used but on how languages interact within a single utterance.

It excels at identifying the dominant language or detecting when multiple languages are being used together, interpreting sentiment even when it is conveyed indirectly or through culturally specific phrasing, and classifying text into practical topics such as driver behaviour, payment issues, app performance, customer support, and ride availability. By capturing these nuances, the model provides a more accurate and contextโ€‘aware understanding of real Zambian communication.


How to Use This Model

from transformers import AutoTokenizer
import torch

class LusakaLangMultiTask:
    def __init__(self, path="Kelvinmbewe/LusakaLang-MultiTask"):
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        self.model = torch.load(f"{path}/model.pt").eval()

    def predict_language(self, texts): pass
    def predict_sentiment(self, texts): pass
    def predict_topic(self, texts): pass

llm = LusakaLangMultiTask()

print(llm.predict_language([...]))
print(llm.predict_sentiment([...]))
print(llm.predict_topic([...]))

Sample Output

# Language Identification ๐ŸŒ
[
  {"lang": "Bemba",  "conf": 0.96},
  {"lang": "Nyanja", "conf": 0.95},
  {"lang": "English","conf": 0.99}
]
# Sentiment โค๏ธ
[
  {"sent": "Negative", "conf": 0.98},
  {"sent": "Positive", "conf": 0.95},
  {"sent": "Neutral",  "conf": 0.87}
]
# Topic ๐Ÿ—‚๏ธ
[
  {"topic": "Payment Issue",     "conf": 0.97},
  {"topic": "Customer Support",  "conf": 0.95},
  {"topic": "Driver Behaviour",  "conf": 0.96}
]
=========================== Training Architecture ===========================

๐Ÿ“ฅ Input                โ†’  ๐Ÿง  Core Engine              โ†’            ๐Ÿ“ˆ Output
------------------------------------------------------------------------------------
Text (Any Language)     โ†’   Tokenizer ๐Ÿ”ค                       โ†’     Language ๐ŸŒ
                        โ†’   Shared mBERT Encoder ๐Ÿง             โ†’     Bemba / Nyanja /
                        โ†’   CLS Vector ๐ŸŽฏ                      โ†’     English / Mixed
------------------------------------------------------------------------------------
User Feedback ๐Ÿ’ฌ        โ†’   Tokenizer ๐Ÿ”ค                       โ†’     Sentiment โค๏ธ
                        โ†’   Shared Encoder ๐Ÿง                   โ†’     Negative / Neutral /
                        โ†’   CLS Vector ๐ŸŽฏ                      โ†’     Positive
------------------------------------------------------------------------------------
Ride Context ๐Ÿš—         โ†’   Tokenizer ๐Ÿ”ค                       โ†’     Topic ๐Ÿ—‚๏ธ
                        โ†’   Shared Encoder ๐Ÿง                   โ†’     Driver / Payment /
                        โ†’   CLS Vector ๐ŸŽฏ                      โ†’     Support / App / Availability
------------------------------------------------------------------------------------