LusakaLang – Multilingual Topic Classification Model

🧠 Model Description

mbert_LusakaLang_Topic is a fine-tuned version of Kelvinmbewe/mbert_LusakaLang designed for topic classification in multilingual Zambian text.
The model focuses on Lusaka-style language, where English is frequently mixed with Bemba and Nyanja, particularly in informal digital communication such as ride-hailing reviews, customer feedback, and social media comments.

LusakaLang captures code-switching patterns, local idioms, and pragmatic expressions unique to Zambia’s urban linguistic environment, enabling accurate classification of real-world, mixed-language text.


🎯 Task

Text Classification (Topic Classification)

Supported topics include:

  • customer_support
  • driver_behaviour
  • payment_issues

🧪 Training Data Creation & Local Review Process

The training data used for LusakaLang was primarily AI-generated synthetic text, created to simulate ride-hailing user reviews and feedback common in Zambian contexts (e.g. complaints, compliments, and service issues).

To ensure linguistic authenticity and cultural relevance, all synthetic samples were reviewed, corrected, and refined by a native Zambian speaker. This human-in-the-loop review process focused on:

  • Correcting unnatural or non-local phrasing introduced by AI generation
  • Aligning expressions with Lusaka-style English, Bemba, and Nyanja usage
  • Ensuring realistic code-switching patterns (English ↔ Bemba ↔ Nyanja)
  • Improving local idioms, slang, and pragmatic meaning

This hybrid approach combines the scalability of AI-generated data with human linguistic expertise, resulting in training samples that better reflect real-world ride-hailing communication in Lusaka.

Note: While the dataset is synthetic, linguistic patterns were intentionally grounded in local Zambian speech norms through native-speaker validation.


📊 Evaluation Results (Validation Set)

The model was evaluated after 0 training epochs on a held-out validation set.

Metric Score
Accuracy 99.1%
Precision 99.0%
Recall 99.0%
Macro F1 99.0%
Micro F1 99.1%
Val Loss 0.10

These results demonstrate excellent generalization with no signs of overfitting.
Macro and Micro F1 scores are closely aligned, indicating balanced performance across all topic classes.


💡 Kelvinmbewe/mbert_LusakaLang_Topic?

✅ Better Understanding of Zambian English

Examples:

  • “I’m just there”
  • “I’m not fine but I’m okay”
  • “I’m feeling somehow”
  • “Believe you me”
  • “Me I tell you the truth”
  • “It’s just temporal”

✅ Better Handling of Bemba & Nyanja Idioms

Examples:

  • “Nimvela bwino” → positive context
  • “Nimvelako bwino pangono pangono” → neutral context
  • “Nima one naiwe” → negative context
  • “Sima one naiwe” → positive context

✅ Strong Code-Switching Support

Common patterns:

  • English + Bemba
  • English + Nyanja
  • English + slang
  • English + Bemba + Nyanja

🚀 Intended Use

mbert_LusakaLang_Topic is intended for:

  • Ride-hailing customer feedback analysis
  • Topic classification of Zambian social media text
  • Customer support automation
  • Research on African multilingual NLP and code-switching

⚠️ Limitations

  • The training data is partially synthetic, despite native-speaker review.
  • Performance may degrade on:
    • Slang or expressions not represented in the dataset
    • Text from regions outside Lusaka
    • Domains unrelated to ride-hailing or customer feedback

Future versions aim to incorporate larger volumes of real-world annotated data.


image

image

image

image


🙌 Acknowledgements

Special thanks to native Zambian language contributors who helped ensure local linguistic accuracy and cultural relevance in the training data.

Downloads last month
28
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kelvinmbewe/mbert_LusakaLang_Topic

Finetuned
(923)
this model

Datasets used to train Kelvinmbewe/mbert_LusakaLang_Topic

Evaluation results