license: cc-by-nc-4.0 language: - en - km base_model: facebook/nllb-200-distilled-600M pipeline_tag: translation tags: - legal - khmer - translation - nllb - refugee - humanitarian - denoising

Khmer Legal Bridge - NLLB Fine-tuned for Legal Translation

English-Khmer Bidirectional Translation Model for legal and humanitarian Documents

Model Description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M optimized for legal document translation between English and Khmer. It was developed to support Cambodian refugees, asylum seekers, and legal professionals who need accurate translations of legal materials.

Intended Use

  • Translation of legal documents (court documents, asylum applications, legal handbooks)
  • Refugee and immigration documentation
  • Juvenile justice materials
  • Human rights reports and policy documents

Languages

  • English (eng_Latn)
  • Khmer (khm_Khmr)

Evaluation Results

Direction chrF BLEU
EN to KM 53.28 29.38
KM to EN 59.68 34.78

Comparison with Base Model

Direction Metric Base NLLB Fine-tuned Change
KM to EN chrF 55.98 59.68 +3.70
KM to EN BLEU 28.35 34.78 +6.43
EN to KM chrF 54.48 53.28 -1.20

Key Results

  • Balanced bidirectional performance: Both directions now perform well
  • KM to EN significantly improved: +3.7 chrF, +6.4 BLEU
  • EN to KM chrF slightly lower: Small trade-off for better overall balance

Training Pipeline

Phase 1.5: Denoising Pre-training

Before translation fine-tuning, we strengthened the model's Khmer understanding using a denoising autoencoder task on 88,000+ monolingual Khmer examples:

Dataset Size Source Content
khPOS ~12,000 Khmer POS Corpus News, politics, economics - professionally segmented with POS tags
Khmer Dictionary 44K ~44,700 Royal Academy of Cambodia (2022) Curated definitions, formal register

Denoising Task:

  • Input: Corrupted Khmer text (15% noise: masking, deletion, token shuffling)
  • Output: Clean original text
  • Both encoder AND decoder trained

Phase 1: Bidirectional Translation Fine-tuning

Using the denoising-pretrained model, we fine-tuned on ~389,000 parallel examples (bidirectional):

Dataset Pairs Bidirectional Examples
ALT Corpus 18,088 36,176
OPUS-100 ~112,000 ~224,000
ParaCrawl ~65,000 ~130,000

Training Results:

Phase Metric Start End Improvement
1.5 Denoising Val Loss ~4.9 ~2.5 -49%
1 Translation Val Loss 3.804 2.586 -32%

Training Configuration:

  • Epochs: 3
  • Batch size: 32 effective
  • Learning rate: 2e-5
  • Training time: ~13.5 hours on Google Colab (A100)

Usage

from transformers import AutoModelForSeq2SeqLM, NllbTokenizerFast
import torch

# Load model and tokenizer
model_id = "ClaudBarbara/Open_Access_Khmer"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = NllbTokenizerFast.from_pretrained(model_id)

def translate(text, src_lang, tgt_lang):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
            max_length=512,
            num_beams=4
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# English to Khmer
result = translate("The court finds the defendant guilty.", "eng_Latn", "khm_Khmr")

# Khmer to English  
result = translate("<your_khmer_text>", "khm_Khmr", "eng_Latn")

TRANSLATION EXAMPLES

ENGLISH β†’ KHMER TESTS

πŸ‡¬πŸ‡§ EN: The court finds the defendant guilty.
πŸ‡°πŸ‡­ KM: αžαž»αž›αžΆαž€αžΆαžš αž”αžΆαž“ αžšαž€ αžƒαžΎαž‰ ថអ αž‡αž“ αž‡αžΆαž”αŸ‹ αž…αŸ„αž‘ αž˜αžΆαž“ αž€αŸ†αž αž»αžŸ αŸ”
Backwards: The court found the accused guilty.

πŸ‡¬πŸ‡§ EN: In addition, authorities, especially provincial authorities, are not aware of the law and apply outdated provisions to restrict NGO meetings and peaceful community demonstrations.
πŸ‡°πŸ‡­ KM: αž›αžΎαžŸ αž–αžΈ αž“αŸαŸ‡ αž‘αŸ… αž‘αŸ€αž αž’αžΆαž‡αŸ’αž‰αžΆαž’αžš αž‡αžΆ αž–αž·αžŸαŸαžŸ αž’αžΆαž‡αŸ’αž‰αžΆαž’αžš αžαŸαžαŸ’αž αž˜αž·αž“ αž”αžΆαž“ αžŠαžΉαž„ αž–αžΈ αž…αŸ’αž”αžΆαž”αŸ‹ αž“αŸαŸ‡ αž‘αŸ αž“αž·αž„ αž’αž“αž»αžœαžαŸ’αž αžŸαŸαž…αž€αŸ’αžαžΈ αž–αŸ’αžšαžΆαž„ αž…αžΆαžŸαŸ‹ αŸ— αžŠαžΎαž˜αŸ’αž”αžΈ αžŠαžΆαž€αŸ‹ αž€αž˜αŸ’αžšαž·αž αž€αž·αž…αŸ’αž… αž”αŸ’αžšαž‡αž»αŸ† αžšαž”αžŸαŸ‹ αž’αž„αŸ’αž‚ αž€αžΆαžš αž˜αž·αž“αž˜αŸ‚αž“ αžšαžŠαŸ’αž‹αžΆαž—αž·αž”αžΆαž› αž“αž·αž„ αž€αžΆαžš αž’αŸ’αžœαžΎ αž”αžΆαžαž»αž€αž˜αŸ’αž˜ αžŠαŸ„αž™ αžŸαž“αŸ’αžαž·αžœαž·αž’αžΈ αžšαž”αžŸαŸ‹ αžŸαž αž‚αž˜αž“αŸ αŸ”
Backwards: In addition, authorities, especially provincial authorities, are unaware of the law and apply the old regulations to restrict meetings of non-governmental organizations and peaceful demonstrations of communities.

πŸ‡¬πŸ‡§ EN: Human rights are protected by law.
πŸ‡°πŸ‡­ KM: αžŸαž·αž‘αŸ’αž’αž· αž˜αž“αž»αžŸαŸ’αžŸ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž€αžΆαžšαž–αžΆαžš αžŠαŸ„αž™ αž…αŸ’αž”αžΆαž”αŸ‹ αŸ”
Backwards: Human rights are protected by law.

πŸ‡¬πŸ‡§ EN: The refugee seeks asylum in Australia.
πŸ‡°πŸ‡­ KM: αž‡αž“αž—αŸ€αžŸαžαŸ’αž›αž½αž“αž“αŸαŸ‡αž”αžΆαž“αžŸαŸ’αžœαŸ‚αž„αžšαž€αžŸαž·αž‘αŸ’αž’αž·αž‡αŸ’αžšαž€αž€αŸ„αž“αž“αŸ…αž”αŸ’αžšαž‘αŸαžŸαž’αžΌαžŸαŸ’αžαŸ’αžšαžΆαž›αžΈαŸ”
Backwards: The refugee sought asylum in Australia.

πŸ‡¬πŸ‡§ EN: Lesson 8: Ensuring that children receive information and guidance, hygiene and sanitation, nutrition, care for children, providing food and drinks, receiving vaccines, maintaining personal hygiene, learning from experiences, fostering social interactions, observing surroundings, participating in community activities, protecting children from danger and harm, and handling various situations.
πŸ‡°πŸ‡­ KM: Lesson 8: αž€αžΆαžšαž’αžΆαž“αžΆαžαžΆαž€αž»αž˜αžΆαžšαž‘αž‘αž½αž›αž”αžΆαž“αž–αŸαžαŸŒαž˜αžΆαž“αž“αž·αž„αž€αžΆαžšαžŽαŸ‚αž“αžΆαŸ†, αž’αž“αžΆαž˜αŸαž™αž“αž·αž„αž’αž“αžΆαž˜αŸαž™, αž’αžΆαž αžΆαžšαžΌαž”αžαŸ’αžαž˜αŸ’αž—, αž€αžΆαžšαžαŸ‚αž‘αžΆαŸ†αž€αž»αž˜αžΆαžš, αž€αžΆαžšαž•αŸ’αžαž›αŸ‹αž’αžΆαž αžΆαžšαž“αž·αž„αž—αŸαžŸαž‡αŸ’αž‡αŸˆ, αž€αžΆαžšαž‘αž‘αž½αž›αžœαŸ‰αžΆαž€αŸ‹αžŸαžΆαŸ†αž„, αž€αžΆαžšαžαŸ‚αž‘αžΆαŸ†αž’αž“αžΆαž˜αŸαž™αž•αŸ’αž‘αžΆαž›αŸ‹αžαŸ’αž›αž½αž“, αž€αžΆαžšαžšαŸ€αž“αž–αžΈαž”αž‘αž–αž·αžŸαŸ„αž’αž“αŸ, αž€αžΆαžšαž›αžΎαž€αž€αž˜αŸ’αž–αžŸαŸ‹αž‘αŸ†αž“αžΆαž€αŸ‹αž‘αŸ†αž“αž„αžŸαž„αŸ’αž‚αž˜, αž€αžΆαžšαžŸαž„αŸ’αž€αŸαžαž˜αžΎαž›αž”αžšαž·αžŸαŸ’αžαžΆαž“, αž€αžΆαžšαž…αžΌαž›αžšαž½αž˜αž“αŸ…αž€αŸ’αž“αž»αž„αžŸαž€αž˜αŸ’αž˜αž—αžΆαž–αžŸαž αž‚αž˜αž“αŸ, αž€αžΆαžšαž€αžΆαžšαž–αžΆαžšαž€αž»αž˜αžΆαžšαž–αžΈαž‚αŸ’αžšαŸ„αŸ‡αžαŸ’αž“αžΆαž€αŸ‹αž“αž·αž„αž€αžΆαžšαž”αŸ‰αŸ‡αž–αžΆαž›αŸ‹, αž“αž·αž„αž€αžΆαžšαžŠαŸ„αŸ‡αžŸαŸ’αžšαžΆαž™αžŸαŸ’αžαžΆαž“αž—αžΆαž–αž•αŸ’αžŸαŸαž„αŸ—
Backwards: Lesson 8: Ensuring children receive information and guidance, sanitation and hygiene, nutrition, childcare, food and beverage provision, vaccination, personal hygiene, learning from experience, social communication promotion, environmental observation, participation in community activities, protecting children from harm and exposure, and addressing various situations

KHMER β†’ ENGLISH TESTS

πŸ‡°πŸ‡­ KM: αž’αž“αžΈαžαž·αž‡αž“αžŠαŸ‚αž›αž”αŸ’αžšαž–αŸ’αžšαžΉαžαŸ’αžαž”αž‘αž›αŸ’αž˜αžΎαžŸαžαŸ’αžšαžΌαžœαž‘αž‘αž½αž›αž”αžΆαž“αž€αžΆαžšαž€αžΆαžšαž–αžΆαžšαž–αžΈαž˜αŸαž’αžΆαžœαžΈ αž“αž·αž„αžαŸ’αžšαžΌαžœαž”αžΆαž“αž‡αŸ†αž“αž»αŸ†αž‡αž˜αŸ’αžšαŸ‡αž€αŸ’αž“αž»αž„αžαž»αž›αžΆαž€αžΆαžšαž’αž“αžΈαžαž·αž‡αž“αŸ”
πŸ‡¬πŸ‡§ EN: A minor who commits a crime is protected by a lawyer and is tried in a juvenile court.
Backwards: αž™αž»αžœαž‡αž“ αžŠαŸ‚αž› αž”αŸ’αžšαž–αŸ’αžšαžΉαžαŸ’αž αž”αž‘ αž§αž€αŸ’αžšαž·αžŠαŸ’αž‹ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž€αžΆαžšαž–αžΆαžš αžŠαŸ„αž™ αž˜αŸαž’αžΆαžœαžΈ αž αžΎαž™ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž€αžΆαžαŸ‹ αž‘αŸ„αžŸ αž“αŸ… αžαž»αž›αžΆαž€αžΆαžš αž™αž»αžœαž‡αž“ αŸ”

πŸ‡°πŸ‡­ KM: αž‡αž“αž—αŸ€αžŸαžαŸ’αž›αž½αž“αžŠαŸ‚αž›αž˜αžΆαž“αž€αžΆαžšαž—αŸαž™αžαŸ’αž›αžΆαž…αžŠαŸ‚αž›αž˜αžΆαž“αž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“αžαŸ’αžšαžΉαž˜αžαŸ’αžšαžΌαžœαž’αŸ†αž–αžΈαž€αžΆαžšαž’αŸ’αžœαžΎαž‘αž»αž€αŸ’αžαž”αž»αž€αž˜αŸ’αž“αŸαž‰αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αžŸαŸ’αž“αžΎαžŸαž»αŸ†αžŸαž·αž‘αŸ’αž’αž·αž‡αŸ’αžšαž€αž€αŸ„αž“αŸ”
πŸ‡¬πŸ‡§ EN: Refugee with a well-founded fear of torture is eligible for asylum.
Backwardsk: αž‡αž“αž—αŸ€αžŸαžαŸ’αž›αž½αž“αžŠαŸ‚αž›αž˜αžΆαž“αž€αžΆαžšαž—αŸαž™αžαŸ’αž›αžΆαž…αž“αŸƒαž€αžΆαžšαž’αŸ’αžœαžΎαž‘αžΆαžšαž»αžŽαž€αž˜αŸ’αž˜αžŠαŸ‚αž›αž˜αžΆαž“αž˜αžΌαž›αžŠαŸ’αž‹αžΆαž“αž›αŸ’αž’αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž‘αž‘αž½αž›αžŸαž·αž‘αŸ’αž’αž·αž‡αŸ’αžšαž€αž€αŸ„αž“αŸ”

πŸ‡°πŸ‡­ KM: αž‡αž“αž‡αžΆαž”αŸ‹αž…αŸ„αž‘αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž˜αž·αž“αž‘αž‘αž½αž›αžŸαŸ’αž‚αžΆαž›αŸ‹αž€αŸ†αž αž»αžŸ αž“αž·αž„αž˜αžΆαž“αžŸαž·αž‘αŸ’αž’αž·αž‘αž‘αž½αž›αž”αžΆαž“αž€αžΆαžšαž‡αŸ†αž“αž»αŸ†αž‡αž˜αŸ’αžšαŸ‡αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒαŸ”
πŸ‡¬πŸ‡§ EN: Defendants have the right not to plead guilty and the right to a fair trial.
Backwards: αž‡αž“ αž‡αžΆαž”αŸ‹ αž…αŸ„αž‘ αž˜αžΆαž“ αžŸαž·αž‘αŸ’αž’αž· αž˜αž·αž“ αž‘αž‘αž½αž› ខុស αžαŸ’αžšαžΌαžœ αž“αž·αž„ αžŸαž·αž‘αŸ’αž’αž· αž‘αž‘αž½αž› αž”αžΆαž“ αž€αžΆαžš αž€αžΆαžαŸ‹ αž‘αŸ„αžŸ αžŠαŸ„αž™ αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒ αŸ”

πŸ‡°πŸ‡­ KM: αž‚αŸ„αž›αž€αžΆαžšαžŽαŸαž˜αž·αž“αž”αž‰αŸ’αž‡αžΌαž“αžαŸ’αžšαž‘αž”αŸ‹αž‘αŸ…αžœαž·αž‰αž αžΆαž˜αžƒαžΆαžαŸ‹αžšαžŠαŸ’αž‹αž˜αž·αž“αž±αŸ’αž™αž”αžŽαŸ’αžαŸαž‰αž”αž»αž‚αŸ’αž‚αž›αž‘αŸ…αž”αŸ’αžšαž‘αŸαžŸαžŠαŸ‚αž›αž–αž½αž€αž‚αŸαž’αžΆαž…αž”αŸ’αžšαžˆαž˜αž“αžΉαž„αž€αžΆαžšαž’αŸ’αžœαžΎαž‘αžΆαžšαž»αžŽαž€αž˜αŸ’αž˜αŸ”
πŸ‡¬πŸ‡§ EN: The non-refoulement policy prohibits states from deporting individuals to countries where they may face torture.
Backwards: αž‚αŸ„αž› αž“αž™αŸ„αž”αžΆαž™ αž˜αž·αž“ αžαŸ’αžšαž‘αž”αŸ‹ αž˜αž€ αžœαž·αž‰ αž“αŸαŸ‡ ហអម αžƒαžΆαžαŸ‹ αžšαžŠαŸ’αž‹ αž˜αž·αž“ αž²αŸ’αž™ αž”αžŽαŸ’αžαŸαž‰ αž”αž»αž‚αŸ’αž‚αž› αž‘αŸ… αž€αžΆαž“αŸ‹ αž”αŸ’αžšαž‘αŸαžŸ αžŠαŸ‚αž› αž–αž½αž€ αž‚αŸ αž’αžΆαž… αž”αŸ’αžšαžˆαž˜ មុខ αž“αžΉαž„ αž€αžΆαžš αž’αŸ’αžœαžΎ αž‘αžΆαžšαž»αžŽ αž€αž˜αŸ’αž˜ αŸ”

πŸ‡°πŸ‡­ KM: αž‚αŸ’αžšαž”αŸ‹αžŸαŸαž…αž€αŸ’αžαžΈαžŸαž˜αŸ’αžšαŸαž…αžŠαŸ‚αž›αž”αŸ‰αŸ‡αž–αžΆαž›αŸ‹αžŠαž›αŸ‹αž€αž»αž˜αžΆαžšαžαŸ’αžšαžΌαžœαž‚αž·αžαž‚αžΌαžšαž–αžΈαž’αžαŸ’αžαž”αŸ’αžšαž™αŸ„αž‡αž“αŸαž€αž»αž˜αžΆαžšαž›αŸ’αž’αž”αŸ†αž•αž»αžαž‡αžΆαž…αž˜αŸ’αž”αž„αŸ”
πŸ‡¬πŸ‡§ EN: Any decision that affects children must be based on the best interests of the child.
Backwards: αž€αžΆαžš αžŸαž˜αŸ’αžšαŸαž… αž…αž·αžαŸ’αž ណអ αž˜αž½αž™ αžŠαŸ‚αž› αž”αŸ‰αŸ‡ αž–αžΆαž›αŸ‹ αžŠαž›αŸ‹ αž€αž»αž˜αžΆαžš αžαŸ’αžšαžΌαžœ αžαŸ‚ αž•αŸ’αž’αŸ‚αž€ αž›αžΎ αž•αž› αž”αŸ’αžšαž™αŸ„αž‡αž“αŸ αž›αŸ’αž’ αž”αŸ†αž•αž»αž αžšαž”αžŸαŸ‹ αž€αž»αž˜αžΆαžš αŸ”

πŸ‡°πŸ‡­ KM: αž€αžΆαžšαžƒαž»αŸ†αžαŸ’αž›αž½αž“αž˜αž»αž“αž€αžΆαžšαž‡αŸ†αž“αž»αŸ†αž‡αž˜αŸ’αžšαŸ‡αžαŸ’αžšαžΌαžœαž”αŸ’αžšαžΎαž‡αžΆαžœαž·αž’αžΆαž“αž€αžΆαžšαž…αž»αž„αž€αŸ’αžšαŸ„αž™αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αž’αž“αžΈαžαž·αž‡αž“αŸ”
πŸ‡¬πŸ‡§ EN: Pre-trial detention shall be used as a last resort for minors.
Backwards: αž€αžΆαžš αžƒαž»αŸ† αžαŸ’αž›αž½αž“ αž˜αž»αž“ αž–αŸαž› αž€αžΆαžαŸ‹ αž€αŸ’αžαžΈ αž“αžΉαž„ αžαŸ’αžšαžΌαžœ αž”αžΆαž“ αž”αŸ’αžšαžΎαž”αŸ’αžšαžΆαžŸαŸ‹ αž‡αžΆ αžŠαŸ†αžŽαŸ„αŸ‡ αžŸαŸ’αžšαžΆαž™ αž…αž»αž„ αž€αŸ’αžšαŸ„αž™ αžŸαž˜αŸ’αžšαžΆαž”αŸ‹ αž’αž“αžΈαžαž·αž‡αž“ αŸ”

πŸ‡°πŸ‡­ KM: αžαž»αž›αžΆαž€αžΆαžšαžαŸ’αžšαžΌαžœαž–αž·αž…αžΆαžšαžŽαžΆαž—αžŸαŸ’αžαž»αžαžΆαž„αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αž˜αž»αž“αž–αŸαž›αžŸαž˜αŸ’αžšαŸαž…αž…αž·αžαŸ’αž αž αžΎαž™αžαŸ’αžšαžΌαžœαž•αŸ’αžαž›αŸ‹αž αŸαžαž»αž•αž›αž…αŸ’αž”αžΆαžŸαŸ‹αž›αžΆαžŸαŸ‹αžŸαž˜αŸ’αžšαžΆαž”αŸ‹αžŸαžΆαž›αž€αŸ’αžšαž˜αŸ”
πŸ‡¬πŸ‡§ EN: The court must consider all the evidence before making a decision and must give a clear reason for the verdict.
Backwards: αžαž»αž›αžΆαž€αžΆαžš αžαŸ’αžšαžΌαžœ αžαŸ‚ αž–αž·αž…αžΆαžšαžŽαžΆ αž›αžΎ αž—αžŸαŸ’αžαž»αžαžΆαž„ αž‘αžΆαŸ†αž„ αž’αžŸαŸ‹ αž˜αž»αž“ αž–αŸαž› αž’αŸ’αžœαžΎ αž€αžΆαžš αžŸαž˜αŸ’αžšαŸαž… αž…αž·αžαŸ’αž αž“αž·αž„ αžαŸ’αžšαžΌαžœ αžαŸ‚ αž•αŸ’αžαž›αŸ‹ ហេតុ αž•αž› αž…αŸ’αž”αžΆαžŸαŸ‹αž›αžΆαžŸαŸ‹ αžŸαŸ†αžšαžΆαž”αŸ‹ αž€αžΆαžš αž€αžΆαžαŸ‹ αž€αŸ’αžαžΈ αŸ”

πŸ‡°πŸ‡­ KM: αž€αŸ’αžšαžŸαž½αž„αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒαžαŸ’αžšαžΌαžœαž’αžΆαž“αžΆαžαžΆαž’αŸ’αž“αž€αž‡αžΆαž”αŸ‹αžƒαž»αŸ†αž‘αžΆαŸ†αž„αž’αžŸαŸ‹αž˜αžΆαž“αž›αž‘αŸ’αž’αž—αžΆαž–αž‘αž‘αž½αž›αž”αžΆαž“αž‡αŸ†αž“αž½αž™αž•αŸ’αž›αžΌαžœαž…αŸ’αž”αžΆαž”αŸ‹αŸ”
πŸ‡¬πŸ‡§ EN: The Justice Department must ensure that all detainees have access to legal aid.
Backwards: αž€αŸ’αžšαžŸαž½αž„ αž™αž»αžαŸ’αžαž·αž’αž˜αŸŒ αžαŸ’αžšαžΌαžœ αžαŸ‚ αž’αžΆαž“αžΆ ថអ αž’αŸ’αž“αž€ αž‡αžΆαž”αŸ‹ αžƒαž»αŸ† αž‘αžΆαŸ†αž„ αž’αžŸαŸ‹ αž˜αžΆαž“ αžŸαž·αž‘αŸ’αž’αž· αž‘αž‘αž½αž› αž”αžΆαž“ αž‡αŸ†αž“αž½αž™ αž•αŸ’αž›αžΌαžœ αž…αŸ’αž”αžΆαž”αŸ‹ αŸ”


## Known Limitations

1. **Temporal bias**: The model tends to add past-tense markers in Khmer when translating present-tense English. This is being addressed in future training phases.

2. **Domain specificity**: Best results on legal and formal documents.

3. **Length**: Optimized for sentence-level translation (max 512 tokens).

## Ethical Considerations

This model is intended for humanitarian purposes. It should NOT replace certified human translators in official legal proceedings.

**Privacy**: Stateless processing - no input text is stored or logged.

## Roadmap

- [x] Phase 1.5: Denoising pre-training on 88K Khmer examples
- [x] Phase 1: Bidirectional translation fine-tuning on 389K examples
- [ ] Phase 2: LoRA fine-tuning on legal glossary (~5000 pairs and 450 terms that could work as hard constraints)
- [ ] Phase 3: Tense augmentation to address temporal bias

## Technical References

**Datasets:**
- khPOS: Ye Kyaw Thu et al., "Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus" (ONA 2017)
- Khmer Dictionary 44K: Royal Academy of Cambodia, 2022
- ALT Corpus: Asian Language Treebank

**Model:**
- NLLB-200: Costa-jussa et al., "No Language Left Behind" (2022)

**Methodology:**
- Denoising: Lewis et al., "BART: Denoising Sequence-to-Sequence Pre-training" (2019)

## Citation

```bibtex
@misc{khmer-legal-bridge-2024,
  title={Khmer Legal Bridge: Fine-tuned NLLB for Legal Translation},
  author={ClaudBarbara},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/ClaudBarbara/Open_Access_Khmer}
}

Acknowledgments

Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning. Retrieved from https://arxiv.org/abs/2103.16801 Loem, M. (2021, May 4). Joint Khmer Word Segmentation and POS tagging. Medium. Retrieved from https://towardsdatascience.com/joint-khmer-word-segmentation-and-pos-tagging-cad650e78d30 Ye, K. T., Vichet, C., & Yoshinori, S. (2017). Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. First Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages (ONA 2017). Retrieved from https://github.com/ye-kyaw-thu/khPOS/blob/master/khpos.pdf

License

CC-BY-NC-4.0


Khmer Legal Bridge - Open Source Legal Translation for Refugee Communities

Downloads last month
43
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ClaudBarbara/Open_Access_Khmer

Finetuned
(215)
this model

Spaces using ClaudBarbara/Open_Access_Khmer 2