---
license: apache-2.0
language:
  - en
  - pl
tags:
  - translation
  - marian
  - nmt
  - encoder-decoder
  - from-scratch
pipeline_tag: translation
widget:
  - text: "The weather is beautiful today."
    example_title: "Simple sentence"
  - text: "Machine learning is transforming the way we build software applications."
    example_title: "Technical text"
  - text: "The European Union has proposed new regulations on artificial intelligence."
    example_title: "Formal text"
datasets:
  - opus100
  - europarl_bilingual
  - un_pc
model-index:
  - name: pumatic-en-pl
    results: []
---

# Pumatic English-Polish Translation Model

A neural machine translation model for English to Polish translation, **trained entirely from scratch** using the MarianMT architecture.

## Model Description

- **Model type:** Encoder-Decoder (MarianMT architecture)
- **Language pair:** English → Polish
- **Parameters:** ~157M
- **Training approach:** From scratch (randomly initialized weights, custom tokenizer)
- **GPU:** 4x NVIDIA H200
- **Trained by:** [pumad](https://huggingface.co/pumad)

> **Note:** This model was **not fine-tuned** from any existing pre-trained model. Both the model weights and the SentencePiece tokenizer were trained from scratch on the parallel corpus.

## Architecture

| Component | Configuration |
|-----------|---------------|
| d_model | 768 |
| Encoder layers | 8 |
| Decoder layers | 8 |
| Attention heads | 12 |
| FFN dimension | 3072 |
| Vocabulary size | 32,000 |
| Max position embeddings | 512 |
| Activation function | GELU |

## Training Details

### Training Data

The model was trained on high-quality parallel corpora:
- **OPUS-100** - Multilingual parallel corpus
- **Europarl** - European Parliament proceedings
- **UN Parallel Corpus (UNPC)** - United Nations documents

### Training Procedure

- **Hardware:** 4x NVIDIA H200 GPU (distributed training)
- **Framework:** Hugging Face Transformers + Accelerate
- **Batch size:** 512 per GPU (2048 effective)
- **Learning rate:** 3e-4 with cosine decay
- **Warmup:** 6% of training steps
- **Epochs:** 10
- **Optimizer:** Fused AdamW
- **Precision:** bf16 mixed precision
- **Max sequence length:** 128 tokens

### Tokenizer

A custom SentencePiece tokenizer (unigram model) was trained on the parallel corpus with:
- 32,000 vocabulary size
- 99.95% character coverage
- Language tag support (`>>pl<<`)

### Data Preprocessing

- Quality filtering: Removed pairs with fewer than 5 words or more than 200 words
- Length ratio filtering: Excluded pairs with extreme length ratios (< 0.5 or > 2.0)
- Deduplication: Removed duplicate source sentences

## Usage

### Using the Transformers library

```python
from transformers import MarianMTModel, MarianTokenizer

model_name = "pumad/pumatic-en-pl"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "Hello, how are you today?"
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
output = tokenizer.decode(translated[0], skip_special_tokens=True)
print(output)
```

### Using the Pipeline API

```python
from transformers import pipeline

translator = pipeline("translation", model="pumad/pumatic-en-pl")
result = translator("The quick brown fox jumps over the lazy dog.")
print(result[0]['translation_text'])
```

## Demo

Try this model live at [pumatic.eu](https://pumatic.eu)

API documentation available at [pumatic.eu/docs](https://pumatic.eu/docs)

## Limitations

- Optimized for general-purpose translation; domain-specific terminology may vary in quality
- Maximum input length of ~400 characters per chunk for optimal results
- Best performance on formal/written text; colloquial expressions may be less accurate

## License

Apache 2.0

## Citation

If you use this model, please cite:

```bibtex
@misc{pumatic-en-pl,
  author = {pumad},
  title = {Pumatic English-Polish Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/pumad/pumatic-en-pl}
}
```