--- license: apache-2.0 language: - en - pl tags: - translation - marian - nmt - encoder-decoder - from-scratch pipeline_tag: translation widget: - text: "The weather is beautiful today." example_title: "Simple sentence" - text: "Machine learning is transforming the way we build software applications." example_title: "Technical text" - text: "The European Union has proposed new regulations on artificial intelligence." example_title: "Formal text" datasets: - opus100 - europarl_bilingual - un_pc model-index: - name: pumatic-en-pl results: [] --- # Pumatic English-Polish Translation Model A neural machine translation model for English to Polish translation, **trained entirely from scratch** using the MarianMT architecture. ## Model Description - **Model type:** Encoder-Decoder (MarianMT architecture) - **Language pair:** English → Polish - **Parameters:** ~157M - **Training approach:** From scratch (randomly initialized weights, custom tokenizer) - **GPU:** 4x NVIDIA H200 - **Trained by:** [pumad](https://huggingface.co/pumad) > **Note:** This model was **not fine-tuned** from any existing pre-trained model. Both the model weights and the SentencePiece tokenizer were trained from scratch on the parallel corpus. ## Architecture | Component | Configuration | |-----------|---------------| | d_model | 768 | | Encoder layers | 8 | | Decoder layers | 8 | | Attention heads | 12 | | FFN dimension | 3072 | | Vocabulary size | 32,000 | | Max position embeddings | 512 | | Activation function | GELU | ## Training Details ### Training Data The model was trained on high-quality parallel corpora: - **OPUS-100** - Multilingual parallel corpus - **Europarl** - European Parliament proceedings - **UN Parallel Corpus (UNPC)** - United Nations documents ### Training Procedure - **Hardware:** 4x NVIDIA H200 GPU (distributed training) - **Framework:** Hugging Face Transformers + Accelerate - **Batch size:** 512 per GPU (2048 effective) - **Learning rate:** 3e-4 with cosine decay - **Warmup:** 6% of training steps - **Epochs:** 10 - **Optimizer:** Fused AdamW - **Precision:** bf16 mixed precision - **Max sequence length:** 128 tokens ### Tokenizer A custom SentencePiece tokenizer (unigram model) was trained on the parallel corpus with: - 32,000 vocabulary size - 99.95% character coverage - Language tag support (`>>pl<<`) ### Data Preprocessing - Quality filtering: Removed pairs with fewer than 5 words or more than 200 words - Length ratio filtering: Excluded pairs with extreme length ratios (< 0.5 or > 2.0) - Deduplication: Removed duplicate source sentences ## Usage ### Using the Transformers library ```python from transformers import MarianMTModel, MarianTokenizer model_name = "pumad/pumatic-en-pl" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) text = "Hello, how are you today?" inputs = tokenizer(text, return_tensors="pt", padding=True) translated = model.generate(**inputs) output = tokenizer.decode(translated[0], skip_special_tokens=True) print(output) ``` ### Using the Pipeline API ```python from transformers import pipeline translator = pipeline("translation", model="pumad/pumatic-en-pl") result = translator("The quick brown fox jumps over the lazy dog.") print(result[0]['translation_text']) ``` ## Demo Try this model live at [pumatic.eu](https://pumatic.eu) API documentation available at [pumatic.eu/docs](https://pumatic.eu/docs) ## Limitations - Optimized for general-purpose translation; domain-specific terminology may vary in quality - Maximum input length of ~400 characters per chunk for optimal results - Best performance on formal/written text; colloquial expressions may be less accurate ## License Apache 2.0 ## Citation If you use this model, please cite: ```bibtex @misc{pumatic-en-pl, author = {pumad}, title = {Pumatic English-Polish Translation Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/pumad/pumatic-en-pl} } ```