File size: 41,747 Bytes
19301c9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 |
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"source": [
"# ✍️ **Day 03 – Summarization & Translation Deep Dive with Hugging Face 🤗**\n",
"\n",
"This notebook contains all the code experiments for **Day 3** of my *30 Days of GenAI* challenge.\n",
"\n",
"For detailed commentary and discoveries, see 👉 [Day 3 Log](https://huggingface.co/Musno/30-days-of-genai/blob/main/logs/day3.md)\n",
"\n",
"---\n",
"\n",
"## 📌 What’s Covered Today\n",
"\n",
"Today, we're broadening our horizons beyond classification to explore two powerful generative NLP tasks: **Text Summarization** and **Machine Translation**. Our focus will be on understanding the capabilities of default Hugging Face pipelines versus models specifically fine-tuned for Arabic.\n",
"\n",
"Here’s our game plan:\n",
"\n",
"### 📝 Text Summarization\n",
"- Initial exploration with the **default summarization pipeline** using English text to establish a baseline.\n",
"- Evaluating the performance of **Arabic-specific summarization models** against the default, both with English and Arabic inputs, to observe the impact of specialized training.\n",
"\n",
"### 🌐 Machine Translation\n",
"- Testing translation capabilities between **English and Modern Standard Arabic (MSA)**, examining both directions (EN -> MSA, MSA -> EN) using fine-tuned models. We anticipate strong performance here.\n",
"- Tackling the more challenging task of translating **Arabic Dialects to English and vice-versa**. This is where we expect to see significant differences and highlight the necessity of dialect-aware models.\n",
"\n",
"Let’s dive in and uncover the nuances of text generation! 🚀\n",
"\n",
"---"
],
"metadata": {
"id": "IoWElNuiSktA"
}
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "2Dnba_uXPlH8"
},
"outputs": [],
"source": [
"from transformers import pipeline"
]
},
{
"cell_type": "markdown",
"source": [
"### 📝 Summarization Experiment 1: Default Pipeline with Narrative Text (English)\n",
"\n",
"---\n",
"\n",
"For our first exploration into text summarization, we'll use the default Hugging Face `summarization` pipeline without specifying a particular model or length parameters. This will give us a baseline understanding of how a general-purpose model handles narrative content, specifically a story about a well-known fictional character like Naruto Uzumaki. We want to see how much detail it retains and its overall summarization style.\n",
"\n",
"---"
],
"metadata": {
"id": "FDunD64VikZW"
}
},
{
"cell_type": "code",
"source": [
"summarizer = pipeline(\"summarization\")\n",
"\n",
"\n",
"# Access the model's configuration\n",
"# The '_name_or_path' attribute often holds the model ID\n",
"# print(f\"The default summarization model loaded is: {summarizer.model.config._name_or_path}\")\n",
"\n",
"# You can also get more details about the model\n",
"# print(summarizer.model.config)\n",
"\n",
"\n",
"# The long story about Naruto\n",
"naruto_story = \"\"\"\n",
"Born an orphan into the Hidden Leaf Village, Naruto Uzumaki's early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him. This secret led to him being ostracized and feared by the villagers, forcing a young Naruto to desperately seek attention and validation through pranks and a boisterous personality. His unwavering dream, however, was to become the Hokage, the village leader, a path he believed would finally earn him the respect and love he craved.\n",
"\n",
"His journey began with humble ninja training, forming Team 7 with the aloof Sasuke Uchiha, his rival and eventual best friend, and the intelligent Sakura Haruno, under the guidance of the enigmatic Kakashi Hatake. Early missions, like confronting Zabuza and Haku in the Land of Waves, forged bonds and revealed Naruto's hidden potential and fierce loyalty. As he grew, he faced numerous personal and global conflicts. The heart-wrenching pursuit of Sasuke, driven by revenge and Orochimaru's manipulation, became a central struggle, pushing Naruto to immense power, including mastering the Rasengan under the tutelage of his beloved mentor, Jiraiya. Jiraiya's tragic death at the hands of Pain, a former student, was a profound blow, yet it fueled Naruto's resolve, leading him to confront Pain and bring peace to the devastated Konoha, finally earning the villagers' acknowledgment and admiration.\n",
"\n",
"The Fourth Great Ninja War tested Naruto's strength and conviction to their limits. During this cataclysmic conflict, he confronted the harsh truths of his heritage, had a heart-touching conversation with his resurrected mother, Kushina, and fought alongside his father, Minato, the Fourth Hokage. His ultimate clash with Sasuke, a final, world-altering battle at the Valley of the End, brought their complex relationship to a poignant resolution. Through relentless effort, unwavering belief in his friends, and an extraordinary capacity for empathy that allowed him to change even the hearts of his enemies, Naruto eventually achieved his childhood dream. He became the Seventh Hokage, the revered protector and hero of Konohagakure, guiding a new generation and finally fulfilling his promise to himself and his village.\n",
"\"\"\"\n",
"\n",
"# Generate the summary\n",
"summary_default = summarizer(naruto_story)\n",
"\n",
"# Print the result\n",
"print(\"--- Original Story ---\")\n",
"print(naruto_story)\n",
"print(\"\\n--- Default Summarizer Output ---\")\n",
"print(summary_default[0]['summary_text'])"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "y9Qt8LvhUUIR",
"outputId": "50f7993d-24d1-440c-be66-6c97b79c7d12"
},
"execution_count": 14,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).\n",
"Using a pipeline without specifying a model name and revision in production is not recommended.\n",
"Device set to use cpu\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"The default summarization model loaded is: sshleifer/distilbart-cnn-12-6\n",
"--- Original Story ---\n",
"\n",
"Born an orphan into the Hidden Leaf Village, Naruto Uzumaki's early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him. This secret led to him being ostracized and feared by the villagers, forcing a young Naruto to desperately seek attention and validation through pranks and a boisterous personality. His unwavering dream, however, was to become the Hokage, the village leader, a path he believed would finally earn him the respect and love he craved.\n",
"\n",
"His journey began with humble ninja training, forming Team 7 with the aloof Sasuke Uchiha, his rival and eventual best friend, and the intelligent Sakura Haruno, under the guidance of the enigmatic Kakashi Hatake. Early missions, like confronting Zabuza and Haku in the Land of Waves, forged bonds and revealed Naruto's hidden potential and fierce loyalty. As he grew, he faced numerous personal and global conflicts. The heart-wrenching pursuit of Sasuke, driven by revenge and Orochimaru's manipulation, became a central struggle, pushing Naruto to immense power, including mastering the Rasengan under the tutelage of his beloved mentor, Jiraiya. Jiraiya's tragic death at the hands of Pain, a former student, was a profound blow, yet it fueled Naruto's resolve, leading him to confront Pain and bring peace to the devastated Konoha, finally earning the villagers' acknowledgment and admiration.\n",
"\n",
"The Fourth Great Ninja War tested Naruto's strength and conviction to their limits. During this cataclysmic conflict, he confronted the harsh truths of his heritage, had a heart-touching conversation with his resurrected mother, Kushina, and fought alongside his father, Minato, the Fourth Hokage. His ultimate clash with Sasuke, a final, world-altering battle at the Valley of the End, brought their complex relationship to a poignant resolution. Through relentless effort, unwavering belief in his friends, and an extraordinary capacity for empathy that allowed him to change even the hearts of his enemies, Naruto eventually achieved his childhood dream. He became the Seventh Hokage, the revered protector and hero of Konohagakure, guiding a new generation and finally fulfilling his promise to himself and his village.\n",
"\n",
"\n",
"--- Default Summarizer Output ---\n",
" Naruto Uzumaki was born an orphan into the Hidden Leaf Village . His early life was shadowed by the terrifying Nine-Tailed Fox, a monstrous beast sealed within him . His unwavering dream was to become the Hokage, the village leader, a path he believed would earn him the respect and love he craved .\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"### 💡 Observation 1: Default Summarization Performance\n",
"\n",
"The default summarization pipeline (which internally uses a model like `sshleifer/distilbart-cnn-12-6`) produced a very concise summary.\n",
"\n",
"**Key observations:**\n",
"\n",
"* **Extreme Conciseness:** The model aggressively condensed the input, focusing on the absolute core narrative: Naruto's origin as an orphan with the Nine-Tails, his dream of becoming Hokage, and his eventual achievement of that goal.\n",
"* **Sensitivity to Initial Text / Abstractive & Title-Oriented:** Interestingly, when the initial descriptive line \"Naruto Uzumaki: From Outcast to Hokage\" was included at the very beginning of the input, the summary referred to the protagonist as \"The Seventh Hokage\" and omitted his name \"Naruto\". However, upon removing this initial line, the model *did* use \"Naruto\" by name. This suggests that the model gives significant weight to prominently placed introductory phrases or titles, using them to synthesize the primary identity of the subject. It prioritizes the *outcome* or *role* (Hokage) as the most salient identifier when provided with such a strong initial clue, aiming for maximum information density in a highly compressed output.\n",
"* **Information Omission:** Crucially, many significant details and character names (like Sasuke, Jiraiya, Sakura, Kakashi, Pain, the Great Ninja War, his parents) were entirely omitted. This is a direct consequence of the model's design for highly compressed summaries and its internal understanding of what constitutes \"essential\" information. While accurate, it lacks the richness of the original narrative.\n",
"\n",
"This initial test provides a valuable baseline, showing the model's ability to grasp the main arc of a complex story even without explicit parameters. However, it also highlights the need to control output length and consider task-specific fine-tuned models for richer, more detailed summaries, and how even subtle input formatting can influence the summary's focus."
],
"metadata": {
"id": "fx89eKESjmef"
}
},
{
"cell_type": "markdown",
"source": [
"---\n",
"### 📝 Summarization Experiment 2: Fine-Tuned Model with Parameters (English)\n",
"\n",
"Following our baseline test with the default summarization pipeline, we now shift our focus to a model specifically fine-tuned for text summarization: `Falconsai/text_summarization`. This model has demonstrated a stronger ability to capture and retain more granular details from narrative content compared to the default, making it a promising candidate for our English story. We will also explicitly set `max_length` and `min_length` parameters to gain more control over the summary's output size, aiming for a richer, yet still concise, summary.\n",
"\n",
"---"
],
"metadata": {
"id": "nn3wUnWTrIhq"
}
},
{
"cell_type": "code",
"source": [
"# Load the fine-tuned summarization model\n",
"summarizer = pipeline(\"summarization\", model=\"Falconsai/text_summarization\")\n",
"\n",
"# Experiment with increased max_length to get more detail\n",
"summarizer = summarizer(naruto_story, max_length=562, min_length=100, do_sample=False)\n",
"\n",
"print(\"\\n--- Fine-Tuned model on English Naruto Story ---\")\n",
"print(summarizer[0]['summary_text'])"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "JRb-11_krbJD",
"outputId": "a84734ca-1bc1-438e-865b-e5e88795624f"
},
"execution_count": 41,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Device set to use cpu\n",
"Token indices sequence length is longer than the specified maximum sequence length for this model (562 > 512). Running this sequence through the model will result in indexing errors\n",
"Both `max_new_tokens` (=256) and `max_length`(=562) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
"--- Fine-Tuned model on English Naruto Story ---\n",
"Naruto Uzumaki was born an orphan into the Hidden Leaf Village . He became the Hokage, the revered protector and hero of Konohagakure . Through humble ninja training, he formed Team 7 with Sasuke Uchiha, his rival and eventual best friend, and the intelligent Sakura Haruno, under the guidance of the enigmatic Kakashi Hatake . As he grew, his journey forged bonds and revealed his hidden potential and fierce\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"### 💡 Observation 2: Performance of Specific Summarization Models (English Narrative)\n",
"\n",
"This section details the comparative performance of various summarization models on our English Naruto story, building upon the baseline established by the default pipeline. We aimed to identify models that offer a better balance of conciseness and detail, and that accurately capture the narrative's essence.\n",
"\n",
"Here's what we observed from the models tested:\n",
"\n",
"* **`facebook/bart-large-cnn`:**\n",
" * This model, a larger version of the default `distilbart`, produced a more verbose and generally coherent summary than the default. It successfully incorporated the protagonist's name, \"Naruto Uzumaki,\" right from the start.\n",
" * **However, a critical issue emerged: the model exhibited a factual inaccuracy by stating Naruto was the \"daughter of Kushina.\"** This is a prime example of \"hallucination,\" where an abstractive summarization model generates plausible-sounding but factually incorrect information. While generally powerful, this specific misattribution highlights the challenge of ensuring complete factual faithfulness in generated text, especially with fictional narratives which might not align perfectly with its general news-based training.\n",
"\n",
"* **`csebuetnlp/mT5_multilingual_XLSum`:**\n",
" * Despite its multilingual capabilities, this model performed poorly on the English Naruto story. **The output was largely \"made up,\" fabricating details not present in the original text** (e.g., \"northern Japanese village of Konoha in July,\" \"BBC's Nicholas Barber\").\n",
" * This severe hallucination and contextual irrelevance likely stem from a **domain mismatch**. The `XLSum` dataset, on which this model is fine-tuned, is predominantly composed of news articles. Consequently, the model attempted to summarize our fictional narrative as if it were a news report, imposing structures and factual elements characteristic of news. This strongly reinforces the importance of selecting models whose training data aligns with the domain of your input text. For this reason, we decided not to proceed further with this model for English narrative summarization.\n",
"\n",
"* **`Falconsai/text_summarization`:**\n",
" * This model, when given sufficient `max_length` and `min_length` parameters (`max_length=562, min_length=100, do_sample=False`), provided a very strong and detailed summary. It effectively included multiple key characters (Sasuke Uchiha, Sakura Haruno, Kakashi Hatake) and plot points (Team 7 formation, the pursuit of Sasuke) that were largely omitted by the more concise default model.\n",
" * While the summary sometimes appeared \"incomplete\" at the very end (\"...and fierce\"), this was a direct result of hitting the `max_length` limit mid-sentence, a common behavior when forcing longer outputs. By adjusting `max_length` further, one could likely mitigate this.\n",
"\n",
"**Conclusion on English Summarization Models:**\n",
"\n",
"Based on these experiments for English narrative summarization:\n",
"\n",
"* The **default `sshleifer/distilbart-cnn-12-6`** proved to be reliable for concise summaries, albeit with less detail.\n",
"* **`Falconsai/text_summarization`** stands out as the best performer for generating more comprehensive and accurate summaries of narrative content, successfully incorporating a richer set of details and character names. Its ability to summarize story elements more effectively makes it our preferred choice for this specific task.\n",
"\n",
"It's important to acknowledge that the landscape of pre-trained models on Hugging Face is constantly evolving. There are always new and potentially better models being released. Our observations are based on the models tested on **July 13, 2025**, and future models or different parameter configurations might yield even superior results. However, for the scope of this deep dive, `Falconsai/text_summarization` provides the most compelling performance for English narrative summarization.\n",
"\n",
"---\n"
],
"metadata": {
"id": "_OEU5Tpzrxcb"
}
},
{
"cell_type": "markdown",
"source": [
"### 📝 Summarization Experiment 3: Fine-Tuned Model with Arabic Narrative (Luffy Story)\n",
"\n",
"Having evaluated English summarization, we now pivot to a crucial challenge: summarizing Arabic narrative text. This requires models specifically trained on Arabic data. We will test `csebuetnlp/mT5_multilingual_XLSum`, a widely used multilingual model. Our aim is to assess how well it handles Arabic content, retains key details from a fictional story about Monkey D. Luffy, and produces coherent summaries in Modern Standard Arabic. We will also observe its response to `max_length` and `min_length` parameters, as we suspect some models have an inherent bias towards brevity."
],
"metadata": {
"id": "tVoGjFEDGIAJ"
}
},
{
"cell_type": "code",
"source": [
"import re\n",
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
"\n",
"WHITESPACE_HANDLER = lambda k: re.sub('\\s+', ' ', re.sub('\\n+', ' ', k.strip()))\n",
"\n",
"\n",
"model_name = \"csebuetnlp/mT5_multilingual_XLSum\"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
"model = AutoModelForSeq2SeqLM.from_pretrained(model_name)\n",
"\n",
"input_ids = tokenizer(\n",
" [WHITESPACE_HANDLER(luffy_story_arabic)],\n",
" return_tensors=\"pt\",\n",
" padding=\"max_length\",\n",
" truncation=True,\n",
" max_length=512\n",
")[\"input_ids\"]\n",
"\n",
"output_ids = model.generate(\n",
" input_ids=input_ids,\n",
" max_length=512,\n",
" min_length=100,\n",
" no_repeat_ngram_size=2,\n",
" num_beams=4\n",
")[0]\n",
"\n",
"summary = tokenizer.decode(\n",
" output_ids,\n",
" skip_special_tokens=True,\n",
" clean_up_tokenization_spaces=False\n",
")\n",
"\n",
"print(summary)\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "TyuQ5o6d12jk",
"outputId": "9f1245e9-af6f-4693-fa48-a299b3254618"
},
"execution_count": 20,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.11/dist-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.\n",
" warnings.warn(\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"يمثل مونكي دي لوفي رمزًا للحرية والإصرار الذي أثر في قلوب الملايين حول العالم، لكنه لم يكن مجرد قرصان عادي بل \"رمزٌ لحرية وإصدار\". الصحفية نيكولاس باربر تلقي الضوء على تأثير هذا الرجل في عالم \"ون بيس\" الخيالي.. . BBC عربي يلتقي فيه الشاب المثير للجدل.\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"\n",
"---\n",
"\n",
"### 💡 Observation 3: Arabic Summarization Challenges\n",
"\n",
"Our journey into Arabic summarization has revealed significant challenges and underscored the importance of model selection, especially when dealing with specific language nuances and content domains. We further investigated the behavior of `csebuetnlp/mT5_multilingual_XLSum` by directly using its `generate` method and explicitly setting `min_length` to encourage a more detailed summary, aiming to overcome its previous brevity.\n",
"\n",
"Here are the key findings from our Arabic summarization tests:\n",
"\n",
"* **`csebuetnlp/mT5_multilingual_XLSum` (Re-tested with `min_length`):**\n",
" * When forced to generate a longer summary by setting `min_length=100` (within `model.generate()`), this model unfortunately also exhibited **hallucination issues**, similar to the other discarded models. It introduced fabricated details such as \"الصحفية نيكولاس باربر تلقي الضوء على تأثير هذا الرجل في عالم 'ون بيس' الخيالي.. . BBC عربي يلتقي فيه الشاب المثير للجدل.\" (Journalist Nicholas Barber sheds light on this man's impact in the fictional world of 'One Piece'... BBC Arabic meets the controversial young man).\n",
" * This clearly demonstrates that its propensity for hallucination is not simply due to brevity, but rather an inherent characteristic when applying it to narrative text, likely stemming from its training on news-focused datasets (XLSum). When pushed for more content, it defaults to generating information aligned with its primary domain.\n",
"\n",
"* **`eslamxm/mt5-base-finetuned-persian-finetuned-persian-arabic` & `ahmeddbahaa/mT5_multilingual_XLSum-finetuned-fa-finetuned-ar` (Models Previously Discarded):**\n",
" * As noted earlier, both of these models also consistently exhibited severe **hallucination issues** by generating factually incorrect or fabricated details (e.g., mentioning specific dates or external figures not present in the original story). This behavior reinforces their unsuitability for tasks demanding factual accuracy, especially outside their likely news-centric training domains.\n",
"\n",
"* **Default and English Models on Arabic Text:**\n",
" * As expected, attempting to use the default `sshleifer/distilbart-cnn-12-6` or `Falconsai/text_summarization` (which are fine-tuned primarily for English) on Arabic text resulted in uninterpretable or garbled outputs, confirming their lack of multilingual capability for Arabic.\n",
"\n",
"**Conclusion on Arabic Summarization Models (Current Date: July 13, 2025):**\n",
"\n",
"Our comprehensive testing of several \"Arabic-supporting\" summarization models reveals a significant challenge: finding a robust, off-the-shelf model capable of performing accurate and detailed **abstractive summarization of Arabic narrative text** without hallucination. All models tested that produced more than a single-sentence summary eventually resorted to generating fabricated information.\n",
"\n",
"This strongly suggests that for nuanced Arabic narrative summarization, relying solely on publicly available pre-trained models from the Hub *at this time* may lead to unreliable results, particularly when seeking detailed and factually faithful summaries. This could be a critical area where custom fine-tuning on a relevant Arabic narrative dataset might be necessary, or where larger, more general Arabic LLMs (used with careful prompting) might offer a solution if they become more accessible for fine-grained control. For the purpose of this deep dive, it highlights a current limitation in the readily available tooling for this specific task.\n",
"\n",
"---\n"
],
"metadata": {
"id": "7W-omkXY7WAA"
}
},
{
"cell_type": "markdown",
"source": [
"## 🌐 Machine Translation Deep Dive\n",
"\n",
"Having explored summarization, we now pivot to **Machine Translation (MT)**, a cornerstone of multilingual NLP. Our goal is to assess the capabilities of various pre-trained models on the Hugging Face Hub, focusing on English-to-Arabic and Arabic-to-English translation. A particular emphasis will be placed on understanding how these models handle both **Modern Standard Arabic (MSA - العربية الفصحى)** and the more challenging **Arabic dialects**, along with the common pitfall of \"Franco\" Arabic (romanized Arabic).\n",
"\n",
"We will directly test four prominent models, showcasing their output for both formal and dialectal sentences to highlight their respective strengths and limitations. This direct comparison will provide valuable insights into the current state of Arabic machine translation.\n",
"\n",
"---\n",
"\n",
"### Translation Experiment 1: `facebook/nllb-200-distilled-600M`\n",
"\n",
"This model is part of Meta AI's No Language Left Behind (NLLB) project, designed to provide high-quality translation for 200 languages. It's known for its broad coverage, including support for various Arabic dialects. We'll test its ability to translate both formal and dialectal Arabic to English, paying close attention to its handling of colloquialisms and informal text."
],
"metadata": {
"id": "7_byn_r1BEnc"
}
},
{
"cell_type": "code",
"source": [
"# Code for facebook/nllb-200-distilled-600M will go here\n",
"from transformers import pipeline\n",
"\n",
"# Example of how to use NLLB with specific language codes\n",
"# For Arabic (MSA) to English\n",
"translator_nllb_ara_en = pipeline(\"translation\", model=\"facebook/nllb-200-distilled-600M\", src_lang=\"ara_Arab\", tgt_lang=\"eng_Latn\")\n",
"print(\"--- NLLB (MSA Arabic to English) ---\")\n",
"print(translator_nllb_ara_en(\"كيف حالك يا صديقي؟ أتمنى أن تكون بخير.\"))\n",
"\n",
"# For Egyptian Arabic to English\n",
"print(\"\\n--- NLLB (Egyptian Arabic to English) ---\")\n",
"print(translator_nllb_ara_en(\"ياسطا انا تعبان\"))\n",
"print(translator_nllb_ara_en(\"هو انت عبيط ياسطا؟\"))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "VAxCarDNB2Bn",
"outputId": "f16c563a-877a-4f5b-b487-c414656df31d"
},
"execution_count": 90,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Device set to use cpu\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"--- NLLB (MSA Arabic to English) ---\n",
"[{'translation_text': 'How are you, my friend?'}]\n",
"\n",
"--- NLLB (Egyptian Arabic to English) ---\n",
"[{'translation_text': \"Yasta, I'm tired of it.\"}]\n",
"[{'translation_text': 'Are you an abject Yasta?'}]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"### Translation Experiment 2: `Helsinki-NLP/opus-mt-ar-en`\n",
"\n",
"This model is part of the OPUS-MT project, renowned for providing pre-trained models for a vast array of language pairs, often trained on parallel corpora from the OPUS project. This specific model is fine-tuned for Arabic-to-English translation. We will examine its performance on both formal and dialectal Arabic inputs, observing its fluency and accuracy, especially in contrast to NLLB's dialect handling.\n",
"\n",
"---"
],
"metadata": {
"id": "g_dXPawgetut"
}
},
{
"cell_type": "code",
"source": [
"# Code for Helsinki-NLP/opus-mt-ar-en will go here\n",
"\n",
"translator_opus_ar_en = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-ar-en\")\n",
"\n",
"print(\"--- OPUS-MT (Arabic to English) ---\")\n",
"print(translator_opus_ar_en(\"كيف حالك يا صديقي؟ أتمنى أن تكون بخير.\"))\n",
"print(translator_opus_ar_en(\"ياسطا انا تعبان\"))\n",
"print(translator_opus_ar_en(\"هو انت عبيط ياسطا؟\"))"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hdUfD7eDLc1W",
"outputId": "d745ac68-f5de-455e-e83e-6519667f3439"
},
"execution_count": 89,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Device set to use cpu\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"--- OPUS-MT (Arabic to English) ---\n",
"[{'translation_text': 'How you doing, buddy?'}]\n",
"[{'translation_text': \"I'm tired.\"}]\n",
"[{'translation_text': \"You're a jackass, aren't you?\"}]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"### Translation Experiment 3: `Helsinki-NLP/opus-mt-en-ar`\n",
"\n",
"Complementing the previous OPUS-MT model, this one specializes in English-to-Arabic translation. We will test its capabilities for translating English sentences into Modern Standard Arabic, with a focus on its accuracy and completeness, noting any instances where it might struggle with specific sentence structures or nuances.\n",
"\n",
"----"
],
"metadata": {
"id": "jJ9EsH-NfwA3"
}
},
{
"cell_type": "code",
"source": [
"# Code for Helsinki-NLP/opus-mt-en-ar will go here\n",
"\n",
"translator_opus_en_ar = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-en-ar\")\n",
"\n",
"print(\"--- OPUS-MT (English to Arabic) ---\")\n",
"print(translator_opus_en_ar(\"How are you, my friend? I hope you're okay.\"))\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qbEN3AyFYQB_",
"outputId": "af234f67-201b-4ed4-9d2a-e11802a5d876"
},
"execution_count": 91,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Device set to use cpu\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"--- OPUS-MT (English to Arabic) ---\n",
"[{'translation_text': '-آمل أنّك بخير .'}]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"### Translation Experiment 4: `Helsinki-NLP/opus-mt-mul-en`\n",
"\n",
"This multilingual OPUS-MT model is designed to translate from various source languages (including Arabic) to English. We'll examine its general robustness and compare its performance, particularly on dialectal Arabic, against the dedicated `opus-mt-ar-en` and the NLLB model, to see if its broader multilingual training offers any advantages or different failure modes.\n",
"\n",
"---"
],
"metadata": {
"id": "ZBh43YehgBsA"
}
},
{
"cell_type": "code",
"source": [
"# Code for Helsinki-NLP/opus-mt-mul-en will go here\n",
"\n",
"# For multilingual to English, source language can sometimes be auto-detected or specified\n",
"# Here, we assume it can handle Arabic input.\n",
"translator_opus_mul_en = pipeline(\"translation\", model=\"Helsinki-NLP/opus-mt-mul-en\")\n",
"\n",
"print(\"--- OPUS-MT (Multilingual to English) ---\")\n",
"print(translator_opus_mul_en(\"كيف حالك يا صديقي؟ أتمنى أن تكون بخير.\"))\n",
"print(translator_opus_mul_en(\"ياسطا انا تعبان\"))\n"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "44YxtgVFMp2W",
"outputId": "91127ea1-da6f-4f78-b913-6e4421199e52"
},
"execution_count": 92,
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"Device set to use cpu\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"--- OPUS-MT (Multilingual to English) ---\n",
"[{'translation_text': \"How are you, buddy? I hope you're okay.\"}]\n",
"[{'translation_text': \"Oh, my God. I'm an asshole.\"}]\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"---\n",
"\n",
"### 💡 Observation 4: Machine Translation Performance Across Formal and Dialectal Arabic\n",
"\n",
"Our exploration into Machine Translation revealed varying degrees of success across different models, particularly highlighting the persistent challenge of handling Arabic dialects compared to Modern Standard Arabic (MSA).\n",
"\n",
"Here's a summary of our findings for each model:\n",
"\n",
"* **`facebook/nllb-200-distilled-600M`**:\n",
" * **Formal Arabic (AR to EN & EN to AR)**: Performed well, providing accurate and fluent translations for Modern Standard Arabic sentences in both directions, especially when `src_lang` and `tgt_lang` were explicitly set with the correct NLLB language codes (e.g., `ara_Arab`, `eng_Latn`).\n",
" * **Dialectal Arabic (AR to EN)**: Showed a unique and interesting behavior. While it struggled with direct, fluent translations of complex dialectal sentences, it demonstrated an awareness of colloquial terms. For example, \"ياسطا انا تعبان\" (Yasta ana ta'ban - Hey man, I'm tired) was often transliterated as \"Yasta I am tired\" rather than a full English translation. This 'Franco' Arabic (Arabic words written with Latin characters) output, while not a perfect translation, indicates the model's exposure to and recognition of informal, real-world Arabic usage, which is a notable capability. When presented with more complex or highly dialectal phrases, it sometimes struggled to produce coherent translations.\n",
" * **Initial Quirk:** It initially showed a tendency to translate to English by default, even when parameters were set, suggesting that explicit language code usage is crucial for consistent behavior.\n",
"\n",
"* **`Helsinki-NLP/opus-mt-ar-en`**:\n",
" * **Formal Arabic (AR to EN)**: Generally good, producing intelligible translations. However, it sometimes exhibited conciseness, occasionally omitting parts of longer, grammatically correct sentences (e.g., shortening \"How are you, my friend? I hope you're okay.\" to \"How you doing, buddy?\"). This suggests a tendency towards brevity or a potential limitation in capturing full semantic content consistently.\n",
" * **Dialectal Arabic (AR to EN)**: Similar to many models, it struggled significantly with dialectal phrases. While it attempted translations that made some sense (e.g., \"You're a jackass, aren't you?\" for \"هو انت عبيط ياسطا؟\"), it often failed to accurately capture or fully translate highly colloquial words or slang, often opting for more generalized or formal equivalents, if any.\n",
"\n",
"* **`Helsinki-NLP/opus-mt-en-ar`**:\n",
" * **Formal Arabic (EN to AR)**: This model showed a surprising and significant weakness in the English-to-Arabic direction. It notably failed to translate entire parts of formal English sentences (e.g., \"How are you, my friend? I hope you're okay.\" translated to only \"-آمل أنّك بخير .\"), rendering the output incomplete and grammatically incorrect. This makes it unreliable for robust EN-to-AR translation.\n",
"\n",
"* **`Helsinki-NLP/opus-mt-mul-en`**:\n",
" * **Formal Arabic (AR to EN)**: Handled formal Arabic to English correctly, indicating its general multilingual capability for standard languages.\n",
" * **Dialectal Arabic (AR to EN)**: Similar to other non-NLLB models, it largely failed on dialectal Arabic, producing translations that were often unrelated to the original input. Its broader multilingual training did not seem to equip it with a nuanced understanding of Arabic dialects.\n",
"\n",
"**Overall Conclusion on Machine Translation:**\n",
"\n",
"Our tests confirm that while **Modern Standard Arabic (العربية الفصحى) translation is reasonably well-supported by several models** (with `facebook/nllb-200-distilled-600M` and `Helsinki-NLP/opus-mt-ar-en` performing commendably in AR-to-EN, and NLLB being strong in EN-to-AR), **translating Arabic dialects remains a significant challenge for publicly available, general-purpose models.**\n",
"\n",
"The `facebook/nllb-200-distilled-600M` model, despite requiring precise language code specification, emerged as the most promising for its unique (though imperfect) ability to recognize and transliterate certain dialectal terms. This suggests NLLB's broader dataset encompasses more real-world, informal Arabic, setting it apart from the OPUS-MT models that tend to lean heavily on formal language.\n",
"\n",
"For highly accurate and nuanced dialectal Arabic translation, specialized fine-tuning on relevant dialectal datasets or the use of larger, more comprehensively trained multimodal LLMs might be necessary. However, within the confines of readily accessible pre-trained models on the Hugging Face Hub, NLLB stands out for its potential in this complex domain.\n",
"\n",
"---"
],
"metadata": {
"id": "Rr6OT8p9g1n6"
}
}
]
} |