{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# 🧪 **Day 02 – Sentiment Analysis & Zero-Shot Beyond Default with Hugging Face 🤗**\n",
        "\n",
        "This notebook contains all the code experiments for **Day 2** of my *30 Days of GenAI* challenge.\n",
        "\n",
        "For detailed commentary and discoveries, see 👉 [Day 2 Log](https://huggingface.co/Musno/30-days-of-genai/blob/main/logs/day2.md)\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        "## 📌 What’s Covered Today\n",
        "\n",
        "- 🔍 Comparing the **default Hugging Face sentiment pipeline** with fine-tuned Arabic models\n",
        "- 🧪 Testing **multiple Arabic sentiment models**, including dialect support\n",
        "- 🏆 Identifying the most accurate model for Arabic sentiment tasks\n",
        "- 🌍 Exploring **zero-shot classification** in multilingual and cross-lingual settings\n",
        "- 🧠 Evaluating how different models handle **Arabic inputs**, **mixed label languages**, and **right-to-left (RTL)** alignment issues\n",
        "- ✅ Highlighting top-performing models for real-world, multi-language use cases\n",
        "\n",
        "Let’s dive in and benchmark some models! 🚀\n",
        "\n",
        "---\n"
      ],
      "metadata": {
        "id": "TtAUIDqpypSD"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 60,
      "metadata": {
        "id": "MscSJWgamgTn"
      },
      "outputs": [],
      "source": [
        "from transformers import pipeline"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 🥇 Best Arabic Sentiment Model – `CAMeL-Lab/bert-base-arabic-camelbert-mix-sentiment`\n",
        "\n",
        "After testing multiple Arabic sentiment models, this one stood out with excellent accuracy on both **Modern Standard Arabic** and **dialects** (like Egyptian).\n",
        "\n",
        "For clarity, only the top-performing model is included here. Others showed noticeably lower accuracy or poor dialect support.\n",
        "\n",
        "Let's load it and run a quick test. 🧪👇\n"
      ],
      "metadata": {
        "id": "3Vmwaa2ipDDz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "classifier = pipeline(\"sentiment-analysis\", model=\"CAMeL-Lab/bert-base-arabic-camelbert-mix-sentiment\")\n",
        "\n",
        "english = classifier(\"I love you\")\n",
        "arabic = classifier(\"أنا بحبك\")\n",
        "arabic_dialect = classifier(\"الواد سواق التوك توك جارنا عسل\")\n",
        "arabic_formal = classifier(\"أنا أحبك\")\n",
        "french = classifier(\"je t'aime\")\n",
        "\n",
        "print(english)\n",
        "print(arabic)\n",
        "print(arabic_dialect)\n",
        "print(arabic_formal)\n",
        "print(french)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "t1Ar2yvFy_rc",
        "outputId": "fe10b205-271c-4d2f-c07d-157497b5bc1e"
      },
      "execution_count": 65,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Device set to use cpu\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[{'label': 'positive', 'score': 0.616008996963501}]\n",
            "[{'label': 'positive', 'score': 0.9781275987625122}]\n",
            "[{'label': 'positive', 'score': 0.973617434501648}]\n",
            "[{'label': 'positive', 'score': 0.9768486022949219}]\n",
            "[{'label': 'positive', 'score': 0.5023519396781921}]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 🔎 Zero‑Shot Classification with `MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7`\n",
        "\n",
        "This powerful multilingual XNLI model performed best across our Day 2 tests. Below we’ll run through five key scenarios to see how well it handles different input/label language pairings, plus mixed‑language labels and RTL alignment.\n",
        "\n",
        "---\n",
        "\n",
        "### 1️⃣ Arabic Input → Arabic Labels  \n",
        "**Goal:** Check pure Arabic performance.  \n",
        "**What to expect:** High confidence (95–99%) on clear MSA and dialectal sentences.\n",
        "\n",
        "---"
      ],
      "metadata": {
        "id": "39h20597q8nI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "classifier = pipeline(\"zero-shot-classification\", model=\"MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7\")\n",
        "classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hLh2atgiOlyt",
        "outputId": "0c1a5be3-e8c9-416d-84fe-f4746c392499"
      },
      "execution_count": 49,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Device set to use cpu\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
              " 'labels': ['تعليم', 'رياضة', 'طعام'],\n",
              " 'scores': [0.9606460332870483, 0.029150988906621933, 0.010203025303781033]}"
            ]
          },
          "metadata": {},
          "execution_count": 49
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "output = classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"تعليم\", \"رياضة\", \"طعام\"]\n",
        ")\n",
        "for label, score in zip(output['labels'], output['scores']):\n",
        "    print(f\"{label}: {score:.3f}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xc8J2TyFT1XB",
        "outputId": "4d8ded1e-92b4-454c-d1c9-fc1f583e6b7a"
      },
      "execution_count": 38,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "تعليم: 0.961\n",
            "رياضة: 0.029\n",
            "طعام: 0.010\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 2️⃣ Arabic Input → English Labels  \n",
        "**Goal:** See if the model can map Arabic text into English categories.  \n",
        "**What to expect:** Strong accuracy (~85–90%), showing true cross‑lingual zero‑shot ability.\n",
        "\n",
        "---"
      ],
      "metadata": {
        "id": "BYX0DUYDrLEE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"education\", \"sports\", \"politics\"]\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ZtZ7iKqaT_iS",
        "outputId": "21d84061-7478-4b01-c68e-5d221f862d7a"
      },
      "execution_count": 39,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
              " 'labels': ['education', 'politics', 'sports'],\n",
              " 'scores': [0.8637387156486511, 0.07514170557260513, 0.06111961230635643]}"
            ]
          },
          "metadata": {},
          "execution_count": 39
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 3️⃣ English Input → English Labels  \n",
        "**Goal:** Benchmark against defaults on an all‑English task.  \n",
        "**What to expect:** Solid English performance (80–85%), exceeding the default pipeline.\n",
        "\n",
        "---"
      ],
      "metadata": {
        "id": "CerlGulurSU-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "classifier(\n",
        "    \"I love learning AI\",\n",
        "    candidate_labels=[\"education\", \"sports\", \"food\"]\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ph3EOpfHUQwk",
        "outputId": "ab06417b-dead-413e-a35e-c6aca455bd06"
      },
      "execution_count": 50,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'sequence': 'I love learning AI',\n",
              " 'labels': ['education', 'sports', 'food'],\n",
              " 'scores': [0.8431005477905273, 0.11592638492584229, 0.04097312316298485]}"
            ]
          },
          "metadata": {},
          "execution_count": 50
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 4️⃣ English Input → Arabic Labels  \n",
        "**Goal:** Reverse the second test: English text, Arabic label set.  \n",
        "**What to expect:** Reliable mapping (~80–85%), far above the default’s ~30%.\n",
        "\n",
        "---"
      ],
      "metadata": {
        "id": "VVY4pwQHrcCY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "classifier(\n",
        "    \"I love learning AI\",\n",
        "    candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"]\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VAU-udqCUeyD",
        "outputId": "aebeb7d6-fdaf-46e8-adfe-89cb6abb7409"
      },
      "execution_count": 41,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'sequence': 'I love learning AI',\n",
              " 'labels': ['تعليم', 'رياضة', 'طعام'],\n",
              " 'scores': [0.8412964940071106, 0.09517763555049896, 0.06352593004703522]}"
            ]
          },
          "metadata": {},
          "execution_count": 41
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "output = classifier(\"I love learning AI\", candidate_labels=[\"طعام\", \"تعليم\", \"رياضة\"])\n",
        "for label, score in zip(output['labels'], output['scores']):\n",
        "    print(f\"{label}: {score:.3f}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "9r0XRnE8UuFH",
        "outputId": "52960d73-0db9-4a22-cd94-4f362bf47292"
      },
      "execution_count": 42,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "تعليم: 0.841\n",
            "رياضة: 0.095\n",
            "طعام: 0.064\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 5️⃣ Mixed Labels (Arabic + English)  \n",
        "**Goal:** Stress‑test the model with a combined RTL/LTR label set.  \n",
        "**What to expect:**  \n",
        "- Correct top‑label selection  \n",
        "- Perfect label ordering (English left, Arabic right)  \n",
        "- No RTL scoring glitches\n",
        "\n",
        "---\n",
        "\n",
        "Let’s fire off each test! 🚀"
      ],
      "metadata": {
        "id": "XLRfyrsLr3Vz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "classifier(\n",
        "    \"أنا أحب تعلم الذكاء الاصطناعي\",\n",
        "    candidate_labels=[\"education\", \"رياضة\", \"طعام\"]\n",
        ")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3HoCFTrtVTQc",
        "outputId": "311fee9a-91de-435f-8047-9cbe14077722"
      },
      "execution_count": 43,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'sequence': 'أنا أحب تعلم الذكاء الاصطناعي',\n",
              " 'labels': ['education', 'رياضة', 'طعام'],\n",
              " 'scores': [0.9260158538818359, 0.05480289086699486, 0.019181348383426666]}"
            ]
          },
          "metadata": {},
          "execution_count": 43
        }
      ]
    }
  ]
}