Darija2SQL-3B
Model Description
Darija2SQL is a specialized Code LLM designed for translating Moroccan Arabic (Darija) natural language questions into SQL queries.
It is based on the Qwen2.5-coder-3B model and fine-tuned on Dialect2SQL, a dataset of Darija–SQL pairs spanning multiple database schemas and domains.
The goal of this model is to bridge the gap between dialectal Arabic users and database systems, enabling users to query structured data in their native dialect.
Finetuning Procedure
Darija2SQL was fine-tuned using PEFT (Parameter-Efficient Fine-Tuning) methods, specifically LoRA (Low-Rank Adaptation) adapters, to preserve the strong code reasoning capabilities of Qwen2.5-coder-3B while specializing it for Darija text-to-SQL generation.
Training included both schema comprehension and dialectal normalization steps, ensuring the model understands common Darija variants and their SQL intent.
Intended Use and Limitations
This model is intended for research and educational purposes, specifically in the areas of:
- Natural Language to SQL generation for dialectal Arabic
- Code model adaptation to low-resource dialects
- Database interaction interfaces in Arabic contexts
While Darija2SQL performs well on the Dialect2SQL dataset, it may not generalize perfectly to unseen schemas or domains without adaptation.
How to Use
Example 1: Ecommerce_DB
from transformers import AutoTokenizer, AutoModelForCausalLM
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("salmane11/Darija2SQL-3B")
model = AutoModelForCausalLM.from_pretrained("salmane11/Darija2SQL-3B").to(device)
input_text = """
CREATE TABLE Products {
product_id number,
product_name varchar,
category varchar,
price real,
stock number
}
CREATE TABLE Orders {
order_id number,
product_id number,
customer_id number,
quantity number,
order_date date
}
CREATE TABLE Customers {
customer_id number,
name varchar,
city varchar
}
-- Using valid SQLite, answer the following question in Darija:
-- شحال من order دار الزبون لي ساكن ف كازا؟
SELECT
"""
encoding = tokenizer.encode_plus(input_text, pad_to_max_length=True, return_tensors="pt").to(device)
input_ids, attention_masks = encoding["input_ids"].to(device), encoding["attention_mask"].to(device)
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=512,
do_sample=True,
top_k=120,
top_p=0.95,
early_stopping=True,
)
line = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
query_beginning = line.find("SELECT")
print(line[query_beginning:])
Cite our work
@inproceedings{chafik2025dialect2sql,
title={Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija},
author={Chafik, Salmane and Ezzini, Saad and Berrada, Ismail},
booktitle={Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)},
pages={86--92},
year={2025}
}
- Downloads last month
- -