Can Large Language Models Understand Tunisian Arabic?
Why Can't AI Understand Tunisian Arabic?
Have you ever tried talking to ChatGPT or any AI assistant in Tunisian dialect? If you have, you've probably noticed it struggles, switching between languages, misunderstanding context, or giving responses that feel completely off. This isn't just a small bug. It's a fundamental problem affecting millions of Tunisian speakers.
The Problem
Modern AI models power everything from virtual assistants to translation tools, but they have a blind spot: Tunisian Arabic (Darija). While these systems work well with English, French, or Modern Standard Arabic, they fail when it comes to our everyday language.
Why does this matter?
- 11 million Tunisians can't use AI tools in their native dialect
- Customer service bots in Tunisia often frustrate users instead of helping them
- Social media analysis misses the sentiment and meaning in Tunisian posts
- Educational tools don't support how Tunisians actually communicate
- Cultural expressions and local knowledge get lost in translation
What Makes Tunisian Arabic Special?
Tunisian Arabic isn't just "broken Arabic" or "informal language"—it's a rich dialect with its own rules:
- Code-switching: We naturally mix Arabic, French, and Berber words in one sentence
- Unique expressions: "Aalesh?" "Barsha" "Famma" don't translate directly
- Flexible writing: We write in Latin script, Arabic script, or both
- Cultural context: Sarcasm, humor, and local references that AI completely misses
When AI can't understand these patterns, it can't serve Tunisian users properly.
Our Research
I've been working on this problem by evaluating how well current LLMs handle Tunisian Arabic. The results? Not great. Even the most advanced models struggle with:
- Basic transliteration between Latin and Arabic script
- Understanding code-switched sentences
- Detecting sentiment in Tunisian social media posts
- Translating expressions that carry cultural meaning
The full research paper and findings are available on GitHub.
The TUNIZI Dataset
To tackle this problem, I created the TUNIZI dataset, a benchmark for evaluating AI models on Tunisian Arabic tasks including transliteration, translation, and sentiment analysis.
But here's the truth: one person's dataset isn't enough.
AI models need massive amounts of diverse data to learn properly. The dataset I built is just a starting point. To make AI truly understand Tunisian Arabic, we need contributions from across Tunisia—different cities, age groups, contexts, and ways of speaking.
How You Can Help
This is a call to action for the Tunisian tech community and anyone who cares about linguistic diversity in AI.
We need your help to grow the TUNIZI dataset:
What We Need
- Tunisian conversations: Natural dialogue in Darija (text messages, social media posts, etc.)
- Translations: Tunisian ↔ English or French pairs
- Sentiment labels: Positive, negative, or neutral Tunisian text
- Transliterations: Same sentences in both Latin and Arabic script
- Regional diversity: Input from different Tunisian cities and regions
Why Contribute?
- Representation: Make AI work for Tunisian speakers
- Impact: Enable better AI tools for millions of users
- Preservation: Document and preserve our dialect digitally
- Innovation: Power new applications in Tunisian markets
- Open source: All contributions benefit the research community
How to Contribute
1. Visit the GitHub repository2. Check the contribution guidelines
3. Submit your data (anonymized and consent-given)
4. Join the discussion on improving the dataset
Even small contributions matter. A few sentences, some translations, or labeled posts—every bit helps build better AI for Tunisia.
The Bigger Picture
This isn't just about Tunisian Arabic. It's about linguistic justice in AI. If we don't actively work to include low-resource languages and dialects, AI will only serve a privileged few who speak dominant languages.
By building better datasets and models for Tunisian Arabic, we're:
- Making technology more inclusive
- Preserving our linguistic heritage
- Creating economic opportunities in Tunisian markets
- Showing that our dialect matters in the global AI conversation
Join the Movement
AI doesn't have to ignore Tunisian speakers. Together, we can change this.
Check out the research paper to understand the technical details: GitHub Repository
Contribute to the dataset and help AI understand how Tunisians really speak.
Let's build AI that speaks our language. 🇹🇳
---
Have questions or want to collaborate? Feel free to reach out or open an issue on GitHub.