French ressources (datasets & models) I developped to empower use cases in French
Loïck BOURDOIS PRO
lbourdois
AI & ML interests
👀
Recent Activity
commented on
their
article
about 15 hours ago
Model statistics of the 50 most downloaded entities on Hugging Face
commented on
their
article
about 18 hours ago
Model statistics of the 50 most downloaded entities on Hugging Face
Organizations
FAT5
Flash Attention T5 (FAT5) models developped when I worked at CATIE (https://hf.co/CATIE-AQ).
French NER
NER models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 170,000 downloads.
-
CATIE-AQ/Moderncamembert_3entities
Token Classification • 0.1B • Updated • 9 • 1 -
CATIE-AQ/NERmemberta-3entities
Token Classification • 0.1B • Updated • 33 • 1 -
CATIE-AQ/NERmembert-base-3entities
Token Classification • 0.1B • Updated • 75 • 2 -
CATIE-AQ/NERmembert-large-3entities
Token Classification • 0.3B • Updated • 93 • 2
French prompts datasets
French prompts dataset developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 30,000 downloads.
French VQA datasets
VQA datasets I cleaned with an image, a question and an answer.
Can be used to train VLMs.
French OCR datasets
Datasets I cleaned with an image, a prompt question (like "transcribe the text in this image") and an answer.
Can be used to train VLMs.
French table-to-text datasets
In 2021 before the release of LoRA, I was interested in Prefix-tuning, which I wanted to apply to French. So I had to translate table-to-text data
French Translations
Things I've translated: courses, blog posts, guides. More on my personal blog (https://lbourdois.github.io/blog/).
-
Running4
Free online AI courses in French
📚4French translations of four AI courses
-
lbourdois/en-fr-nyu-dl-course-corpus
Viewer • Updated • 3.13k • 90 • 1 -
Sleeping4
SSM Blog Posts
📝4Blog posts about State Space Models (SSM)
-
Running2
Guide sur l'évaluation des LLM
⚖2Traduction du guide de Clémentine Fourrier
Breton packs
Breton ressources (datasets & models) I developped to empower use cases in Breton
French QA
QA models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 150,000 downloads.
French embedding datasets
French datasets to train embeddings models or evaluate them.
French caption datasets
Datasets I cleaned with an image, a prompt question (like "describe this image") and an answer.
Can be used to train VLMs.
-
lbourdois/caption-maya-multimodal-pretrain-clean
Viewer • Updated • 551k • 455 -
CATIE-AQ/caption-vidore-vdsid_french-clean
Viewer • Updated • 5k • 65 -
CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 45 -
CATIE-AQ/caption-floschne-xm3600-clean
Viewer • Updated • 8.56k • 30
French retriever datasets
Datasets I cleaned with an image and a question.
Can be used to train visual retrievers (ColPali and co.).
-
CATIE-AQ/retriever-vidore-vdsid_french-clean
Viewer • Updated • 5k • 87 -
CATIE-AQ/retriever-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 39 -
CATIE-AQ/retriever-manu-tabfquad_retrieving-clean
Viewer • Updated • 1.83k • 63 -
CATIE-AQ/retriever-princeton-nlp-CharXiv-clean
Viewer • Updated • 1.32k • 34
French audio datasets (pretraining)
Around 117K hours of audio in French for research purpose
French packs
French ressources (datasets & models) I developped to empower use cases in French
French Translations
Things I've translated: courses, blog posts, guides. More on my personal blog (https://lbourdois.github.io/blog/).
-
Running4
Free online AI courses in French
📚4French translations of four AI courses
-
lbourdois/en-fr-nyu-dl-course-corpus
Viewer • Updated • 3.13k • 90 • 1 -
Sleeping4
SSM Blog Posts
📝4Blog posts about State Space Models (SSM)
-
Running2
Guide sur l'évaluation des LLM
⚖2Traduction du guide de Clémentine Fourrier
FAT5
Flash Attention T5 (FAT5) models developped when I worked at CATIE (https://hf.co/CATIE-AQ).
Breton packs
Breton ressources (datasets & models) I developped to empower use cases in Breton
French NER
NER models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 170,000 downloads.
-
CATIE-AQ/Moderncamembert_3entities
Token Classification • 0.1B • Updated • 9 • 1 -
CATIE-AQ/NERmemberta-3entities
Token Classification • 0.1B • Updated • 33 • 1 -
CATIE-AQ/NERmembert-base-3entities
Token Classification • 0.1B • Updated • 75 • 2 -
CATIE-AQ/NERmembert-large-3entities
Token Classification • 0.3B • Updated • 93 • 2
French QA
QA models & datasets developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 150,000 downloads.
French prompts datasets
French prompts dataset developped when I worked at CATIE (https://hf.co/CATIE-AQ). Over 30,000 downloads.
French embedding datasets
French datasets to train embeddings models or evaluate them.
French VQA datasets
VQA datasets I cleaned with an image, a question and an answer.
Can be used to train VLMs.
French caption datasets
Datasets I cleaned with an image, a prompt question (like "describe this image") and an answer.
Can be used to train VLMs.
-
lbourdois/caption-maya-multimodal-pretrain-clean
Viewer • Updated • 551k • 455 -
CATIE-AQ/caption-vidore-vdsid_french-clean
Viewer • Updated • 5k • 65 -
CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 45 -
CATIE-AQ/caption-floschne-xm3600-clean
Viewer • Updated • 8.56k • 30
French OCR datasets
Datasets I cleaned with an image, a prompt question (like "transcribe the text in this image") and an answer.
Can be used to train VLMs.
French retriever datasets
Datasets I cleaned with an image and a question.
Can be used to train visual retrievers (ColPali and co.).
-
CATIE-AQ/retriever-vidore-vdsid_french-clean
Viewer • Updated • 5k • 87 -
CATIE-AQ/retriever-vidore-tabfquad_test_subsampled-clean
Viewer • Updated • 280 • 39 -
CATIE-AQ/retriever-manu-tabfquad_retrieving-clean
Viewer • Updated • 1.83k • 63 -
CATIE-AQ/retriever-princeton-nlp-CharXiv-clean
Viewer • Updated • 1.32k • 34
French table-to-text datasets
In 2021 before the release of LoRA, I was interested in Prefix-tuning, which I wanted to apply to French. So I had to translate table-to-text data
French audio datasets (pretraining)
Around 117K hours of audio in French for research purpose