Cedille, a new artificial intelligence created by the digital agency Coteries, based at the EPFL Innovation Park in Lausanne, provides a game-changing solution for French-speaking users.
The generation of French content will now be made easier. Any company active in the generation of French texts, which until now has had access mainly to models trained in English, can now take advantage of the largest French-speaking model to date, publicly accessible in Beta version on app.cedille.ai.
The model now achieves a perplexity score - a key performance measure of next word prediction where the lowest score is the best - of 4.5 compared to the best publicly available system (GPT-en) which has a score of 12.9, positioning Cedille as almost 3 times better.
The project was launched with the support of the Google TRC programme and was trained for several months on Tensor Processing Units (TPUs), special chips created by Google from scratch to accelerate artificial intelligence calculations. By relying on this infrastructure, the team was able to ensure a neutral ecological footprint for the model training process. This is a major achievement when you consider that such processes require huge amounts of energy and therefore high carbon emissions.
Cedille builds on the EleutherAI community, a grassroots movement of open source AI researchers. Since Cedille is publicly available, researchers can verify and reproduce the results and experiment with them as they wish.
"With Cedille we are leveling the playing field for French compared to English language models - with other non-English languages soon to follow! We are able to achieve this feat also thanks to the efforts of the open source community EleutherAI. By releasing our model publicly we're excited to contribute back to the community!"
Martin Müller, Senior Machine Learning Engineer at Coteries
In order to understand the world, the current main text generation models based on artificial intelligence such as GPT-3 are trained using large databases of publicly available content on the internet. As this content also contains a good deal of misinformation, sexism or racism, it has been shown that existing models can pick up these same discriminatory tendencies in text generation.
Coteries has endeavoured to publish a template free of inappropriate content as much as possible and to filter the data for Cedille's training. All toxic content as well as low quality content was removed. This process was made possible by a combination of natural language processing and careful manual examination of the data samples.
As a result, Cedille now generates quality texts with a significant reduction of 14.7% of toxic content compared to the best existing model so far (GPT-fr).
From enhanced journalism to autocompletion to chatbots, Cedille offers a wide range of potential uses. Coteries offers its model and the skills of its team to create custom applications, representing an excellent opportunity for any company wishing to make the most of artificial intelligence to generate content in French.
“With Cedille, I’m thrilled that we can bring the power of very large language models to French. Now, there’s no more need to train a new model for each specific task: just give Cedille a few examples and the model will follow your lead!”
Florian Laurent, Senior Machine Learning Engineer at Coteries