Pandata Tech scientist on the importance of Arabic data in the Future built around Artificial intelligence
Nowadays the text we write is being processed by natural language processing models everywhere online. Whether it’s a social media platform like Twitter or Instagram, search engine, customer service chatbots or any other online service, text is being processed everywhere to train language models so that they could understand the user’s text more accurately and improve their experience.
Some common examples of how these models are working:
When you interact with the search engine, the model behind interprets words and phrases to understand the query then results that are relevant to your query are returned. Online retailers use NLP algorithms to determine which products are most likely to be of interest-based on the conversations people are having on social media platforms like Twitter or Instagram. Recommendation systems recommend books, movies, articles or any other thing based on what we read or what we write in comments and review.
The Arab world is a growing market. It is home to some of the fastest-growing economies in the world. And as the economies grow, so too does demand for services and products that cater to them – including those reliant on accurate Arabic NLP capabilities.
Hassan Ghalib who is a Lead Data Scientist at Pandata Tech, a company focused on solving challenging problems and developing high-value-added solutions based on Big Data, Natural Language Processing (NLP), and Machine Learning, shared his thoughts about the challenges in Arabic NLP.
“In the world of AI and machine learning data, the OIL. Good Performing models are trained on datasets that are huge in size and diverse in nature so that they cover all the aspects and richness of a language. Many novel architectures for language models such as The Transformer are only able to produce good metrics if they are trained on the right dataset. Because data quality along with quantity are the main driver of model performance,” he said.
An accurate language model is one that is trained on unbiased datasets and is aware of the diversity and complexity of multiple dialects, vocabulary and grammar rules. Otherwise, if a language model is trained on dataset that lacks representation of certain Arab region its performance could be biased and could offend the cultural values and sentiments of people. For example, a model that predicts whether someone is likely to default on a loan could inadvertently discriminate against people from certain regions or religions if it’s trained on data that reflects only one perspective.
“If we talk about Arabic language, there are some challenges in Arabic NLP due to large number of dialects spoken throughout the Arab world where each dialect has its own unique vocabulary and grammar rules and insufficient datasets. The Arabic NLP models trained on such insufficient datasets result in being biased. If we look at state of the art language model available for other languages, the top of the list is GPT3, trained on hundreds of billions of tokens/words with the size of training dataset around 45 Terra bytes. have that much datasets for Arabic which are truly representative of all dialects spoken in different Arab regions then producing a GPT3 for the Arab world is not too far away,” Ghalib added.
In this technological world machines are also learning just like humans, so the more data we give to the machines the more aware and accurate they will be. Qatar can avail this opportunity to produce massive datasets which can be harnessed to build top-notch Arabic NLP models. Doing this will not only preserve the language and values of Qatar in the future tech world but also, they will be pioneers in the region to reach such milestone.