Machine Learning and Arabic Language Processing Research Group

About

Background

The Arabic language is a Semitic language that has its own complex phonological, morphological, syntactic, and topological characteristics. Such characteristics contribute to the complexity of processing the Arabic language. In addition, Arabic has very rich vocabulary; it has more morphological inflections than the English language. Arabic is very challenging and, therefore, offers different opportunities for study and investigation. For example, while the English language has nearly 50 part-of-speech (POS) tags, Modern Standard Arabic (MSA) POS tags exceed 300,000. Besides, Arabic employs diacritics and has several rule exceptions.

Arabic is manifest in two main forms: Modern Standard Arabic (MSA) and colloquial Arabic dialects. Colloquial Arabic dialects (CAD) has become important because of the proliferation of social networks, which resulted in the vast unstructured dialectal texts available on the web. Despite the strong rise of MSA and CAD content on the web, little research has been done to extract such content. Additional efforts are mandatory to enrich the Arabic language datasets or corpora for written and spoken forms. This is a prerequisite for developing solid research in the areas of document/information retrieval and analysis, language processing, sentiment analysis, speech recognition, and machine learning. A critical area of research of the proposed group is the creation of such content (MSA and CAD) and making it accessible to other researchers in this field to attain concrete contribution and meet the desired expectations.

Arabic Language Processing (ALP) is a subfield that combines research from the fields of computational linguistics and artificial intelligence. It aims at facilitating the interaction between computers and humans using Arabic language or its colloquial varieties. Many challenges in ALP involve syntactic parsing and tagging, morphological analysis and disambiguation, MSA and CAD Corpora, sentiment analysis, speech recognition, unsupervised and supervised machine learning from corpora, chatbots and their applications, search and information retrieval. Some of the projects that we anticipate to start include construction of MSA and CAD repositories, sentiment analysis of MSA and CAD texts, Arabic speech recognition and understanding in MSA and Emirati CAD.

In the field of speech processing and recognition, we are interested in studying and analyzing different problems such as speech recognition, speaker recognition (identification and verification), talking condition recognition, gender recognition, accent recognition, language recognition, and abnormal talking environments. The abnormal talking environments are stressful talking environments and emotional talking environments. Specifically, we need to study and analyze Arabic Emirati-accented speech database in each of stressful and emotional talking environments for different applications.

In the field of computational linguistics, the overall language system consists of subsystems such as phonology, morphology, syntax, semantics and pragmatics. Rules about language use can be best studied from a large corpus that can be analyzed through the use of computer software. Certain computer programs can also be used to compile dictionaries of Arabic collocations that are of paramount importance to language learning, translation, teaching Arabic to non-Arabic speakers and writing textbooks for Arabs and foreign learners of Arabic. Building a huge Arabic corpus facilitates research in determining the frequency of certain lexical items and language structures. Such a project can also be helpful in developing machine interpretation and translation.

Nowadays, machine learning (ML) is a hot area of research and its applications are wide. In our group, we shall attempt at initiating state-of-the-art research by combining research on deep linguistic modeling and data analysis with machine learning and deep learning approaches to process Arabic language. This is a promising research area with great potential.

We aim at developing local expertise and promoting awareness of the importance of ALP and ML in the community at large. The members of the group shall embark on interdisciplinary collaborative research projects with external entities that share the same concerns about Arabic computing.