Cross-lingual Word Embeddings for Low-resource and Morphologically-rich Languages

Cross-lingual Word Embeddings for Low-resource and Morphologically-rich Languages
Author: Ali Hakimi Parizi
Publisher:
Total Pages: 0
Release: 2021
Genre:
ISBN:

Despite recent advances in natural language processing, there is still a gap in state-of-the-art methods to address problems related to low-resource and morphologically-rich languages. These methods are data-hungry, and due to the scarcity of training data for low-resource and morphologically-rich languages, developing NLP tools for them is a challenging task. Approaches for forming cross-lingual embeddings and transferring knowledge from a rich- to a low-resource language have emerged to overcome the lack of training data. Although in recent years we have seen major improvements in cross-lingual methods, these methods still have some limitations that have not been addressed properly. An important problem is the out-of-vocabulary word (OOV) problem, i.e., words that occur in a document being processed, but that the model did not observe during training. The OOV problem is more significant in the case of low-resource languages, since there is relatively little training data available for them, and also in the case of morphologically-rich languages, since it is very likely that we do not observe a considerable number of their word forms in the training data. Approaches to learning sub-word embeddings have been proposed to address the OOV problem in monolingual models, but most prior work has not considered sub-word embeddings in cross-lingual models. The hypothesis of this thesis is that it is possible to leverage sub-word information to overcome the OOV problem in low-resource and morphologically-rich languages. This thesis presents a novel bilingual lexicon induction task to demonstrate the effectiveness of sub-word information in the cross-lingual space and how it can be employed to overcome the OOV problem. Moreover, this thesis presents a novel cross-lingual word representation method that incorporates sub-word information during the training process to learn a better cross-lingual shared space and also better represent OOVs in the shared space. This method is particularly suitable for low-resource scenarios and this claim is proven through a series of experiments on bilingual lexicon induction, monolingual word similarity, and a downstream task, document classification. More specifically, it is shown that this method is suitable for low-resource languages by conducting bilingual lexicon induction on twelve low-resource and morphologically-rich languages.

Cross-Lingual Word Embeddings

Cross-Lingual Word Embeddings
Author: Anders Søgaard
Publisher: Springer Nature
Total Pages: 120
Release: 2022-05-31
Genre: Computers
ISBN: 3031021711

The majority of natural language processing (NLP) is English language processing, and while there is good language technology support for (standard varieties of) English, support for Albanian, Burmese, or Cebuano--and most other languages--remains limited. Being able to bridge this digital divide is important for scientific and democratic reasons but also represents an enormous growth potential. A key challenge for this to happen is learning to align basic meaning-bearing units of different languages. In this book, the authors survey and discuss recent and historical work on supervised and unsupervised learning of such alignments. Specifically, the book focuses on so-called cross-lingual word embeddings. The survey is intended to be systematic, using consistent notation and putting the available methods on comparable form, making it easy to compare wildly different approaches. In so doing, the authors establish previously unreported relations between these methods and are able to present a fast-growing literature in a very compact way. Furthermore, the authors discuss how best to evaluate cross-lingual word embedding methods and survey the resources available for students and researchers interested in this topic.

Cross-Lingual Word Embeddings with Universal Concepts and Their Applications

Cross-Lingual Word Embeddings with Universal Concepts and Their Applications
Author: Pezhman Sheinidashtegol
Publisher:
Total Pages:
Release: 2020
Genre: Electronic dissertations
ISBN:

Enormous amounts of data are generated in many languages every day due to our increasing global connectivity. This increases the demand for the ability to read and classify data regardless of language. Word embedding is a popular Natural Language Processing (NLP) strategy that uses language modeling and feature learning to map words to vectors of real numbers. However, these models need a significant amount of data annotated for the training. While gradually, the availability of labeled data is increasing, most of these data are only available in high resource languages, such as English. Researchers with different sets of proficient languages seek to address new problems with multilingual NLP applications. In this dissertation, I present multiple approaches to generate cross-lingual word embedding (CWE) using universal concepts (UC) amongst languages to address the limitations of existing methods. My work consists of three approaches to build multilingual/bilingual word embeddings. The first approach includes two steps: pre-processing and processing. In the pre-processing step, we build a bilingual corpus containing both languages' knowledge in the form of sentences for the most frequent words in English and their translated pair in the target language. In this step, knowledge of the source language is shared with the target language and vice versa by swapping one word per sentence with its corresponding translation. In the second step, we use a monolingual embeddings estimator to generate the CWE. The second approach generates multilingual word embeddings using UCs. This approach consists of three parts. For part I, we introduce and build UCs using bilingual dictionaries and graph theory by defining words as nodes and translation pairs as edges. In part II, we explain the configuration used for word2vec to generate encoded-word embeddings. Finally, part III includes decoding the generated embeddings using UCs. The final approach utilizes the supervised method of the MUSE project, but, the model trained on our UCs. Finally, we applied our last two proposed methods to some practical NLP applications; document classification, cross-lingual sentiment analysis, and code-switching sentiment analysis. Our proposed methods outperform the state of the art MUSE method on the majority of applications.

Embeddings in Natural Language Processing

Embeddings in Natural Language Processing
Author: Mohammad Taher Pilehvar
Publisher: Morgan & Claypool Publishers
Total Pages: 177
Release: 2020-11-13
Genre: Computers
ISBN: 1636390226

Embeddings have undoubtedly been one of the most influential research areas in Natural Language Processing (NLP). Encoding information into a low-dimensional vector representation, which is easily integrable in modern machine learning models, has played a central role in the development of NLP. Embedding techniques initially focused on words, but the attention soon started to shift to other forms: from graph structures, such as knowledge bases, to other types of textual content, such as sentences and documents. This book provides a high-level synthesis of the main embedding techniques in NLP, in the broad sense. The book starts by explaining conventional word vector space models and word embeddings (e.g., Word2Vec and GloVe) and then moves to other types of embeddings, such as word sense, sentence and document, and graph embeddings. The book also provides an overview of recent developments in contextualized representations (e.g., ELMo and BERT) and explains their potential in NLP. Throughout the book, the reader can find both essential information for understanding a certain topic from scratch and a broad overview of the most successful techniques developed in the literature.

Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications

Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications
Author: Vinit Kumar Gunjan
Publisher: Springer Nature
Total Pages: 821
Release: 2022-01-10
Genre: Technology & Engineering
ISBN: 9811664072

This book contains original, peer-reviewed research articles from the Second International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications, held in March 28-29th 2021 at CMR Institute of Technology, Hyderabad, Telangana India. It covers the latest research trends and developments in areas of machine learning, artificial intelligence, neural networks, cyber-physical systems, cybernetics, with emphasis on applications in smart cities, Internet of Things, practical data science and cognition. The book focuses on the comprehensive tenets of artificial intelligence, machine learning and deep learning to emphasize its use in modelling, identification, optimization, prediction, forecasting and control of future intelligent systems. Submissions were solicited of unpublished material, and present in-depth fundamental research contributions from a methodological/application perspective in understanding artificial intelligence and machine learning approaches and their capabilities in solving a diverse range of problems in industries and its real-world applications.

Persian Computational Linguistics and NLP

Persian Computational Linguistics and NLP
Author: Katarzyna Marszałek-Kowalewska
Publisher: Walter de Gruyter GmbH & Co KG
Total Pages: 258
Release: 2023-05-22
Genre: Language Arts & Disciplines
ISBN: 3110616718

In this series, Iranian languages and linguistics take centre stage. Each volume is dedicated to a key topic and brings together leading experts from around the globe.

Locative Alternation

Locative Alternation
Author: Seizi Iwata
Publisher: John Benjamins Publishing
Total Pages: 258
Release: 2008-06-09
Genre: Language Arts & Disciplines
ISBN: 9027291047

The aim of the present volume is two-fold: to give a coherent account of the locative alternation in English, and to develop a constructional theory that overcomes a number of problems in earlier constructional accounts. The lexical-constructional account proposed here is characterized by two main features. On the one hand, it emphasizes the need for a detailed examination of verb meanings. On the other, it introduces lower-level constructions such as verb-class-specific constructions and verb-specific constructions, and makes full use of these lower-level constructions in accounting for alternation phenomena. Rather than being a completely new version of construction grammar, the proposed lexical-constructional account is an automatic consequence of the basic tenet of constructional approaches as being usage-based.

Similar Languages, Varieties, and Dialects

Similar Languages, Varieties, and Dialects
Author: Marcos Zampieri
Publisher: Cambridge University Press
Total Pages: 345
Release: 2021-09-02
Genre: Computers
ISBN: 1108429351

Studying language variation requires comprehensive interdisciplinary knowledge and new computational tools. This essential reference introduces researchers and graduate students in computer science, linguistics, and NLP to the core topics in language variation and the computational methods applied to similar languages, varieties, and dialects.