Pre-Trained Language Models on NUS HPC

Kuang Hao, Research Computing, NUS IT

Introduction

The research into Deep Learning is growing at a rapid rate over the years. With more and more researchers stepping into the realm of AI, the need to use popular pre-trained models and packages increases as well.

To reduce the repetitive downloading time and storage space, we are now supporting popular pre-trained models in a shared space on NUS HPC. For NLP (Natural Language Processing), we currently have 3 commonly used pre-trained models in place.

This article introduces the newly installed language models on NUS HPC and demonstrates how to make use of them.

Please note that prior knowledge in Python, NLP and Deep Learning is required for the use of pre-trained models introduced in this article. The demo Jupyter notebook can be found HERE.

Text Vectorisation

Text vectorisation is the first step of an NLP pipeline, where text is transformed into a vector of numbers. It is required since most machine learning algorithms and deep learning architectures are not capable of processing plain text in their raw form. [1] A pre-trained semantic text vectorization model is a high-dimensional matrix that maps a word/character/punctuation into a numeric vector, where its meaning is represented by the vector. On NUS HPC, we currently support 2 popular semantic word embedding pre-trained weights: GloVe and Word2Vec.

GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. The training is performed on an aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. [2]

To use pre-trained GloVe weights on NUS HPC, we can follow these steps below:

First, we need to load the GloVe matrix as a dictionary. For demonstration I will use “glove.6B.100d”, which is trained by a text corpus of 6 billion tokens, with the vocabulary size of 400 thousand, and for each vector the dimension is 100.

The next step is to load the dictionary into an embedding layer.

For TensorFlow, we use the built-in Keras embedding layer. A legacy way of “tensorflow.nn.embedding_lookup” is also mentioned in the notebook, for users who needs to use an older version of tensorflow:

For PyTorch users, you will need to add 2 more tokens: <pad> and <unk>, which indicates padding and unknown tokens. In Keras, the same step is already included so there is no need to code explicitly. After this, we need to use the embedding layer. In PyTorch’s implementation, the counterpart is “torch.nn.Embedding”:

Word2Vec

Word2Vec is a family of model architectures and optimisations that can be used to learn word embeddings from large datasets. Embeddings learned through Word2Vec have proven to be successful on a variety of downstream natural language processing tasks. [3]

For this illustration, we will use Gensim to load the pre-trained weights of Word2Vec. Gensim is a free open-source Python library for text vectorisation. It supports Word2Vec natively and since we are making use of its own API instead of an embedding layer, there is no difference between TensorFlow and PyTorch:

Transfer Learning in NLP

Since the release of transformer models by Google in 2018, transfer learning has become a major topic of NLP, improving the state-of-the-art performance of many tasks. The most common transformer model is BERT (Bidirectional Encoder Representations from Transformers). On NUS HPC, we now support BERT locally with the most popular version: BERT-base-uncased.

BERT

BERT is a transformer-based machine learning technique for NLP pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. In 2019, Google announced that it had begun leveraging BERT in its search engine, and by late 2020 it was using BERT in almost every English-language query. A 2020 literature survey concluded that “in a little over a year, BERT has become a ubiquitous baseline in NLP experiments”, counting over 150 research publications analyzing and improving the model. [4]

To load the pre-trained BERT, we use the transformers package developed by HuggingFace. We will load BERT for a text classification task, please check their documentation here for other tasks.

The following codes loads BERT for a text classification model in TensorFlow:

We can print the model structure using Keras summary:

For PyTorch, the following codes does the same job:

To check the model structure, we can just call the printing function:

HPC Support

We have now introduced 3 popular language models that are pre-installed on NUS HPC and demonstrates how to use them.

The files of these models are shared with all HPC users and is read-only. If you want to edit them, please create a local copy in your home or hpctmp directory and customise from there. If you want other popular open-source models to be installed on NUS HPC, or if you have any issues using our HPC resources, please contact us via nTouch.

Reference

[1]. Analytics Vidhya. (2021). “Part 5: Step by Step Guide to Master NLP – Word Embedding and Text Vectorization”. [online] Available at: https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/

[2]. Jeffrey Pennington, Richard Socher, Christopher D. Manning. (2015). “GloVe: Global Vectors for Word Representation”. [online] Available at: https://nlp.stanford.edu/projects/glove/

[3]. Google. (2021). “TensorFlow Documentation”. [online] Available at: https://www.tensorflow.org/tutorials/text/word2vec

[4]. WikiPedia. (2022). “BERT (language model)”. [online] Available at: https://en.wikipedia.org/wiki/BERT_(language_model)