Dimension reduction by whitening BERT/ Roberta.

4 min readAug 12, 2021

Since the introduction of BERT as a PLM (Pre-trained Language Model) in late 2018 we all have been trying out different methodologies to use it for different NLP downstream tasks like Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), Semantic Textual Similarity (STS) etc.

We often use text embeddings for semantic search, clustering or paraphrase mining. But BERT is not suitable for such tasks. Post the introduction of Sentence-BERT (Nils Reimers and Iryna Gurevychit) life became easier. It enabled us compute sentence/ text embeddings for more than 100 languages.

Source: https://en.wikipedia.org/wiki/Sentence_embedding

My intent of this blog is not to discuss on how we can use sentence embeddings or how Sentence-BERT works? Rather, I am going to talk about how we can reduce it’s dimension and show you how well it performs with SentEval toolkit which evaluates the quality of sentence embeddings. Based on this we can draw a logical conclusion that even after dimension reduction it doesn’t suffer much from semantic loss.

Why we need dimension reduction?

BERT/ Roberta Base and finetuned Base models offered by SBERT represent a sentence or paragraph by 768 dimension. Point to be noted here is that 768 dimension is quite large which not only leads to increase storage cost but also the computation or retrieval speed.

How can we reduce Dimension?

There are multiple ways to reduce dimension of embeddings, some of the general approaches are like PCA, tSNE, LDA, Autoencoder etc.

Recently, I came across to a new methodology which not only is very straight forward but also helps to reduce dimension of sentences without losing semantic meaning. Author of the paper “Whitening Sentence Representations for Better Semantics and Faster Retrieval” used whitening process. Which is a simple & effective post-processing technique.

The idea behind whitening is to remove the underlying correlation in the data and normalize the scale. It makes features less correlated with each other, less redundant, and more dense. This process is usually done after the data is projected in such a way that data gets aligned with concept of the interest.

The author of the paper experimented on BERT base & BERT large and compared it with other results:

Table 1: Results without supervision of NLI. We report the Spearman’s rank correlation score as ρ×100 between the cosine similarity of sentence embeddings and the gold labels on multiple datasets. ↑ denotes outperformance over its BERT-flow baseline and ↓ denotes underperformance. Ref : Arvix report

How can we do it?

I will show you how you can reduce the dimension using whitening process.

In most of the scenarios it didn’t suffer from semantic loss even after reduction of dimension from 768 to 256 or 128, in some cases it did but it’s quite small.

Like, the author I chose “first-last-avg” as our default configuration which averaging the first and the last layers of the model. This configuration achieve better performance compared to only averaging the last one layer.

Let’s see the code now:

Line num 31 gives the average of first & last two layers of the model. Then we call helper methods in line num 38 & 47, which not only enhance the semantic meaning but help in reduction of dimension as well.

Let me show you few execution using STS-B dataset, who wants to know more about STS please read this wiki link.

Data format, STS-B similarity score followed by pair of sentences.

Output 1 by our method:

STS-B data: 0.750, someone is slicing a onion, someone is carrying a fish.

Similarity score of two sentences 0.4724435126693901New Dimension is (2, 256)

Output 2 by our method:

STS-B data: 1.714, a man is playing a guitar, a man is playing a trumpet.

Similarity score of two sentences 0.23541956653401658 
New Dimension is (2, 256)

Output 3 by our method:

STS-B data: 4.500, a woman is climbing a cliff, a woman is climbing a rock face.

Similarity score of two sentences 0.9047186432382186 
New Dimension is (2, 256)

SentEval toolkit evaluation on Roberta Whitening:

I do understand the author of the paper wanted to demonstrate how whitening operation can yield better semantic meaning on Pre-trained BERT base & large models. But my intent is to show not only we can compress the dimension using whitening but it can also boost model performance.

I ran SentEval toolkit on two different sets of Roberta models:

Pre-trained Roberta base model.
Models which are finetuned on NLI dataset.

My SentEval evaluation report on STS dataset:

Table 2: Results with & without supervision of NLI. We report the Spearman’s rank correlation score as ρ×100 between the cosine similarity of sentence embeddings and the gold labels on multiple datasets.

Green box denotes out performance & Orange box shows small change. Evaluation scores look very promising to me.

Conclusion:

We can apply whitening operation not only to BERT model but other transformer models as well where experimental results indicate proposed method is simple but effective on 7 semantic similarity benchmark datasets.

Besides, we also found that even after the dimension reduction operation, it can further boost the model performance and naturally optimize the memory storage and accelerate the retrieval speed.

References:

Whitening Sentence Representations for Better Semantics and Faster Retrieval
Sentence-BERT
Code referred from. A big thanks for such a informative repo.
SentEval toolkit for sentence embeddings
Why BERT models perform poor in sentence embeddings?
Another approach: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach

Please, feel free to leave your comments if any. Thank you….!!!