BERT has become the go-to language model for many natural language processing (NLP) use cases. You can easily access BERT from the HuggingFace transfo

Pretrain your custom BERT model

submited by
Style Pass
2022-09-21 00:00:29

BERT has become the go-to language model for many natural language processing (NLP) use cases. You can easily access BERT from the HuggingFace transformers library and fine-tune it for a downstream task. But there are limitations when dealing with niche domains where using a standard version of BERT yields suboptimal results as BERT isn't familiar with the problem domain.

At VMware, we deal with many technical terms (eg; virtualization), product names (eg; vSphere, vRealize), and abbreviations (eg; VCSA which stands for vCenter Server Appliance) that are not part of BERT's vocabulary. BERT uses WordPiece tokenization to deal with Out-Of-Vocabulary (OOV) terms, but WordPiece tokenization alone cannot solve the issue as the sub-tokens of these words will still lack context and meaning: see Weaknesses of WordPiece Tokenization.

While words such as Tanzu have no meaning to the general public, a person familiar with VMware products would know that Tanzu is a suite of products related to Kubernetes. Similarly, BERT trained on standard English datasets (general public in our analogy) would struggle to make meaningful embeddings out of the domain-specific terms.

Leave a Comment