Text Augumentation using nlpaug
nlpaug is a library for text augmentation. It is based on the paper Text Augmentation.
- Introduction
- The nlpaug Package
- First Attempt
- KeyboardAug
- SpellingAug
- SynonymAug
- WordEmbsAug
- ContextualWordEmbsAug
Introduction
More data can help train models to be more general, with less overfitting. One convenient way to generate additional data is simply to transform the given data into something slightly different, such that it still represents the assigned labels. As well as helping during training, augmentation can also be used when running inference (Test Time Augmentation or TTA).
For an image based challenge, flips, rotations, etc. can be used to generate a new image that still presents the same class of object as the original. But how do we do this for a NLP challenge?
There are a wide variety of approaches, from character and word substitutions, all the way to translating the text from the source lanaguage to another and then back to get a setence with (hopefully) the same meaning, but a different structure.
For more info, vad13irt has already posted a great survey here: https://www.kaggle.com/c/feedback-prize-2021/discussion/295277
For this notebook, I want to focus on what the application of some of these methods actually looks like for the feedback-prize-2021 dataset. In particular, I'll be looking at methods that can be used while preserving the discourse_start and discourse_end annotations given in train.csv.
The nlpaug Package
I have chosen the nlpaug package, as it seems to have all I could want to experiement with. I could just have used 'pip install nlpaug', but I've installed it from a dataset to allow this notebook to run with internet turned off.
Documentation is available at: https://nlpaug.readthedocs.io/en/latest/
!cp -r ../input/nlpaug-from-github/nlpaug-master ./
!pip install nlpaug-master/
!rm -r nlpaug-master
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws
First Attempt
Let's start by taking a arbitrarily selected training text and augmenting it. For this I've picked a very simple augmetation, KeyboardAug, that uses adjacency of keys on the keyboard to simulate typos. Alongside the original and annotated text, I'm printing the length of the text run through split() - if that changes the discourse_start and discourse_end annotations given in train.csv will no longer be valid.
from colorama import Fore
from pathlib import Path
# Set some seeds. We want randomness for real usage, but for this tutorial, determinism helps explain some examples.
import numpy as np
np.random.seed(1000)
import random
random.seed(1000)
base_path = Path("../input/feedback-prize-2021/train")
with open(base_path / "3FF2F530D590.txt") as f:
sample = f.read()
print(f"Original: {len(sample)}\n{sample}")
aug = nac.KeyboardAug()
augmented_text = aug.augment(sample)
print(f"\nKeyboard augmentation: {len(augmented_text)}\n{augmented_text}")
Not great, several issues are obvious:
- The length of the split text has changed as the original had additional white space for formatting.
- It is hard to see all the differences without highlighting.
- The augmentation adds digits and special characters, which are unlikely to have been present.
- The augmentation can change many characters in one word, making the new word too far from the original.
- The augmented text adds spaces around apostrophes, increasing the split length.
Let's solve these by:
- Stripping the original text.
- Adding a diff viewer to highlight the differences.
- Setting arguments to the augmentation method.
- Setting arguments to the augmentation method.
- Post-processing the augmented text with: replace(" ' ", "'")
sample = " ".join([x.strip() for x in sample.split()])
def print_and_highlight_diff(orig_text, new_texts):
""" A simple diff viewer for augmented texts. """
orig_split = orig_text.split()
print(f"Original: {len(orig_split)}\n{orig_text}\n")
for new_text in new_texts:
print(f"Augmented: {len(new_text.split())}")
for i, word in enumerate(new_text.split()):
if i < len(orig_split) and word == orig_split[i]:
print(word, end=" ")
else:
print(Fore.RED + word + Fore.RESET, end=" ")
print()
aug = nac.KeyboardAug(include_numeric=False, include_special_char=False, aug_char_max=1, aug_word_p=0.05)
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)
aug = naw.SpellingAug()
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)
aug = nawwe.WordEmbsAug(model_type='glove', model_path='../input/glove-embeddings/glove.6B.300d.txt')
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)
aug = nawcwe.ContextualWordEmbsAug(model_path='../input/huggingface-bert-variants/bert-base-cased/bert-base-cased')
augmented_texts = aug.augment(sample, n=3)
augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)