nlpaug
是一个非常有用的NLP数据扩增库,支持了数十种数据扩增方法。本文将其基础的使用进行介绍。
nlpaug优点
支持的数据扩增方法较多 能与nltk和多语言词向量使用 支持BERT进行数据扩增
nlpaug数据扩增方法
KeyboardAug
支持英文,按照键盘距离对文本进行扰动:
text = 'The quick brown fox jumps over the lazy dog.'
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps over the lazy dog.
Augmented Text:
The quick beIwn fox jKmpQ ive3 the lazy dog.
OcrAug
支持英文,模仿OCR识别错误:
text = 'The quick brown fox jumps over the lazy dog .'
aug = nac.OcrAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps over the lazy dog.
Augmented Text:
The 9uicr 6kown fox jumps over the 1a2y dog.
RandomAug
支持中英文,对文本进行随机插入、删除、交换和替换:
text = 'The quick brown fox jumps over the lazy dog.'
aug = nac.RandomCharAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps over the lazy dog.
Augmented Text:
The quick IrowF fox jZmSs Yv@r the lazy dog.
AntonymAug
支持英文,使用wordnet对单词进行相反含义的替换:
text = 'very beautiful'
import re
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
token_pattern = re.compile(token_pattern)
return token_pattern.findall(text)
aug = naw.AntonymAug(aug_p=1, lang='eng')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
very beautiful
Augmented Text:
very ugly
ContextualWordEmbsAug
支持中英文,通过BERT模型完成单词预测:
text = 'The quick brown fox jumps over the lazy dog.'
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', aug_p=1)
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps over the lazy dog.
Augmented Text:
our quick brown fox jumps over the raging waterfall.
RandomWordAug
支持中英文,随机单词交换和删除:
text = 'The quick brown fox jumps over the lazy dog .'
aug = naw.random.RandomWordAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
#输出结果
Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The brown fox jumps the lazy.
SpellingAug
支持中英文,使用拼写错误的单词进行替换:
text = 'The quick brown fox jumps over the lazy dog .'
aug = naw.spelling.SpellingAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
#输出结果
Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
Tho quick Brow fox jumps over the last dog.
SplitAug
支持英文,随机将单词进行拆分:
text = 'The quick brown fox jumps over the lazy dog .'
aug = naw.split.SplitAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
#输出结果
Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick bro wn fox jumps ov er the la zy dog.
SynonymAug
支持英文,从wordnet中使用相似单词替换:
text = 'The quick brown fox jumps over the lazy dog .'
aug = naw.synonym.SynonymAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The speedy brown dodger jump over the lazy dog.
TfIdfAug
支持中英文,计算TFIDF后然后找到相似单词替换:
import re
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
token_pattern = re.compile(token_pattern)
return token_pattern.findall(text)
tfidf_model = nmw.TfIdf()
tfidf_model.train(common_texts)
tfidf_model.save('.')
# Load TF-IDF augmenter
aug = naw.TfIdfAug(model_path='.', tokenizer=_tokenizer)
text = 'The quick brown fox jumps interface the lazy dog .'
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps interface the lazy dog .
Augmented Text:
The quick brown fox jumps response the lazy dog
WordEmbsAug
支持中英文,词向量空间内相似单词替换:
text = 'The quick brown fox jumps over the lazy dog .'
aug = naw.WordEmbsAug(model_type='glove', model_path='glove.840B.300d.txt')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
# 输出结果
Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps the their lazy bulldog.
nlpaug多语言支持
nlpaug
在时候用时可以自定义分词方法,也可以结合多种语言的wordnet、词向量和BERT模型来使用:
WordNet, Spanish
text = 'Un rápido zorro marrón salta sobre el perro perezoso'
aug = naw.SynonymAug(aug_src='wordnet', lang='spa')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
BERT, Japanese
# Augment Japanese by BERT
aug = naw.ContextualWordEmbsAug(model_path='bert-base-multilingual-uncased', aug_p=0.1)
text = '速い茶色の狐が怠惰なな犬を飛び越えます'
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)
nlpaug安装
https://github.com/makcedward/nlpaug
基础安装:
pip install numpy requests nlpaug
添加👇微信


文章转载自Coggle数据科学,如果涉嫌侵权,请发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




