# 基本概念
text normalization: Regular expressions can be used to specify strings we might want to
extract from a document, to defining
strings like $199 or $24.99 for extracting tables of prices from a document.
用正则表达式,从文档中提取出来的集合,文本规范化。tokenization: English words are often separated from each other
by whitespace, but whitespace is not always sufficient. New York and rock ’n’ roll
are sometimes treated as large words despite the fact that they contain spaces, while
sometimes we’ll need to separate I’m into the two words I and am.
分词lemmatization: the task of determining
that two words have the same root, despite their surface differences. For example,
the words sang, sung, and sings are forms of the verb sing.
找出单词变形的原形lemmatizer: The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
变形的单词对应原形的 map 结构stemming: refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word.
找出单词的词干,简单的方式是划掉后缀,也是 lemmatization 的一种具体方式。sentence segmentation: breaking up a text into individual sentences, using cues like
periods or exclamation points.
断句,划分表述部分edit distance: measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other.
两个字符串之间差异化程度的度量,从一个字符串到另外一个字符串的编辑(添加,删除,替换)次数来度量相似度。corpus: The corpus can be a single document or a collection.
可以为一个文档或集合。文集,例如:莎士比亚文集
# 正则表达式
- 普通的序列(sequence) /hello/ ---> hello world
- 区分大小写,并集(disjunction),/[hH] ello/ ---> Hello world.
- 范围(range),/[a-z]/
- 脱字符(caret ^)的使用,开方括号(open square brace [ )相邻,为集合取反,其余位置则代表本身字符(^)
- [^a] 除了字符‘a’外,匹配任何字符
- [a^] 匹配 "a^",匹配的是 ^ 字符
- 可选元素(optional elements),/?/ (the preceding character or nothing),/colou?r/---> color /colour
- kleene * (发音:cleany star),/a*/--- 匹配 0 或多个字符;/aa*/ --- 匹配 1 个或多个字符
- Kleene + 简化上面的 1 个或多个字符,/a+/--- 匹配 1 个或多个字符
- 匹配任意 1 个字符,英文句号 period /./,/beg.n/ ---> begin, begun, beg'n
- Anchors 锚记,
- 行边界,开头位置匹配(^),结尾位置匹配(&);/^The dog$/ ---> The dog
- 单词边界,单词边界(\b),非单词边界(\B);/\bm/ ---> moon,/\Bon/ ---> noon
- 单词:由数字,下滑线或字母( digits, underscores, or letters)组成
- (pipe symbol |),disjunction,/cat|dog/ ---> 匹配 cat 或 dog
- 与集合的区别是,集合的是单个字符;
- 此方式下为两个序列的选择
- 使用括号调整优先权(precedence),/guppy|ies/ 不能匹配 guppies,因此:/gupp (y|ies)/
- 最低级:|
- 次低级,字符序列,因此 guppy 比 | 优先权高,需要括号来调整优先级
- 也可以用于 counters(*+? 等)
- 操作符的优先级:
- 最高:转义 \
- 高:Parenthesis: ()、(?:)、(?=)、[]
- 中:Counters: *、+、?、{n}、{n,}、
- 低: Sequences and anchors: 字符序列,边界符号
- 最低:Disjunction: |
- Counters 默认是贪婪模式(greedy);尽可能多的匹配更多的字符
- Counters 非贪婪模式(non-greedy);尽可能少的匹配字符,后面加?符号,例如:*?, +?
- 捕获组(capture group),和寄存(register);
- 使用 () 捕获组,例如:/(\d+)/
- 使用 <number>,例如:/(\d) \1/ ---> 1 1
- 非捕获组(non-capturing group),使用 (?:),例如:/(?:hello)/
- 由于括号有 2 种用法,第一调整优先权,第二是捕获;如果不想被捕获,就可以使用此方式
- 先行断言(Lookahead assertions),当前位置往前搜索,零宽(zero-width:the match pointer doesn’t advance.)
- (?=) 匹配内容则为 true,例如:/(?=Hello)/ ----> Hello World ---> true
- (?!) 没匹配到内容则为 true,例如:/(?!Hello)/ ----> Hi, World ----> true
# 单词
corpus/corpora: a computer-readable collection of text or speech.
utterance: 说话的内容
disfluency: 口语上的不流利,以下句子有 2 种类型的不流利。
I do uh main- mainly business data processing
- fragment: The broken-off word main- is called a fragment.
- filled pause: Words like uh and um are called fillers or filled pauses.
wordform: is the full inflected or derived form of the word.
word type: 单词的种类,单词不重复
word token: 单词的数量,可以重复
Herdan’s Law/Heaps’ Law:
- |V| ---- word type
- N ---- word token
- k, b ---- 正数常量
# corpora
语音识别困难:
- 口语发音简化,
- 多语言切换,
- disfluency
AAVE: African American Vernacular English 非裔美国本土英语
SAE: Standard American English 美国标准英语
- talmbout ---- talking about
- iont ---- I don't
code switching: 口语上,多种语言切换
genre: 体裁不同
# Text Normalization
# tokenization
- Penn Treebank tokenization: 英语分词的一种规则
- Case folding: everything is mapped to lower case.
- MaxMatch algorithm: 中文分词的一种算法,贪婪模式:
- 需要一个字典
- 将字符串分成 2 部分,第一部分为第一个分词,第二部分为剩余字符;
- 贪婪算法,第一部分先以最长的长度,作为第一个分词;
- 第一部分的长度递减,循环判断第一个分词是否存在于字典中:
- 存在于字典中,就得到第一部分的分词,然后递归处理第二部分
- 不存在,就递减长度,继续判断,直到存在于字典或长度为 0,然后递归处理第二部分
- 如果都不存在,则将第 1 个词作为第一部分的分词;剩余的字符作为第二部分,第二部分继续执行 MaxMatch 算法
# 伪代码 | |
function MAXMATCH(sentence, dictionary) returns word sequence W | |
if sentence is empty | |
return empty list | |
for i length(sentence) downto 1 | |
firstword = first i chars of sentence | |
remainder = rest of sentence | |
if InDictionary(firstword, dictionary) | |
return list(firstword, MaxMatch(remainder,dictionary) ) | |
# no word was found, so make a one-character word | |
firstword = first char of sentence | |
remainder = rest of sentence | |
return list(firstword, MaxMatch(remainder,dictionary) ) |
word error rate: MaxMatch 用于中文较好,但英文较差,所以需要评估分词的错误率
statistical sequence models trained via supervised machine learning on hand-segmented training
sets 目前较好的中文分词是由监督机器学习的,人工分词训练集的,统计学的序列模型。morpheme: steam,affix
词素:词干和词缀,例如:cats 包含 cat, s 两个词素Porter Stemmer algorithm: steam 算法,包含一系列的重写规则,例如:
- -ational --> -ate (e.g., relational --> relate)
- -sses --> -ss (e.g., grasses --> grass)
unknown words: words that a system has not seen before.
byte-pair encoding/BPE: an unseen word can be represented by combining the parts.
- 实现算法参考:https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/bpe_toy.py
# 断句(Sentence segmentation)
- 标点符号(punctuation),例如:period (.), question marks (?), exclamation points (!)
- 缺点:有歧义,例如:period (.) 代表句子的结束,但也有可能是省略,例如:Mr., Inc.
- 二元分类器(binary classifier):(基于一系列规则或机器学习) 构建一个,用于判断 perid (.) 的作用,这需要一个缩写字典来判断。
- 最新技术是,利用机器学习来做句子的分词。
# Minimum Edit Distance
measuring how similar two strings are.
度量 2 个字符串的相似程度。
minimum edit distance: between two strings is defined as the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.
alignment: Given two sequences, an alignment is a correspondence between substrings of the two sequences.
I N T E * N T I O N | |
| | | | | | | | | | | |
* E X E C U T I O N | |
d s s i s |
计算这些操作的成本/权重
# The Levenshtein distance:
有两个版本:
- d, s, i 的成本都为 1,则上面的 distance = 5
- d, i 的成本为 1;不允许 s 操作,但可以替换成是 1 个 d 和 1 个 i,因此为 2,所以上面的 distance = 8
# The Minimum Edit Distance Algorithm
- dynamic programming: 动态规划,是一种思想,有很多算法使用该思想:将一个大的问题,拆分成很多小问题,然后解小问题,得到大问题的解。
- Viterbi algorithm
- CKY algorithm