词向量技术的学习

下面我们来讲讲词向量的两种主要的方法CBOW和SkipGram,讲CBOW的时候我们采用的是hierarchical softmax(分层softmax)优化,讲skip_gram的时候我们用的是negative sampling(负采样)优化。

CBOW

CBOW是Continuous Bag-of words Model的缩写,是一种根据上下文的词语预测当前词语出现概率的模型。

目标函数为:

\(\iota = \sum_{w\in{C}}^{}{logP(w|Context(w))}\)

CBOW->hierarchical softmax

哈夫曼树

哈夫曼树是一种带权路径长度最短的二叉树,也称最优二叉树,用下图来表示。

图(a)表示普通的二叉树,图(b)表示哈夫曼树。

图(a)的带权长度:?\(WPL = (5+7+2+13) * 2 = 54\)

图(b)的带权长度:?\(WPL = 13*1 + 7*2 + (5+2) * 3 = 48\)

可见,图b的带权长度较小,我们可以证明图b就是哈夫曼树(也称最优二叉树)。

设计思想

现在一堆词的初始森林,A出现5次,B出现7次,C出现2次,D出现13次。第一次合并,将出现次数最少的A和C组合到一起,然后把次小值B和他们组合到一起,再把最大值和他们组合到一起,这样就构造好了一颗哈夫曼树。

哈夫曼树还可以用来做编码:

D:(0),B:(1,0), C:(1,1,0), A:(1,1,1)

哈尔曼树往左子树或右子数走,可以用逻辑回归去实现。

输入层是上下文的词向量,在训练过程中,词向量只是个副产品,我们最终要得到的是参数 \(\theta\)。

投影层就是简单的向量加法。

对于输出层,我们重点来理解下:

现在我们要根据上下文预测中心词足球,正例的概率为 :

\(\delta(X_{w}^T\theta) = \frac{1}{1+e^{-X_{w}^T\theta}}\)

负例的概率为:

\(1 – \delta(X_{w}^T\theta)\)

现在我们假设往左子树为正,往右子数为负,则到足球的方向为负正正负

\(第1次: P(d_{2}^{w}|x_{w},\theta_{1}^{w}) = 1 – \delta(x_{w}^{T}\theta_{1}^{w})\)

\(第2次: P(d_{3}^{w}|x_{w},\theta_{2}^{w}) = \delta(x_{w}^{T}\theta_{2}^{w})\)

\(第3次: P(d_{4}^{w}|x_{w},\theta_{3}^{w}) = \delta(x_{w}^{T}\theta_{3}^{w})\)

\(第4次: P(d_{5}^{w}|x_{w},\theta_{4}^{w}) = 1 – \delta(x_{w}^{T}\theta_{4}^{w})\)

\(P(足球|Context(足球)) = \prod_{j=2}^{5}P(d_{j}^{w}|x_{w},\theta_{j-1}^{w})\)

CBOW求解

目标函数:

\(\begin{align*}&\iota = \sum_{w\in{C}}^{}{log\prod_{j=2}^{l^{w}}}\left\{ \left[ \sigma(x_{w}^{T}\theta_{j-1}^{w}) \right]^{1 – d_{j}^{w}} \cdot {\left[ 1 – \sigma(x_{w}^{T}\theta_{j-1}^{w}) \right]^{d_{j}^{w}}} \right\}\\
&\Rightarrow \quad \iota = \sum_{w\in{C}}^{}{\sum_{j=2}^{l^{w}}}\left\{ ({1 – d_{j}^{w}} )\cdot{log\left[ \sigma(x_{w}^{T}\theta_{j-1}^{w}) \right]}+ d_{j}^{w}\cdot{log{\left[ 1 – \sigma(x_{w}^{T}\theta_{j-1}^{w}) \right]}} \right\}\end{align*}\)

现在用梯度上升的方法求解 \(\theta\) ,对 \(\theta\) 进行求导

\(\Rightarrow\quad \frac{\partial \iota(w, j)}{\partial \theta_{j-1}^{w}} = \frac{\partial }{\partial \theta_{j-1}^{w}} \left\{ ({1 – d_{j}^{w}})\cdot{log\left[ \sigma(x_{w}^{T}\theta_{j-1}^{w}) \right]} + d_{j}^{w}\cdot{log{\left[ 1 – \sigma(x_{w}^{T}\theta_{j-1}^{w}) \right]}} \right\}\)

其中Sigmoid的导数: \(\sigma'(x) = \sigma(x)[1 – \sigma(x)] \)

\(\begin{align*}&\Rightarrow\quad \frac{\partial \iota(w, j)}{\partial \theta_{j-1}^{w}} = ({1 – d_{j}^{w}})( 1 – \sigma(x_{w}^{T}\theta_{j-1}^{w})x_{w} – d_{j}^{w} \sigma(x_{w}^{T}\theta_{j-1}^{w})x_{w} \\
&\Rightarrow\quad \frac{\partial \iota(w, j)}{\partial \theta_{j-1}^{w}} = [{1 – d_{j}^{w}} – \sigma(x_{w}^{T}\theta_{j-1}^{w})]x_{w} \end{align*}\)

\(\theta_{j-1}^{w} \) 的更新表达式为:

\(\theta_{j-1}^{w}:=\theta_{j-1}^{w} + \eta [{1 – d_{j}^{w}} – \sigma(x_{w}^{T}\theta_{j-1}^{w})]x_{w} \)

对投影层向量 \(x_{w}\) 求导:

\(\frac{\partial \iota(w, j)}{\partial \ x_{w}} = [{1 – d_{j}^{w}} – \sigma(x_{w}^{T}\theta_{j-1}^{w})]\theta_{j-1}^{w}\)

\(x_{w}\)?是上下文词向量的和,不是单个词的词向量。怎么把这个更新量应用到单个单词上去呢??\(word2vec \)?采取的是直接将?\(x_{w}\)?的更新量应用到每个单词的词向量上去:

\(v(\tilde{w}) :=v(\tilde{w}) + \eta\sum_{j=2}^{l^{w}}{\frac{\partial \iota(w, j)}{\partial \ x_{w}}} \quad \tilde{w}\in{Context(w)}\)

CBOW->Negative Sampling

语料库特别大时,计算量特别大,我们现在采用另外一种方式优化。当中心词预测对时为1,没有预测对时为0

\( L^{w}(\tilde{w})=\left\{ \begin{aligned} 1, \quad \quad\tilde{w}=w \\ 0, \quad \quad\tilde{w} \ne w \end{aligned} \right.\\ \)

\(p(u|Context(w)) = \left\{ \begin{aligned} \sigma(x_{w}^{T}\theta^{u}) \quad \\ 1 – \sigma(x_{w}^{T}\theta^{u}) \end{aligned} \right.\\ \)

\(g(w) = \prod_{u\in{w}\cup{NEG(w)}}^{}p(u|Context(w))\)

任何采样方法都应该保证频次越高的样本越容易被采样出来。基本的思路就是对长度为1的线段,根据词语的词频公平地分给词语:

\(len(w) = \frac{counter(w)}{\sum_{u\in{D}}^{}{counter(w)}}\\ \)

\(counter\)?就是w的词频。

于是我们将线段公平地分配了:

 

接下来我们只要生成一个0-1之间的随机数,看看落到哪个区间,就能采样到该区间对应的单词了。

\(g(w) = \delta(x_{w}^{T}\theta^{w})\prod_{u\in{NEG(w)}}^{}\left[ 1 – \delta(x_{w}^{T}\theta^{u})\right]\)

\( \delta(x_{w}^{T}\theta^{w})\)表示上下文为\(Context(w)\)时,预测中心词为w的概率。

\( \delta(x_{w}^{T}\theta^{u}),u\in{NEG(W)}\)表示上下文为\(Context(w)\)时,预测中心词为u的概率。

\(\begin{align*}&\iota = logG = log\prod_{w\in{C}}g(w)= \prod_{w\in{C}}^{}log\;\;g(w)\\
&=\sum_{w\in{C}}^{}\;\;{\sum_{u\in{\left\{ w \right\}\cup{NEG(w)}}}^{}{\left\{ L^{w}(u)\cdot{log\left[ \delta(x_{w}^{T}\theta^{u}) \right]} + \left[ 1- L^{w}(u) \right]\cdot{log\left[ 1- \delta(x_{w}^{T}\theta^{u})\right]} \right\}}} \end{align*}\)

我们对 \(\theta^{u}\) 进行求导:

\(\begin{align*}&\Rightarrow\quad \frac{\partial \iota(w, u)}{\partial \theta^{u}} = \frac{\partial }{\partial \theta^{u}} \left\{ L^{w}(u)\cdot{log\left[ \sigma(x_{w}^{T}\theta^{u}) \right]} + \left[ 1-L^{w}(u) \right]\cdot{log{\left[ 1 – \sigma(x_{w}^{T}\theta^{u}) \right]}} \right\}\\
&\quad\quad\quad \quad\quad \quad \; \; = L^{w}(u)( 1 – \sigma(x_{w}^{T}\theta^{u})x_{w} – \left[ 1- L^{w}(u) \right] \sigma(x_{w}^{T}\theta^{u})x_{w} \\
&\quad\quad\quad \quad\quad \quad \; \; = [L^{w}(u) -\sigma(x_{w}^{T}\theta^{u})]x_{w} \end{align*}\)

\(\theta^{u}\)?的更新表达式为:

\(\theta^{u}:=\theta^{u} + \eta [L^{w}(u) – \sigma(x_{w}^{T}\theta_{u})]x_{w}\)

对投影层向量 \(x_{w}\) 求导:

\(\begin{align*}&\frac{\partial \iota(w, j)}{\partial \ x_{w}} =[L^{w}(u) -\sigma(x_{w}^{T}\theta^{u})]\theta^{u}\\
&v(\tilde{w}) :=v(\tilde{w}) + \eta\sum_{u\in{w}\cup{NEG(w)}}^{}{\frac{\partial \iota(w, j)}{\partial \ x_{w}}} \quad \tilde{w}\in{Context(w)}\end{align*}\)

SkipGram-> Negative Sampling

Word2Vec之前的叫做Global Method,Word2Vec出来之后叫做Local Method

给定一个句子

当数据量很大时挨在一起的单词语义相关度比不挨在一起的单词的语义相关度要高,基于这种理念可以来构造目标函数。

Skip-Gram

基于语义相关度, 根据中心词去预测周边词。

CBOW

根据周边词去预测中心词。

当数据量少的时候,当window_size = 2时,skipgram生成了四个样本,cbow只有一个样本,我们选择skipgram比较好。

SkipGram 核心思路

语料库为 We are working on NLP Project, it is interesting,当窗口大小等于1时 working为中心词的时候,我们目标函数为:

\(\begin{equation} \mathop{maxmize}_{\theta} \ \ P(are|we)\cdot{P(we|are)}\cdot{P(working|are)}\cdot{P(are|working)}\cdot{P(on|working)…} \end{equation}\)

\(P(are|we) = f(\theta)\)

SkipGram Formulation

\(\begin{equation} \mathop{Maxmize}_{\theta} \ \ \prod_{w\in{Text}}^{}\prod_{c\in{context(w)}}^{}P(c|w;\theta) \end{equation}\)

\(\Rightarrow\begin{equation} \mathop{Maxmize}_{\theta} \ \ \sum_{w\in{Text}}^{}\sum_{c\in{context(w)}}^{}log P(c|w;\theta) \end{equation}\)

\(w\) 是中心词,\(c\) 是相对应中心词的背景词。

语料库:

doc1: 今天 天气 很好

doc2: 今天 上 NLP 课程

\(\begin{equation} \mathop{\arg\max}_{\theta} \ \ {P(天气|今天)}\cdot{P(今天|天气)}\cdot{P(很好|天气)}\cdot{P(天气|很好)} \\\cdot{P(上|今天)} \cdot{P(今天|上)} \cdot{P(NLP|上)} \cdot{P(上|NLP)} \cdot{P(课程|NLP)} \cdot{P(NLP|课程)}\end{equation}\)

embedding matrix

\(\nu\)为中心词矩阵,\(\mu\)为周边词矩阵。

\(\theta=\left\{ \nu,\mu \right\}\)

\(P(c|w;\theta) = \frac{e^{\mu_{c}\cdot{\nu_{w}}}}{\sum_{c’\in{V}}^{}{e^{\mu_{c’}\cdot{\nu_{w}}}}}\leftarrow{Softmax}\)

使用了一个 \(softmax\) 函数。\(c\)?是上下文,?\(w\)?是中心词,等号右边分子表示上下文和中心词的相似度, 我们希望它越大越好。

我们来举个例子:

语料库 {今天, 天气, 好, NLP, 上课}

\(P(今天|天气) = \frac{e^{\mu_{今天}\cdot{\nu_{天气}}}}{e^{\mu_{今天}\cdot{\nu_{天气}}} + e^{\mu_{天气}\cdot{\nu_{天气}}} + e^{\mu_{好}\cdot{\nu_{天气}}} + e^{\mu_{NLP}\cdot{\nu_{天气}}} + e^{\mu_{上课}\cdot{\nu_{天气}}}}\)

\(P(今天|天气) + P(天气|天气) + P(好|天气)+ P(NLP|天气)+P(上课|天气) = 1\)

我们现在的目标函数为:

\(\theta=\left\{ \nu,\mu \right\}\)

\(\begin{equation} \mathop{\arg\max}_{\theta} \ \ \prod_{w\in{Text}}^{}\;\prod_{c\in{context(w)}}^{} log \; \frac{e^{\mu_{c}\cdot{\nu_{w}}}}{\sum_{c’\in{Vocab}}^{}{e^{\mu_{c’}\cdot{\nu_{w}}}}} \end{equation}\)

\(\Rightarrow \; \begin{equation} \mathop{\arg\max}_{\theta} \ \ \prod_{w\in{Text}}^{}\;\prod_{c\in{context(w)}}^{} \left(\mu_{c}\cdot{\nu_{w}} – log\sum_{c’\in{Vocab}}^{}{e^{\mu_{c’}\cdot{\nu_{w}}}} \right) \end{equation}\)

现在出现了?\(log\sum_{}^{}{}\)?,而且?\(\sum_{}^{}{}\)?线性于词库的大小复杂度很高,现在没有办法进行下去了。

现在解决这种方法有两种办法:

  • hierarchical softmax(分词的softmax)
  • Negative Sampling

之前hierarchical softmax在将cbow的时候已经介绍了,现在着重讲一下Negative Sampling

SkipGram Negative Sampling

\(s =\left\{ w_{1},w_{2},w_{3},w_{4},w_{5},w_{6} \right\}, window\_size=2\)

对于中心词?\(w_{3}\)?我们希望窗口内的?\(p(w_{1}|w_{3}),p(w_{2}|w_{3}),p(w_{4}|w_{3}),p(w_{5}|w_{3})\)?越大越好

窗口外的?\(p(w_{6}|w_{3})\)?越小越好。

\(p(D=1|w_{1}, w_{3})\)?表示?\(w_{1}和w_{3}\)?出现在同一窗口内的概率

\(p(D=0|w_{1}, w_{3})\)?表示?\(w_{1}和w_{3}\)?不出现在同一窗口内的概率

目标函数

\(s =\left\{ w_{1},w_{2},w_{3},w_{4},w_{5} \right\}, window\_size=1\)

\(\iota = \begin{equation} \mathop{\arg\max}_{\theta} \ \ P(D=1|w_{1},w_{2})\cdot{P(D=1|w_{2},w_{1})} \\ \cdot{P(D=1|w_{2},w_{3})}\cdot{P(D=1|w_{3},w_{2})}\\ \cdot{P(D=1|w_{3},w_{4})}\cdot{P(D=1|w_{4},w_{3})}\\ \cdot{P(D=1|w_{4},w_{5})}\cdot{P(D=1|w_{5},w_{4})}\\ \cdot{(1-P(D=0|w_{1},w_{3}))}\cdot{(1-P(D=0|w_{1},w_{4}))}\\ \cdot{(1-P(D=0|w_{1},w_{5}))}\cdot{(1-P(D=0|w_{2},w_{4}))}\\ \cdot{(1-P(D=0|w_{2},w_{5}))}\cdot{(1-P(D=0|w_{3},w_{1}))}\\ \cdot{(1-P(D=0|w_{3},w_{5}))}\cdot{(1-P(D=0|w_{4},w_{1}))}\\ \cdot{(1-P(D=0|w_{4},w_{2}))}\cdot{(1-P(D=0|w_{5},w_{1}))}\\ \cdot{(1-P(D=0|w_{5},w_{2}))}\cdot{(1-P(D=0|w_{5},w_{3}))} \end{equation}\)

其中?\(P(D=1|w_{i},w_{j})\)?是?\(positive\;pair\)

其中?\(P(D=0|w_{i},w_{j})\)?是?\(negative\;pair\)

\(P(D=1|w_{i},w_{j}) = \sigma(u_{w_{j}}\cdot{v_{w_{i}}}) = \frac{1}{1 + e^{-u_{w_{j}}\cdot{v_{w_{i}}}}}\)

\(\Rightarrow = \begin{equation} \mathop{\arg\max}_{\theta} \ \ log\prod_{(w,c)\in{D}}^{}P(D=1|w,c;\theta) \prod_{(w,c)\in{\tilde{D}}}^{}(1-P(D=0|w,c;\theta) ) \end{equation}\)

\(\Rightarrow = \begin{equation} \mathop{\arg\max}_{\theta} \ \ \sum_{(w,c)\in{D}}^{}log\;\sigma(u_{c}\cdot{v_{w}}) + \sum_{(w,c)\in{\tilde{D}}}^{}log\;\sigma(-u_{c}\cdot{v_{w}}) \end{equation}\)

假如说我们的预料有?\(10^{9}\)?个单词 ,?\(window=5 \)?,则正样本的数量为?\(|D| = 10^{9}\times5\times2\)

\(|\tilde{D}| =|V|^2 – |D|\gg|D|\)?负样本的数量要远远大于正样本的数量

我们现在采用了从正样本对应的负样本中随机采样来减少负样本的数量

\( L(\theta) \approx \begin{equation} \mathop{\arg\max}_{\theta} \ \sum_{(w,c)\in{D}}^{} \left[ log\;\sigma(u_{c}\cdot{v_{w}}) + \sum_{c’\in{N(w)}}^{}log\;\sigma(-u_{c’}\cdot{v_{w}}) \right] \end{equation}\)

用?\(SGD\)?对参数进行求解

对?\(u_{c}\)?进行求导?\(\frac{\partial L(\theta)}{\partial \ u_{c}} = \left[ 1- \sigma(u_{c}\cdot{v_{w}})\right]\cdot{v_{w}}\)

对?\(u_{c’}\)?进行求导?\(\frac{\partial L(\theta)}{\partial \ u_{c’}} = \left[ \sigma(-u_{c’}\cdot{v_{w}}) – 1\right]\cdot{v_{w}}\)

对?\(v_{w}\)?进行求导?\(\frac{\partial L(\theta)}{\partial \ v_{w}} = \left[ 1-\sigma(u_{c}\cdot{v_{w}}) \right]\cdot{u_{c}} + \sum_{c’\in{N(w)}}^{}\left[ \sigma(-u_{c’}\cdot{v_{w}}) – 1\right]\cdot{u_{c’}}\)

伪代码如下:

skipgram存在的一些缺点:

  • 不考虑顺序(Solver:基于LM的词向量训练NNLM, RNN/LSTM)
  • 受窗口大小限制 Local (Solver:基于LM的词向量训练RNN/LSTM;Transformer)
  • 浅层网络(Solver:Elmo)
  • 不能反映指代关系(Solver:Transformer)
  • 一词多义/不考虑上下文(Solver:Emlo, Bert, XLNet)
  • 低频词/未登陆词(Out of Vocabulary)(Solver:Gaussian Embedding(低频词), Subword Model)
  • 可解释性弱(如存在从属关系)(Solver:Hyperbolic Embedding)

Follow me!

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

Previous article

MCMC(一)蒙特卡罗方法

Next article

深度学习之Skip-Gram和CBOW原理