月度归档:2020年02月

hmm 数学推导

Markov chain

Markov chain definition

Markov chain is a model
that tells us something about the probabilities of sequences of random variables,states, each of which can take on values from some set.

\text { Markov Assumption: } \quad P\left(q_{i}=a | q_{1} \ldots q_{i-1}\right)=P\left(q_{i}=a | q_{i-1}\right)

Markov chain application

A Markov chain is useful when we need to compute a probability for a sequence
of observable events.

The Hidden Markov Model

A hidden Markov model (HMM) allows us to talk about both observed events
(like words that we see in the input) and hidden events (like part-of-speech tags) that
we think of as causal factors in our probabilistic model.

A first-order hidden Markov model ...

\small
\text { Markov Assumption: } \quad P\left(q_{i} | q_{1} \ldots q_{i-1}\right)=P\left(q_{i} | q_{i-1}\right)
\small
\text { Output Independence: } \quad P\left(o_{i} | q_{1} \ldots q_{i}, \ldots, q_{T}, o_{1}, \ldots, o_{i}, \ldots, o_{T}\right)=P\left(o_{i} | q_{i}\right)

1. Likelihood Computation: The Forward Algorithm

Computing Likelihood: Given an HMM \lambda=(A, B) and an observation sequence O, determine the likelihood P(O | \lambda)

the joint probability of being in a particular weather sequence Q and
generating a particular sequence O of ice-cream events.

\small
P(O, Q)=P(O | Q) \times P(Q)=\prod_{i=1}^{T} P\left(o_{i} | q_{i}\right) \times \prod_{i=1}^{T} P\left(q_{i} | q_{i-1}\right)

we can compute the total probability of the
observations just by summing over all possible hidden state sequences:

\small
P(O)=\sum_{Q} P(O, Q)=\sum_{Q} P(O | Q) P(Q)

到此如果穷举所有可能产生 O 的序列,可以算出 P(O),但是我们需要
sum over all N^{T} possible hidden sequences 计算量很大!!!

于是我们使用动态规划, 在这里被叫做前向算法 forward algorithm ,递归部分和前面 crf 推导一样,画个图或列举几个tag把公式展开写出来很好理解
时间复杂度是 N^{2}*T

  1. Initialization:
    \small
    \alpha_{1}(j)=\pi_{j} b_{j}\left(o_{1}\right) ; 1 \leq j \leq N
  2. Recursion:
    \small
    \alpha_{t}(j)=\sum_{i=1}^{N} \alpha_{t-1}(i) a_{i j} b_{j}\left(o_{t}\right) ;  1 \leq j \leq N, 1 \leq t \leq T
  3. Termination:
    \small
    P(O | \lambda)=\sum_{i=1}^{N} \alpha_{T}(i)

2. Decoding: The Viterbi Algorithm

Decoding: Given as input an HMM \lambda=(A, B) and a sequence of observations O, find the most probable sequence of states
Q.

  1. Initialization:

    \small
    v_{1}(j)=\pi_{j} b_{j}\left(o_{1}\right);  1 \leq j \leq N \\
    bt_{1}(j) = 0;1 \leq j \leq N  \text{ (record best previes for bc)} \\
  2. Recursion:

    \small
    v_{t}(j) = \mathop {max} \limits_{i = 1}^{N} v_{t-1}(i)a_{ij}b_{j}\left(o_{t}\right); 1 \leq j \leq N,1 \le t \leq T \\
    bt_{t}(j) = \mathop {argmax} \limits_{i = 1}^{N} v_{t-1}(i)a_{ij}b_{j}\left(o_{t}\right); 1 \leq j \leq N,1 \le t \leq T
  3. Termination:

    \small
    \text { The best score: } p* = \mathop {max} \limits_{i = 1}^{N}v_{T}(i) \\
    \text { The start of backtrace: } q_{T*} = \mathop {argmax} \limits_{i = 1}^{N}v_{T}(i)

3. HMM Training: The Forward-Backward Algorithm

Learning: Given an observation sequence O and the set of possible
states in the HMM, learn the HMM parameters A and B.

To understand the algorithm, we need to define a useful probability related to the forward probability and called the backward probability. The backward probability \beta is the probability of seeing the observations from time t +1 to the end, given that we are in state i at time t (and given the automaton \lambda):

\small
\beta_{t}(i) = P\left(o_{t+1},o_{t+2} \ldots o_{T}|q_{t}=i,\lambda\right)

It is computed inductively in a similar manner to the forward algorithm.

  1. Initialization:
    \small
    \beta_{T}\left(i\right)=1, 1 \leq i \leq N\\
  2. Recursion:
    \small
    \beta_{t}(i)=\sum_{j=1}^{N}a_{ij}b_{j}\left(o_{t+1}\right)\beta_{t+1}, \quad 1 \leq i \leq N, 1 \leq t \le T \\
  3. Termination:
    \small
    P\left(O| \lambda \right) = \sum_{j=1}^{N}\pi_{j}b_{j} \left(o_{1}\right)\beta_{1}\left(j\right)

compute transition prob

定义几个重要的概率

\xi_{t}(i,j) as the probability of being in state
i at time t and state j at time t +1, given the observation sequence and of course the
model:

\small
\xi_{t}(i,j)=P(q_{t}=i,q_{t+1}=j|O,\lambda)
\small
nq\xi_{t}(i,j) =P\left(q_{t}=i,q_{t+1}=j,O|\lambda\right) 
=\alpha_{t}(i)a_{aj}b_{j}(o_{t+1})\beta_{t+1}(j)
\small
P(O|\lambda)=\sum_{j=1}^{N}\alpha_{t}(j)\beta_{t}(j)

因为 P(X,Y|Z) = P(X|Y,Z)P(Y|Z), 所以 P(X|Y,Z)=\frac{P(X,Y|Z)}{P(Y|Z)} 据此可得

\small
\xi_{t}(i,j)=\frac{\alpha_{t}(i)a_{ij}b_{j}(o_{t+1}) \beta_{t+1}(j)}{\sum_{j=1}^{N}\alpha_{t}(j)\beta_{t}(j)}

The expected number of transitions from state i to state j is then the sum over all
t of \xi.

The total expected number of transitions from state i. We can get this by summing over all transitions out of state i.

得到了计算新的转移概率的公式

\small
\hat{a_{i,j}}=\frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}
{\sum_{t=1}^{T-1}\sum_{k=1}^{N}\xi_{t}(i,k)} 

compute emission prob

The probability of being in state j at time t \gamma_{t}(j)

 \small
 \gamma_{t}(j)=P(q_{t}=j|O,\lambda)

类似上面的操作

 \small
 \gamma_{t}(j)=\frac{P(q_{t}=j,O|\lambda)}{P(O|\lambda)} \\
 =\frac{\alpha_{t}(j)\beta_{t}(j)}{P(O|\lambda)}

新的发射概率

 \small
 \hat{b_{j}}(v_{k})=\frac{\sum_{t=1s.t.O_{t}=v_{k}}^{T}\gamma_{t}(j)}
 {\sum_{t=1}^{T}\gamma_{t}(j)}

em_step 计算 A,B

em_step

总结

无监督的 HMM 主要就是涉及上面三个问题,如果是有监督的话直接根据标注好的语料计算 A 、B 即可,有监督的实现具体可以参照FastHMM。但是生产中很多问题是无法标注或者标注实施起来很困难的,比如语音识别等,所以了解无监督的 HMM 实现是很有价值的。如果得到模型之后需要对模型进行采样而不仅仅是求最优 state seq,利用 forward_backword 也能得到 \tiny P(q_{t}|O,\lambda)然后进行采样。
编辑了一天公式,latex语法熟悉了很多,哈哈!!!