https://stats.stackexchange.com/questions/47846/how-to-define-initial-probabilities-for-hmm
# Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs) are a type of [[Probabilistic Graphical Models (PGMs)|probabilistic graphical model]] that expand on the concept of [[Markov chains]] by introducing a
The whole purpose of these algorithms is to take a look at the “big picture”, despite the fact that the markov assumption should limit us to
If we chain these independant events together we can calculate the probability of a chain of events occuring.
In a sense, HMMs can be considered as the sort of inverse version of a [[Machine Learning|classification model]]. Where, instead of trying to predict an outcome based on a set of known properties, we are *given the outcome* (the observations) and are trying to *predict what the most likely properties* are that led to it (the hidden states).
> [!question]- The Three Main Questions
>
> **1) The Evaluation Problem (Scoring)**
>
> What is the likelihood of encountering a given sequence of emissions?
> - The Foward Algorithm & The Backward Algorithm
>
> What is the likelihood that a given sequence of observations occurs given our HMM, $P(O|\lambda)$.
>
>
>
> **2) The Decoding Problem (Parsing)**
>
> What is the most likely sequence of hidden states that would explain a given sequence of emissions? (infer hidden states)
> - *The Viterbi Algorithm*
>
>
> **3) The Training Problem (Learning)**
>
> What are the parameters that will most likely model a given sequence of hidden states/emissions?
> - *One of the three following algorithms:*
> - Maximum Likelihood Estimation (MLE) (For labeled data, aka observations & hidden states)
> - Viterbi Training (Not the same as Viterbi decoding)
> - Baum-Welch (Estimation-Maximization) (Forward-Backward)
> - learning = training = parameter estimation
>
>
>
> *→ Viterbi Training*
> *→ Baum-Welch Training*
> *→ Alternative Training Methods (i.e. Sampling/MCMC)*
![[Pasted image 20240912222754.png|500]]
### Bayesian Inference
The posterior distribution of a hidden markov model is essentially the probability of being in a particular hidden state at a specific time, given the sequence of observed data (The Decoding Problem).
## Formal Definition
> [!abstract] **Definition:** Hidden Markov Models
>
> The *parameters* of a hidden markov model can be compactly written as $\lambda$:
> $\lambda = \{A, B, \pi\}$
>
> where:
> - $\pi$ is the matrix of **initial state probabilities**, where $\pi_i$ is the probability that state $i$ is the initial state.
> - $A$ is the **transition probability matrix**, where $a_{i \rightarrow j}$ is the probability of transitioning from state $i$ to $j$.
> - $B$ is the emission probability matrix where $b_j(k)$ is the probability of observing symbol $x_k$ in state $j$
>
> ---
> The sequence of observations is defined as:
> - $X = \{x_1,x_2,x_3,...,x_t\}$ is a sequence of observed symbols (emissions)
> - $S=\{s_1,s_2,s_3,...,s_t\}$ is a sequence of hidden states
>
> **Observations:** $O=\{o_1, o_2, …, o_n\}$
> - $|O|=m$
>
> **State Path:** $Q=\{q_1, q_2, …, q_n\}$
> - $|Q|=n$
>
>
>
>
>
> Compactly, a HMM is represented as
>
> In markov chains, we assumed that $x_j$ only depends
#### Parameter Inference Through Sampling
> See also:
> - [[Markov Chain Monte Carlo (MCMC)]]
## HMM Algorithms
![[References/Excalidraw/Hidden Markov Model Algorithms|Hidden Markov Model Algorithms]]
> [!danger] The Brute-Force Approach
> One possible approach is to generate every single possible possible walk through the model and selected the one with the highest probability.
>
> Of course, this runtime would grow exponentially as the sequence length and number of states were increased.
### The Forward & Backward Algorithms
The overall purpose of these algorithms is to calculate the **joint probability** for a sequence of consecutive events.
The problem boils down to identifying the joint probability across the sequence of observations. Because there are multiple potential ways these emissions can be produced, we can utilize the SUM, aka EITHER/OR, rule of [[Probability Theory|probability]] (hence the summation seen in the algorithm).
It’s pretty important to *distinguish the algorithms themselves*, from *the probabilities they are calculating*:
- Both algorithms produce an $N \times T$ matrix, where $N$ is the amount of states and $T$ is the length of the observation sequence.
While each element within these matrices is the joint probability of the given time step given the past/future observations, you can *normalize the these probabilities* for all possible states at that time step to find a *likelihood distribution of being in each state at a specific time step*.
### The Forward Algorithm
The **forward algorithm** computes the *forward probability* for all hidden states. The forward probability is the probability of seeing the observations $x_1, x_2, … x_t$ and being in state $i$ at time $t$, given a model $\lambda = \{\pi, A, B\}$.
Instead of needing to recalculate the preceeding forward probabilities, it is able to use the computed probability at the current time step $(t)$ to derive the probability of being in a particullar hidden state at the next time step $(t+1)$.
$\alpha_i(t) =$ probability of being in state $j$ at time step $k$ given all observations in the past
![[Pasted image 20241022154418.png|400]]
> [!summary]+ **Definition:** The Forward Algorithm
>
> The function $\alpha_t(i)$ calculates the probability of being in state $i$ at time step $t$ *given all past observations.*
>
> **1. Initialization:**
> $a_1(i)=\pi_i B_i(o_1)$
>
> **2. Iteration:**
> $a_{t+1}(i) = B_i(o_{t+1}) \sum^{N}_{j=1}\alpha_t(j) \cdot A_{j \rightarrow i}$
> - The function $B_i$ can be brought out of the summation as it does not involve the iterator $j$.
>
> **3. Termination:**
> $P(O|\lambda) =$
>
> where:
> - $\alpha_t(i)$ is the probability of being in state $i$ at time $t$ (*The Forward Probability*).
> - $\pi_i$ is the probability that the initial state is $i$.
> - $B_i(o_t)$ is the probability of seeing symbol $o_t$ in state $i$ (*Emission Matrix*).
> - $A_{j,i}$ is the probability of being in state $i$ given the previous state was $j$ (*Transition Matrix*).
> ---
>
> **Output:** A 2D The final output of the forward algorithm will be with size $T \times N$.
An optional termination step can be performed to output a final forward probability for the entire sequence of observations.
As we get deeper into the nested function call, we get closer to the beginning of the sequence. The “forward” aspect comes once we reach the base call and begin surfacing.
While the approach can be represented recursively, the actual implementation of the forward algorithm is *often performed iteratively* from the beginning of the sequence, with the forward matrix being populated from the ground up over time.
We can either “propel” the algorithm forward through an iterative loop, or by first recursing down to the base case and then returning the values until we return to the end of the sequence.
### The Backward Algorithm
The backward probability ($\beta$) is the probability of seeing the rest of the observations from time $t+1$ to the end of the sequence, given that we are in state $i$ at time $t$.
![[Pasted image 20241022154453.png|350]]
> [!summary]- **Definition:** The Backward Algorithm
> Contents
It’s important to notice that we don’t consider the emission probability for the state $j$ being iterated over. *This allows us to combine the forward and backward algorithms* later on without any overlap of the probabilities.
### The Baum-Welch Algorithm
> See also:
> - [[Expectation Maximization (EM)]]
The **Baum-Welch Algorithm** is an unsupervised (or semi-supervised) learning algorithm used to *estimate the transition and emission parameters* of [[Hidden Markov Models (HMMs)]].
- It is a form of the [[Expectation Maximization (EM)]] Algorithm but tailored for HMMs.
- It is most commonly used in unsupervised scenarios, where only sequences of observations are known.
We first have to identify the *expected hidden states (E-Step)* before identifying the most likely
The method used to find the maximum likely parameters is through simply counting the amount of transitions that occur and using them to generate ratios of how frequently transitions occur between one state and another.
relies on the forward and backward probability distributions
#### Combining The Forward & Backward Algorithms
> See also:
> - https://crazyeights225.github.io/hmm2/
The **expectation step (E-Step)** consists of combining the forward and backward algorithms to tell us *the expected hidden state* that produced the observation *at a given time step*.
These probabilities are often called the *posterior probabilities*.
It’s extremely important to distinguish the forward-backward algorithm from the viterbi algorithm
- The viterbi algorithm aims to identify the most likely sequence of hidden states
- The forward-backward algorithm only aims to identify the most likely hidden state at a given position
> [!question]- But Why Do We Need Both?
>
> **The forward or backward algorithm alone are not sufficient.**
>
> Consider this: we identify a potential most likely state at time step $t$ using the forward algorithm and all observations leading up to that position.
>
> But what if being in that candidate state makes the remaining observations in the sequence next to impossible to occur?
>
> This situation is why we need to combine both the forward and backward probabilities.
>
> ---
> **Why did we get away with only using one before?**
>
> Well, we were iterating all the way through the sequence when we had used the two algorithms independently
Both types of probabilities only take into account the observations either before (forward algorithm) or after (backward algorithm) the current observation/time step.
We use to new functions to help us combine the forward and backward algorithms:
- The Gamma Function: $\gamma_t(i)$
- The Xi Function: $\xi_t(i,j)$
#### Computing The Marginal Probabilities (Gamma & Xi)
The **Gamma Function**, $\gamma_t(i)$, is the probability of being in state $i$ at time $t$ given the observation sequence $O$, and the model $\lambda$.
$\gamma_t(i)=P(q_t=Q_i | O, \lambda)$
$\gamma_t(i)=\frac{\alpha_t(i)\beta_t(i)}{P(O|\lambda)}$
where:
- $\alpha_t(i)$ is the probability of being in state $i$ at time $t$ for a given observation sequence and model (The Forward Algorithm)
---
We are also able to identify the **Xi Function**, $\xi_t(i,j)$, which is the *probability of a particular transition at a specific point in time*.
$\xi_t(i,j) = \frac{\alpha_t(i) A_{i\rightarrow j} B_j(x_{t+1})\beta_{t+1}(j)}{P(O|\lambda)}$
The goal of this final part of the E-Step is to estimate the parameters by calculating the ratios of state transitions and emissions.
#### Maximizing The Parameters
The overall goal of the **maximization step (M-Step)** is to re-estimate new versions of the transition ($A$) and emission ($B$) matrices that best fit the probabilities calculated during the E-Step earlier.
Both of these new estimations are calculated by *counting how frequently transitions and emissions occur*, using the results from the forward-backward algorithm.
These frequencies are often called “*soft counts*”, as we are still working with the likelihood of each one occuring rather than discrete counts.
**Update Transitions:**
$\hat{A}_{i \rightarrow j} = \frac{\text{expected number of transitions from state } i \text{ to state } j}{\text{expected number of transitions from state } i}$
$\hat{A}_{i \rightarrow j}$ is a ratio of how many times our transition of interest $(i \rightarrow j)$ occurs versus all other possible transitions.
$\hat{A}_{i \rightarrow j} = \frac{\sum^{T-1}_{t=1} \xi_t(i,j)}{\sum^{T-1}_{t=1} \sum^{N}_{k=1} \xi_t(i,k)}$
$\hat{A}_{ij}$ can be considered the expected fraction of times the model moves from state $i$ to state $j$ based on the current model parameters
---
**Update Emissions:**
$\hat{B}_j(k) = \frac{\sum^{T-1}_{t=1} \xi_t(i,j)}{\sum^{T-1}_{t=1} \gamma_t(j)}$
### The Viterbi Algorithm
The **Viterbi Algorithm** is extremely similar to the forward algorithm, except that *it prioritizes remembering the most probable path* up to the given state and time step.
It is a [[dynamic programming]] algorithm which computes the optimal (aka most probable) sequence of underlying states given the observations.
> [!important] The Forward Algorithm vs The Viterbi Algorithm
>
> The Viterbi algorithm is identical to the forward algorithm except that it takes the *max* over the previous path probabilities.
>
> On the other hand, the forward algorithm cares about whether or not were in state $X$ OR $Y$ and performs a *sum* to account for both possibilities (as per the rules of [[Probability Theory|probability]]).
>
> ---
> Another key difference is the addition of a backpointer
$\delta_t(i) = \max^N_{i=1}[\delta_{t-1}(i) \cdot B_j(O_t)$
One of the biggest drawbacks of the Viterbi algorithm is it’s inability to distinguish between high- and low-confidence guesses. This is because it tosses out any non-maximal probabilities along the way, preventing us from seeing the magnitude at which the optimal path is actually optimal.
> [!summary] **Definition:** The Viterbi Algorithm
>
> The function $v_t(j)$ represents the probability that the HMM is in state $j$ after seeing the first $t$ observations and pasing through the most probable state sequence $q_1, \ldots, q_{t-1}$.
>
> **1. Initialization:**
> $\begin{align}
v_1(j)=\pi_j B_j(o_1) && 1 \le j \le N\\
bt_1(j)=0 && 1 \le j \le N
\end{align}$
>
>
> **2. Iteration:**
> $\begin{align}
v_{t}(i) = \max^{N}_{i=1}v_{t-1}(j) A_{j \rightarrow i} B_i(o_{t+1}) && 1 \le j \le N, 1 \lt t \le T \\
bt_{t}(i) = \operatorname*{argmax}^{N}_{i=1}v_{t-1}(j) A_{j \rightarrow i} B_i(o_{t+1}) && 1 \le j \le N, 1 \lt t \le T
\end{align}$
>
> **3. Termination:**
> $\begin{align}
\text{{The best score: }} && P* = \max^{N}_{i=1}v_T(i) \\
\text{{The start of backtrace: }} && q_T* = \operatorname*{argmax}^{N}_{i=1} v_T(i)
\end{align}$
>
> where:
> - $v_t(i)$ is the probability that the HMM is in state $j$ after seeing the first $t$ observations and pasing through the most probable state sequence $q_1, \ldots, q_{t-1}$.
> - $\pi_i$ is the probability that the initial state is $i$.
> - $B_i(o_t)$ is the probability of seeing symbol $o_t$ in state $i$ (*Emission Matrix*).
> - $A_{j,i}$ is the probability of being in state $i$ given the previous state was $j$ (*Transition Matrix*).
> ---
>
> **Output:** aaa aaa