Notes on Natural Language Processing

N-gram

n pairs of words in sentence.

Bleu

Bilingual evaluation understudy

\[ bleu\ score = \frac {max\ number\ of\ occurances\ in\ reference} {total\ unique\ ngram} \]

On uni and n grams

RNN

\[ h_t = tanh(W_{hs} h_{t-1} + W_{x} x_t) \]

The hidden state \(h_t\) is constantly rewritten at every time step causing to vanishing gradient over time. This makes the network not learn from dependencies from a longer time period.

To overcome this issue 2 types of RNN was developed.

LSTM (Long Short Term Memory)

Gate

Remove or Add information from the cell state.

  1. Cell state \(C_t\)
  1. Forget gate layer (What info to remove in cell state)

\(f_t = \sigma(W_f . [h_{t-1}, x_t] + b_f)\)

  • Output 0,1 (for each number in the cell state \(C_{t-1}\))
  • Input gate (What info to store in cell state)

3.1) \(i_t = \sigma(W_i . [h_{t-1}, x_t]+ b_i)\) decides which input shall pass

3.2) \(\tilde{C} = tanh(W_c . [h_{t-1}, x_t]+ b_c)\) creates vector for cell state to be added into.

  1. Update the cell state

\[ C_{t} = f_t * C_{t-1} + i_t * \tilde{C} \]

  1. Output

\[ o_t = \sigma(W_o . [h_{t-1}, x_t] + b_o)\\ h_t = o_t * tanh(C_t) \]

Variants of LSTM

  1. Adds peepholes (context for the gates on the current cell state \(C_t\)).

\[ f_t = \sigma(W_f . [\boxed{C_t}, h_{t-1}, x_t] + b_f)\\ i_t = \sigma(W_i . [\boxed{C_t}, h_{t-1}, x_t]+ b_i) \\ o_t = \sigma(W_o . [\boxed{C_t}, h_{t-1}, x_t] + b_o)\\ \]

  1. Coupled input and forget gates

\[ C_t = f_t * C_{t-1} + \boxed{(1-f_t)} * \tilde{C} \]

GRU (Gated Recurrent Unit)

  1. GRU (Gated Recurrent Unit)

  2. Merge forget and Input gate

  3. Cell state and Hidden state.

\[ z_t = \sigma(W_z . [h_{t-1}, x_t]) \\ r_t = \sigma(W_r . [h_{t-1}, x_t])\\ \tilde{h}_t = tanh(W . [r_t * h_{t-1}, x_t])\\ h_t = (1-z_t) * h_{t-1} + z_t * \tilde{h_t} \]

  1. BiLSTM

TODO Embeddings from language models (EL-mo)

or

  • word representations are from entire input sentence.
  • predicts the next char from inputs seen so far.
  • cost funtion: maximizes log likelihood of forward and backward prediction.

Note;

use this

or

  • Reference

https://paperswithcode.com/method/elmo https://arxiv.org/pdf/1802.05365v2.pdf https://iq.opengenus.org/elmo/ https://indicodata.ai/blog/how-does-the-elmo-machine-learning-model-work/

Attention

Attention \(\alpha\) is weighted sum of activation ouput;

where \(\sum_{\grave{t}} \alpha^{<1,\grave{t}>} = 1\) (ofcouse with softmax :P) and the context \(C = \sum_{\grave{t}} \alpha^{<1,\grave{t}>} a^{<\grave{t}>}\)

Here \(a^{\grave{t}}\) is the activation from bidirectional rnn (from both the forward and backward).

the attention is computed with

\[ e^<t,{\grave{t}}> = W \]

Time complexity is \(t_x t_y\) where \(t_x\) is the input length and \(t_y\) the output time length. Since for every output prediction we are calculating the attention \(\alpha\) for every input to generate the context \(C\).

https://arxiv.org/pdf/1502.03044v2.pdf

Transformers

https://www.tensorflow.org/text/tutorials/transformer http://jalammar.github.io/illustrated-transformer/

Bert model

Similar to ELmo, Bert also provides a contextual representation of the word embedding, but looks at the sentence all at once (with out having to concatenate forward and backward contextural hidden state vectors).

https://medium.com/analytics-vidhya/understanding-the-bert-model-a04e1c7933a9 https://www.youtube.com/watch?v=xI0HHN5XKDo

Audio data

Preprocessed to get the spectrogram (a visual representation of audio).

x axis time y axis small change in air pressure.

  • when the input length is higher than output length
  • A simple trigger word detection model.