Notes on Natural Language Processing


n pairs of words in sentence.


Bilingual evaluation understudy

\[ bleu\ score = \frac {max\ number\ of\ occurances\ in\ reference} {total\ unique\ ngram} \]

On uni and n grams


\[ h_t = tanh(W_{hs} h_{t-1} + W_{x} x_t) \]

The hidden state \(h_t\) is constantly rewritten at every time step causing to vanishing gradient over time. This makes the network not learn from dependencies from a longer time period.

To overcome this issue 2 types of RNN was developed.

LSTM (Long Short Term Memory)


Remove or Add information from the cell state.

  1. Cell state \(C_t\)
  1. Forget gate layer (What info to remove in cell state)

\(f_t = \sigma(W_f . [h_{t-1}, x_t] + b_f)\)

  • Output 0,1 (for each number in the cell state \(C_{t-1}\))
  • Input gate (What info to store in cell state)

3.1) \(i_t = \sigma(W_i . [h_{t-1}, x_t]+ b_i)\) decides which input shall pass

3.2) \(\tilde{C} = tanh(W_c . [h_{t-1}, x_t]+ b_c)\) creates vector for cell state to be added into.

  1. Update the cell state

\[ C_{t} = f_t * C_{t-1} + i_t * \tilde{C} \]

  1. Output

\[ o_t = \sigma(W_o . [h_{t-1}, x_t] + b_o)\\ h_t = o_t * tanh(C_t) \]

Variants of LSTM

  1. Adds peepholes (context for the gates on the current cell state \(C_t\)).

\[ f_t = \sigma(W_f . [\boxed{C_t}, h_{t-1}, x_t] + b_f)\\ i_t = \sigma(W_i . [\boxed{C_t}, h_{t-1}, x_t]+ b_i) \\ o_t = \sigma(W_o . [\boxed{C_t}, h_{t-1}, x_t] + b_o)\\ \]

  1. Coupled input and forget gates

\[ C_t = f_t * C_{t-1} + \boxed{(1-f_t)} * \tilde{C} \]

GRU (Gated Recurrent Unit)

  1. GRU (Gated Recurrent Unit)

  2. Merge forget and Input gate

  3. Cell state and Hidden state.

\[ z_t = \sigma(W_z . [h_{t-1}, x_t]) \\ r_t = \sigma(W_r . [h_{t-1}, x_t])\\ \tilde{h}_t = tanh(W . [r_t * h_{t-1}, x_t])\\ h_t = (1-z_t) * h_{t-1} + z_t * \tilde{h_t} \]

  1. BiLSTM

TODO Embeddings from language models (EL-mo)


  • word representations are from entire input sentence.
  • predicts the next char from inputs seen so far.
  • cost funtion: maximizes log likelihood of forward and backward prediction.


use this


  • Reference


Attention \(\alpha\) is weighted sum of activation ouput;

where \(\sum_{\grave{t}} \alpha^{<1,\grave{t}>} = 1\) (ofcouse with softmax :P) and the context \(C = \sum_{\grave{t}} \alpha^{<1,\grave{t}>} a^{<\grave{t}>}\)

Here \(a^{\grave{t}}\) is the activation from bidirectional rnn (from both the forward and backward).

the attention is computed with

\[ e^<t,{\grave{t}}> = W \]

Time complexity is \(t_x t_y\) where \(t_x\) is the input length and \(t_y\) the output time length. Since for every output prediction we are calculating the attention \(\alpha\) for every input to generate the context \(C\).


Bert model

Similar to ELmo, Bert also provides a contextual representation of the word embedding, but looks at the sentence all at once (with out having to concatenate forward and backward contextural hidden state vectors).

Audio data

Preprocessed to get the spectrogram (a visual representation of audio).

x axis time y axis small change in air pressure.

  • when the input length is higher than output length
  • A simple trigger word detection model.