CTC Loss for un-segmented Data
Updated: Jan 3, 2021
Its importance & where it is used:
It is used for dealing with un-segmented sequence data. Such data is ubiquitous and it may not be practically feasible to create a segmented labelled dataset out of it.
Sequential Data: Data is like a stream of inputs. E.g. Handwritten text, audio data, video clip, time-series data etc.
Un-segmented Data: Where we do not have a corresponding label for each of the chunks of sequential input data i.e. if we divide an input stream into chunks, then we do not have a corresponding labels for each of those chunks.
RNNs have limited applicability for such type of data because:
RNNs require pre-segmented training data.
Network outputs must be post-processed to give the final label sequence.
CTC Loss to the rescue: Solves the above 2 problems i.e. removes the need to segment sequential training data and post-process network's output.
CTC can be divided into 3 parts:
Let's say we have an audio input for the word "hello" and we divide this audio into 10 mini chunks. Our objective is to have a character associated at every timestamp (mini chunk) of the input audio.
A straightforward way could be to allow character repetition
and then remove duplicate characters to get the desired string
Remove duplicates: H,E,L,O
As you can see, the above way will not work when there are double-letters in the desired string i.e. for words like - too, apple, hello, etc.
Introduce a blank character (say '_') to separate repeat characters.
So, we can have repeated characters and repeated blanks in our encoding with the only condition being that there should be atleast 1 blank between double letters of the desired string.
Steps for Decoding an encoded string: 1. Merge repeated characters
2. Remove blank characters
_,H,H,_E,L,L,_,L,O -> _,H,_,E,L,_,L,O -> H,E,L,L,O
H,_,_,E,E,L,_,L,O,O -> H,_,E,L,_,L,O -> H,E,L,L,O
Image on the right shows 1 of the several ways to calculate probability of "ab"
A*A*A*_*b ->(decodes to) "ab"
Other possibilities which decodes to "ab":
AAAAB, AAABB, AABBB, ABBBB, _AAAAB, _A_B__ , A_ _BB + …. etc
So, total probability of getting "ab" is simply the sum of all the possibilities:
(total probability theorem)
p("ab" | input): AAAAB + AAABB + AABBB + ABBBB + _AAAAB + _A_B__ + A_ _BB + ….
p("a" | input): AAAAA + _AAAA + A_ _ _ _ + _ _ A_A + ….
p("b" | input): BBBBB + _B_ _ _ + _BB_ _ + _BBB_
p("_" | input) : _ _ _ _ _
CTC-Loss is then basically the negative logarithm of the above conditional probability i.e.
CTC loss = -Log p(Y | X)
But how do we actually calculate this?
If we notice above, there are a lot of ways in which we can obtain "ab" string. And to calculate the total probability, we need to calculate the probabilities of all those individual paths and sum them up.
A simple brute force approach is to calculate probabilities of all paths and then sum up the ones which decode to the desired string. However, this approach is computationally expensive and may not work for bigger test cases.
The solution is to use the Forward-Backward algorithm which is nothing but a dynamic programming based solution.
The next post will talk about how to compute the CTC Loss using the Forward-Backward algorithm (which creates a 2D DP matrix) and illustrate how the alpha matrix is created.