By Jae Duk Seo
Multi-Dimensional Recurrent Neural Networks, I became interested in them as soon as I heard it’s name. So today, I will attempt to tackle the network structure of Spatial LSTM introduce in this paper. “ Generative Image Modeling Using Spatial LSTMs” — by Lucas Theis. Also for today’s blog we will perform Forward Feed on 2D LSTM.
Transform from 1D LSTM to 2D LSTM
So above image shows how we can take the idea of 1D LSTM, to 2D LSTM. To apply them on images. One very important thing to take note from above photo are the Cell State and hidden States.
Yellow Box → 1D LSTM
Green Box → Transposed 1D LSTM
(Think about it as being one column in a matrix)
Pink Box → 2D LSTM
As seen above, for 1D LSTM, we initialize C(0) and h(0) before we start to train the network. There are multiple of methods to initialize these values, for example in the paper “ Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” the authors initialize the first values via something called MLP — I can only assume that it is Multi Layer Perceptrons.
But in 2D LSTM, we have to initialize whole lot more of cell and hidden state values.
As seen above, not only we need to initialize from C(0,1) to C(0,j) but also C(1,0) to C(i,0). Same goes for all hidden states. Now we can do something interesting, since we know the structure of 1D LSTM and 2D LSTM, let’s imagine 3D LSTM.
Quite a beauty isn’t she? 😀
Again, the orange boxes are the location of the first Cell and Hidden States. The applications for this network is not only bounding to video data but much more. Now we know the general structure, lets go back to the paper “ Generative Image Modeling Using Spatial LSTMs”
Spatial long short-term memory
So as the authors said, the original SLSTM was proposed by the two authors Graves & Schmidhuber. To see the paper by those two authors please click “ Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks”. In that paper, the authors have a very good visual of what an 2D LSTM is and it is shown below. However, the paper that I am working with have more clear and clean mathematical equation that describes SLSTM. (Shown above)
Sample Training Data
So we will do forward feed pass on a VERY simple training data, which is an image that have dimension of 2*2 (total of 4 pixel), shown in the black box above.
Now I know that it looks bad, but I had to use the whole white board to make that diagram LOL so work with me here. Lets start from the beginning.
First each box represents one LSTM box, the architecture is a derivation from the famous Colah Blog.
Second, here is the time stamp information below.
Red Box → Forward Feed when time stamp is (1,1)
Green Box → Forward Feed when time stamp is (2,1)
Orange Box → Forward Feed when time stamp is (1,2)
Purple Box → Forward Feed when time stamp is (2,2)
Third, each blue star represents the cost function we can calculate at each time stamp.
Again, I know that it looks bad, but with LSTM’s the equations get messy all of the time.
One thing to note is all of the variables written with BLUE markers are already initialized values. So don’t worry about where they just popped up from no where, they were initialized before hand.
I can’t image the back propagation process for this network, it will be SO fun to derive them by hand. I’ll hopefully do that one day.
If any errors are found, please email me at firstname.lastname@example.org.
- Theis, L., & Bethge, M. (2015). Generative image modeling using spatial LSTMs. In Advances in Neural Information Processing Systems (pp. 1927–1935).
- CoRR, abs/1502.03044, . Kelvin Xu and (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual.
- CoRR, abs/0705.2011, . Alex Graves and (2007). Multi-Dimensional Recurrent Neural Networks.
- Understanding LSTM Networks. (n.d.). Retrieved January 19, 2018, from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Graves, A., & Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in neural information processing systems (pp. 545–552).