Only Numpy: Deriving Forward Feed on Multi-Dimensional Recurrent Neural Networks (Spatial LSTM) by “Generative Image Modeling Using Spatial LSTMs”

By Jae Duk Seo
Multi-Dimensional Recurrent Neural Networks, I became interested in them as soon as I heard it’s name. So today, I will attempt to tackle the network structure of Spatial LSTM introduce in this paper. “ Generative Image Modeling Using Spatial LSTMs” — by Lucas Theis. Also for today’s blog we will perform Forward Feed on 2D LSTM.
Transform from 1D LSTM to 2D LSTM
So above image shows how we can take the idea of 1D LSTM, to 2D LSTM. To apply them on images. One very important thing to take note from above photo are the Cell State and hidden States. Yellow Box → 1D LSTM Green Box → Transposed 1D LSTM (Think about it as being one column in a matrix) Pink Box → 2D LSTM
1D LSTM that depends on Time
As seen above, for 1D LSTM, we initialize C(0) and h(0) before we start to train the network. There are multiple of methods to initialize these values, for example in the paper “ Show, Attend and Tell: Neural Image Caption Generation with Visual Attention” the authors initialize the first values via something called MLP — I can only assume that it is Multi Layer Perceptrons.
Image from original Paper Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
But in 2D LSTM, we have to initialize whole lot more of cell and hidden state values.
2D LSTM respect to time
As seen above, not only we need to initialize from C(0,1) to C(0,j) but also C(1,0) to C(i,0). Same goes for all hidden states. Now we can do something interesting, since we know the structure of 1D LSTM and 2D LSTM, let’s imagine 3D LSTM.
Quite a beauty isn’t she? 😀 Again, the orange boxes are the location of the first Cell and Hidden States. The applications for this network is not only bounding to video data but much more. Now we know the general structure, lets go back to the paper “ Generative Image Modeling Using Spatial LSTMs
Spatial long short-term memory
Image from original paper
So as the authors said, the original SLSTM was proposed by the two authors Graves & Schmidhuber. To see the paper by those two authors please click “ Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks”. In that paper, the authors have a very good visual of what an 2D LSTM is and it is shown below. However, the paper that I am working with have more clear and clean mathematical equation that describes SLSTM. (Shown above)
Image from paper Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

Sample Training Data
So we will do forward feed pass on a VERY simple training data, which is an image that have dimension of 2*2 (total of 4 pixel), shown in the black box above.
Network Architecture
Now I know that it looks bad, but I had to use the whole white board to make that diagram LOL so work with me here. Lets start from the beginning. First each box represents one LSTM box, the architecture is a derivation from the famous Colah Blog.
Image from Colah Blog
Second, here is the time stamp information below. Red Box → Forward Feed when time stamp is (1,1) Green Box → Forward Feed when time stamp is (2,1) Orange Box → Forward Feed when time stamp is (1,2) Purple Box → Forward Feed when time stamp is (2,2) Third, each blue star represents the cost function we can calculate at each time stamp.
Forward Feed
Again, I know that it looks bad, but with LSTM’s the equations get messy all of the time. One thing to note is all of the variables written with BLUE markers are already initialized values. So don’t worry about where they just popped up from no where, they were initialized before hand.
Detailed Look at Forward Feed at Time Stamp (1,1) and (1,2)
Detailed Look at Forward Feed at Time Stamp (2,1) and (2,2)

Final Words

I can’t image the back propagation process for this network, it will be SO fun to derive them by hand. I’ll hopefully do that one day. If any errors are found, please email me at Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did deriving back propagation on simple RNN here if you are interested.
  1. Theis, L., & Bethge, M. (2015). Generative image modeling using spatial LSTMs. In Advances in Neural Information Processing Systems (pp. 1927–1935).
  2. CoRR, abs/1502.03044, . Kelvin Xu and (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual.
  3. CoRR, abs/0705.2011, . Alex Graves and (2007). Multi-Dimensional Recurrent Neural Networks.
  4. Understanding LSTM Networks. (n.d.). Retrieved January 19, 2018, from
  5. Graves, A., & Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in neural information processing systems (pp. 545–552).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.