Prashanth Babu

Ramblings on Hadoop Ecosytem, Java, etc.

Udacity course on Apache Storm

I am a huge fan of Apache Storm for its simplicity and ease of use and more so the uncomplicated way of solving Big Data problems. I have given a Storm session at Fifth Elephant, 2013 at Bangalore.

For Big Data projects, I try to utilize Storm whenever we deal with any real-time streaming use cases as such. Storm is good and a well-designed tool for solving real-time streaming issues and hence the reason its dubbed as Hadoop of the real-time. I have open sourced many projects on my GitHub which use Storm as the processing engine.

Udacity is one of the 3 wonderful MOOCs we have right now, which also include Coursera and edX. Udacity already has a course on Apache Hadoop titled, “Intro to Hadoop and MapReduce”. Udacity created this course collaborating with Cloudera. I have done this course last year though I did not opt for the [paid] verified certificate for this course, since I am already a Cloudera Certified Developer for Apache Hadoop [CCDH]. You can find the solutions to all the assignments of this course also on my GitHub account.

Udacity has started a new course on Storm, as part of their Data Science catalog in the first week of December, 2013. This particular course is in partnership with Twitter. Just in case, if you are not aware, Storm was open sourced at Twitter and they are one of the power users of Storm for their manyt use cases. Best use case of Twitter usage of Storm is that of real-time hashtags which you see on Twitter. Hence it makes great sense to have them teach and talk about Storm. And also the syllabus looks really interesting.

Here is a brief rundown of the syllabus from Udacity website on Storm course:

Lesson 1
Join instructor Karthik Ramasamy and the first Udacity-Twitter Storm Hackathon to cover the motivation and practice of real-time, distributed, fault-tolerant data processing. Dive into basic Storm Topologies by linking to a real-time d3 Word Cloud Visualization using Redis, Flask, and d3.
Lesson 2
Explore Storm basics by programming Bolts, linking Spouts, and finally connecting to the live Twitter API to process real-time tweets. Explore open source components by connecting a Rolling Count Bolt to your topology to visualize Rolling Top Tweeted Words.
Lesson 3
Go beyond Storm basics by exploring multi-language capabilities to download and parse real-time Tweeted URLs in Python using Beautiful Soup. Integrate complex open source bolts to calculate Top-N words to visualize real-time Top-N Hashtags. Finally, use stream grouping concepts to easily create streaming join to connect and dynamically process multiple streams.
Lesson 4
Work on your final project and we cover additional questions and topics brought up by Hackathon participants. Explore Vagrant, VirtualBox, Redis, Flask, and d3 further if you are interested!

I am really excited about this course and I am already half-way thru with the first lesson and its a pretty pleasant experience till now.

Unfortunately, this course does not cover few of the important topics of Storm like Ack, Cluster and Trident. That’s a dampener for an otherwise some pretty great content. Needless to say, if they would have included these wonderful topics, this course would have been a really wonderful course and a go-to one for any Storm related course on the internet. But alas!

Anyways I am planning to go ahead and complete the course by the end of this month and if the course guidelines permit, I will put the solutions of the assignments up on my GitHub account.

If you are interested in or your work includes solving real-time Big Data use cases, you might want to checkout this course: https://www.udacity.com/course/ud381.

Happy learning! And all the very best too.

Updated on 20th January, 2015: I have uploaded my code and solutions for the exercises, while working on this course to GitHub. Please check it, you may find it useful.