Weather data analysis and visualization – Big data tutorial Part 3/9 – Environment

Tutorial big data analysis: Weather changes in the Carpathian-Basin from 1900 to 2014 – Part 3/9

Preparation – Analysis Environment

As the analyzed data is relatively small to get it processed on a single machine, I have spared some time to set-up a new Hadoop cluster ? I have administrative access to a smaller cluster of regular PCs chained into a Hadoop cluster but this one was reserved the time I made the experiment.

Anyway, analyzing a small dataset with some big data tools is resulting in the same development efforts as analyzing Petabytes of data on a cluster of thousands of machines? Only it takes less CPU time ? one can still learn the basics on small datasets.

Setting up the environment, choosing the tools

OS: Ubuntu 13.10

Chosen Linux as the power of the shell is great for data manipulation. The “magic” toolset needed is the following:

  • Bash shell and AWK for easy text file procession
  • Python for data gathering and manipulation
  • A web server to play with for the blog post
  • Tools for GIS playground: geographical information system for map based data visualization

Linux is more suitable and easy to handle for development tasks like the above.

I have installed Python and Kartograph: as the weather stations are geographically distributed, I would like to make some graphs based on a map.

The open-source Kartograph GIS framework provided me an easy to use alternative for map creation and web-based visualization. It is more or less well documented and have some nice tutorials.

Download it and install using this guide.

I have used a virtual machine to run Hortonwork?s Hadoop ? it is a pre-configure environment with a handy web-based UI. Pre-installed are Hadoop, Pig, Hive: all the goodies needed for an analysis ? open-source and free, sparing you a lot of hassle by eliminating the need for a lengthy sysadmin session on Hadoop.

Installed Virtualbox on Ubuntu. Downloaded Hortonworks Hadoop 2 Sandbox bundle for Virtualbox. The analysis was done using a single virtual Hadoop machine.

Hortonworks Hadoop

Hortonworks Hadoop is a bundle of open-source tools: Hadoop itself and multiple extras: query languages that are translated to map-reduce jobs, monitoring and job scheduling tools and a great web-based UI to integrate it all called Hue.