Using the 50TB Hadoop Cluster

To use the Hadoop Cluster you must have a basic working knowledge of Java. The Hadoop File System (HDFS) and related framework is provided using Apache Hadoop (v 2.4.1) from the Apache Hadoop Foundation.

The cluster consists of a namednode (discovery3.neu.edu) that is used for monitoring the cluster and three compute nodes (compute-2-004, compute-2-005, compute-2-006) that each have over 18TB of storage, 128GB of RAM, a dual 10Gb/s bonded backplane, a dual 2.8 GHz CPU’s for 40 logical cores. These three servers give a usable HDFS file system of 50 terabytes (TB). To run Hadoop jobs you will need to login to the login nodes and from there get an interactive hadoop node from the interactive “hadoop-10g” queue. Then after logging into the interactive hadoop node assigned to you by SLURM (see https://www.northeastern.edu/rc/?page_id=18#intjobs on how to get and interactive node), you can now run your hadoop jobs. You will need to add the proper modules in your “.bashrc” to run hadoop jobs. “module whatis hadoop-2.4.1” will give you usage and the order in which the modules must be loaded.

To monitor the Hadoop Cluster cut and paste the following in your browser:

For overall status: http://discovery3.neu.edu:50070

The PDF document “USING HDFS ON DISCOVERY CLUSTER” (that can be downloaded here – “https://www.northeastern.edu/rc/wp-content/uploads/2014/09/USING_HDFS_ON_DISCOVERY_CLUSTER-.pdf” ) gives all details with two examples (including source code) on how to compile code, move data into the HDFS file system, run jobs, remove data from the HDFS file system, protocols to follow when creating your working top level directory in the HDFS file system, and finally exiting the interactive session. Please get an interactive node or submit batch jobs using the SLURM scheduler. Do not run jobs on the login nodes. Please replace the LSF script for interactive and batch jobs in the document above with the equivalent SLURM ones here – https://www.northeastern.edu/rc/?page_id=18#intjobs.

Please contact researchcomputing@neu.edu if you have questions, issues or need further help, clarifications or guidance with respect to Hadoop Cluster usage, Java, Python, or SLURM.