Hadoop

From HPCwiki
Jump to navigation Jump to search

Hadoop is another kind of architecture to promote writing code that can be executed in a parallel fashion. It's become more of an environment than a piece of software these days, as it typically contains:

  • HBase, flat table storage
  • Hive, a form of data warehousing
  • MapReduce, a style of coding to promote largely non-interacting workflows in order to leverage slow interconnects between workers
  • HDFS, a distributed filesystem between workers

And many other components.

This install of Hadoop has been optimised for executing a standalone HDFS filesystem, primarily for teaching purposes.

WARNING

HDFS is incredibly insecure. As soon as anyone knows the location of your nameserver, they may spoof into it to access your data, as any user. This can be prevented by using Kerberos tickets, however that is currently beyond our level of complexity. Do not store anything you care about here.

Usage

load up the module, and java: <source lang="bash"> module load hadoop module load java </source>

This will set your environment variables correctly.

There is an example submission script at:

/shared/apps/hadoop/current/wur/example_sbatch.sh

Copy this to your preferred location and submit it.

Edit this script and adjust the settings accordingly. You can override the location that HDFS will store its blocks by altering the environment variable HADOOP_TMP_DIR, which will default to /lustre/scratch.

Once it's run, the output log will tell you the location. I'd recommend setting a static location in your job script so that it's reproducibly located, but the HADOOP_CONF_DIR will point to the correct fs.defaultFS once you've started it.

Mechanism details

The example script will run an executable called start-hdfs.sh. This simply starts by creating ~/.hadoop/$SLURM_JOB_ID, then assigning that as HADOOP_LOG_DIR. Next, it creates a subfolder conf, assigns that to HADOOP_CONF_DIR, and writes the core-site.xml and hdfs-site.xml files into it. Last, it creates a symlink from ~/.hadoop/current to point to your currently running HDFS instance. On the client side, all HADOOP_CONF_DIR's point to ~/.hadoop/current/conf so that whichever job is submitted, this symlink will point to the currently running config for you.

If you're trying to share your HDFS instance with other users, you should:

  • Copy the ~/.hadoop/current/conf/* to somewhere shared
  • Tell them to set the environment variable HADOOP_CONF_DIR to that location