Spark

From HPCwiki
Jump to navigation Jump to search

Apache Spark is a means of distributing compute resources across multiple worker machines. It is the successor to Hadoop, and allows for a wider distribution of code to be executed on the clustered resources. The only requirement for Spark to be able to operate is that each worker must be able to reach each other via TCP, thus it allows for compute to be executed on very simple resources, if the code itself can be translated into the MapReduce paradigm.

SPARK on HPC

In order to create a personal SPARK cluster, you must first request resources on the HPC. Use this example submission script to initialise your cluster:

#!/bin/bash
#SBATCH --time=<length>
#SBATCH --mem-per-cpu=4000
#SBATCH --nodes=<number of nodes>
#SBATCH --tasks-per-node=<number of workers per node>
#SBATCH --job-name="my spark cluster"
#SBATCH --qos=QOS

module load spark/3.0.1-2.7
module load python/3.8.5

source $SPARK_HOME/wur/start-spark

tail -f /dev/null

This will spawn a new cluster of your desired dimensions once resources are available. This spark module has been written to output its logs to your home directory, at:

/home/WUR/yourid/.spark/<jobid>/

In this folder you will find the raw logs of the master and all worker threads. By default the master will consume 1Gb of memory from the first process, and so a single 4Gb 'cluster' will be provided with one 3Gb worker. You can adjust the CPU/memory use by adjusting the parameters in your batch script.

Within the log file you will find two unique files: master, and master-console. master will always contain the URI of the current spark cluster master access point, and master-console the URL of the console of it.

To access the web console, the easiest solution is to use links:

links http://myspark:8081

This will nicely render the page for you in the console. Ctrl-R reloads the page, q to quit.

There are several caveats to remember with this:

  • The cluster exists (and consumes resources) until you cancel it with scancel <jobid>
  • There is no security at all - any user of the HPC can access both these at any point if they know the port and host.

Instant SPARK

You can also spin up clusters solely to execute scripts. Simply replace the last line from the example above:

tail -f /dev/null

with

spark-submit myscript.py

And after the script has executed, the cluster will automatically terminate.

SPARK in Jupyter

There is a kernel available for using Spark from Jupyter. All this does is to set up the correct path to the python version and the spark binaries for you. In order to set up your Context, your first cell for each notebook should be:

import os, pyspark
with open(os.environ['HOME']+'/.spark/current/master') as f:
    conf = (pyspark.SparkConf()
         .setMaster(f.read().strip()))
         .setAppName("MyName"))

sc = pyspark.SparkContext(conf=conf)

Using the cluster master name from the master file in your job output as above. Subsequent cells will then have sc defined. Run this cell only once - attempting to reconnect will throw an error. That application will run until the kernel is terminated and prevent other applications from being able to be executed - you may wish to manually terminate your kernel from the top bar in Jupyter to free resources.

As a teacher, you might want to put that master file somewhere else, such as /lustre/shared, so that students can connect to your cluster.