Training Neural Networks on Computing Clusters

Recently I’ve been trying to outsource my neural network training onto a computing cluster (the Hoffman2 cluster at UCLA). Specifically, I’m trying to use Tensorflow’s implementation of the CNN model on CIFAR-10 dataset. Training the model interactively was not difficult. I just had to load the Anaconda module, install Tensorflow package and its associated modules, then train the model. However, to do it non-interactively, i.e. submit as a job to a remote node, it requires writing up a simple C shell script for submission. I’m posting the command file for job submission here for those who may be interested in implementing machine learning on a computing cluster.

This part gives the Sun Grid Engine job scheduler some parameters for initiating a remote job, providing information on how much memory to request, how long to run the job for, etc.

#!/bin/csh -f
#
#$ -cwd
#$ -o output/directory
#$ -j y
#$ -l exclusive,h_data=8G,h_rt=6:00:00
#$ -v QQAPP=job
#$ -M email_for_sending_job-related_notifications@email.com
#$ -m bea
# job is rerunable
#$ -r n

Not sure what below lines do, but they apparently set the path to the files related to the training.

 unalias * # Remove all previous aliases
 
 set qqversion =
 set qqapp = "job serial"
 set qqmtasks = 8
 set qqidir = directory/to/training/files
 set qqjob = file name
 set qqodir = directory/to/training/files

Change to the project directory and initiate the job scheduler.

 cd /path/to/project/directory
 source /u/local/bin/qq.sge/qr.runtime
 if ($status != 0) exit (1)

Load Anaconda and load Python environment.

 echo "Load modules..."
 source /u/local/Modules/default/init/modules.csh
         module load anaconda/python3-4.2

Activate TensorFlow conda environment. One caveat here is that there is a conflict for the command source in both C shell and conda (normally you would activate Tensorflow with source activate tensorflow), so you cannot directly call source to activate the tensorflow environment inside your csh script. Fortunately, someone has written a csh version of the activate command that is available on GitHub, so you can activate the tensorflow environment by directly calling it in the script.

 echo "Anaconda loaded"
 setenv CONDA_ENVS_PATH /path/to/conda/environments # Set path to search for Conda environments
 source /path/to/activate.csh tensorflow # Use CSH version of conda activate to change to tensorflow environment
 setenv PATH /path/to/tensorflow/conda/environment/bin:$PATH # Add Tensorflow env to beginning of search paths
 echo "Tensorflow loaded"
 setenv MCR_CACHE_ROOT $TMPDIR

Begin training the neural network! This also prints the output to a text file so you can check the progress as it goes along. Note that the output does not update instantaneously, so don’t freak out if you don’t see anything in the file right away.

 echo "Begin model training..."
 python cifar10_train.py | tee -a train_cifar10.output.$JOB_ID

 source /u/local/bin/qq.sge/qr.runtime
 exit(0)

Hope this helpes!