ClusterRun is a feature in emergent that allows users to drive and launch emergent jobs on a remote backend compute cluster from your own personal computer. From within the normal GUI of emergent, you can launch or kill jobs on the remote cluster and easily retrieve and process resultant data. Overall ClusterRun provides:
- The ability to launch jobs asynchronously on a remote compute server
- The ability to run large models in a distributed fashion through MPI
- Perform large scale parallel parameter searches
- Provides a log, notes and revision history on models run, and the ability to retrieve the exact project that was run
- Allow to easily create graphs that compare performance between different versions of a model to test how various changes effect performance.
- Allows to manage multiple compute servers at once.
The main user interface for ClusterRun is the ClusterRun panel, which will automatically appear in the .edits section of your Project -- it provides the master interface for submitting and managing compute jobs on a remote cluster or other compute server (e.g., cloud computing). See Stampede for a high-performance cluster with allocated cycles for emergent users. You select variables that you are manipulating (using the ControlPanel system -- ClusterRun is a specialized version of ControlPanel), then hit Run to submit a job on the cluster, either with the current parameters or using a parameter search algorithm to sweep over parameters. Jobs can be monitored in the jobs_running and jobs_done tabs -- you can Kill jobs, GetData from jobs to see the results and get other files -- the file_list tab shows files to manipulate. The cluster_info tab provides overall info on the cluster.
The system uses subversion (svn) as a shared file system between your computer and the cluster -- the Run button checks in a jobs_submit.dat file into subversion and a script running on the cluster monitors for changes to this file, and submits the job on the cluster. The Update button in ClusterRun does an "svn update" command and updates all the local data tables (jobs_running, jobs_done, cluster_info) according to the latest info. This will also update any data tables with job results that have been previously retrieved using the GetData button. Most actions will auto-update to the next svn revision (watch the css console for progress -- it will say "updated to target revision" when it is done auto-updating). To track further progress for running jobs, just hit the Update button. After a job actually starts running, it is tracked for a few minutes, but then not thereafter -- to see if there are new files or other job output from a job that has been running a while, hit Updt Running The script is generally configured to poll and update roughly every 10-20 sec, so that is the rough turn-around time for actions that depend on the script.
You must configure clusters and the subversion repository info in the overall Preferences. These settings will be specific to a given lab or user -- consult your local lab info for proper settings for clusters available to you. It is strongly recommended that you create a dedicated svn repository for cluster run usage -- you only need one repository for any number of clusters -- all the files that you exchange between the cluster and your desktop need to go through this repository, so it will get relatively large over time. Periodically, you can rebuild the repository to remove old history and encourage users to clean up files. Everything is highly compartmentalized for any given user, so no individual will suffer unduly from the large repository (you only check out the specific parts for your own user name), but the central svn server will need to be able to manage the large repository file.
- 1 Running the Cluster Monitor Script and Initial Configuration
- 2 Selecting Parameters
- 3 Submitting a New Compute Job
- 4 Monitoring and Killing Jobs
- 5 Getting Data Tables
- 6 Getting Other Files
- 7 Keeping things Clean
- 8 Sharing a common svn repository among multiple projects
- 9 Parallel batch jobs
- 10 Configuring the Server
- 11 Problem Resolution
Running the Cluster Monitor Script and Initial Configuration
IMPORTANT: before you do anything, you must set the Preferences settings for at least one cluster, the cluster_svn_path, and at least one svn_repo_url -- the cluster and svn_repo_url settings are site specific and must be setup by your administrator or yourself. The cluster_svn_path default is usually fine for everyone.
Also, it is useful for first-timers to click on the Properties tab of the ClusterRun object, and look at the tool-tips for each of the items there -- these are the key configuration parameters for a given job. The popup dialog that comes up will refer to these same fields.
First, you need to create the proper directories in the svn repository for your username and the cluster you want to use. The simplest way to do this is to just click the Probe button and select the cluster you want to use -- this will trigger all the proper initialization. You only need to do this once for every new cluster you use.
Then, ssh into the cluster, and, per instructions for your particular lab or site, grab the appropriate monitor script for that cluster and the standard
cluster_run_mon_lib.py library file, and install it in the recommended location in your home directory on the cluster. Typically this will be in ~/cluster_run_<clust_name> in your home directory, but it can be anywhere. You will need a separate subdirectory for each cluster. Typically we configure different queues on the same cluster as a different cluster name in emergent, so you can have multiple such directories on the same computer for each such "cluster/queue" combination. Be sure to grab the script specific to that particular cluster/queue name, it will have the queue name baked into it.
Then, just run the script in that directory, e.g.,:
cd ~/cluster_run_janus_short ./cluster_run_mon_moab.py
You will then be prompted for your username, the name of the cluster (e.g., "janus_short" in this case), and the svn repository info (which must match exactly what is set in the emergent preferences), polling interval, and a few other questions -- defaults are recommended. The first time you run the script, you should answer "n" to the "run in background using nohup" question, so you can see the output of the script as it runs, and make sure everything is working well.
After the first time you run the script, you can just run it in the background and it should just keep chugging along even after you log out etc. You can restart if the machine is rebooted. If you notice that jobs are not being submitted, then you'll need to restart the script..
The ability to set parameters as selected in the ClusterRun automatically via the cluster run system depends on the presence of a ControlPanelFmArgs program element in the Startup program of your project -- this will translate command-line arguments into parameter values using the items selected in the ClusterRun. The best thing to do is to make a new startup program by doing New From Lib in your overall programs group, and selecting LeabraStartup (this will generally work with other algorithms too). You can just drag the highlighted green lines into your existing startup, or vice-versa for any custom things you previously added to your startup. Then delete the one you're not using -- you should only have one startup program in general.
Startup is also where you specify logging to save the results of your run. The results are retrieved using GetData and ImportData.
The cluster run works just like ControlPanel, so see the documentation there for basics on that. One important difference for ClusterRun is that you must only select individual values, and not collections of values. For example, in LeabraUnitSpec, the act line has a number of individual values -- if you want to manipulate gain, you must select it individually -- just do the context menu on the gain label and you can add it to the cluster run. You can have a large number of parameters added to your cluster run control panel, but only those with the search flag enabled will be recorded and used for the parameter search algorithms. Thus, you should carefully consider which of the parameters you are manipulating and select those for searching, while deselecting those you are not searching on -- this is useful even when doing manual parameter searches, and it is absolutely essential when using a search algorithm, as the search time can increase exponentially as a function of the number of parameters.
Submitting a New Compute Job
When you first open up a project, it is a good idea to do Probe followed by Update -- this gets the cluster sync'd up for this project, ensures everything is updated and you can look at the cluster_info tab to see how loaded the cluster is, before deciding where to submit a job -- can check out different clusters in this way before deciding where to submit.
Then, just press the Run button and fill in all the parameters for the job -- these are saved in the ClusterRun and will reflect previous runs if jobs had been submitted before. Use the mouse-over to get more info on the parameters.
Specifying Emergent Version To Run
The "Run" dialog's text field "Executable Cmd:" defaults to "emergent" and whatever is the most recent version of emergent on the cluster will execute. This version is the "bleeding edge" version and may be from the previous evening or from a few days ago. You can get the revision number by logging into the cluster and typing
If you want to run a different version you can checkout the emergent sources to your home directory on the cluster and build any revision.
svn checkout -r <rev #> --username anonymous --password emergent https://grey.colorado.edu/svn/emergent/emergent/trunk ~/emergent
You might want to build a single node and an mpi version. On the dream and blanca_ccn cluster we are still using qt4, so:
cd emergent ./configure qt4 clean sse8 and/or ./configure qt4 mpi clean sse8
You specify the version by specifying the path in the "Executable Cmd:" field. For example, ~/emergent/build/bin/emergent or ~/emergent/build_mpi/bin/emergent
Parameter Search: Automating Parameter Testing
You can automate the comparison of multiple parameter settings by adding a parameter search algorithm and setting parameters to search mode on the cluster run control panel.
To add a search algorithm, select Jobs/New Search Algo, and select the type of algorithm, e.g., GridSearch or any other available algorithm (but not "ParamSearchAlgo" as that is a virtual base class of the actual algorithms).
See ParamSearchAlgo for more information on the available search algorithms and their behavior.
If a parameter search algo is not set, then the current parameters as set in the ControlPanel tab are used.
The range of values to be searched for each item can be specified by listing values separated by commas , and ranges can be specified using start:stop:increment (increment is optional, defaults to 1) notation as used in Matrix code -- e.g. 1,2,3:10:1,10:20:2 -- however here the stop value here is INCLUSIVE: 1:3:1 = 1,2,3 as that is more often useful.
You can also specify %paramname to yoke this parameter to the paramname parameter -- it will not be searched independently, but rather will have the same values as paramname.
Loading Weights Files
When you run your model and generate weights files they are saved in <your project>/results along with the .args file and the .dat files for the run. To load these weights just add CRR: before the name of the weights file. For example, CRR:Reconnaissance_recon_basic_objrec_train_69033_0.00_0250.wts.gz. CRR: will expand to be the path to <your project>/results. CRM: is similar, it expands to <your project>/models. Also, if you want to run locally with weights generated from a training run on a cluster you can get the weight file from cluster run file list and emergent will know to look in <your project>/results on your local machine. Getting a weights file also makes it possible to use weights generated by another user's run. When you get the file it will be added to your repository so it will be available to you locally and on the cluster.
Monitoring and Killing Jobs
Once you submit the job, you can click on the jobs_running tab and keep hitting Update until the status column transitions to RUNNING -- you can look at the cluster_info tab to see your job listed as queued or running there as well, as you wait.. The monitor script will definitely check in once the status changes, but it can take a while depending on the load and the nature of the cluster.
Once you have a running job, or even if it transitions immediately to KILLED or DONE, you'll want to look at the job_out column to see what happened when the job was started or why it died if it did. The easiest way to look at this column is to right-click (context menu) on the cell and select View -- this is a new option that will pull up a dialog with the contents of that cell in a much more readable form.
Getting Data Tables
Once a job has started, you can highlight that job in jobs_running and then click GetData to get all the data files associated with that job (.dat files) -- this will just cause the cluster to check in the files, and it will then auto-update to actually svn update and receive the checked-in files. To view the files after it updates to the target svn revision, click ImportData with a job highlighted, and it will import the data files for that job into data/ClusterRun -- you can then go there and look at the results, and you can also build analysis programs that automate further analyses on these data tables.
A very powerful feature of ImportData is that it will add extra columns to your data files for each of the parameters that were selected in the Control Panel tab and set in the startup when the job started. This is automatic and avoids the need to augment your data tables manually with this information (which otherwise can be a real pain to do). This is one of the most important advantages of using the ClusterRun system -- you can analyze your results as a function of parameters values. See the taDataAnal and other related functionality for things you can do to operate on this data, and see the Data Proc Tutorial for more such info.
Getting Other Files
The Files menu button has functions for getting a list of other files associated with a given job or set of jobs, and then you can use the file_list tab to select files to grab from the cluster. You can also do ListAllFiles to see all the results files that have been checked in -- you can the operate directly on those (e.g., for ImportData) or removing them using RemoveFiles.
Keeping things Clean
In the Jobs menu button, you can find options to remove jobs or all killed jobs (anything with a status of KILLED) -- always a good idea to delete things you're no longer using. For important runs that you want to keep around, but aren't currently relevant, you can archive them -- this will automatically remove the job control data -- the JOB.xxx.out file and the project file tagged with the tag for that job. You can also remove this stuff in the Files menu. In general, you can usually just save the best performing parameter case in the archive, and remove the rest of the jobs, while noting on a wiki or elsewhere about the outcomes of these other parameters -- that will be more accessible than the raw data itself.
Sharing a common svn repository among multiple projects
You can specify a given project name to use for the svn repository (in the Properties tab, click set_proj_name, which will reveal the proj_name field), instead of using the actual project name -- this allows multiple different project files to share a common svn repository, which is useful for trying out different architectural variants in different projects.
The local proj_name.proj file is never written to in this process, so it can be one of the projects sharing its native svn repository -- the process works by copying the local project file into the svn repository directory -- it is in this process that the local project name gets copied over to the common proj_name.proj file. However, the original project file name is retained in naming the output data files -- therefore it is useful to make sure that the different projects all have the same starting name as proj_name, with typically a unique suffix.
Parallel batch jobs
You can run multiple runs of the same model in parallel across nodes or procs (not using mpi -- just embarassingly parallel separate runs), each on a different set of batch iterations (e.g., different initial random weights) -- this will submit a different job for each set of batches on the server (so they can all be tracked directly) -- see pb_batches' and pb_n_batches_per for relevant parameters.
Important: as of version 8.0 batch_start and n_batches are used as the parallel-batch args, instead of b_start and b_end that were used previously -- the new standard programs have been updated to use these parameters.
Configuring the Server
See ClusterRun Server for information about configuring a cluster run server.
You clicked run and entered the run parameters but you don't see your job in jobs running or in the full job list in the cluster_info table.
- You probably need to restart your background script running on the cluster (see Running the Cluster Monitor Script and Initial Configuration)
You are getting an error on update because the a directory anonymous can't be found. Something like svn_clusterun/grey_run/blanca_ccn/anonymous.
- The code has picked up your svn username from you last svn action (perhaps an anonymous checkout)
- Correct by doing an action such as that same checkout with your actual svn username