utils module

A collection of utilities used by the various WE2E scripts

calculate_core_hours(expts_dict: dict) → dict

Takes in an experiment dictionary, reads the var_defns.sh file for necessary information, and calculates the core hours used by each task, updating expts_dict with this information

Parameters:: expts_dict (dict) – The information needed to run one or more experiments. See example file WE2E_tests.yaml
Returns:: expts_dict – Experiment dictionary updated with core hours

compare_rocotostat(expt_dict, name)

Reads the dictionary showing the location of a given experiment, runs a rocotostat command to get the full set of tasks for the experiment, and compares the two to see if there are any unsubmitted tasks remaining.

Parameters:

expt_dict (dict) – A dictionary containing the information for an individual experiment
name (str) – Name of the experiment

Returns:

expt_dict – A dictionary containing the information for an individual experiment

create_expts_dict(expt_dir: str)

Takes in a directory, searches that directory for subdirectories containing experiments, and creates a skeleton dictionary that can be filled out by update_expt_status()

Parameters:: expt_dir (str) – Experiment directory name
Returns:: (summary_file, expts_dict) – A tuple including the name of the summary file (WE2E_tests_YYYYMMDDHHmmSS.yaml) and the experiment dictionary

print_WE2E_summary(expts_dict: dict, debug: bool = False)

Creates a summary of the specified experiment

Parameters:

expts_dict (dict) – A dictionary containing the information needed to run one or more experiments. See example file WE2E_tests.yaml.
debug (bool) – [optional] Enable extra output for debugging

Returns:

None

print_test_info(txtfile: str = 'WE2E_test_info.txt') → None

Prints a pipe-delimited ( | ) text file containing summaries of each test with a configuration file in test_configs/*

Parameters:: txtfile (str) – File name for test details file (default: WE2E_test_info.txt)
Returns:: None

update_expt_status(expt: dict, name: str, refresh: bool = False, debug: bool = False, submit: bool = True) → dict

This function reads the dictionary for a given experiment, runs the rocotorun command to update the experiment (by running new jobs and updating the status of previously submitted ones), and reads the Rocoto database (.db) file to update the status of each job in the experiment dictionary. The function then uses a simple set of rules to combine the statuses of every task into a useful summary status for the whole experiment and returns the updated experiment dictionary.

Experiment status levels explained:

CREATED: The experiments have been created, but the monitor script has not yet processed them. This is immediately overwritten at the beginning of the monitor_jobs() function.

SUBMITTING: All jobs are in status SUBMITTING or SUCCEEDED. This is a normal state; experiment monitoring will continue.

DYING: One or more tasks have died (status DEAD), so this experiment has an error. Experiment monitoring will continue until all previously submitted tasks are in either status DEAD or status SUCCEEDED (see next entry).

DEAD: One or more tasks are in status DEAD, and other previously submitted jobs are either DEAD or SUCCEEDED. This experiment will no longer be monitored.

ERROR: Could not read the Rocoto database (.db) file. This will require manual intervention to solve, so the experiment will no longer be monitored.

STALLED: All submitted jobs are SUCCEEDED but one or more jobs have not been submitted; if this state persists, it will become “STUCK”.

STUCK: All submitted jobs are SUCCEEDED but one or more jobs have not been submitted for multiple iterations; this can indicate system-level throttling or a problem with Rocoto dependencies.

RUNNING: One or more jobs are in status RUNNING, and other previously submitted jobs are in status QUEUED, SUBMITTED, or SUCCEEDED. This is a normal state; experiment monitoring will continue.

QUEUED: One or more jobs are in status QUEUED, and some others may be in status SUBMITTED or SUCCEEDED. This is a normal state; experiment monitoring will continue.

SUCCEEDED: All jobs are in status SUCCEEDED; experiment monitoring will continue for one more cycle in case there are unsubmitted jobs remaining.

COMPLETE: All jobs are in status SUCCEEDED, and the experiment has been monitored for an additional cycle to ensure that there are no unsubmitted jobs. This experiment will no longer be monitored.

Parameters:

expt (dict) – A dictionary containing the information for an individual experiment, as described in the main monitor_jobs() function.
name (str) – Name of the experiment; used for logging only
refresh (bool) – If True, this flag will check an experiment status even if it is listed as DEAD, ERROR, or COMPLETE. Used for initial checks for experiments that may have been restarted.
debug (bool) – Will capture all output from rocotorun. This will allow information such as job cards and job submit messages to appear in the log files, but turning on this option can drastically slow down the testing process.
submit (bool) – In addition to reading the Rocoto database (.db) file, the script will advance the workflow by calling rocotorun. If simply generating a report, set this to False.

Returns:

expt – The updated experiment dictionary

update_expt_status_parallel(expts_dict: dict, procs: int, refresh: bool = False, debug: bool = False) → dict

This function updates an entire set of experiments in parallel, drastically speeding up the testing if given enough parallel processes. Given a dictionary of experiments, it will pass each individual experiment dictionary to update_expt_status(), making use of the Python multiprocessing starmap() functionality to update the experiments in parallel.

Parameters:

expts_dict (dict) – A dictionary containing information for all experiments
procs (int) – The number of parallel processes
refresh (bool) – “Refresh” flag to pass to update_expt_status(). If True, this flag will check an experiment status even if it is listed as DEAD, ERROR, or COMPLETE. Used for initial checks for experiments that may have been restarted.
debug (bool) – Will capture all output from rocotorun. This will allow information such as job cards and job submit messages to appear in the log files, but can drastically slow down the testing process.

Returns:

expts_dict – The updated dictionary of experiment dictionaries

write_monitor_file(monitor_file: str, expts_dict: dict)

Writes status of tests to file

Parameters:

monitor_file (str) – File name
expts_dict (dict) – Experiments being monitored

Returns:

None

Raises:

KeyboardInterrupt – If a user attempts to disrupt program execution (e.g., with Ctrl+C) while program is writing information to monitor_file.