.. _QuickstartC: ==================================== Container-Based Quick Start Guide ==================================== This Container-Based Quick Start Guide will help users build and run the "out-of-the-box" case for the Unified Forecast System (:term:`UFS`) Short-Range Weather (SRW) Application using a `Singularity/Apptainer `__ container. The :term:`container` approach provides a uniform enviroment in which to build and run the SRW App. Normally, the details of building and running the SRW App vary from system to system due to the many possible combinations of operating systems, compilers, :term:`MPIs `, and package versions available. Installation via container reduces this variability and allows for a smoother SRW App build experience. The basic "out-of-the-box" case described in this User's Guide builds a weather forecast for June 15-16, 2019. Multiple convective weather events during these two days produced over 200 filtered storm reports. Severe weather was clustered in two areas: the Upper Midwest through the Ohio Valley and the Southern Great Plains. This forecast uses a predefined 25-km Continental United States (:term:`CONUS`) grid (RRFS_CONUS_25km), the Global Forecast System (:term:`GFS`) version 16 physics suite (FV3_GFS_v16 :term:`CCPP`), and :term:`FV3`-based GFS raw external model data for initialization. .. attention:: * The SRW Application has :srw-wiki:`four levels of support `. The steps described in this chapter will work most smoothly on preconfigured (Level 1) systems. However, this guide can serve as a starting point for running the SRW App on other systems, too. * This chapter of the User's Guide should **only** be used for container builds. For non-container builds, see :numref:`Section %s ` for a Quick Start Guide or :numref:`Section %s ` for a detailed guide to building the SRW App **without** a container. .. _DownloadCodeC: Download the Container ========================== Prerequisites ------------------- **Intel Compiler and MPI** Users must have an **Intel** compiler and :term:`MPI` (`available for free here `__) in order to run the SRW App in the container provided using the method described in this chapter. Additionally, it is recommended that users install the `Rocoto workflow manager `__ on their system in order to take advantage of automated workflow options. Although it is possible to run an experiment without Rocoto, and some tips are provided, the only fully-supported and tested container option assumes that Rocoto is preinstalled. **Install Singularity/Apptainer** To build and run the SRW App using a Singularity/Apptainer container, first install the software according to the `Apptainer Installation Guide `__. This will include the installation of all dependencies. .. note:: As of November 2021, the Linux-supported version of Singularity has been `renamed `__ to *Apptainer*. Apptainer has maintained compatibility with Singularity, so ``singularity`` commands should work with either Singularity or Apptainer (see compatibility details `here `__.) .. attention:: Docker containers can only be run with root privileges, and users cannot have root privileges on :term:`HPCs `. Therefore, it is not possible to build the SRW App, which uses the spack-stack, inside a Docker container on an HPC system. However, a Singularity/Apptainer image may be built directly from a Docker image for use on the system. .. _work-on-hpc: Working in the Cloud or on HPC Systems ----------------------------------------- Users working on systems with limited disk space in their ``/home`` directory may need to set the ``SINGULARITY_CACHEDIR`` and ``SINGULARITY_TMPDIR`` environment variables to point to a location with adequate disk space. For example: .. code-block:: export SINGULARITY_CACHEDIR=/absolute/path/to/writable/directory/cache export SINGULARITY_TMPDIR=/absolute/path/to/writable/directory/tmp where ``/absolute/path/to/writable/directory/`` refers to the absolute path to a writable directory with sufficient disk space. If the ``cache`` and ``tmp`` directories do not exist already, they must be created with a ``mkdir`` command. See :numref:`Section %s ` to view an example of how this can be done. .. _BuildC: Build the Container ------------------------ * :ref:`On Level 1 Systems ` (see :srw-wiki:`list `) * :ref:`On Level 2-4 Systems ` .. hint:: If a ``singularity: command not found`` error message appears when working on Level 1 platforms, try running: ``module load singularity`` or (on Derecho) ``module load apptainer``. .. _container-L1: Level 1 Systems ^^^^^^^^^^^^^^^^^^ On most Level 1 systems, a container named ``ubuntu22.04-intel-ue-1.6.0-srw-dev.img`` has already been built at the following locations: .. list-table:: Locations of pre-built containers :widths: 20 50 :header-rows: 1 * - Machine - File Location * - Derecho [#fn]_ - /glade/work/epicufsrt/contrib/containers * - Gaea-C5 [#fn]_ - /gpfs/f5/epic/world-shared/containers * - Gaea-C6 [#fn]_ - /gpfs/f6/bil-fire8/world-shared/containers * - Hera - /scratch1/NCEPDEV/nems/role.epic/containers * - Jet - /mnt/lfs5/HFIP/hfv3gfs/role.epic/containers * - NOAA Cloud [#fn]_ - /contrib/EPIC/containers * - Orion/Hercules - /work/noaa/epic/role-epic/contrib/containers .. [#fn] On these systems, container testing shows inconsistent results. .. note:: * The NOAA Cloud containers are accessible only to those with EPIC resources. Users can simply set an environment variable to point to the container: .. code-block:: console export img=/path/to/ubuntu22.04-intel-ue-1.6.0-srw-dev.img Users may convert the container ``.img`` file to a writable sandbox: .. code-block:: console singularity build --sandbox ubuntu22.04-intel-ue-1.6.0-srw-dev $img When making a writable sandbox on Level 1 systems, the following warnings commonly appear and can be ignored: .. code-block:: console INFO: Starting build... INFO: Verifying bootstrap image ubuntu22.04-intel-ue-1.6.0-srw-dev.img WARNING: integrity: signature not found for object group 1 WARNING: Bootstrap image could not be verified, but build will continue. .. _container-L2-4: Level 2-4 Systems ^^^^^^^^^^^^^^^^^^^^^ On non-Level 1 systems, users should build the container in a writable sandbox: .. code-block:: console sudo singularity build --sandbox ubuntu22.04-intel-ue-1.6.0-srw-dev docker://noaaepic/ubuntu22.04-intel21.10-srw:ue160-fms202401-dev Some users may prefer to issue the command without the ``sudo`` prefix. Whether ``sudo`` is required is system-dependent. .. note:: Users can choose to build a release version of the container using a similar command: .. code-block:: console sudo singularity build --sandbox ubuntu22.04-intel-srw-release-public-v3.0.0 docker://noaaepic/ubuntu22.04-intel21.10-srw:ue160-fms202401-release3 For easier reference, users can set an environment variable to point to the container: .. code-block:: console export img=/path/to/ubuntu22.04-intel-ue-1.6.0-srw-dev .. _RunContainer: Start Up the Container ---------------------- Copy ``stage-srw.sh`` from the container to the local working directory: .. code-block:: console singularity exec -B /:/ $img cp /opt/ufs-srweather-app/container-scripts/stage-srw.sh . If the command worked properly, ``stage-srw.sh`` should appear in the local directory. The command above also binds the local directory to the container so that data can be shared between them. On :srw-wiki:`Level 1 ` systems, ```` is usually the topmost directory (e.g., ``/lustre``, ``/contrib``, ``/work``, or ``/home``). Additional directories can be bound by adding another ``-B /:/`` argument before the name of the container. In general, it is recommended that the local base directory and container directory have the same name. For example, if the host system's top-level directory is ``/user1234``, the user can create a ``user1234`` directory in the writable container sandbox and then bind it: .. code-block:: console mkdir /path/to/container/user1234 singularity exec -B /user1234:/user1234 $img cp /opt/ufs-srweather-app/container-scripts/stage-srw.sh . .. attention:: Be sure to bind the directory that contains the experiment data! To explore the container and view available directories, users can either ``cd`` into the container and run ``ls`` (if it was built as a sandbox) or run the following commands: .. code-block:: console singularity shell $img cd / ls The list of directories printed will be similar to this: .. code-block:: console autofs dev gpfs lfs2 lib64 ncrc sbin srv u bin discover home lfs3 libx32 opt scratch sw usr boot environment host_lib64 lfs4 lustre proc scratch1 sys usw contrib etc lfs lib media root scratch2 third-party-programs.txt var data glade lfs1 lib32 mnt run singularity tmp work Users can run ``exit`` to exit the shell. Download and Stage the Data ============================ The SRW App requires input files to run. These include static datasets, initial and boundary condition files, and model configuration files. On Level 1 systems, the data required to run SRW App tests are already available as long as the bind argument (starting with ``-B``) in :numref:`Step %s ` included the directory with the input model data. See :numref:`Table %s ` for Level 1 data locations. For Level 2-4 systems, the data must be added manually by the user. In general, users can download fix file data and experiment data (:term:`ICs/LBCs`) from the `SRW App Data Bucket `__ and then untar it: .. code-block:: console wget https://noaa-ufs-srw-pds.s3.amazonaws.com/experiment-user-cases/release-public-v3.0.0/out-of-the-box/fix_data.tgz wget https://noaa-ufs-srw-pds.s3.amazonaws.com/experiment-user-cases/release-public-v3.0.0/out-of-the-box/gst_data.tgz tar -xzf fix_data.tgz tar -xzf gst_data.tgz More detailed information can be found in :numref:`Section %s `. Sections :numref:`%s ` and :numref:`%s ` contain useful background information on the input and output files used in the SRW App. .. _GenerateForecastC: Generate the Forecast Experiment ================================= To generate the forecast experiment, users must: #. :ref:`Stage the container ` #. :ref:`Set experiment parameters to configure the workflow ` #. :ref:`Run a script to generate the experiment workflow ` The first two steps depend on the platform being used and are described here for Level 1 platforms. Users will need to adjust the instructions to match their machine configuration if their local machine is a Level 2-4 platform. .. _SetUpCont: Stage the Container ------------------------ To set up the container with your host system, run the ``stage-srw.sh`` script: .. code-block:: console ./stage-srw.sh -c= -m= -p= -i=$img where: * ``-c`` indicates the compiler on the user's local machine (e.g., ``intel/2022.1.2``, ``intel-oneapi-compilers/2022.2.1``, ``intel/2023.2.0``) * ``-m`` indicates the :term:`MPI` on the user's local machine (e.g., ``impi/2022.1.2``, ``intel-oneapi-mpi/2021.7.1``, ``cray-mpich/8.1.28``) * ```` refers to the local machine (e.g., ``hera``, ``jet``, ``noaacloud``). See ``MACHINE`` in :numref:`Section %s ` for a full list of options. * ``-i`` indicates the full path to the container image that was built in :numref:`Step %s ` (``ubuntu22.04-intel-ue-1.6.0-srw-dev`` or ``ubuntu22.04-intel-ue-1.6.0-srw-dev.img`` by default). For example, on Hera, the command would be: .. code-block:: console ./stage-srw.sh -c=intel/2022.1.2 -m=impi/2022.1.2 -p=hera -i=$img .. attention:: The user must have an Intel compiler and MPI on their system because the container uses an Intel compiler and MPI. Intel compilers are now available for free as part of the `Intel oneAPI Toolkit `__. After this command runs, the working directory should contain the ``srw.sh`` script and a ``ufs-srweather-app`` directory. .. _SetUpConfigFileC: Configure the Workflow --------------------------- Configuring the workflow for the container is similar to configuring the workflow without a container. The only exception is that there is no need to activate the ``srw_app`` conda environment. That is because there is a conflict between the container's conda and the host’s conda. To get around this, the container’s conda environment bin directory is appended to the system’s ``PATH`` variable in the ``python_srw.lua`` and ``build__intel.lua`` modulefiles with the ``stage-srw.sh`` script. Activate the workflow by running the following commands: .. code-block:: console module use ufs-srweather-app/modulefiles module load wflow_ where: * ```` is a valid, lowercased machine/platform name (see the ``MACHINE`` variable in :numref:`Section %s `). From here, users can follow the steps below to configure the out-of-the-box SRW App case with an automated Rocoto workflow. For more detailed instructions on experiment configuration, users can refer to :numref:`Section %s `. #. Copy the out-of-the-box case from ``config.community.yaml`` to ``config.yaml``. This file contains basic information (e.g., forecast date, grid, physics suite) required for the experiment. .. code-block:: console cd ufs-srweather-app/ush cp config.community.yaml config.yaml The default settings include a predefined 25-km :term:`CONUS` grid (RRFS_CONUS_25km), the :term:`GFS` v16 physics suite (FV3_GFS_v16 :term:`CCPP`), and :term:`FV3`-based GFS raw external model data for initialization. #. Edit the ``MACHINE`` and ``ACCOUNT`` variables in the ``user:`` section of ``config.yaml``. See :numref:`Section %s ` for details on valid values. .. note:: On ``JET``, users must also add ``PARTITION_DEFAULT: xjet`` and ``PARTITION_FCST: xjet`` to the ``platform:`` section of the ``config.yaml`` file. #. To automate the workflow, add these two lines to the ``workflow:`` section of ``config.yaml``: .. code-block:: console USE_CRON_TO_RELAUNCH: TRUE CRON_RELAUNCH_INTVL_MNTS: 3 There are instructions for running the experiment via additional methods in :numref:`Section %s `. However, this technique (automation via :term:`crontab`) is the simplest option. .. note:: On Orion, *cron* is only available on the orion-login-1 node, so users will need to work on that node when running *cron* jobs on Orion. #. Edit the ``task_get_extrn_ics:`` section of the ``config.yaml`` to include the correct data paths to the initial conditions files. For example, on Hera, add: .. code-block:: console USE_USER_STAGED_EXTRN_FILES: true EXTRN_MDL_SOURCE_BASEDIR_ICS: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/v3p0/input_model_data/FV3GFS/grib2/${yyyymmddhh} On other systems, users will need to change the path for ``EXTRN_MDL_SOURCE_BASEDIR_ICS`` and ``EXTRN_MDL_SOURCE_BASEDIR_LBCS`` (below) to reflect the location of the system's data. The location of the machine's global data can be viewed :ref:`here ` for Level 1 systems. Alternatively, the user can add the path to their local data if they downloaded it as described in :numref:`Section %s `. #. Edit the ``task_get_extrn_lbcs:`` section of the ``config.yaml`` to include the correct data paths to the lateral boundary conditions files. For example, on Hera, add: .. code-block:: console USE_USER_STAGED_EXTRN_FILES: true EXTRN_MDL_SOURCE_BASEDIR_LBCS: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/v3p0/input_model_data/FV3GFS/grib2/${yyyymmddhh} .. _GenerateWorkflowC: Generate the Workflow ----------------------------- .. attention:: This section assumes that Rocoto is installed on the user's machine. If it is not, the user will need to allocate a compute node (described in the :ref:`Appendix `) and run the workflow using standalone scripts as described in :numref:`Section %s `. Run the following command to generate the workflow: .. code-block:: console ./generate_FV3LAM_wflow.py This workflow generation script creates an experiment directory and populates it with all the data needed to run through the workflow. The last line of output from this script should start with ``*/3 * * * *`` (or similar). The generated workflow will be in the experiment directory specified in the ``config.yaml`` file in :numref:`Step %s `. The default location is ``expt_dirs/test_community``. To view experiment progress, users can ``cd`` to the experiment directory from ``ufs-srweather-app/ush`` and run the ``rocotostat`` command to check the experiment's status: .. code-block:: console cd ../../expt_dirs/test_community rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 Users can track the experiment's progress by reissuing the ``rocotostat`` command above every so often until the experiment runs to completion. The following message usually means that the experiment is still getting set up: .. code-block:: console 08/04/23 17:34:32 UTC :: FV3LAM_wflow.xml :: ERROR: Can not open FV3LAM_wflow.db read-only because it does not exist After a few (3-5) minutes, ``rocotostat`` should show a status-monitoring table: .. code-block:: console CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION ================================================================================== 201906151800 make_grid 53583094 QUEUED - 0 0.0 201906151800 make_orog - - - - - 201906151800 make_sfc_climo - - - - - 201906151800 get_extrn_ics 53583095 QUEUED - 0 0.0 201906151800 get_extrn_lbcs 53583096 QUEUED - 0 0.0 201906151800 make_ics - - - - - 201906151800 make_lbcs - - - - - 201906151800 run_fcst - - - - - 201906151800 run_post_f000 - - - - - ... 201906151800 run_post_f012 - - - - - When all tasks show ``SUCCEEDED``, the experiment has completed successfully. For users who do not have Rocoto installed, see :numref:`Section %s ` for guidance on how to run the workflow without Rocoto. Troubleshooting ------------------ If a task goes DEAD, it will be necessary to restart it according to the instructions in :numref:`Section %s `. To determine what caused the task to go DEAD, users should view the log file for the task in ``$EXPTDIR/log/``, where ```` refers to the name of the task's log file. After fixing the problem and clearing the DEAD task, it is sometimes necessary to reinitialize the crontab. Run ``crontab -e`` to open your configured editor. Inside the editor, copy-paste the crontab command from the bottom of the ``$EXPTDIR/log.generate_FV3LAM_wflow`` file into the crontab: .. code-block:: console crontab -e */3 * * * * cd /path/to/expt_dirs/test_community && ./launch_FV3LAM_wflow.sh called_from_cron="TRUE" where ``/path/to`` is replaced by the actual path to the user's experiment directory. New Experiment =============== To run a new experiment in the container at a later time, users will need to rerun the commands in :numref:`Section %s ` to reactivate the workflow. Then, users can configure a new experiment by updating the experiment variables in ``config.yaml`` to reflect the desired experiment configuration. Basic instructions appear in :numref:`Section %s ` above, and detailed instructions can be viewed in :numref:`Section %s `. After adjusting the configuration file, regenerate the experiment by running ``./generate_FV3LAM_wflow.py``. .. _appendix: Appendix ========== .. _work-on-hpc-details: Sample Commands for Working in the Cloud or on HPC Systems ----------------------------------------------------------- Users working on systems with limited disk space in their ``/home`` directory may set the ``SINGULARITY_CACHEDIR`` and ``SINGULARITY_TMPDIR`` environment variables to point to a location with adequate disk space. On NOAA Cloud systems, the ``sudo su``/``exit`` commands may also be required; users on other systems may be able to omit these. For example: .. code-block:: mkdir /lustre/cache mkdir /lustre/tmp sudo su export SINGULARITY_CACHEDIR=/lustre/cache export SINGULARITY_TMPDIR=/lustre/tmp exit .. note:: ``/lustre`` is a fast but non-persistent file system used on NOAA Cloud systems. To retain work completed in this directory, `tar the files `__ and move them to the ``/contrib`` directory, which is much slower but persistent. .. _allocate-compute-node: Allocate a Compute Node -------------------------- Users working on HPC systems that do **not** have Rocoto installed must `install Rocoto `__ or allocate a compute node. All other users may :ref:`continue to start up the container `. .. note:: All NOAA Level 1 systems have Rocoto pre-installed. The appropriate commands for allocating a compute node will vary based on the user's system and resource manager (e.g., Slurm, PBS). If the user's system has the Slurm resource manager, the allocation command will follow this pattern: .. code-block:: console salloc -N 1 -n -A -t