retrieve_data module

This script helps users pull data from known data streams, including URLs and HPSS (only on supported NOAA platforms), or from user-supplied data locations on disk.

Several supported data streams are included in parm/data_locations.yml, which provides locations and naming conventions for files commonly used with the SRW App. Provide the file to this tool via the --config flag. Users are welcome to provide their own file with alternative locations and naming conventions.

When using this script to pull from disk, the user is required to provide the path to the data location, which can include Python templates. The file names follow those included in the --config file by default or can be user-supplied via the --file_name flag. That flag takes a YAML-formatted string that follows the same conventions outlined in the parm/data_locations.yml file for naming files.

To see usage for this script:

python retrieve_data.py -h
arg_list_to_range(args)

Returns the sequence to process, given an argparse list.

The length of the list will determine what sequence items are returned:

  • Length = 1: A single item is to be processed

  • Length = 2: A sequence of start, stop with increment 1

  • Length = 3: A sequence of start, stop, increment

  • Length > 3: List as is

argparse should provide a list of at least one item (nargs='+').

Must ensure that the list contains integers.

Parameters:

args (list) – An argparse list argument

Returns:

args – An argparse list

awscli_get_file(bucket, fname)

Download a file from an AWS S3 bucket, and place it in a target location on disk.

Parameters:
  • bucket – The bucket and directory where the file(s) are located

  • fname – The name of the file(s) to be retrieved. Can include glob wildcards

Returns:

Boolean value reflecting whether the copy was successful (True) or unsuccessful (False)

check_file(url)

Checks that a file exists at the expected URL.

Parameters:

url – URL for file to be checked

Returns:

Boolean value (True if status_code == 200 or False otherwise)

clean_up_output_dir(expected_subdir, local_archive, output_path, source_paths)

Removes expected subdirectories and existing_archive files on disk once all files have been extracted and put into the specified output location.

Parameters:
  • expected_subdir – Expected subdirectories

  • local_archive (str) – File name

  • output_path (str) – Path to a location on disk.

  • source_paths (str)

Returns:

unavailable (dict) – A dictionary of unavailable files

config_exists(arg)

Checks to ensure that the provided config file exists. If it does, load it with YAML’s safe loader and return the resulting dictionary.

Parameters:

arg (str) – Path to a configuration file

Returns:

cfg – A dictionary representation of the configuration file contents

copy_file(source, destination, copy_cmd)

Copies a file from a source and places it in the destination location. Assumes destination exists.

Parameters:
  • source (str) – Directory where file currently resides

  • destination (str) – Directory that the file should be moved to

  • copy_cmd (str) – Copy command (e.g., cp, ln -sf)

Returns:

A boolean value reflecting whether the copy was successful (True) or unsuccessful (False)

create_target_path(target_path)

Appends target path and creates directory for ensemble members

Parameters:

target_path (str) – Target path

Returns:

target_path

fill_template(template_str, cycle_date, templates_only=False, **kwargs)

Fills in the provided template string with date time information, and returns the resulting string.

Parameters:
  • template_str – A string containing Python templates

  • cycle_date – A datetime object that will be used to fill in date and time information

  • templates_only (bool) – When True, this function will only return the templates available.

Keyword Arguments:
  • ens_group (int) – A number associated with a bin where ensemble members are stored in archive files.

  • fcst_hr (int) – An integer forecast hour. String formatting should be included in the template_str.

  • mem (int) – A single ensemble member. Should be a positive integer value.

Returns:

Filled template string

find_archive_files(paths, file_names, cycle_date, ens_group)

Given an equal-length set of archive paths and archive file names, and a cycle date, check HPSS via hsi to make sure at least one set exists. Return a dict of the paths of the existing archive, along with the item in set of paths that was found.

Parameters:
  • paths (list) – Archive paths

  • file_names (list) – Archive file names

  • cycle_date (int) – Cycle date (YYYYMMDDHH or YYYYMMDDHHmm format)

  • ens_group (int) – A number associated with a bin where ensemble members are stored in archive files

Returns:

A tuple containing (existing_archives, list_item) or (“”, 0)

get_ens_groups(members)

Gets ensemble groups with the corresponding list of ensemble members in that group.

Parameters:

members (list) – List of ensemble members.

Returns:

ens_groups – A dictionary where keys are the ensemble group and values are lists of ensemble members requested in that group

get_file_templates(cla, known_data_info, data_store, use_cla_tmpl=False)

Returns the file templates requested by user input, either from the command line, or from the known data information dictionary.

Parameters:
  • cla (str) – Command line arguments (Namespace object)

  • known_data_info (dict) – Dictionary from data_locations.yml file

  • data_store (str) – String corresponding to a key in the known_data_info dictionary

  • use_cla_tmpl (bool) – Whether to check command line arguments for templates

Returns:

file_templates – A list of file templates

get_requested_files(cla, file_templates, input_locs, method='disk', **kwargs)

Copies files from disk locations or downloads files from a URL or S3 bucket, depending on the option specified by the user.

This function expects that the output directory exists and is writeable.

Parameters:
  • cla (str) – Command line arguments (Namespace object)

  • file_templates (list) – A list of file templates

  • input_locs (str) – A string containing a single data location, either a URL, a disk path, an AWS bucket/directory, or a list these paths/URLs.

  • method (str) – Choice of "disk", "wget", or "awscli" to indicate protocol for retrieval

Keyword Arguments:
  • members (list) – A list of integers corresponding to the ensemble members

  • check_all (bool) – Flag that indicates whether all URLs should be checked for all files

Returns:

unavailable (list) – A list of locations/files that were unretrievable

hpss_requested_files(cla, file_names, store_specs, members=-1, ens_group=-1)

This function interacts with the “hpss” protocol in a provided data store specs file to download a set of files requested by the user. Depending on the type of archive file (zip or tar), it will either pull the entire file and unzip it or attempt to pull individual files from a tar file.

It cleans up the local disk after files are deemed available in order to remove any empty subdirectories that may still be present.

This function exepcts that the output directory is writable.

Parameters:
  • cla (str) – Command line arguments (Namespace object)

  • file_names (list) – List of file names

  • store_specs (dict) – Data-store specifications (specs) file

  • members (list) – A list of integers corresponding to the ensemble members

  • ens_group (int) – A number associated with a bin where ensemble members are stored in archive files

Returns:

A Python set of unavailable files

Raises:

Exception – If there is an error running the archive extraction command

hsi_single_file(file_path, mode='ls')

Calls hsi as a subprocess for Python and returns information about whether the file_path was found.

Parameters:
  • file_path (str) – File path on HPSS

  • mode (str) – The hsi command to run. ls is default. May also pass get to retrieve the file path.

load_str(arg)

Loads a dictionary string safely using YAML.

Parameters:

arg (str) – A string representation of a dictionary

Returns:

A resulting dictionary

main(argv)

Uses known location information to try the known locations and file paths in priority order.

Parameters:

argv (list) – List of command line arguments

pair_locs_with_files(input_locs, file_templates, check_all)

Given a list of input locations and files, return an iterable that contains the multiple locations and file templates for files that should be searched in those locations.

The different possibilities:

  1. Get one or more files from a single path/URL

  2. Get multiple files from multiple corresponding paths/URLs

  3. Check all paths for all file templates until files are found

The default will be to handle #1 and #2. #3 will be indicated by a flag in the YAML: check_all: True

Parameters:
  • input_locs (list) – Input locations

  • file_templates (list) – File templates

  • check_all (bool) – Flag that indicates whether all input locations should be checked for all available file templates

Returns:

locs_files (list) – Iterable containing multiple locations and file templates for files that should be searched in those locations

parse_args(argv)

Maintains the arguments accepted by this script. Please see Python’s argparse documentation for more information about settings of each argument.

Parameters:

argv (list) – Command line arguments to parse

Returns:

args – An argparse.Namespace object (parser.parse_args(argv))

path_exists(arg)

Check whether the supplied path exists and is writeable

Parameters:

arg (str) – File path

Returns:

File path or error message

Raises:

argparse.ArgumentTypeError – If the path does not exist or is not writable

to_datetime(arg)

Converts a string to a datetime object

Parameters:

arg (str) – String like YYYYMMDDHH or YYYYMMDDHHmm

Returns:

A datetime object

to_lower(arg)

Converts a string to lowercase

Parameters:

arg (str) – Any string

Returns:

An all-lowercase string

wget_file(url)

Download a file from a URL source, and place it in a target location on disk.

Parameters:

url – URL for file to be retrieved

Returns:

Boolean value reflecting whether the copy was successful (True) or unsuccessful (False)