retrieve_data module
This script helps users pull data from known data streams, including URLs and HPSS (only on supported NOAA platforms), or from user-supplied data locations on disk.
Several supported data streams are included in parm/data_locations.yml, which provides locations and naming conventions for files commonly used with the SRW App. Provide the file to this tool via the --config flag. Users are welcome to provide their own file with alternative locations and naming conventions.
When using this script to pull from disk, the user is required to provide the path to the data location, which can include Python templates. The file names follow those included in the --config file by default or can be user-supplied via the --file_name flag. That flag
takes a YAML-formatted string that follows the same conventions outlined in the parm/data_locations.yml file for naming files.
To see usage for this script:
python retrieve_data.py -h
- arg_list_to_range(args)
Returns the sequence to process, given an
argparselist.The length of the list will determine what sequence items are returned:
Length = 1: A single item is to be processed
Length = 2: A sequence of start, stop with increment 1
Length = 3: A sequence of start, stop, increment
Length > 3: List as is
argparseshould provide a list of at least one item (nargs='+').Must ensure that the list contains integers.
- Parameters:
args (list) – An
argparselist argument- Returns:
args – An
argparselist
- awscli_get_file(bucket, fname)
Download a file from an AWS S3 bucket, and place it in a target location on disk.
- Parameters:
bucket – The bucket and directory where the file(s) are located
fname – The name of the file(s) to be retrieved. Can include glob wildcards
- Returns:
Boolean value reflecting whether the copy was successful (True) or unsuccessful (False)
- check_file(url)
Checks that a file exists at the expected URL.
- Parameters:
url – URL for file to be checked
- Returns:
Boolean value (True if
status_code == 200or False otherwise)
- clean_up_output_dir(expected_subdir, local_archive, output_path, source_paths)
Removes expected subdirectories and
existing_archivefiles on disk once all files have been extracted and put into the specified output location.
- config_exists(arg)
Checks to ensure that the provided config file exists. If it does, load it with YAML’s safe loader and return the resulting dictionary.
- Parameters:
arg (str) – Path to a configuration file
- Returns:
cfg – A dictionary representation of the configuration file contents
- copy_file(source, destination, copy_cmd)
Copies a file from a source and places it in the destination location. Assumes destination exists.
- create_target_path(target_path)
Appends target path and creates directory for ensemble members
- Parameters:
target_path (str) – Target path
- Returns:
target_path
- fill_template(template_str, cycle_date, templates_only=False, **kwargs)
Fills in the provided template string with date time information, and returns the resulting string.
- Parameters:
template_str – A string containing Python templates
cycle_date – A datetime object that will be used to fill in date and time information
templates_only (bool) – When
True, this function will only return the templates available.
- Keyword Arguments:
- Returns:
Filled template string
- find_archive_files(paths, file_names, cycle_date, ens_group)
Given an equal-length set of archive paths and archive file names, and a cycle date, check HPSS via hsi to make sure at least one set exists. Return a dict of the paths of the existing archive, along with the item in set of paths that was found.
- Parameters:
- Returns:
A tuple containing (existing_archives, list_item) or (“”, 0)
- get_ens_groups(members)
Gets ensemble groups with the corresponding list of ensemble members in that group.
- Parameters:
members (list) – List of ensemble members.
- Returns:
ens_groups – A dictionary where keys are the ensemble group and values are lists of ensemble members requested in that group
- get_file_templates(cla, known_data_info, data_store, use_cla_tmpl=False)
Returns the file templates requested by user input, either from the command line, or from the known data information dictionary.
- Parameters:
- Returns:
file_templates – A list of file templates
- get_requested_files(cla, file_templates, input_locs, method='disk', **kwargs)
Copies files from disk locations or downloads files from a URL or S3 bucket, depending on the option specified by the user.
This function expects that the output directory exists and is writeable.
- Parameters:
cla (str) – Command line arguments (Namespace object)
file_templates (list) – A list of file templates
input_locs (str) – A string containing a single data location, either a URL, a disk path, an AWS bucket/directory, or a list these paths/URLs.
method (str) – Choice of
"disk","wget", or"awscli"to indicate protocol for retrieval
- Keyword Arguments:
- Returns:
unavailable (list) – A list of locations/files that were unretrievable
- hpss_requested_files(cla, file_names, store_specs, members=-1, ens_group=-1)
This function interacts with the “hpss” protocol in a provided data store specs file to download a set of files requested by the user. Depending on the type of archive file (
ziportar), it will either pull the entire file and unzip it or attempt to pull individual files from a tar file.It cleans up the local disk after files are deemed available in order to remove any empty subdirectories that may still be present.
This function exepcts that the output directory is writable.
- Parameters:
cla (str) – Command line arguments (Namespace object)
file_names (list) – List of file names
store_specs (dict) – Data-store specifications (specs) file
members (list) – A list of integers corresponding to the ensemble members
ens_group (int) – A number associated with a bin where ensemble members are stored in archive files
- Returns:
A Python set of unavailable files
- Raises:
Exception – If there is an error running the archive extraction command
- hsi_single_file(file_path, mode='ls')
Calls
hsias a subprocess for Python and returns information about whether thefile_pathwas found.
- load_str(arg)
Loads a dictionary string safely using YAML.
- Parameters:
arg (str) – A string representation of a dictionary
- Returns:
A resulting dictionary
- main(argv)
Uses known location information to try the known locations and file paths in priority order.
- Parameters:
argv (list) – List of command line arguments
- pair_locs_with_files(input_locs, file_templates, check_all)
Given a list of input locations and files, return an iterable that contains the multiple locations and file templates for files that should be searched in those locations.
The different possibilities:
Get one or more files from a single path/URL
Get multiple files from multiple corresponding paths/URLs
Check all paths for all file templates until files are found
The default will be to handle #1 and #2. #3 will be indicated by a flag in the YAML:
check_all: True- Parameters:
- Returns:
locs_files (list) – Iterable containing multiple locations and file templates for files that should be searched in those locations
- parse_args(argv)
Maintains the arguments accepted by this script. Please see Python’s argparse documentation for more information about settings of each argument.
- Parameters:
argv (list) – Command line arguments to parse
- Returns:
args – An argparse.Namespace object (
parser.parse_args(argv))
- path_exists(arg)
Check whether the supplied path exists and is writeable
- Parameters:
arg (str) – File path
- Returns:
File path or error message
- Raises:
argparse.ArgumentTypeError – If the path does not exist or is not writable
- to_datetime(arg)
Converts a string to a datetime object
- Parameters:
arg (str) – String like
YYYYMMDDHHorYYYYMMDDHHmm- Returns:
A datetime object
- to_lower(arg)
Converts a string to lowercase
- Parameters:
arg (str) – Any string
- Returns:
An all-lowercase string
- wget_file(url)
Download a file from a URL source, and place it in a target location on disk.
- Parameters:
url – URL for file to be retrieved
- Returns:
Boolean value reflecting whether the copy was successful (True) or unsuccessful (False)