National Hydrologic Geospatial Fabric (hydrofabric) Workflow
Provisional Citation
Blodgett, D.L., Bock, A.R., Johnson, J.M., Santiago, M., Wieczorek, M.E., 2022, National Hydrologic Geospatial Fabric (hydrofabric) Workflow: U.S. Geological Survey software release, https://doi.org/10.5066/P977IAA2 (pending)
Overview
This repository houses workflow software for development of three data releases associated with the National Hydrologic Geospatial Fabric (hydrofabric).
-
Blodgett, D.L., 2022, Updated CONUS river network attributes based on the E2NHDPlusV2 and NWMv2.1 networks: U.S. Geological Survey data release, https://doi.org/10.5066/P9W79I7Q.
-
Johnson, J.M., Blodgett, D.L., Bock A.R., 2022, Updated CONUS network geometry based on NWMv2.1: U.S. Geological Survey data release, https://doi.org/10.5066/P9ZP6NNO. (pending here https://www.sciencebase.gov/catalog/item/6317a72cd34e36012efa4e8a)
-
Bock, A.R., Blodgett, D.L., Johnson, J.M., Santiago, M., Wieczorek, M.E., 2022, National Hydrologic Geospatial Fabric Reference and Derived Hydrofabrics: U.S. Geological Survey data release, https://doi.org/10.5066/P9NFPB5S (pending here https://www.sciencebase.gov/catalog/item/60be0e53d34e86b93891012b)
These efforts are development-phase and are working towards the priorities of the USGS Water Availability and Use Science Program and the NOAA National Water Model. Details of these efforts will be documented in forthcoming reports and web communication products.
Repository Layout
The hyfabric
directory houses an R package that has reused utilities specific to the workflow in the workspace
directory. The workspace
directory houses an collection of R-based workflow artifacts that collectively generate the data releases referenced above.
This work is being developed in an open and iterative process. As such, the disclaimer that follows applies. Periodic software releases (tagged and referencable via a DOI) are planned. Please review the open issues for current development priorities.
Disclaimer
This software is in the public domain because it contains materials that originally came from the U.S. Geological Survey, an agency of the United States Department of Interior. For more information, see the official USGS copyright policy
Although this software program has been used by the U.S. Geological Survey (USGS), no warranty, expressed or implied, is made by the USGS or the U.S. Government as to the accuracy and functioning of the program and related program material nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USGS in connection therewith.
This software is provided "AS IS."
Environment Types
This repository is meant to support, local, docker/Jupyter-based, and HPC use with Shifter.
Local
Local environment setup is largely up to the developer and not all dependencies are strictly required for everyone. As such, review the Dockerfile
to see what is installed in Docker for what might be required. If you add things, please either add them to the Docker build or notify someone who knows how and they will add them. Development of local environment setup / verification scripts should be considered for use with individual process steps.
A major exception for local builds is ArcPy. Because it requires an ArcGIS license, it can not be intalled in a shared Docker/Jupyter environment. We will have to work through those dependencies on a case be case basis -- do not hesitate to check in files with ArcPY dependencies, but aim to have the workflow steps that use ArcPy generate data that can be accessed by other workflow steps rather than needing to rerun the ArcPy code.
parallel --jobs 10 < workspace/navigate_calls.txt && docker run --mount type=bind,source="$(pwd)"/workspace,target=/jupyter dblodgett/gfv2:v0.12 R -e "rmarkdown::render('/jupyter/merge.Rmd', output_file='/jupyter/merge.html')" && parallel --jobs 10 < workspace/hyRefactor_calls.txt
Docker/Jupyter
The Docker / Jupyter environment will be treated as a continuous integration system that can be run on a local Docker or a remote Docker. Binder may be a suitable technology here as well. Some project developers may work directly in this environment, others may use it as a test environment to build the final version of their artifact.
Denali / Shifter
Interactive, first salloc
:
salloc -N 1 -A impd -t 01:00:00
then use shifter
to enter a running container
shifter --image=docker:dblodgett/hydrogeoenv-custom:latest --volume="/absolute/path/gfv2/gfv2.0/workspace:/jupyter" /bin/bash
Or execute a test shifter
command.
Batch, use a batch script like:
#!/bin/bash
#SBATCH -A impd
#SBATCH -N 1
#SBATCH --job-name=gfv2
#SBATCH --time=24:00:00
parallel --jobs 20 < workspace/shifter_navigate_calls.txt
shifter --volume="/absolute/path/gfv2/gfv2.0/workspace:/jupyter" --image=dblodgett/hydrogeoenv-custom:latest R -e "rmarkdown::render('/jupyter/merge.Rmd', output_file='/jupyter/merge.html')"
parallel --jobs 10 < workspace/shifter_hyRefactor_calls.txt
Artifact Management
In order to keep the workflow system from getting too complex, we will not pursue automated artifact cache management initially. Rather, local builds should (optionally) check for remote content from upstream processing steps and download the output if it is available. Each processing step will be responsible for fetching its own cache. Process output should be versioned but can be versioned per processing step such that newly-checked-in code should have the updated version of it's output checked in already and know how to fetch it for other's to use.
Using sciencebase should be considerd a best practice.
This approach will require our dependency graph to remain fairly simple and processing steps to remain fairly large. We should create a process-step diagram and make sure that we are all clear on it's lay out and the strategy for the components we will be implementing.