Skip to content
Snippets Groups Projects
user avatar
Bock authored
f89d0f17
History

NHM Geospatial Fabric V2

This repository holds a Dockerfile for a Jupyter installation and dependencies for use in development of the NHM Geospatial Fabric V2.

Assuming you've installed Docker Desktop and git/gitbash (windows), to get started, open a terminal or gitbash and do:

git clone https://code.usgs.gov/abock/gfv2.git_
cd nhm_gf
docker-compose up

To run the Docker image, from the root of this repository, do:

docker-compose up

Changing the Docker build

Updates may be required to allow newly implemented capabilities that work locally but don't on the shared Docker build to work. In this case, changes to the Docker build should be submitted at the same time, or prior to submital of workflow changes such that the Docker build can be built and deployed to Docker hub prior to the workflow being accessed by others in the project.

The most common case is going to be additional System, Python or R dependencies. These can be added as described below.

  1. Open the docker-compose.yml
  2. uncomment the image: dblodgett/gfv2:v* line.
  3. comment (add # before) the build: . line.
  4. modify the Dockerfile by adding additional RUN blocks at the end of the file.
  5. test the new build by running docker-compose up --build
  6. commit the changes to the Dockerfile but not the docker-compose.yml
  7. Submit a pull request and notify dblodgett@usgs.gov that the docker build needs to be updated.

Install system dependencies with apt-get: RUN apt-get install -y dep1 dep2
Install python dependencies with pip or shell: RUN pip install dep1 dep2
Or:

RUN command one && \
    command two && \
    command three

Install R dependencies with Rscript: RUN Rscript -e 'install.packages(c("dep1", "dep2")'

Tag: docker tag gfv2:latest dblodgett/gfv2:v* Push: docker push dblodgett/gfv2:v*

Environment Types

This repository is meant to support, local, docker/Jupyter-based, and HPC use with Shifter.

Local

Local environment setup is largely up to the developer and not all dependencies are strictly required for everyone. As such, review the Dockerfile to see what is installed in Docker for what might be required. If you add things, please either add them to the Docker build or notify someone who knows how and they will add them. Development of local environment setup / verification scripts should be considered for use with individual process steps.

A major exception for local builds is ArcPy. Because it requires an ArcGIS license, it can not be intalled in a shared Docker/Jupyter environment. We will have to work through those dependencies on a case be case basis -- do not hesitate to check in files with ArcPY dependencies, but aim to have the workflow steps that use ArcPy generate data that can be accessed by other workflow steps rather than needing to rerun the ArcPy code.

parallel --jobs 10 < workspace/navigate_calls.txt && docker run --mount type=bind,source="$(pwd)"/workspace,target=/jupyter dblodgett/gfv2:v0.12 R -e "rmarkdown::render('/jupyter/merge.Rmd', output_file='/jupyter/merge.html')" && parallel --jobs 10 < workspace/hyRefactor_calls.txt

Docker/Jupyter

The Docker / Jupyter environment will be treated as a continuous integration system that can be run on a local Docker or a remote Docker. Binder may be a suitable technology here as well. Some project developers may work directly in this environment, others may use it as a test environment to build the final version of their artifact.

Denali / Shifter

Interactive, first salloc:

salloc -N 1 -A impd -t 01:00:00

then use shifter to enter a running container

shifter --image=docker:dblodgett/hydrogeoenv-custom:latest --volume="/absolute/path/gfv2/gfv2.0/workspace:/jupyter" /bin/bash

Or execute a test shifter command.

Batch, use a batch script like:

#!/bin/bash
#SBATCH -A impd
#SBATCH -N 1
#SBATCH --job-name=gfv2
#SBATCH --time=24:00:00

parallel --jobs 20 < workspace/shifter_navigate_calls.txt

shifter --volume="/absolute/path/gfv2/gfv2.0/workspace:/jupyter" --image=dblodgett/hydrogeoenv-custom:latest R -e "rmarkdown::render('/jupyter/merge.Rmd', output_file='/jupyter/merge.html')"

parallel --jobs 10 < workspace/shifter_hyRefactor_calls.txt

Artifact Management

In order to keep the workflow system from getting too complex, we will not pursue automated artifact cache managment initially. Rather, local builds should (optionally) check for remote content from upstream processing steps and download the output if it is available. Each processing step will be responsible for fetching its own cache. Process output should be versioned but can be versioned per processing step such that newly-checked-in code should have the updated version of it's output checked in already and know how to fetch it for other's to use.

Using sciencebase should be considerd a best practice.

This approach will require out dependency graph to remain fairly simple and processing steps to remain fairly large. We should create a process-step diagram and make sure that we are all clear on it's lay out and the strategy for the components we will be implementing.

Disclaimer

This software is in the public domain because it contains materials that originally came from the U.S. Geological Survey, an agency of the United States Department of Interior. For more information, see the official USGS copyright policy

Although this software program has been used by the U.S. Geological Survey (USGS), no warranty, expressed or implied, is made by the USGS or the U.S. Government as to the accuracy and functioning of the program and related program material nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USGS in connection therewith.

This software is provided "AS IS."