NHM Geospatial Fabric V2
This repository holds a Dockerfile for a Jupyter installation and dependencies for use in development of the NHM Geospatial Fabric V2.
Assuming you've installed Docker Desktop and git/gitbash (windows), to get started, open a terminal or gitbash and do:
git clone https://code.usgs.gov/abock/gfv2.git_
cd nhm_gf
docker-compose up
To run the Docker image, from the root of this repository, do:
docker-compose up
Changing the Docker build
Updates may be required to allow newly implemented capabilities that work locally but don't on the shared Docker build to work. In this case, changes to the Docker build should be submitted at the same time, or prior to submital of workflow changes such that the Docker build can be built and deployed to Docker hub prior to the workflow being accessed by others in the project.
The most common case is going to be additional System, Python or R dependencies. These can be added as described below.
- Open the
docker-compose.yml
- uncomment the
image: dblodgett/gfv2:v*
line. - comment (add # before) the
build: .
line. - modify the Dockerfile by adding additional
RUN
blocks at the end of the file. - test the new build by running
docker-compose up --build
- commit the changes to the
Dockerfile
but not thedocker-compose.yml
- Submit a pull request and notify dblodgett@usgs.gov that the docker build needs to be updated.
Install system dependencies with apt-get: RUN apt-get install -y dep1 dep2
Install python dependencies with pip or shell: RUN pip install dep1 dep2
Or:
RUN command one && \
command two && \
command three
Install R dependencies with Rscript: RUN Rscript -e 'install.packages(c("dep1", "dep2")'
Tag: docker tag gfv2:latest dblodgett/gfv2:v*
Push: docker push dblodgett/gfv2:v*
Environment Types
This repository is meant to support, local, docker/Jupyter-based, and HPC use with Shifter.
Local
Local environment setup is largely up to the developer and not all dependencies are strictly required for everyone. As such, review the Dockerfile
to see what is installed in Docker for what might be required. If you add things, please either add them to the Docker build or notify someone who knows how and they will add them. Development of local environment setup / verification scripts should be considered for use with individual process steps.
A major exception for local builds is ArcPy. Because it requires an ArcGIS license, it can not be intalled in a shared Docker/Jupyter environment. We will have to work through those dependencies on a case be case basis -- do not hesitate to check in files with ArcPY dependencies, but aim to have the workflow steps that use ArcPy generate data that can be accessed by other workflow steps rather than needing to rerun the ArcPy code.
parallel --jobs 10 < workspace/navigate_calls.txt && docker run --mount type=bind,source="$(pwd)"/workspace,target=/jupyter dblodgett/gfv2:v0.12 R -e "rmarkdown::render('/jupyter/merge.Rmd', output_file='/jupyter/merge.html')" && parallel --jobs 10 < workspace/hyRefactor_calls.txt
Docker/Jupyter
The Docker / Jupyter environment will be treated as a continuous integration system that can be run on a local Docker or a remote Docker. Binder may be a suitable technology here as well. Some project developers may work directly in this environment, others may use it as a test environment to build the final version of their artifact.
Denali / Shifter
Interactive, first salloc
:
salloc -N 1 -A impd -t 01:00:00
then use shifter
to enter a running container
shifter --image=docker:dblodgett/hydrogeoenv-custom:latest --volume="/absolute/path/gfv2/gfv2.0/workspace:/jupyter" /bin/bash
Or execute a test shifter
command.
Batch, use a batch script like:
#!/bin/bash
#SBATCH -A impd
#SBATCH -N 1
#SBATCH --job-name=gfv2
#SBATCH --time=24:00:00
parallel --jobs 20 < workspace/shifter_navigate_calls.txt
shifter --volume="/absolute/path/gfv2/gfv2.0/workspace:/jupyter" --image=dblodgett/hydrogeoenv-custom:latest R -e "rmarkdown::render('/jupyter/merge.Rmd', output_file='/jupyter/merge.html')"
parallel --jobs 10 < workspace/shifter_hyRefactor_calls.txt
Artifact Management
In order to keep the workflow system from getting too complex, we will not pursue automated artifact cache managment initially. Rather, local builds should (optionally) check for remote content from upstream processing steps and download the output if it is available. Each processing step will be responsible for fetching its own cache. Process output should be versioned but can be versioned per processing step such that newly-checked-in code should have the updated version of it's output checked in already and know how to fetch it for other's to use.
Using sciencebase should be considerd a best practice.
This approach will require out dependency graph to remain fairly simple and processing steps to remain fairly large. We should create a process-step diagram and make sure that we are all clear on it's lay out and the strategy for the components we will be implementing.
Disclaimer
This software is in the public domain because it contains materials that originally came from the U.S. Geological Survey, an agency of the United States Department of Interior. For more information, see the official USGS copyright policy
Although this software program has been used by the U.S. Geological Survey (USGS), no warranty, expressed or implied, is made by the USGS or the U.S. Government as to the accuracy and functioning of the program and related program material nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USGS in connection therewith.
This software is provided "AS IS."