00_get_data.Rmd

---
title: "Get Hydro Location and Network Data"
output: html_document
---

This notebook pulls data from a number of sources and populates the data 
directory. Any new data requirements should be added as code chunks here. 

Each code chunk should create a path to the file you want to use in a process 
step, check if that path exists, and put the data there if it does not. All 
paths are stored in a list that is saved to the `cache` directory. If changes 
are made to the output of this notebook, they should be checked in.

**If resources from ScienceBase need to be downloaded Rmarkdown document should be run from RStudio so username and password authentication will work**

```{r}
source("R/utils.R")
source("R/config.R")
source("R/user_vars.R")
source("R/1_get_data.R")

if(!dir.exists("data")) {dir.create("data")}
if(!dir.exists("bin")) {dir.create("bin")}

data_dir <- "data"
out_list <- list("data_dir" = data_dir)
out_file <- file.path("cache", "data_paths.json")

sevenz <- "7z"
check_7z <- try(nhdplusTools:::check7z(), silent = TRUE)
if(is(check_7z, "try-error")) {
  message("trying to download 7z -- it's not on your path")
  # Download command-line Z-zip
  if(!file.exists("bin/7za.exe")){
    download.file("https://www.7-zip.org/a/7za920.zip", 
                  destfile = "bin/7za920.zip")
    unzip("bin/7za920.zip", exdir = "bin")
  }
  sevenz <- "bin/7za.exe"
}

initialize_sciencebase_session(username = Sys.getenv("sb_user"))
# Enable mapview rendering if desired
mapview <- FALSE
```
HUC12 (Hydrologic Unit Code, Level 12) outlets derived from the Watershed 
Boundary Dataset indexed to the reference fabricform the baseline and extent of 
national modeling fabrics.
```{r HUC12 outlets}

#  Blodgett, D.L., 2022, Mainstem Rivers of the Conterminous United States: 
#  U.S. Geological Survey data release, https://doi.org/10.5066/P9BTKP3T. 

out_list <- c(
  out_list, 
  list(hu12_points_path = 
         get_sb_file(item = "63cb38b2d34e06fef14f40ad",
                     item_files = "102020wbd_outlets.gpkg",
                     out_destination = data_dir)))

if(mapview)(mapview(read_sf(out_list$hu12_points_path)))

```

SWIM (Streamgage Watershed InforMation(SWIM) includes locations for 12,422 USGS 
streamgages as indexed along the network of streams (flowlines) in NHDPlus 
Version 2.1 (NHDPlus v2, Moore and Dewald, 2016). The dataset is one of two 
datasets developed for the Streamgage Watershed InforMation (SWIM) project. This 
dataset, which is referred to as “SWIM streamgage locations,” was created in 
support of the second dataset of basin characteristics and disturbance indexes. 

```{r SWIM}
#  Hayes, L., Chase, K.J., Wieczorek, M.E., and Jackson, S.E., 2021, 
#  USGS streamgages in the conterminous United States indexed to NHDPlus v2.1 
#  flowlines to support Streamgage Watershed InforMation (SWIM), 2021: U.S. 
#  Geological Survey data release, https://doi.org/10.5066/P9J5CK2Y.

out_list <- c(
  out_list, 
  list(SWIM_points_path = 
         get_sb_file(item = "5ebe92af82ce476925e44b8f",
                     item_files = "all",
                     out_destination = file.path(data_dir, "SWIM_gage_loc"))))

if(mapview)(mapview(read_sf(out_list$SWIM_points_path)))
```

Sites associated with Work by the U.S. Geological Survey (USGS)  to estimate 
the amount of water that is withdrawn and consumed by thermoelectric power 
plants (Diehl and others, 2013; Diehl and Harris, 2014; Harris and Diehl, 2019 
Galanter and othes, 2023). 

```{r Thermoelectric Facilities}
#   Harris, Melissa A. and Diehl, Timothy H., 2017. A Comparison of Three 
#   Federal Datasets for Thermoelectric Water Withdrawals in the United States 
#   for 2010. Journal of the American Water Resources Association (JAWRA) 
#   53(5): 1062– 1080. https://doi.org/10.1111/1752-1688.12551
#
#   Galanter, A.E., Gorman Sanisaca, L.E., Skinner, K.D., Harris, M.A., 
#   Diehl, T.H., Chamberlin, C.A., McCarthy,    #   B.A., # Halper, A.S., 
#   Niswonger, R.G., Stewart, J.S., Markstrom, S.L., Embry, I., and 
#   Worland, S., 2023, Thermoelectric-power water use reanalysis for the 
#   2008-2020 period by power plant, month, and year for the conterminous 
#   United States: U.S. Geological Survey data release, 
#   https://doi.org/10.5066/P9ZE2FVM.

TE_points_path <- file.path(data_dir, "TE_points")

dir.create(TE_points_path, recursive = TRUE, showWarnings = FALSE)

get_sb_file("5dbc53d4e4b06957974eddae", 
            "2015_TE_Model_Estimates_lat.long_COMIDs.7z",
            TE_points_path)

get_sb_file("63adc826d34e92aad3ca5af4", 
            "galanter_and_others_2023.zip",
            TE_points_path)

out_list <- c(out_list, list(TE_points_path = TE_points_path))

if(mapview)(mapview(read_sf(out_list$TE_points_path)))
```

Network locations made to improve the routing capabilities 
and ancillary hydrologic attributes of NHDPlusV2 to support modeling and other 
hydrologic analyses. The resulting enhanced network is named E2NHDPlusV2_us. 
This includes the network locations associated with some diversions and 
water use withdrawals.

```{r e2nhd supplemental data - USGS}
#   Schwarz, G.E., 2019, E2NHDPlusV2_us: Database of Ancillary Hydrologic 
#   Attributes and Modified Routing for NHDPlus Version 2.1 Flowlines: U.S. 
#   Geological Survey data release, https://doi.org/10.5066/P986KZEM.

out_list <- c(
  out_list, 
  list(USGS_IT_path = 
         get_sb_file("5d16509ee4b0941bde5d8ffe", 
                     "supplemental_files.zip",
                     file.path(data_dir, "USGS_IT"))))
```

Two datasets relate hydro location information from the National Inventory of
Dams to the NHDPlus network.  One effort is related to the SPARROW work 
(Wieczorek and others, 2018), the other related to work quantifying impacts on
natural flow (Wieczorek and others, 2021).

```{r National Inventory of Dams}

#   Wieczorek, M.E., Jackson, S.E., and Schwarz, G.E., 2018, Select Attributes 
#   for NHDPlus Version 2.1 Reach Catchments and Modified Network Routed 
#   Upstream Watersheds for the Conterminous United States (ver. 2.0, 
#   November 2019): U.S. Geological Survey data release, 
#   https://doi.org/10.5066/F7765D7V.

#  Wieczorek, M.E., Wolock, D.M., and McCarthy, P.M., 2021, Dam 
#  impact/disturbance metrics for the conterminous United States, 1800 to 2018: 
#  U.S. Geological Survey data release, https://doi.org/10.5066/P92S9ZX6.

NID_points_path <- file.path(data_dir, "NID_points")

get_sb_file("5dbc53d4e4b06957974eddae",
            "NID_attributes_20170612.txt",
            NID_points_path)

get_sb_file("5fb7e483d34eb413d5e14873",
            "Final_NID_2018.zip",
            NID_points_path)

out_list <- c(out_list, list(NID_points_path = NID_points_path))

if(mapview)(mapview(read_sf(out_list$NID_points_path)))
```

This next section retrieves NHDPlus datasets related to national modeling 
efforts. These include:
1. National Geodatbase
2. Hawaii, Puerto Rico, and Islands Geodatabase
3. GageLoc file of streamgages indexed to NHDPlusv2 flowlines
4. NHDPlusv2 catchment - HUC12 crosswalk.

```{r NHDPlusV2}
# NHDPlus Seamless National Data -  pulled from NHDPlus national data server; 
# post-processed to RDS files by NHDPlusTools GageLoc - Gages snapped to 
# NHDPlusv2 flowlines (QAQC not verified)

# NHDPlus HUC12 crosswalk 
#    Moore, R.B., Johnston, C.M., and Hayes, L., 2019, Crosswalk Table Between 
#    NHDPlus V2.1 and its Accompanying WBD Snapshot of 12-Digit Hydrologic 
#    Units: U.S. Geological Survey data release, 
#    https://doi.org/10.5066/P9CFXHGT.

epa_data_root <- "https://dmap-data-commons-ow.s3.amazonaws.com/NHDPlusV21/Data/"

nhdplus_dir <- file.path(data_dir, "NHDPlusNationalData")
nhdplus_gdb <- file.path(nhdplus_dir, "NHDPlusV21_National_Seamless_Flattened_Lower48.gdb")

islands_dir <- file.path(data_dir, "islands")
islands_gdb <- file.path(islands_dir, "NHDPlusNationalData/NHDPlusV21_National_Seamless_Flattened_HI_PR_VI_PI.gdb/")

rpu <- file.path(nhdplus_dir, "NHDPlusGlobalData", "BoundaryUnit.shp")

get_sb_file("5dbc53d4e4b06957974eddae", "NHDPlusV21_NationalData_GageLoc_05.7z", nhdplus_dir)

get_sb_file("5c86a747e4b09388244b3da1", "CrosswalkTable_NHDplus_HU12_CSV.7z", nhdplus_dir)

# will download the 7z and unzip into the folder structure in nhdplus_gdb path
download_file(paste0(epa_data_root, "NationalData/NHDPlusV21_NationalData_Seamless_Geodatabase_Lower48_07.7z"),
              out_path = data_dir, check_path = nhdplus_gdb)

download_file(paste0(epa_data_root, "NationalData/NHDPlusV21_NationalData_Seamless_Geodatabase_HI_PR_VI_PI_03.7z"),
              out_path = islands_dir, check_path = islands_gdb)

# cache the huc12 layer in rds format
hu12_rds <- file.path(nhdplus_dir, "HUC12.rds")

if(!file.exists(hu12_rds)) {
  read_sf(nhdplus_gdb, layer = "HUC12") |>
    st_make_valid() |>
    st_transform(crs = proj_crs) |> 
    # TODO: convert this to gpkg
    saveRDS(file = hu12_rds)
}

get_sb_file("5dcd5f96e4b069579760aedb", "GageLocGFinfo.dbf", data_dir)

download_file(paste0(epa_data_root, "GlobalData/NHDPlusV21_NHDPlusGlobalData_03.7z"),
              out_path = nhdplus_dir, check_path = rpu)

out_list <- c(out_list, list(nhdplus_dir = nhdplus_dir, 
                             nhdplus_gdb = nhdplus_gdb, 
                             islands_dir = islands_dir, 
                             islands_gdb = islands_gdb,
                             nhdplus_rpu = rpu))
```

Reference catchments and flowlines are hydrographic products that are derived
from the USGS National Hydrologic Geospatial Fabric and the National Oceanic 
and Atmospheric Administration.  The Reference Flowlines include modifications
from the National Water Model and e2nhd networks integrated into NHDPlusv2.1, 
and the reference catchments are geometrically-simplified to POLYGON geometry to
improve rasterre-gridding efficiency, and have a large number of DEM artifacts 
removed.


```{r Reference Fabric}
# Reference Fabric flowpaths and catchments derived by Mike Johnson (NOAA)

ref_fab_path <- file.path(data_dir, "reference_fabric")
ref_cat <- file.path(ref_fab_path, "reference_catchments.gpkg")
ref_fl <- file.path(ref_fab_path, "reference_flowline.gpkg")
nwm_fl <- file.path(ref_fab_path, "nwm_network.gpkg")

for (vpu in c("01", "08", "10L", "15", "02", "04", "05", "06", "07", "09", 
              "03S", "03W", "03N", "10U", "11", "12", "13", "14",  "16", 
              "17", "18")) {
  
  get_sb_file("6317a72cd34e36012efa4e8a", 
              paste0(vpu, "_reference_features.gpkg"), 
              ref_fab_path)
}

get_sb_file("61295190d34e40dd9c06bcd7",
            c("reference_catchments.gpkg", "reference_flowline.gpkg", "nwm_network.gpkg"), 
            out_destination = ref_fab_path)


out_list <- c(out_list, list(ref_fab_path = ref_fab_path, 
                             ref_cat = ref_cat, ref_fl = ref_fl, nwm_fl = nwm_fl))
```

NHDPlus Waterbody and Area Polygons converted to an RDS file for easier 
loading within R.

```{r NHDPlusV2 Waterbodies}
#  Waterbodies - derived after downloading and post-processing 
#  NHDPlus Seamless National Geodatabase
#  Compacted here into a GDB
waterbodies_path <- file.path(nhdplus_dir, "nhdplus_waterbodies.rds")

if(!file.exists(waterbodies_path)) {
  message("formatting NHDPlus watebodies...")
  
  data.table::rbindlist(
    list(
    
    read_sf(out_list$nhdplus_gdb, "NHDWaterbody") |>
      st_transform(proj_crs) |>
      mutate(layer = "NHDWaterbody"), 
    
    read_sf(out_list$nhdplus_gdb, "NHDArea") |>
      st_transform(proj_crs) |>
      mutate(layer = "NHDArea")
    
    ), fill = TRUE) |>
    st_as_sf() |>
    saveRDS(waterbodies_path)
  
}

out_list <- c(out_list, list(waterbodies_path = waterbodies_path))
```

Formatting a full list of network and non-network catchments for the NHDPlus
domains.  This more easily tracks catchments were are off and on the network
when aggregating at points of interest.

```{r full cats}
# Modification to NHDPlus catchments
fullcat_path <- file.path(nhdplus_dir, "nhdcat_full.rds")
islandcat_path <- file.path(islands_dir, "nhdcat_full.rds")

if(!file.exists(fullcat_path))
  saveRDS(cat_rpu(out_list$ref_cat, nhdplus_gdb), 
          fullcat_path)

if(!file.exists(islandcat_path))
  saveRDS(cat_rpu(out_list$islands_gdb, islands_gdb), 
          islandcat_path)
  
out_list <- c(out_list, list(fullcats_table = fullcat_path, islandcats_table = islandcat_path)) 

```

Download NHDPlusV2 FDR and FAC grids for refactoring and catcment splitting.

```{r NHDPlusV2 FDR_FAC}
# NHDPlus FDR/FAC grids available by raster processing unit
# TODO: set this up for a per-region download for #134
out_list<- c(out_list, make_fdr_fac_list(file.path(data_dir, "fdrfac")))

```

Download NHDPlusV2 elevation grids for headwater extensions and splitting 
catchments into left and right banks.

```{r NHDPlusV2 elev}
# NHDPlus elev grids available by raster processing unit
# TODO: set this up for a per-region download for #134
out_list<- c(out_list, make_nhdplus_elev_list(file.path(data_dir, "nhdplusv2_elev")))

```

Merrit Topographic and Hydrographic data for deriving GIS Features of the 
National Hydrologic Modeling, Alaska Domain

```{r MERIT HydroDEM}
#  MERIT HydroDEM - used for AK Geospatial Fabric, and potentially 
#  Mexico portion of R13
#-----------------------------------------------------------------------------
#  Yamazaki, D., Ikeshima, D., Sosa, J., Bates, P. D., Allen, G. H., & 
#  Pavelsky, T. M. ( 2019). MERIT Hydro: a high‐resolution global hydrography 
#  map based on latest topography dataset. Water Resources Research, 55, 
#  5053– 5073. https://doi.org/10.1029/2019WR024873

merit_dir <- file.path(data_dir, "merged_AK_MERIT_Hydro")

get_sb_file("5dbc53d4e4b06957974eddae", "merged_AK_MERIT_Hydro.zip", merit_dir)

# TODO: update to use "6644f85ed34e1955f5a42dc4" when released (roughly Dec 10,)
get_sb_file("5fbbc6b6d34eb413d5e21378", "dem.zip", merit_dir)

get_sb_file("64ff628ed34ed30c2057b430", 
            c("ak_merit_dem.zip", "ak_merit_fdr.zip", "ak_merit_fac.zip"),
            merit_dir)

out_list <- c(
  out_list, 
  list(merit_catchments = file.path(merit_dir, 
                                    "merged_AK_MERIT_Hydro", 
                                    "cat_pfaf_78_81_82_MERIT_Hydro_v07_Basins_v01.shp"),
       merit_rivers = file.path(merit_dir, 
                                "merged_AK_MERIT_Hydro", 
                                "riv_pfaf_78_81_82_MERIT_Hydro_v07_Basins_v01.shp"),
       aster_dem = file.path(merit_dir, "dem.tif"),
       merit_dem = file.path(merit_dir, "ak_merit_dem.tif"),
       merit_fdr = file.path(merit_dir, "ak_merit_fdr.tif"),
       merit_fac = file.path(merit_dir, "ak_merit_fac.tif")))

```
  
  Source data for deriving GIS Featurs of the National Hydrologic Modeling, 
Alaska Domain

```{r AK GF Source data}
# TODO: fix this citation
#  Bock, A.R., Rosa, S.N., McDonald, R.R., Wieczorek, M.E., Santiago, M., 
#  Blodgett, D.L., and Norton, P.A., 2024,   Geospatial Fabric for National 
#  Hydrologic Modeling, Hawaii Domain: U.S. Geological Survey data release,  
#  https://doi.org/10.5066/P9HMKOP8

AK_GF_source <- "ak.7z"
AK_dir <- file.path(data_dir, "AK")

get_sb_file("5dbc53d4e4b06957974eddae", AK_GF_source, AK_dir)

out_list <- c(out_list, list(ak_source = file.path(AK_dir, "ak.gpkg")))

```

Source data for deriving GIS Featurs of the National Hydrologic Modeling, 
Hawaii Domain

```{r HI GF Source data}
#  Bock, A.R., Rosa, S.N., McDonald, R.R., Wieczorek, M.E., Santiago, M., 
#  Blodgett, D.L., and Norton, P.A., 2024,   Geospatial Fabric for National 
#  Hydrologic Modeling, Hawaii Domain: U.S. Geological Survey data release,  
#  https://doi.org/10.5066/P9HMKOP8

get_sb_file("5dbc53d4e4b06957974eddae", "hi.7z", islands_dir)

out_list <- c(out_list, list(hi_source = file.path(islands_dir, "hi.gpkg")))

```

GIS Features of the Geospatial Fabric for National Hydrologic Modeling, 
version 1.1, Transboundary Geospatial Fabric

```{r GFv1.1}
#  Bock, A.E, Santiago,M., Wieczorek, M.E., Foks, S.S., Norton, P.A., and 
#  Lombard, M.A., 2020, Geospatial Fabric for National Hydrologic Modeling, 
#  version 1.1 (ver. 3.0, November 2021): U.S. Geological Survey data release, 
#  https://doi.org/10.5066/P971JAGF.

GFv11_dir <- file.path(data_dir, "GFv11")

out <- list(GFv11_gages_lyr = file.path(data_dir, "GFv11/GFv11_gages.rds"),
            GFv11_gdb = file.path(GFv11_dir, "GFv1.1.gdb"),
            GFv11_tgf = file.path(GFv11_dir, "TGF.gdb"))

get_sb_file("5e29d1a0e4b0a79317cf7f63", "GFv1.1.gdb.zip", GFv11_dir)

get_sb_file("5d967365e4b0c4f70d113923", "TGF.gdb.zip", GFv11_dir)

cat("", file = file.path(GFv11_dir,  "GFv1.1.gdb.zip"))
cat("", file = file.path(GFv11_dir,  "TGF.gdb.zip"))

# Extract gages
read_sf(out$GFv11_gdb, "POIs_v1_1") |>
  filter(Type_Gage != 0) |>
  saveRDS(out$GFv11_gages_lyr)

out_list <- c(out_list, out)

if(mapview)(mapview(readRDS(out_list$GFv11_gages_lyr)))
```

GAGESII dataset

```{r Gages_II}
# Falcone, J., 2011, GAGES-II: Geospatial Attributes of Gages for Evaluating 
# Streamflow: U.S. Geological Survey data release, https://doi.org/10.5066/P96CPHOT. 

get_sb_file("631405bbd34e36012efa304a", "gagesII_9322_point_shapefile.zip", SWIM_points_path)

out_list <- c(out_list, list(
  gagesii_lyr = file.path(SWIM_points_path, "gagesII_9322_point_shapefile")))

if(mapview)(mapview(read_sf(out_list$gagesii_lyr)))
```

HILARRI dataset of Network-indexed Hydropower structures, reservoirs, and 
locations

```{r HILARRI}
#  Carly H. Hansen and Paul G. Matson. 2023. Hydropower Infrastructure - LAkes, 
#  Reservoirs, and RIvers (HILARRI), # Version 2. HydroSource. Oak Ridge 
#  National Laboratory, Oak Ridge, Tennessee, USA. 
#  DOI: https/doi.org/10.21951/HILARRI/1960141

hilarri_dir <- file.path(data_dir, "HILARRI")
hilarri_out <- list(hilari_sites = file.path(hilarri_dir, "HILARRI_v2.csv"))

download_file("https://hydrosource.ornl.gov/sites/default/files/2023-03/HILARRI_v2.zip", 
              out_path = hilarri_dir, check_path = hilarri_out$hilari_sites)

out_list <- c(out_list, hilarri_out)

if(mapview) {
  mapview(st_as_sf(read.csv(out_list$hilari_sites),
                      coords = c("longitude", "latitude"), 
                   crs = 4326))
}

```

ResOpsUS dataset and indexed locations

```{r Reservoir datasets}
# ResOpsUS
# Steyaert, J.C., Condon, L.E., W.D. Turner, S. et al. ResOpsUS, a dataset of 
# historical reservoir operations in the contiguous United States. Sci Data 9, 
# 34 (2022). https://doi.org/10.1038/s41597-022-01134-7

# GRanD
#  Lehner, B., C. Reidy Liermann, C. Revenga, C. Vörösmarty, B. Fekete, 
#  P. Crouzet, P. Döll, M. Endejan, K. Frenken, J. Magome, C. Nilsson, 
#  J.C. Robertson, R. Rodel, N. Sindorf, and D. Wisser. 2011. High-resolution 
#  mapping of the world’s reservoirs and dams for sustainable river-flow 
#  management. Frontiers in Ecology and the Environment 9 (9): 494-502.
#  https://ln.sync.com/dl/bd47eb6b0/anhxaikr-62pmrgtq-k44xf84f-pyz4atkm/view/default/447819520013

res_path <- file.path(data_dir,"reservoir_data")

# Set Data download links
res_att_url <- "https://zenodo.org/record/5367383/files/ResOpsUS.zip?download=1"
# ISTARF  - Inferred Storage Targets and Release Functions for CONUS large reservoirs
istarf_url <- "https://zenodo.org/record/4602277/files/ISTARF-CONUS.csv?download=1"
# Download GRanD zip from above
GRanD_zip <- file.path(res_path, "GRanD_Version_1_3.zip")

download_file(res_att_url, res_path, file_name = "ResOpsUS.zip")

tab_out <- c(out_list, list(res_attributes = file.path(res_path, "ResOpsUS", "attributes", 
                                                       "reservoir_attributes.csv")))

istarf_csv <- file.path(res_path, "ISTARF-CONUS.csv")

download_file(istarf_url, res_path, istarf_csv, file_name = "ISTARF-CONUS.csv")

out_list <- c(out_list, list(istarf = istarf_csv))

grand_dir <- file.path(res_path, "GRanD_Version_1_3")

# Extract GRanD data
if(!dir.exists(grand_dir)) {
  if(!file.exists(GRanD_zip)) 
    stop("Download GRanD data from https://ln.sync.com/dl/bd47eb6b0/anhxaikr-62pmrgtq-k44xf84f-pyz4atkm/view/default/447819520013 to ",
         res_path)
  
  unzip(GRanD_zip, exdir = res_path)
}

out_list <- c(out_list, list(GRanD = grand_dir))

resops_to_nid_path <- file.path(res_path, "cw_ResOpsUS_NID.csv")

get_sb_file("5dbc53d4e4b06957974eddae", "cw_ResOpsUS_NID.csv", resops_to_nid_path)

out_list <- c(out_list, list(resops_NID_CW = resops_to_nid_path))
```

All Hydro-linked Network Data Index (NLDI) datasets

```{r nldi}
# NLDI feature data sources
#  https://www.sciencebase.gov/catalog/item/60c7b895d34e86b9389b2a6c

nldi_dir <- file.path(data_dir, "nldi")

get_sb_file("60c7b895d34e86b9389b2a6c", "all", nldi_dir)

```

```{r}
write_json(out_list, path = out_file, pretty = TRUE, auto_unbox = TRUE)
rm(out_list)
```