Bug: Handling of incomplete dates
name: Bug Report
about: Lack of functionality to handle incomplete dates
I believe HASP
has trouble handling incomplete dates (i.e., cases where only the year, or only the year and month of a sampling time were recorded). I'll provide two examples below; I am happy to implement a fix once we decide on what an appropriate course of action is.
Example 1: Site 180049066381200
library(dataRetrieval)
library(HASP)
# define site number and get discrete and daily value groundwater data
siteno <- "180049066381200"
discrete_data <- readNWISgwl(siteno, parameterCd = "72019")
continuous_data <- readNWISdv(siteno, parameterCd = "72019", statCd = "00003")
# try to call the internal HASP function "set_up_data()" that underpins other functionality
set_up_data(
gw_level_dv = continuous_data,
gwl_data = discrete_data,
parameter_cd = "72019",
date_col = NA,
value_col = NA,
approved_col = NA,
stat_cd = "00003"
)
This code snippet returns the following traceback:
Error in format.default(gwl_data[[date_col_per]], "%Y") :
invalid 'trim' argument
3. format.default(gwl_data[[date_col_per]], "%Y")
2. format(gwl_data[[date_col_per]], "%Y") at gwl_single_sites.R#424
1. HASP:::set_up_data(continuous, discrete, "72019", NA, NA, NA, "00003")
In this case, there is actually no daily groundwater data and the "continuous_data" variable is an empty list. The discrete measurements do exist and the error is due to the oldest measurement. The output of head(discrete_data, 1)
is the following:
agency_cd site_no site_tp_cd lev_dt lev_tm lev_tz_cd_reported lev_va sl_lev_va sl_datum_cd lev_status_cd
USGS 180049066381200 GW 1936 <NA> UTC 60 NA <NA> 1
lev_agency_cd lev_dt_acy_cd lev_acy_cd lev_src_cd lev_meth_cd lev_age_cd parameter_cd lev_dateTime lev_tz_cd
USGS Y <NA> S L A 72019 <NA> UTC
Critically, the value for the date in column lev_dt
is "1936"
while HASP
expects that entry to be a proper date value. The second value, for example, is discrete_data$lev_dt[2]
which returns: "2012-11-20"
and that is a full date.
Example 2: Site 290000095192602
library(dataRetrieval)
library(HASP)
# define site number and get discrete and daily value groundwater data
siteno <- "290000095192602"
discrete_data <- readNWISgwl(siteno, parameterCd = "72019")
continuous_data <- readNWISdv(siteno, parameterCd = "72019", statCd = "00003")
# try to call the internal HASP function "set_up_data()" that underpins other functionality
set_up_data(
gw_level_dv = continuous_data,
gwl_data = discrete_data,
parameter_cd = "72019",
date_col = NA,
value_col = NA,
approved_col = NA,
stat_cd = "00003"
)
This code snippet returns the following traceback:
Error in format.default(gwl_data[[date_col_per]], "%Y") :
invalid 'trim' argument
3. format.default(gwl_data[[date_col_per]], "%Y")
2. format(gwl_data[[date_col_per]], "%Y") at gwl_single_sites.R#424
1. HASP:::set_up_data(continuous, discrete, "72019", NA, NA, NA, "00003")
Similarly, in this case there is no daily data. The first row of the discrete sampling data is the following:
agency_cd site_no site_tp_cd lev_dt lev_tm lev_tz_cd_reported lev_va sl_lev_va sl_datum_cd lev_status_cd
USGS 290000095192602 GW 1980-01 <NA> UTC 80 NA <NA> 1
lev_agency_cd lev_dt_acy_cd lev_acy_cd lev_src_cd lev_meth_cd lev_age_cd parameter_cd lev_dateTime lev_tz_cd
<NA> M <NA> <NA> L A 72019 <NA> UTC
Again we have an incomplete date for the first entry in the lev_dt
column which is "1980-01"
. The second value in the lev_dt
column is a proper date and is "1992-02-12"
.
Potential Resolution
The simplest resolution in my mind is to have HASP
discard rows with incomplete date information. So this would be some version of a vectorized try-catch on the as.Date()
function I'm guessing. as.Date()
fails on the incomplete dates, and succeeds on the complete ones. Let me know if that seems reasonable or if you have any suggestions for a fix. My worry is that this process does discard valuable (and often older) data. Especially when the subsequent calculation is a monthly, or a yearly statistic, then knowing the specific day is not necessary. In those cases, some (or all) of the incomplete dates might still contain valuable information. Should we devise a more clever scheme to "fill-in" the incomplete dates based on the type of computation that is being performed on the data?
Alternatively is this something that the user is expected to check prior to using HASP
to manipulate their data?