Bug: Handling of incomplete dates

name: Bug Report
about: Lack of functionality to handle incomplete dates

I believe HASP has trouble handling incomplete dates (i.e., cases where only the year, or only the year and month of a sampling time were recorded). I'll provide two examples below; I am happy to implement a fix once we decide on what an appropriate course of action is.

Example 1: Site 180049066381200

library(dataRetrieval)
library(HASP)
# define site number and get discrete and daily value groundwater data
siteno <- "180049066381200"
discrete_data <- readNWISgwl(siteno, parameterCd = "72019")
continuous_data <- readNWISdv(siteno, parameterCd = "72019", statCd = "00003")
# try to call the internal HASP function "set_up_data()" that underpins other functionality
set_up_data(
  gw_level_dv = continuous_data,
  gwl_data = discrete_data,
  parameter_cd = "72019",
  date_col = NA,
  value_col = NA,
  approved_col = NA,
  stat_cd = "00003"
)

This code snippet returns the following traceback:

Error in format.default(gwl_data[[date_col_per]], "%Y") : 
 invalid 'trim' argument
  3. format.default(gwl_data[[date_col_per]], "%Y")
  2. format(gwl_data[[date_col_per]], "%Y") at gwl_single_sites.R#424
  1. HASP:::set_up_data(continuous, discrete, "72019", NA, NA, NA, "00003")

In this case, there is actually no daily groundwater data and the "continuous_data" variable is an empty list. The discrete measurements do exist and the error is due to the oldest measurement. The output of head(discrete_data, 1) is the following:

 agency_cd         site_no site_tp_cd lev_dt lev_tm lev_tz_cd_reported lev_va sl_lev_va sl_datum_cd lev_status_cd
      USGS 180049066381200         GW   1936   <NA>                UTC     60        NA        <NA>             1
 lev_agency_cd lev_dt_acy_cd lev_acy_cd lev_src_cd lev_meth_cd lev_age_cd parameter_cd lev_dateTime lev_tz_cd
          USGS             Y       <NA>          S           L          A        72019         <NA>       UTC

Critically, the value for the date in column lev_dt is "1936" while HASP expects that entry to be a proper date value. The second value, for example, is discrete_data$lev_dt[2] which returns: "2012-11-20" and that is a full date.

Example 2: Site 290000095192602

library(dataRetrieval)
library(HASP)
# define site number and get discrete and daily value groundwater data
siteno <- "290000095192602"
discrete_data <- readNWISgwl(siteno, parameterCd = "72019")
continuous_data <- readNWISdv(siteno, parameterCd = "72019", statCd = "00003")
# try to call the internal HASP function "set_up_data()" that underpins other functionality
set_up_data(
  gw_level_dv = continuous_data,
  gwl_data = discrete_data,
  parameter_cd = "72019",
  date_col = NA,
  value_col = NA,
  approved_col = NA,
  stat_cd = "00003"
)

This code snippet returns the following traceback:

Error in format.default(gwl_data[[date_col_per]], "%Y") : 
 invalid 'trim' argument
  3. format.default(gwl_data[[date_col_per]], "%Y")
  2. format(gwl_data[[date_col_per]], "%Y") at gwl_single_sites.R#424
  1. HASP:::set_up_data(continuous, discrete, "72019", NA, NA, NA, "00003")

Similarly, in this case there is no daily data. The first row of the discrete sampling data is the following:

 agency_cd         site_no site_tp_cd  lev_dt lev_tm lev_tz_cd_reported lev_va sl_lev_va sl_datum_cd lev_status_cd
      USGS 290000095192602         GW 1980-01   <NA>                UTC     80        NA        <NA>             1
 lev_agency_cd lev_dt_acy_cd lev_acy_cd lev_src_cd lev_meth_cd lev_age_cd parameter_cd lev_dateTime lev_tz_cd
          <NA>             M       <NA>       <NA>           L          A        72019         <NA>       UTC

Again we have an incomplete date for the first entry in the lev_dt column which is "1980-01". The second value in the lev_dt column is a proper date and is "1992-02-12".

Potential Resolution

The simplest resolution in my mind is to have HASP discard rows with incomplete date information. So this would be some version of a vectorized try-catch on the as.Date() function I'm guessing. as.Date() fails on the incomplete dates, and succeeds on the complete ones. Let me know if that seems reasonable or if you have any suggestions for a fix. My worry is that this process does discard valuable (and often older) data. Especially when the subsequent calculation is a monthly, or a yearly statistic, then knowing the specific day is not necessary. In those cases, some (or all) of the incomplete dates might still contain valuable information. Should we devise a more clever scheme to "fill-in" the incomplete dates based on the type of computation that is being performed on the data?

Alternatively is this something that the user is expected to check prior to using HASP to manipulate their data?

Edited Nov 25, 2022 by Hariharan, Jayaram Athreya