Switching from day-of-year to month-day indexing - ripple effects
Day-of-year indexing is problematic for percentile calculations due to inconsistent year lengths (365 vs 366 days in leap years). We decided to switch from day-of-year indexing in percentile calculations to month-day indexing. This means that a percentile for March 1 is composed of all flows measured on March 1, rather than all flows measured on day 60 (which is sometimes Feb 29th and sometimes Mar 1). Because Feb 29 only occurs generally once every 4 years, it will have less data associated with it than other month-days. Furthermore, this switch to month-day interacts with many functions in hyswap that take an input called year_type
. The year_type
argument uses the day of year index to determine the order of the output dataset, based on whether the user wants to plot by water or climate year.
After discussing with the team, we have decided to explore the various options associated with the switch to a month-day index and removing the package's reliance on year_type inputs. This issue summarizes the functions affected by this change and suggests action(s) to address these breaks.
Functions that rely on day-of-year and/or year_type:
-
utils.define_year_doy_columns
- uses (or creates) the datetime index from a dataframe to create columns calledindex_year
andindex_doy
that are dependent upon theyear_type
input. It also adds a column calledindex_month_day
that is the month-day format for each day.Suggested adjustment: edit
utils.define_year_doy_columns
to use month-day to createindex_year
based onyear_type
. Could removeindex_doy
. Not needed. -
utils.leap_year_adjustment
- removes Feb 29th from the dataset and adjustsindex_doy
andindex_year
based onyear_type
to account for the adjustment.Suggested adjustment: simplify to a function that simply removes rows with a datetime month-day of "02-29". Remove
index_doy
work. -
cumulative._tidy_cumulative_dataframe
- uses theindex_year
andindex_doy
to create adate
column of the formatyear-doy
.year-doy
will change significantly when the user specifies they want a water or climateyear_type
.Suggested adjustment: Function loops through by
index_year
but uses the datetime index to go from earliest to latest date in theindex_year
to make cumulative calculation. Actually removed this as a requirement forcalculate_daily_cumulative_values
. -
cumulative.calculate_daily_cumulative_values
- Needsyear_type
viadefine_year_doy_columns
and_tidy_cumulative_dataframe
. I'm actually wondering if there is a bug in this function due to theyear_type
input for water and climate year inputs. For example, the input flow data are structured viadefine_year_doy_columns
for a water year and then the cdf is created using thatindex_year
structure. After the cdf is created, the_tidy_cumulative_dataframe
function uses the water year input to re-adjust the already adjusted dataframe. Seems to be recording water years incorrectly (e.g. 1918-10-01 in example is in WY 1918, not WY 1919).I don't think any adjustment is needed in this function at this time, except see #71 (closed) . Removed reliance on
_tidy_cumulative_dataframe
, I think fixed problem in #71 (closed). -
percentiles.calculate_variable_percentile_thresholds_by_day_of_year
- usesdefine_year_doy_columns
to create indices for input data, but not used beyond that. Function creates separatedoy_index
to loop through each day of the calendar year. At the end of the function, it usesdoy_index
to align with the user'syear_type
input.I don't think any adjustment is needed in this function at this time.
-
plots.plot_cumulative_hydrograph
- First sets the index usingdefine_year_doy_columns
for a cdf created usingcalculate_daily_cumulative_values
. It then usescalculate_variable_percentile_thresholds_by_day
to create the percentile envelopes for plotting.Suggested adjustment: x-axis is defined by month-day. Within each
index_year
, defined by theyear_type
, month-days need to be re-arranged in data and plot to matchyear_type
. Remove reliance onindex_doy
. -
rasterhydrograph._calculate_date_range
- This function usesyear_type
to determine the min and max years to include in a raster hydrograph.I don't think any adjustment is needed in this function at this time.
-
rasterhydrograph._check_inputs
- Looking to make sureyear_type
is included in function call.I don't think any adjustment is needed in this function at this time.
-
rasterhydrograph.format_data
- Uses_calculate_date_range
and_check_inputs
first, then sets the index usingdefine_year_doy_columns
. It usesindex_doy
to order columns and "future columns". Removes leap days.Suggested adjustment: remove reliance on
index_doy
and instead just use month-day, steps could be: getindex_year
, order data by date within index year, then pivot by month-day, so that there is a column for each month-day. This might need some manual re-arranging for non-calendaryear_types
.