Cleaning datetime series — mc_prep

By default, mc_prep_clean runs automatically when mc_read_files() or mc_read_data() are called. mc_prep_clean checks the time-series in the myClim object in Raw-format for missing, duplicated, and disordered records. The function can either directly regularize microclimatic time-series to a constant time-step, remove duplicated records, and fill missing values with NA (resolve_conflicts=TRUE); or it can insert new states (tags) see mc_states_insert to highlight records with conflicts i.e. duplicated datetime but different measurement values (resolve_conflicts=FALSE) but not perform the cleaning itself. When there were no conflicts, cleaning is performed in both cases (resolve_conflicts=TRUE or FALSE) See details.

mc_prep_clean(data, silent = FALSE, resolve_conflicts = TRUE, tolerance = NULL)

Arguments

data: myClim object in Raw-format. see myClim-package
silent: if true, then cleaning log table and progress bar is not printed in console (default FALSE), see mc_info_clean()
resolve_conflicts: by default the object is automatically cleaned and conflict measurements with closest original datetime to rounded datetime are selected, see details. (default TRUE) If FALSE and conflict records exist the function returns the original, uncleaned object with tags (states) "clean_conflict" highlighting records with duplicated datetime but different measurement values.When conflict records does not exist, object is cleaned in both TRUE and FALSE cases.
tolerance: list of tolerance values for each physical unit see mc_data_physical. Format is list(unit_name=tolerance_value). If maximal difference of conflict values is lower then tolerance, conflict is resolved without warning. If NULL, then tolerance is not applied (default NULL) see details.

Value

cleaned myClim object in Raw-format (default) resolve_conflicts=TRUE or resolve_conflicts=FALSE but no conflicts exist
cleaning log is by default printed in console, but can be called also later by mc_info_clean()
non cleaned myClim object in Raw-format with "clean_conflict" tags resolve_conflicts=FALSE and conflicts exist

Details

The function mc_prep_clean can be used in two different ways depending on the parameter resolve_conflicts. When resolve_conflicts=TRUE, the function performs automatic cleaning and returns a cleaned myClim object. When resolve_conflicts=FALSE, and myClim object contains conflicts (rows with identical time, but different measured value), the function returns the original, uncleaned object with tags (states) see mc_states_insert highlighting records with duplicated datetime but different measured values. When there were no conflicts, cleaning is performed in both cases (resolve_conflicts=TRUE OR FALSE)

Processing the data with mc_prep_clean and resolving the conflicts is a mandatory step required for further data handling in the myClim library.

This function guarantee that all time series are in chronological order, have regular time-step and no duplicated records. Function mc_prep_clean use either time-step provided by user during data import with mc_read (used time-step is permanently stored in logger metadata mc_LoggerMetadata; or if time-step is not provided by the user (NA),than myClim automatically detects the time-step from input time series based on the last 100 records. In case of irregular time series, function returns warning and skip (does not read) the file.

In cases when the user provides a time-step during data import in mc_read functions instead of relying on automatic step detection, and the provided step does not correspond with the actual records (i.e., the logger records data every 900 seconds but the user provides a step of 3600 seconds), the myClim rounding routine consolidates multiple records into an identical datetime. The resulting value corresponds to the one closest to the provided step (i.e., in an original series like ...9:50, 10:05, 10:20, 10:35, 10:50, 11:05..., the new record would be 10:00, and the value will be taken from the original record at 10:05). This process generates numerous warnings in resolve_conflicts=TRUE and a multitude of tags in resolve_conflicts=FALSE.

The tolerance parameter is designed for situations where the logger does not perform optimally, but the user still needs to extract and analyze the data. In some cases, loggers may record multiple rows with identical timestamps but with slightly different microclimate values, due to the limitations of sensor resolution and precision. By using the tolerance parameter, myClim will automatically select one of these values and resolve the conflict without generating additional warnings. It is strongly recommended to set the tolerance value based on the sensor's resolution and precision.

In case the time-step is regular, but is not nicely rounded, function rounds the time series to the closest nice time and shifts original data. E.g., original records in 10 min regular step c(11:58, 12:08, 12:18, 12:28) are shifted to newly generated nice sequence c(12:00, 12:10, 12:20, 12:30). Note that microclimatic records are not modified but only shifted. Maximum allowed shift of time series is 30 minutes. For example, when the time-step is 2h (e.g. 13:33, 15:33, 17:33), the measurement times are shifted to (13:30, 15:30, 17:30). When you have 2h time step and wish to go to the whole hour (13:33 -> 14:00, 15:33 -> 16:00) the only way is aggregation - use mc_agg(period="2 hours") command after data cleaning.

Examples

cleaned_data <- mc_prep_clean(mc_data_example_raw)
#> 5 loggers
#> datetime range: 2020-10-06 09:00:00 - 2021-02-01
#> detected steps: (900s = 15min)
#>   locality_id serial_number     logger_name          start_date   end_date
#> 1       A1E05      91184101        Thermo_1 2020-10-28 08:45:00 2021-02-01
#> 2       A1E05      92201058        Dendro_1 2020-10-31 12:00:00 2021-02-01
#> 3       A2E32      94184103           TMS_1 2020-10-16 06:15:00 2021-02-01
#> 4       A2E32      20024338 HOBO_U23-001A_1 2020-10-09 08:00:00 2021-02-01
#> 5       A6W79      94184102           TMS_1 2020-10-06 09:00:00 2021-02-01
#>   step_seconds count_duplicities count_missing count_disordered rounded
#> 1          900                 0             0                0   FALSE
#> 2          900                 0             0                0   FALSE
#> 3          900                 0             0                0   FALSE
#> 4          900                 0             0                0   FALSE
#> 5          900                 0             0                0   FALSE