CARD_extraction

Extracts variables from time series (for example, the yearly mean of a time series) using CARD parameterization files.

Usage

CARD_extraction(
  data,
  CARD_name = c("QA", "QJXA"),
  CARD_path = NULL,
  period_default = NULL,
  suffix = NULL,
  suffix_delimiter = "_",
  cancel_lim = FALSE,
  simplify = FALSE,
  expand_overwrite = NULL,
  sampling_period_overwrite = NULL,
  rmNApct = TRUE,
  rm_duplicates = FALSE,
  extract_only_metadata = FALSE,
  dev = FALSE,
  verbose = FALSE
)

Arguments

data

Input data format is a tibble from the tibble package. It needs to have :

Only one column of Date that are regularly spaced and unique for each time serie.
If there is more than one time serie, at least one column needs to be of character string for names of time series in order to identify them. If more than one column of identifier is given, they will all be used in order to identify a unique time serie.
At least one column of numeric (or logical) on which the process of variable extraction will be perform. More numerical column can be leave but if they are useless, they will be suppressed.

e.g.

> data
A tibble: 201 × 4
   time         Q_obs  Q_sim  ID
   <date>       <dbl>  <dbl>  <chr>
1   2000-02-10   10     97.8  serie 1
2   2000-02-11   19    -20.5  serie 1
3   2000-02-12   13    -76.9  serie 1
4   2000-02-13   15    -86.0  serie 1
    ...
103 2001-01-01  1.3     1988  serie 2
104 2001-01-02  1.2      109  serie 2
105 2001-01-03  1.0       90  serie 2
106 2001-01-04  1.1       91  serie 2
    ...

CARD_name

A vector of character strings to specify which variables you want to extract. See CARD_list_all() to get the variable names. By default, c("QA", "QJXA"). If NULL, all the variable will be extracted, so avoid this value except with extract_only_metadata = TRUE or your custom CARD_path directory.

CARD_path

An optional character string for the path where to search for custom CARDs that have been created by the CARD_management function. By default, NULL in order to get the default CARD variable parameters.

period_default

A vector of two dates (or two unambiguous character strings that can be coerced to dates) to restrict the period of analysis. As an example, it can be c("1950-01-01", "2020-12-31") to select data from the 1st January of 1950 to the end of December of 2020. Some CARD can have a specific period parameter that overide this period_default argument. The default option is period_default=NULL, which considers all available data for each time serie.

suffix

A character string vector representing suffixes to be appended to the column names of the extracted variables. This parameter allows handling multiple extraction scenarios. For example, a cumbersome case can be to have a unique function to apply to a multiple list of column. It is possible to give funct=list(QA_obs=mean, QA_sim=mean) and funct_args=list(list("Q_obs", na.rm=TRUE), list("Q_sim", na.rm=TRUE)) or simply funct=list(QA=mean) and funct_args=list("Q", na.rm=TRUE) with suffix=c("obs", "sim"). The two approach give the same result. Default NULL.

suffix_delimiter

character string specifies the delimiter to use between the variable name and the suffix if not NULL. The default is "_".

cancel_lim

A logical to specify whether to cancel the NA percentage limits in the CARDs. Default is FALSE.

simplify

A logical to specify whether to simplify the extracted data by joining each tibble extracted from each CARDs. Usefull when the extracted variable has no temporal extension. Default "FALSE".

expand_overwrite

logical or NULL. If TRUE, expand the output tibble as a list of tibble for each extracted variable by suffix. Default NULL to conserve the value specified in the CARDs used.

sampling_period_overwrite

A character string or a vector of two character strings that will indicate how to sample the data for each time step defined by time_step. Hence, the choice of this argument needs to be link with the choice of the time step. For example, for a yearly extraction so if time_step is set to "year", sampling_period needs to be formated as %m-%d (a month - a day of the year) in order to indicate the start of the sampling of data for the current year. More precisly, if time_step="year" and sampling_period="03-19", funct will be apply on every data from the 3rd march of each year to the 2nd march of the following one. In this way, it is possible to create a sub-year sampling with a vector of two character strings as sampling_period=c("02-01", "07-31") in order to process data only if the date is between the 1st february and the 31th jully of each year. not available for now For a monthly (or seasonal) extraction, sampling_period needs to give only day in each month, so for example sampling_period="10" to extract data from the 10th of each month to the 9th of each following month. Default NULL to conserve the value specified in the CARDs used.

rmNApct

logical. Should the NApct column, which shows the percentage of missing values in the output, be removed ? Default TRUE.

rm_duplicates

logical. Should duplicate time series values be automatically removed ? Default FALSE.

extract_only_metadata

logical. If TRUE, only metadata of CARD will be extracted. In that case, use data=NULL. Default FALSE.

dev

logical If TRUE, development mode is enabled. Default is FALSE.

verbose

logical. Should intermediate messages be printed during the execution of the function ? Default FALSE.

Value

A list of two tibbles.

The dataEX tibble, which contains the extracted variable, or a named list of tibbles for each extracted variable if expand_overwrite is TRUE.
The metaEX tibble, which contains the metadata of the extraction from CARDs.

Examples

library(CARD)

# Get all the available variables
metaEX_all = CARD_list_all()
metaEX_all
#> # A tibble: 563 × 23
#>    CARD_name          variable_en       unit_en name_en description_en method_en
#>    <chr>              <chr>             <chr>   <chr>   <chr>          <chr>    
#>  1 ETPA               ETPA              mm      Cumula… ""             ""       
#>  2 BFI_Wal            BFI_Wal           withou… Basefl… "Ratio betwee… "1. no t…
#>  3 BFM                BFM               withou… Basefl… ""             "1. no t…
#>  4 delta{BFI}_LH_H1   delta{BFI}_LH_H1  withou… Change… "Ratio betwee… "1. no t…
#>  5 delta{BFI}_LH_H2   delta{BFI}_LH_H2  withou… Change… "Ratio betwee… "1. no t…
#>  6 delta{BFI}_LH_H3   delta{BFI}_LH_H3  withou… Change… "Ratio betwee… "1. no t…
#>  7 delta{BFI}_Wal_H1  delta{BFI}_Wal_H1 withou… Change… "Ratio betwee… "1. no t…
#>  8 delta{BFI}_Wal_H2  delta{BFI}_Wal_H2 withou… Change… "Ratio betwee… "1. no t…
#>  9 delta{BFI}_Wal_H3  delta{BFI}_Wal_H3 withou… Change… "Ratio betwee… "1. no t…
#> 10 delta{centerBF}_H1 delta{centerBF}_… day     Averag… "Date when 50… "1. annu…
#> # ℹ 553 more rows
#> # ℹ 17 more variables: sampling_period_en <chr>, topic_en <chr>,
#> #   variable_fr <chr>, unit_fr <chr>, name_fr <chr>, description_fr <chr>,
#> #   method_fr <chr>, sampling_period_fr <chr>, topic_fr <chr>,
#> #   is_experimental <lgl>, input_vars <chr>, source <chr>,
#> #   preferred_sampling_period <chr>, is_date <lgl>, to_normalise <lgl>,
#> #   palette <chr>, script_path <chr>

# Create mock data
Start = as.Date("2001-03-02")
End = as.Date("2024-11-30")
Date = seq.Date(Start, End, by="day")
data = dplyr::tibble(time=Date,
                     Q=as.numeric(Date),
                     id="serie 1")

# Do a direct extraction
res = CARD_extraction(data, CARD_name=c("QA", "QMNA"), verbose=TRUE)
#> [1] "Computes QMNA"
#> [1] "Process 1/2"
#> [1] "EXTRACTION PROCESS                                "
#> [1] "├── Missing year                                  "
#> [1] "│   └── Checking missing continuous periods       "
#> [1] "│       longer than 10 years                      "
#> [1] "├── Period                                        "
#> [1] "│   └── Selecting all the data                    "
#> [1] "├── Sample period                                 "
#> [1] "│   ├── Default sample period used                "
#> [1] "│   └── Fixing sample period                      "
#> [1] "│       ├── Only start of the sample period was   "
#> [1] "│       │   given                                 "
#> [1] "│       ├── Every time series have the same       "
#> [1] "│       │   sample period                         "
#> [1] "│       └── All : 01 / 30                         "
#> [1] "├── Monthly extraction along years                "
#> [1] "│   └── Preparing date data for the extraction    "
#> [1] "│       ├── Get general sample info               "
#> [1] "│       ├── Computing of time indicators for      "
#> [1] "│       │   each time serie                       "
#> [1] "│       ├── Get number of missing data for start  "
#> [1] "│       │   and end                               "
#> [1] "│       └── Create each group                     "
#> [1] "├── Grouping data                                 "
#> [1] "├── Application of the function                   "
#> [1] "├── Cleaning extracted tibble                     "
#> [1] "│   ├── Manage possible infinite values           "
#> [1] "│   └── Recreate a date vector and add value for  "
#> [1] "│       NApct computing                           "
#> [1] "├── NA management                                 "
#> [1] "│   ├── Compute NA percentage                     "
#> [1] "│   ├── Removing data if NA percentage is         "
#> [1] "│   │   strictly above 3 %                        "
#> [1] "│   └── Cleaning NA percentage info               "
#> [1] "└── Last cleaning and formating for output        "
#> [1] "    ├── Rename column                             "
#> [1] "    ├── Keeping only the needed data : all        "
#> [1] "    └── Return data                               "
#> # A tibble: 8,675 × 4
#>    id      time           Q   QMA
#>    <chr>   <date>     <dbl> <dbl>
#>  1 serie 1 2001-03-02 11383    NA
#>  2 serie 1 2001-03-03 11384    NA
#>  3 serie 1 2001-03-04 11385    NA
#>  4 serie 1 2001-03-05 11386    NA
#>  5 serie 1 2001-03-06 11387    NA
#>  6 serie 1 2001-03-07 11388    NA
#>  7 serie 1 2001-03-08 11389    NA
#>  8 serie 1 2001-03-09 11390    NA
#>  9 serie 1 2001-03-10 11391    NA
#> 10 serie 1 2001-03-11 11392    NA
#> # ℹ 8,665 more rows
#> [1] "Process 2/2"
#> [1] "EXTRACTION PROCESS                                "
#> [1] "├── Period                                        "
#> [1] "│   └── Selecting all the data                    "
#> [1] "├── Sample period                                 "
#> [1] "│   ├── Fixing sample period for each time series "
#> [1] "│       ├── Every time series have the same       "
#> [1] "│       │   sample period                         "
#> [1] "│       └── All : 11-01 / 10-31                   "
#> [1] "├── Yearly extraction                             "
#> [1] "│   └── Preparing date data for the extraction    "
#> [1] "│       ├── Get general sample info               "
#> [1] "│       ├── Computing of time indicators for      "
#> [1] "│       │   each time serie                       "
#> [1] "│       ├── Get number of missing data for start  "
#> [1] "│       │   and end                               "
#> [1] "│       └── Create each group                     "
#> [1] "├── Grouping data                                 "
#> [1] "├── Application of the function                   "
#> [1] "├── Cleaning extracted tibble                     "
#> [1] "│   ├── Manage possible infinite values           "
#> [1] "│   └── Recreate a date vector and add value for  "
#> [1] "│       NApct computing                           "
#> [1] "├── NA management                                 "
#> [1] "│   ├── Compute NA percentage                     "
#> [1] "│   ├── Removing data if NA percentage is         "
#> [1] "│   │   strictly above 3 %                        "
#> [1] "│   └── Cleaning NA percentage info               "
#> [1] "└── Last cleaning and formating for output        "
#> [1] "    ├── Rename column                             "
#> [1] "    └── Return data                               "
#> # A tibble: 25 × 3
#>    id      time         QMNA
#>    <chr>   <date>      <dbl>
#>  1 serie 1 2000-11-01    NA 
#>  2 serie 1 2001-11-01 11642.
#>  3 serie 1 2002-11-01 12006.
#>  4 serie 1 2003-11-01 12372.
#>  5 serie 1 2004-11-01 12738.
#>  6 serie 1 2005-11-01 13102.
#>  7 serie 1 2006-11-01 13468.
#>  8 serie 1 2007-11-01 13832.
#>  9 serie 1 2008-11-01 14198.
#> 10 serie 1 2009-11-01 14564.
#> # ℹ 15 more rows
#> [1] "Computes QA"
#> [1] "Process 1/1"
#> [1] "EXTRACTION PROCESS                                "
#> [1] "├── Missing year                                  "
#> [1] "│   └── Checking missing continuous periods       "
#> [1] "│       longer than 10 years                      "
#> [1] "├── Period                                        "
#> [1] "│   └── Selecting all the data                    "
#> [1] "├── Sample period                                 "
#> [1] "│   └── Fixing sample period                      "
#> [1] "│       ├── Only start of the sample period was   "
#> [1] "│       │   given                                 "
#> [1] "│       ├── Every time series have the same       "
#> [1] "│       │   sample period                         "
#> [1] "│       └── All : 09-01 / 08-31                   "
#> [1] "├── Yearly extraction                             "
#> [1] "│   └── Preparing date data for the extraction    "
#> [1] "│       ├── Get general sample info               "
#> [1] "│       ├── Computing of time indicators for      "
#> [1] "│       │   each time serie                       "
#> [1] "│       ├── Get number of missing data for start  "
#> [1] "│       │   and end                               "
#> [1] "│       └── Create each group                     "
#> [1] "├── Grouping data                                 "
#> [1] "├── Application of the function                   "
#> [1] "├── Cleaning extracted tibble                     "
#> [1] "│   ├── Manage possible infinite values           "
#> [1] "│   └── Recreate a date vector and add value for  "
#> [1] "│       NApct computing                           "
#> [1] "├── NA management                                 "
#> [1] "│   ├── Compute NA percentage                     "
#> [1] "│   ├── Removing data if NA percentage is         "
#> [1] "│   │   strictly above 3 %                        "
#> [1] "│   └── Cleaning NA percentage info               "
#> [1] "└── Last cleaning and formating for output        "
#> [1] "    ├── Rename column                             "
#> [1] "    └── Return data                               "
#> # A tibble: 25 × 3
#>    id      time           QA
#>    <chr>   <date>      <dbl>
#>  1 serie 1 2000-09-01    NA 
#>  2 serie 1 2001-09-01 11748 
#>  3 serie 1 2002-09-01 12113 
#>  4 serie 1 2003-09-01 12478.
#>  5 serie 1 2004-09-01 12844 
#>  6 serie 1 2005-09-01 13209 
#>  7 serie 1 2006-09-01 13574 
#>  8 serie 1 2007-09-01 13940.
#>  9 serie 1 2008-09-01 14305 
#> 10 serie 1 2009-09-01 14670 
#> # ℹ 15 more rows
res
#> $metaEX
#> # A tibble: 2 × 21
#>   variable_en unit_en      name_en   description_en method_en sampling_period_en
#>   <chr>       <chr>        <chr>     <chr>          <chr>     <chr>             
#> 1 QMNA        m^{3}.s^{-1} Annual m… ""             "1. mont… Month of maximum …
#> 2 QA          m^{3}.s^{-1} Annual m… ""             "1. annu… 09-01, 08-31      
#> # ℹ 15 more variables: topic_en <chr>, variable_fr <chr>, unit_fr <chr>,
#> #   name_fr <chr>, description_fr <chr>, method_fr <chr>,
#> #   sampling_period_fr <chr>, topic_fr <chr>, is_experimental <lgl>,
#> #   input_vars <chr>, preferred_sampling_period <chr>, is_date <lgl>,
#> #   to_normalise <lgl>, palette <chr>, script_path <chr>
#> 
#> $dataEX
#> $dataEX$QMNA
#> # A tibble: 25 × 3
#>    id      time         QMNA
#>    <chr>   <date>      <dbl>
#>  1 serie 1 2000-11-01    NA 
#>  2 serie 1 2001-11-01 11642.
#>  3 serie 1 2002-11-01 12006.
#>  4 serie 1 2003-11-01 12372.
#>  5 serie 1 2004-11-01 12738.
#>  6 serie 1 2005-11-01 13102.
#>  7 serie 1 2006-11-01 13468.
#>  8 serie 1 2007-11-01 13832.
#>  9 serie 1 2008-11-01 14198.
#> 10 serie 1 2009-11-01 14564.
#> # ℹ 15 more rows
#> 
#> $dataEX$QA
#> # A tibble: 25 × 3
#>    id      time           QA
#>    <chr>   <date>      <dbl>
#>  1 serie 1 2000-09-01    NA 
#>  2 serie 1 2001-09-01 11748 
#>  3 serie 1 2002-09-01 12113 
#>  4 serie 1 2003-09-01 12478.
#>  5 serie 1 2004-09-01 12844 
#>  6 serie 1 2005-09-01 13209 
#>  7 serie 1 2006-09-01 13574 
#>  8 serie 1 2007-09-01 13940.
#>  9 serie 1 2008-09-01 14305 
#> 10 serie 1 2009-09-01 14670 
#> # ℹ 15 more rows
#> 
#> 

# Or find the closest CARD variable that interests you
CARD_management(CARD_name=c("VCN10-5"),
                CARD_path="CARD-WIP",
                overwrite=TRUE)
# Personalise it in the created  `"CARD-WIP"` directory (for example change the return period)
# And perform a custom extraction
res = CARD_extraction(data, CARD_name=NULL,
                      CARD_path="CARD-WIP",
                      verbose=TRUE)
#> [1] "Computes VCN10-5"
#> [1] "Process 1/3"
#> [1] "EXTRACTION PROCESS                                "
#> [1] "├── Missing year                                  "
#> [1] "│   └── Checking missing continuous periods       "
#> [1] "│       longer than 10 years                      "
#> [1] "├── Period                                        "
#> [1] "│   └── Selecting all the data                    "
#> [1] "├── Sample period                                 "
#> [1] "│   ├── Default sample period used                "
#> [1] "│   └── Fixing sample period                      "
#> [1] "│       ├── Only start of the sample period was   "
#> [1] "│       │   given                                 "
#> [1] "│       ├── Every time series have the same       "
#> [1] "│       │   sample period                         "
#> [1] "│       └── All : 01-01 / 12-31                   "
#> [1] "├── None extraction                               "
#> [1] "│   └── Preparing date data for the extraction    "
#> [1] "│       ├── Get general sample info               "
#> [1] "├── Grouping data                                 "
#> [1] "├── Application of the function                   "
#> [1] "├── Cleaning extracted tibble                     "
#> [1] "│   ├── Manage possible infinite values           "
#> [1] "│   └── Recreate a date vector and add value for  "
#> [1] "│       NApct computing                           "
#> [1] "├── NA management                                 "
#> [1] "│   ├── Compute NA percentage                     "
#> [1] "│   └── Cleaning NA percentage info               "
#> [1] "└── Last cleaning and formating for output        "
#> [1] "    ├── Rename column                             "
#> [1] "    ├── Keeping only the needed data : all        "
#> [1] "    └── Return data                               "
#> # A tibble: 8,675 × 4
#>    id      time           Q   VC10
#>    <chr>   <date>     <dbl>  <dbl>
#>  1 serie 1 2001-03-02 11383    NA 
#>  2 serie 1 2001-03-03 11384    NA 
#>  3 serie 1 2001-03-04 11385    NA 
#>  4 serie 1 2001-03-05 11386    NA 
#>  5 serie 1 2001-03-06 11387 11388.
#>  6 serie 1 2001-03-07 11388 11388.
#>  7 serie 1 2001-03-08 11389 11390.
#>  8 serie 1 2001-03-09 11390 11390.
#>  9 serie 1 2001-03-10 11391 11392.
#> 10 serie 1 2001-03-11 11392 11392.
#> # ℹ 8,665 more rows
#> [1] "Process 2/3"
#> [1] "EXTRACTION PROCESS                                "
#> [1] "├── Period                                        "
#> [1] "│   └── Selecting all the data                    "
#> [1] "├── Sample period                                 "
#> [1] "│   ├── Fixing sample period for each time series "
#> [1] "│       ├── Every time series have the same       "
#> [1] "│       │   sample period                         "
#> [1] "│       └── All : 11-01 / 10-31                   "
#> [1] "├── Yearly extraction                             "
#> [1] "│   └── Preparing date data for the extraction    "
#> [1] "│       ├── Get general sample info               "
#> [1] "│       ├── Computing of time indicators for      "
#> [1] "│       │   each time serie                       "
#> [1] "│       ├── Get number of missing data for start  "
#> [1] "│       │   and end                               "
#> [1] "│       └── Create each group                     "
#> [1] "├── Grouping data                                 "
#> [1] "├── Application of the function                   "
#> [1] "├── Cleaning extracted tibble                     "
#> [1] "│   ├── Manage possible infinite values           "
#> [1] "│   └── Recreate a date vector and add value for  "
#> [1] "│       NApct computing                           "
#> [1] "├── NA management                                 "
#> [1] "│   ├── Compute NA percentage                     "
#> [1] "│   ├── Removing data if NA percentage is         "
#> [1] "│   │   strictly above 3 %                        "
#> [1] "│   └── Cleaning NA percentage info               "
#> [1] "└── Last cleaning and formating for output        "
#> [1] "    ├── Rename column                             "
#> [1] "    └── Return data                               "
#> # A tibble: 25 × 3
#>    id      time        VCN10
#>    <chr>   <date>      <dbl>
#>  1 serie 1 2000-11-01    NA 
#>  2 serie 1 2001-11-01 11628.
#>  3 serie 1 2002-11-01 11992.
#>  4 serie 1 2003-11-01 12358.
#>  5 serie 1 2004-11-01 12724.
#>  6 serie 1 2005-11-01 13088.
#>  7 serie 1 2006-11-01 13454.
#>  8 serie 1 2007-11-01 13818.
#>  9 serie 1 2008-11-01 14184.
#> 10 serie 1 2009-11-01 14550.
#> # ℹ 15 more rows
#> [1] "Process 3/3"
#> [1] "EXTRACTION PROCESS                                "
#> [1] "├── Period                                        "
#> [1] "│   └── Selecting all the data                    "
#> [1] "├── Sample period                                 "
#> [1] "│   ├── Default sample period used                "
#> [1] "│   └── Fixing sample period                      "
#> [1] "│       ├── Only start of the sample period was   "
#> [1] "│       │   given                                 "
#> [1] "│       ├── Every time series have the same       "
#> [1] "│       │   sample period                         "
#> [1] "│       └── All : 01-01 / 12-31                   "
#> [1] "├── None extraction                               "
#> [1] "│   └── Preparing date data for the extraction    "
#> [1] "│       ├── Get general sample info               "
#> [1] "├── Grouping data                                 "
#> [1] "├── Application of the function                   "
#> [1] "├── Cleaning extracted tibble                     "
#> [1] "│   ├── Manage possible infinite values           "
#> [1] "│   └── Recreate a date vector and add value for  "
#> [1] "│       NApct computing                           "
#> [1] "├── NA management                                 "
#> [1] "│   ├── Compute NA percentage                     "
#> [1] "│   └── Cleaning NA percentage info               "
#> [1] "└── Last cleaning and formating for output        "
#> [1] "    ├── Rename column                             "
#> [1] "    └── Return data                               "
#> # A tibble: 1 × 2
#>   id      `VCN10-5`
#>   <chr>       <dbl>
#> 1 serie 1    13495.
res$dataEX
#> $`VCN10-5`
#> # A tibble: 1 × 2
#>   id      `VCN10-5`
#>   <chr>       <dbl>
#> 1 serie 1    13495.
#>

Usage

Arguments

Value

See also

Examples