CARD_extraction
CARD_extraction.Rd
Extracts variables from time series (for example, the yearly mean of a time series) using CARD parameterization files.
Usage
CARD_extraction(
data,
CARD_name = c("QA", "QJXA"),
CARD_path = NULL,
period_default = NULL,
suffix = NULL,
suffix_delimiter = "_",
cancel_lim = FALSE,
simplify = FALSE,
expand_overwrite = NULL,
sampling_period_overwrite = NULL,
rmNApct = TRUE,
rm_duplicates = FALSE,
extract_only_metadata = FALSE,
dev = FALSE,
verbose = FALSE
)
Arguments
- data
Input data format is a tibble from the tibble package. It needs to have :
Only one column of Date that are regularly spaced and unique for each time serie.
If there is more than one time serie, at least one column needs to be of character string for names of time series in order to identify them. If more than one column of identifier is given, they will all be used in order to identify a unique time serie.
At least one column of numeric (or logical) on which the process of variable extraction will be perform. More numerical column can be leave but if they are useless, they will be suppressed.
e.g.
> data A tibble: 201 × 4 time Q_obs Q_sim ID <date> <dbl> <dbl> <chr> 1 2000-02-10 10 97.8 serie 1 2 2000-02-11 19 -20.5 serie 1 3 2000-02-12 13 -76.9 serie 1 4 2000-02-13 15 -86.0 serie 1 ... 103 2001-01-01 1.3 1988 serie 2 104 2001-01-02 1.2 109 serie 2 105 2001-01-03 1.0 90 serie 2 106 2001-01-04 1.1 91 serie 2 ...
- CARD_name
A vector of character strings to specify which variables you want to extract. See
CARD_list_all()
to get the variable names. By default,c("QA", "QJXA")
. IfNULL
, all the variable will be extracted, so avoid this value except withextract_only_metadata = TRUE
or your customCARD_path
directory.- CARD_path
An optional character string for the path where to search for custom CARDs that have been created by the CARD_management function. By default,
NULL
in order to get the default CARD variable parameters.- period_default
A vector of two dates (or two unambiguous character strings that can be coerced to dates) to restrict the period of analysis. As an example, it can be
c("1950-01-01", "2020-12-31")
to select data from the 1st January of 1950 to the end of December of 2020. Some CARD can have a specificperiod
parameter that overide thisperiod_default
argument. The default option isperiod_default=NULL
, which considers all available data for each time serie.- suffix
A character string vector representing suffixes to be appended to the column names of the extracted variables. This parameter allows handling multiple extraction scenarios. For example, a cumbersome case can be to have a unique function to apply to a multiple list of column. It is possible to give
funct=list(QA_obs=mean, QA_sim=mean)
andfunct_args=list(list("Q_obs", na.rm=TRUE), list("Q_sim", na.rm=TRUE))
or simplyfunct=list(QA=mean)
andfunct_args=list("Q", na.rm=TRUE)
withsuffix=c("obs", "sim")
. The two approach give the same result. DefaultNULL
.- suffix_delimiter
character string specifies the delimiter to use between the variable name and the suffix if not
NULL
. The default is"_"
.- cancel_lim
A logical to specify whether to cancel the NA percentage limits in the CARDs. Default is
FALSE
.- simplify
A logical to specify whether to simplify the extracted data by joining each tibble extracted from each CARDs. Usefull when the extracted variable has no temporal extension. Default
"FALSE"
.- expand_overwrite
logical or
NULL
. IfTRUE
, expand the output tibble as a list of tibble for each extracted variable bysuffix
. DefaultNULL
to conserve the value specified in the CARDs used.- sampling_period_overwrite
A character string or a vector of two character strings that will indicate how to sample the data for each time step defined by
time_step
. Hence, the choice of this argument needs to be link with the choice of the time step. For example, for a yearly extraction so iftime_step
is set to"year"
,sampling_period
needs to be formated as%m-%d
(a month - a day of the year) in order to indicate the start of the sampling of data for the current year. More precisly, iftime_step="year"
andsampling_period="03-19"
,funct
will be apply on every data from the 3rd march of each year to the 2nd march of the following one. In this way, it is possible to create a sub-year sampling with a vector of two character strings assampling_period=c("02-01", "07-31")
in order to process data only if the date is between the 1st february and the 31th jully of each year. not available for now For a monthly (or seasonal) extraction,sampling_period
needs to give only day in each month, so for examplesampling_period="10"
to extract data from the 10th of each month to the 9th of each following month. DefaultNULL
to conserve the value specified in the CARDs used.- rmNApct
logical. Should the
NApct
column, which shows the percentage of missing values in the output, be removed ? DefaultTRUE
.- rm_duplicates
logical. Should duplicate time series values be automatically removed ? Default
FALSE
.- extract_only_metadata
logical. If TRUE, only metadata of CARD will be extracted. In that case, use
data=NULL
. Default FALSE.- dev
logical If
TRUE
, development mode is enabled. Default isFALSE
.- verbose
logical. Should intermediate messages be printed during the execution of the function ? Default
FALSE
.
See also
CARD_list_all()
list all available CARD.CARD_management()
for managing CARD parameterization files.
CARD_extraction()
for extracting variables using CARD.
Examples
library(CARD)
# Get all the available variables
metaEX_all = CARD_list_all()
metaEX_all
#> # A tibble: 563 × 23
#> CARD_name variable_en unit_en name_en description_en method_en
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ETPA ETPA mm Cumula… "" ""
#> 2 BFI_Wal BFI_Wal withou… Basefl… "Ratio betwee… "1. no t…
#> 3 BFM BFM withou… Basefl… "" "1. no t…
#> 4 delta{BFI}_LH_H1 delta{BFI}_LH_H1 withou… Change… "Ratio betwee… "1. no t…
#> 5 delta{BFI}_LH_H2 delta{BFI}_LH_H2 withou… Change… "Ratio betwee… "1. no t…
#> 6 delta{BFI}_LH_H3 delta{BFI}_LH_H3 withou… Change… "Ratio betwee… "1. no t…
#> 7 delta{BFI}_Wal_H1 delta{BFI}_Wal_H1 withou… Change… "Ratio betwee… "1. no t…
#> 8 delta{BFI}_Wal_H2 delta{BFI}_Wal_H2 withou… Change… "Ratio betwee… "1. no t…
#> 9 delta{BFI}_Wal_H3 delta{BFI}_Wal_H3 withou… Change… "Ratio betwee… "1. no t…
#> 10 delta{centerBF}_H1 delta{centerBF}_… day Averag… "Date when 50… "1. annu…
#> # ℹ 553 more rows
#> # ℹ 17 more variables: sampling_period_en <chr>, topic_en <chr>,
#> # variable_fr <chr>, unit_fr <chr>, name_fr <chr>, description_fr <chr>,
#> # method_fr <chr>, sampling_period_fr <chr>, topic_fr <chr>,
#> # is_experimental <lgl>, input_vars <chr>, source <chr>,
#> # preferred_sampling_period <chr>, is_date <lgl>, to_normalise <lgl>,
#> # palette <chr>, script_path <chr>
# Create mock data
Start = as.Date("2001-03-02")
End = as.Date("2024-11-30")
Date = seq.Date(Start, End, by="day")
data = dplyr::tibble(time=Date,
Q=as.numeric(Date),
id="serie 1")
# Do a direct extraction
res = CARD_extraction(data, CARD_name=c("QA", "QMNA"), verbose=TRUE)
#> [1] "Computes QMNA"
#> [1] "Process 1/2"
#> [1] "EXTRACTION PROCESS "
#> [1] "├── Missing year "
#> [1] "│ └── Checking missing continuous periods "
#> [1] "│ longer than 10 years "
#> [1] "├── Period "
#> [1] "│ └── Selecting all the data "
#> [1] "├── Sample period "
#> [1] "│ ├── Default sample period used "
#> [1] "│ └── Fixing sample period "
#> [1] "│ ├── Only start of the sample period was "
#> [1] "│ │ given "
#> [1] "│ ├── Every time series have the same "
#> [1] "│ │ sample period "
#> [1] "│ └── All : 01 / 30 "
#> [1] "├── Monthly extraction along years "
#> [1] "│ └── Preparing date data for the extraction "
#> [1] "│ ├── Get general sample info "
#> [1] "│ ├── Computing of time indicators for "
#> [1] "│ │ each time serie "
#> [1] "│ ├── Get number of missing data for start "
#> [1] "│ │ and end "
#> [1] "│ └── Create each group "
#> [1] "├── Grouping data "
#> [1] "├── Application of the function "
#> [1] "├── Cleaning extracted tibble "
#> [1] "│ ├── Manage possible infinite values "
#> [1] "│ └── Recreate a date vector and add value for "
#> [1] "│ NApct computing "
#> [1] "├── NA management "
#> [1] "│ ├── Compute NA percentage "
#> [1] "│ ├── Removing data if NA percentage is "
#> [1] "│ │ strictly above 3 % "
#> [1] "│ └── Cleaning NA percentage info "
#> [1] "└── Last cleaning and formating for output "
#> [1] " ├── Rename column "
#> [1] " ├── Keeping only the needed data : all "
#> [1] " └── Return data "
#> # A tibble: 8,675 × 4
#> id time Q QMA
#> <chr> <date> <dbl> <dbl>
#> 1 serie 1 2001-03-02 11383 NA
#> 2 serie 1 2001-03-03 11384 NA
#> 3 serie 1 2001-03-04 11385 NA
#> 4 serie 1 2001-03-05 11386 NA
#> 5 serie 1 2001-03-06 11387 NA
#> 6 serie 1 2001-03-07 11388 NA
#> 7 serie 1 2001-03-08 11389 NA
#> 8 serie 1 2001-03-09 11390 NA
#> 9 serie 1 2001-03-10 11391 NA
#> 10 serie 1 2001-03-11 11392 NA
#> # ℹ 8,665 more rows
#> [1] "Process 2/2"
#> [1] "EXTRACTION PROCESS "
#> [1] "├── Period "
#> [1] "│ └── Selecting all the data "
#> [1] "├── Sample period "
#> [1] "│ ├── Fixing sample period for each time series "
#> [1] "│ ├── Every time series have the same "
#> [1] "│ │ sample period "
#> [1] "│ └── All : 11-01 / 10-31 "
#> [1] "├── Yearly extraction "
#> [1] "│ └── Preparing date data for the extraction "
#> [1] "│ ├── Get general sample info "
#> [1] "│ ├── Computing of time indicators for "
#> [1] "│ │ each time serie "
#> [1] "│ ├── Get number of missing data for start "
#> [1] "│ │ and end "
#> [1] "│ └── Create each group "
#> [1] "├── Grouping data "
#> [1] "├── Application of the function "
#> [1] "├── Cleaning extracted tibble "
#> [1] "│ ├── Manage possible infinite values "
#> [1] "│ └── Recreate a date vector and add value for "
#> [1] "│ NApct computing "
#> [1] "├── NA management "
#> [1] "│ ├── Compute NA percentage "
#> [1] "│ ├── Removing data if NA percentage is "
#> [1] "│ │ strictly above 3 % "
#> [1] "│ └── Cleaning NA percentage info "
#> [1] "└── Last cleaning and formating for output "
#> [1] " ├── Rename column "
#> [1] " └── Return data "
#> # A tibble: 25 × 3
#> id time QMNA
#> <chr> <date> <dbl>
#> 1 serie 1 2000-11-01 NA
#> 2 serie 1 2001-11-01 11642.
#> 3 serie 1 2002-11-01 12006.
#> 4 serie 1 2003-11-01 12372.
#> 5 serie 1 2004-11-01 12738.
#> 6 serie 1 2005-11-01 13102.
#> 7 serie 1 2006-11-01 13468.
#> 8 serie 1 2007-11-01 13832.
#> 9 serie 1 2008-11-01 14198.
#> 10 serie 1 2009-11-01 14564.
#> # ℹ 15 more rows
#> [1] "Computes QA"
#> [1] "Process 1/1"
#> [1] "EXTRACTION PROCESS "
#> [1] "├── Missing year "
#> [1] "│ └── Checking missing continuous periods "
#> [1] "│ longer than 10 years "
#> [1] "├── Period "
#> [1] "│ └── Selecting all the data "
#> [1] "├── Sample period "
#> [1] "│ └── Fixing sample period "
#> [1] "│ ├── Only start of the sample period was "
#> [1] "│ │ given "
#> [1] "│ ├── Every time series have the same "
#> [1] "│ │ sample period "
#> [1] "│ └── All : 09-01 / 08-31 "
#> [1] "├── Yearly extraction "
#> [1] "│ └── Preparing date data for the extraction "
#> [1] "│ ├── Get general sample info "
#> [1] "│ ├── Computing of time indicators for "
#> [1] "│ │ each time serie "
#> [1] "│ ├── Get number of missing data for start "
#> [1] "│ │ and end "
#> [1] "│ └── Create each group "
#> [1] "├── Grouping data "
#> [1] "├── Application of the function "
#> [1] "├── Cleaning extracted tibble "
#> [1] "│ ├── Manage possible infinite values "
#> [1] "│ └── Recreate a date vector and add value for "
#> [1] "│ NApct computing "
#> [1] "├── NA management "
#> [1] "│ ├── Compute NA percentage "
#> [1] "│ ├── Removing data if NA percentage is "
#> [1] "│ │ strictly above 3 % "
#> [1] "│ └── Cleaning NA percentage info "
#> [1] "└── Last cleaning and formating for output "
#> [1] " ├── Rename column "
#> [1] " └── Return data "
#> # A tibble: 25 × 3
#> id time QA
#> <chr> <date> <dbl>
#> 1 serie 1 2000-09-01 NA
#> 2 serie 1 2001-09-01 11748
#> 3 serie 1 2002-09-01 12113
#> 4 serie 1 2003-09-01 12478.
#> 5 serie 1 2004-09-01 12844
#> 6 serie 1 2005-09-01 13209
#> 7 serie 1 2006-09-01 13574
#> 8 serie 1 2007-09-01 13940.
#> 9 serie 1 2008-09-01 14305
#> 10 serie 1 2009-09-01 14670
#> # ℹ 15 more rows
res
#> $metaEX
#> # A tibble: 2 × 21
#> variable_en unit_en name_en description_en method_en sampling_period_en
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 QMNA m^{3}.s^{-1} Annual m… "" "1. mont… Month of maximum …
#> 2 QA m^{3}.s^{-1} Annual m… "" "1. annu… 09-01, 08-31
#> # ℹ 15 more variables: topic_en <chr>, variable_fr <chr>, unit_fr <chr>,
#> # name_fr <chr>, description_fr <chr>, method_fr <chr>,
#> # sampling_period_fr <chr>, topic_fr <chr>, is_experimental <lgl>,
#> # input_vars <chr>, preferred_sampling_period <chr>, is_date <lgl>,
#> # to_normalise <lgl>, palette <chr>, script_path <chr>
#>
#> $dataEX
#> $dataEX$QMNA
#> # A tibble: 25 × 3
#> id time QMNA
#> <chr> <date> <dbl>
#> 1 serie 1 2000-11-01 NA
#> 2 serie 1 2001-11-01 11642.
#> 3 serie 1 2002-11-01 12006.
#> 4 serie 1 2003-11-01 12372.
#> 5 serie 1 2004-11-01 12738.
#> 6 serie 1 2005-11-01 13102.
#> 7 serie 1 2006-11-01 13468.
#> 8 serie 1 2007-11-01 13832.
#> 9 serie 1 2008-11-01 14198.
#> 10 serie 1 2009-11-01 14564.
#> # ℹ 15 more rows
#>
#> $dataEX$QA
#> # A tibble: 25 × 3
#> id time QA
#> <chr> <date> <dbl>
#> 1 serie 1 2000-09-01 NA
#> 2 serie 1 2001-09-01 11748
#> 3 serie 1 2002-09-01 12113
#> 4 serie 1 2003-09-01 12478.
#> 5 serie 1 2004-09-01 12844
#> 6 serie 1 2005-09-01 13209
#> 7 serie 1 2006-09-01 13574
#> 8 serie 1 2007-09-01 13940.
#> 9 serie 1 2008-09-01 14305
#> 10 serie 1 2009-09-01 14670
#> # ℹ 15 more rows
#>
#>
# Or find the closest CARD variable that interests you
CARD_management(CARD_name=c("VCN10-5"),
CARD_path="CARD-WIP",
overwrite=TRUE)
# Personalise it in the created `"CARD-WIP"` directory (for example change the return period)
# And perform a custom extraction
res = CARD_extraction(data, CARD_name=NULL,
CARD_path="CARD-WIP",
verbose=TRUE)
#> [1] "Computes VCN10-5"
#> [1] "Process 1/3"
#> [1] "EXTRACTION PROCESS "
#> [1] "├── Missing year "
#> [1] "│ └── Checking missing continuous periods "
#> [1] "│ longer than 10 years "
#> [1] "├── Period "
#> [1] "│ └── Selecting all the data "
#> [1] "├── Sample period "
#> [1] "│ ├── Default sample period used "
#> [1] "│ └── Fixing sample period "
#> [1] "│ ├── Only start of the sample period was "
#> [1] "│ │ given "
#> [1] "│ ├── Every time series have the same "
#> [1] "│ │ sample period "
#> [1] "│ └── All : 01-01 / 12-31 "
#> [1] "├── None extraction "
#> [1] "│ └── Preparing date data for the extraction "
#> [1] "│ ├── Get general sample info "
#> [1] "├── Grouping data "
#> [1] "├── Application of the function "
#> [1] "├── Cleaning extracted tibble "
#> [1] "│ ├── Manage possible infinite values "
#> [1] "│ └── Recreate a date vector and add value for "
#> [1] "│ NApct computing "
#> [1] "├── NA management "
#> [1] "│ ├── Compute NA percentage "
#> [1] "│ └── Cleaning NA percentage info "
#> [1] "└── Last cleaning and formating for output "
#> [1] " ├── Rename column "
#> [1] " ├── Keeping only the needed data : all "
#> [1] " └── Return data "
#> # A tibble: 8,675 × 4
#> id time Q VC10
#> <chr> <date> <dbl> <dbl>
#> 1 serie 1 2001-03-02 11383 NA
#> 2 serie 1 2001-03-03 11384 NA
#> 3 serie 1 2001-03-04 11385 NA
#> 4 serie 1 2001-03-05 11386 NA
#> 5 serie 1 2001-03-06 11387 11388.
#> 6 serie 1 2001-03-07 11388 11388.
#> 7 serie 1 2001-03-08 11389 11390.
#> 8 serie 1 2001-03-09 11390 11390.
#> 9 serie 1 2001-03-10 11391 11392.
#> 10 serie 1 2001-03-11 11392 11392.
#> # ℹ 8,665 more rows
#> [1] "Process 2/3"
#> [1] "EXTRACTION PROCESS "
#> [1] "├── Period "
#> [1] "│ └── Selecting all the data "
#> [1] "├── Sample period "
#> [1] "│ ├── Fixing sample period for each time series "
#> [1] "│ ├── Every time series have the same "
#> [1] "│ │ sample period "
#> [1] "│ └── All : 11-01 / 10-31 "
#> [1] "├── Yearly extraction "
#> [1] "│ └── Preparing date data for the extraction "
#> [1] "│ ├── Get general sample info "
#> [1] "│ ├── Computing of time indicators for "
#> [1] "│ │ each time serie "
#> [1] "│ ├── Get number of missing data for start "
#> [1] "│ │ and end "
#> [1] "│ └── Create each group "
#> [1] "├── Grouping data "
#> [1] "├── Application of the function "
#> [1] "├── Cleaning extracted tibble "
#> [1] "│ ├── Manage possible infinite values "
#> [1] "│ └── Recreate a date vector and add value for "
#> [1] "│ NApct computing "
#> [1] "├── NA management "
#> [1] "│ ├── Compute NA percentage "
#> [1] "│ ├── Removing data if NA percentage is "
#> [1] "│ │ strictly above 3 % "
#> [1] "│ └── Cleaning NA percentage info "
#> [1] "└── Last cleaning and formating for output "
#> [1] " ├── Rename column "
#> [1] " └── Return data "
#> # A tibble: 25 × 3
#> id time VCN10
#> <chr> <date> <dbl>
#> 1 serie 1 2000-11-01 NA
#> 2 serie 1 2001-11-01 11628.
#> 3 serie 1 2002-11-01 11992.
#> 4 serie 1 2003-11-01 12358.
#> 5 serie 1 2004-11-01 12724.
#> 6 serie 1 2005-11-01 13088.
#> 7 serie 1 2006-11-01 13454.
#> 8 serie 1 2007-11-01 13818.
#> 9 serie 1 2008-11-01 14184.
#> 10 serie 1 2009-11-01 14550.
#> # ℹ 15 more rows
#> [1] "Process 3/3"
#> [1] "EXTRACTION PROCESS "
#> [1] "├── Period "
#> [1] "│ └── Selecting all the data "
#> [1] "├── Sample period "
#> [1] "│ ├── Default sample period used "
#> [1] "│ └── Fixing sample period "
#> [1] "│ ├── Only start of the sample period was "
#> [1] "│ │ given "
#> [1] "│ ├── Every time series have the same "
#> [1] "│ │ sample period "
#> [1] "│ └── All : 01-01 / 12-31 "
#> [1] "├── None extraction "
#> [1] "│ └── Preparing date data for the extraction "
#> [1] "│ ├── Get general sample info "
#> [1] "├── Grouping data "
#> [1] "├── Application of the function "
#> [1] "├── Cleaning extracted tibble "
#> [1] "│ ├── Manage possible infinite values "
#> [1] "│ └── Recreate a date vector and add value for "
#> [1] "│ NApct computing "
#> [1] "├── NA management "
#> [1] "│ ├── Compute NA percentage "
#> [1] "│ └── Cleaning NA percentage info "
#> [1] "└── Last cleaning and formating for output "
#> [1] " ├── Rename column "
#> [1] " └── Return data "
#> # A tibble: 1 × 2
#> id `VCN10-5`
#> <chr> <dbl>
#> 1 serie 1 13495.
res$dataEX
#> $`VCN10-5`
#> # A tibble: 1 × 2
#> id `VCN10-5`
#> <chr> <dbl>
#> 1 serie 1 13495.
#>