Consider the following vector which contains genome position strings,
<- c(
pos.vec "chr10:213,054,000-213,055,000",
"chrM:111,000",
"chr1:110-111 chr2:220-222") # two possible matches.
To capture the first genome position in each string, we use the following syntax. The first argument is the subject character vector, and the other arguments are pasted together to make a capturing regular expression. Each named argument generates a capture group; the R argument name is used for the column name of the result.
<- nc::capture_first_vec(
(chr.dt
pos.vec, chrom="chr.*?",
":",
chromStart="[0-9,]+"))
#> chrom chromStart
#> <char> <char>
#> 1: chr10 213,054,000
#> 2: chrM 111,000
#> 3: chr1 110
str(chr.dt)
#> Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
#> $ chrom : chr "chr10" "chrM" "chr1"
#> $ chromStart: chr "213,054,000" "111,000" "110"
#> - attr(*, ".internal.selfref")=<externalptr>
We can add type conversion functions on the same line as each named argument:
<- function(x)as.integer(gsub("[^0-9]", "", x))
keep.digits <- nc::capture_first_vec(
(int.dt
pos.vec, chrom="chr.*?",
":",
chromStart="[0-9,]+", keep.digits))
#> chrom chromStart
#> <char> <int>
#> 1: chr10 213054000
#> 2: chrM 111000
#> 3: chr1 110
str(int.dt)
#> Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
#> $ chrom : chr "chr10" "chrM" "chr1"
#> $ chromStart: int 213054000 111000 110
#> - attr(*, ".internal.selfref")=<externalptr>
Below we use list variables to create patterns which are re-usable, and we use an un-named list to generate a non-capturing optional group:
<- list("[0-9,]+", keep.digits)
pos.pattern <- list(
range.pattern chrom="chr.*?",
":",
chromStart=pos.pattern,
list(
"-",
chromEnd=pos.pattern
"?")
), ::capture_first_vec(pos.vec, range.pattern)
nc#> chrom chromStart chromEnd
#> <char> <int> <int>
#> 1: chr10 213054000 213055000
#> 2: chrM 111000 NA
#> 3: chr1 110 111
In summary, nc::capture_first_vec
takes a variable number of arguments:
To see the generated regular expression pattern string, call nc::var_args_list
with the variable number of arguments that specify the pattern:
::var_args_list(range.pattern)
nc#> $fun.list
#> $fun.list$chrom
#> function (x)
#> x
#> <bytecode: 0x000001fae1e3aed8>
#> <environment: namespace:base>
#>
#> $fun.list$chromStart
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x000001fae8838ed8>
#>
#> $fun.list$chromEnd
#> function(x)as.integer(gsub("[^0-9]", "", x))
#> <bytecode: 0x000001fae8838ed8>
#>
#>
#> $pattern
#> [1] "(?:(chr.*?):([0-9,]+)(?:-([0-9,]+))?)"
The generated regex is the pattern
element of the resulting list above. The other element fun.list
indicates the names and type conversion functions to use with the capture groups.
The default is to stop with an error if any subject does not match:
<- c(bad="does not match", pos.vec)
bad.vec ::capture_first_vec(bad.vec, range.pattern)
nc#> Error in stop_for_na(make.na): subject(s) 1 (1 total) did not match regex below; to output missing rows use nomatch.error=FALSE
#> (?:(?:(chr.*?):([0-9,]+)(?:-([0-9,]+))?))
Sometimes you want to instead report a row of NA when a subject does not match. In that case, use nomatch.error=FALSE
:
::capture_first_vec(bad.vec, range.pattern, nomatch.error=FALSE)
nc#> chrom chromStart chromEnd
#> <char> <int> <int>
#> 1: <NA> NA NA
#> 2: chr10 213054000 213055000
#> 3: chrM 111000 NA
#> 4: chr1 110 111
By default nc uses the PCRE regex engine. Other choices include ICU and RE2. Each engine has different features, which are discussed in my R journal paper.
The engine is configurable via the engine
argument or the nc.engine
option:
<- "a\U0001F60E#"
u.subject <- list(emoji="\\p{EMOJI_Presentation}")
u.pattern <- options(nc.engine="ICU")
old.opt ::capture_first_vec(u.subject, u.pattern)
nc#> emoji
#> <char>
#> 1: 😎
::capture_first_vec(u.subject, u.pattern, engine="PCRE")
nc#> emoji
#> <char>
#> 1: 😎
::capture_first_vec(u.subject, u.pattern, engine="RE2")
nc#> re2google/re2/re2.cc:205: Error parsing '(?:(?:(\p{EMOJI_Presentation})))': invalid character class range: \p{EMOJI_Presentation}
#> Error in value[[3L]](cond): (?:(?:(\p{EMOJI_Presentation})))
#> when matching pattern above with RE2 engine, an error occured: invalid character class range: \p{EMOJI_Presentation}
options(old.opt)
For more details see the “engines” vignette,
vignette("v6-engines", package="nc")
We also provide nc::capture_first_df
which extracts text from several columns of a data.frame, using a different regular expression for each column.
data.frame
(or data.table
) as the first argument.nc::capture_first_vec
on one column of the input table.nc::capture_first_vec
.nc::capture_first_vec
, in list/character/function format as explained in the previous section.data.table
with the same number of rows as the input, but with an additional column for each named capture group. New columns have the same names as the capture groups, so make sure they are unique.data.frame
and outputs data.table
so can be used in a pipe.data.table
, it is modified to avoid copying the entire table.This function can greatly simplify the code required to create numeric data columns from character data columns. For example consider the following data which was output from the sacct program.
<- data.frame(
(sacct.df Elapsed = c(
"07:04:42", "07:04:42", "07:04:49",
"00:00:00", "00:00:00"),
JobID=c(
"13937810_25",
"13937810_25.batch",
"13937810_25.extern",
"14022192_[1-3]",
"14022204_[4]"),
stringsAsFactors=FALSE))
#> Elapsed JobID
#> 1 07:04:42 13937810_25
#> 2 07:04:42 13937810_25.batch
#> 3 07:04:49 13937810_25.extern
#> 4 00:00:00 14022192_[1-3]
#> 5 00:00:00 14022204_[4]
Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:
<- list("[0-9]+", as.integer)
int.pattern <- list(
range.pattern "\\[",
task1=int.pattern,
list(
"-",#begin optional end of range.
taskN=int.pattern
"?", #end is optional.
), "\\]")
::capture_first_df(sacct.df, JobID=range.pattern, nomatch.error=FALSE)
nc#> Elapsed JobID task1 taskN
#> <char> <char> <int> <int>
#> 1: 07:04:42 13937810_25 NA NA
#> 2: 07:04:42 13937810_25.batch NA NA
#> 3: 07:04:49 13937810_25.extern NA NA
#> 4: 00:00:00 14022192_[1-3] 1 3
#> 5: 00:00:00 14022204_[4] 4 NA
The result shown above is another data frame with an additional column for each capture group. Next, we define another pattern that matches either one task ID or the previously defined range pattern:
<- list(
task.pattern "_",
list(
task=int.pattern,
"|",#either one task(above) or range(below)
range.pattern))::capture_first_df(sacct.df, JobID=task.pattern)
nc#> Elapsed JobID task task1 taskN
#> <char> <char> <int> <int> <int>
#> 1: 07:04:42 13937810_25 25 NA NA
#> 2: 07:04:42 13937810_25.batch 25 NA NA
#> 3: 07:04:49 13937810_25.extern 25 NA NA
#> 4: 00:00:00 14022192_[1-3] NA 1 3
#> 5: 00:00:00 14022204_[4] NA 4 NA
Below we match the complete JobID column:
<- list(
job.pattern job=int.pattern,
task.pattern,list(
"[.]",
type=".*"
"?")
), ::capture_first_df(sacct.df, JobID=job.pattern)
nc#> Elapsed JobID job task task1 taskN type
#> <char> <char> <int> <int> <int> <int> <char>
#> 1: 07:04:42 13937810_25 13937810 25 NA NA
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA
Below we match the Elapsed column with a different regex:
<- list(
elapsed.pattern hours=int.pattern,
":",
minutes=int.pattern,
":",
seconds=int.pattern)
::capture_first_df(sacct.df, JobID=job.pattern, Elapsed=elapsed.pattern)
nc#> Elapsed JobID job task task1 taskN type hours minutes
#> <char> <char> <int> <int> <int> <int> <char> <int> <int>
#> 1: 07:04:42 13937810_25 13937810 25 NA NA 7 4
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch 7 4
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern 7 4
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3 0 0
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA 0 0
#> seconds
#> <int>
#> 1: 42
#> 2: 42
#> 3: 49
#> 4: 0
#> 5: 0
Overall the result is another data table with an additional column for each capture group. Note in the code below that the input sacct.df
was not modified, but if the input is data.table
then it is modified:
::capture_first_df(sacct.df, JobID=job.pattern)
nc#> Elapsed JobID job task task1 taskN type
#> <char> <char> <int> <int> <int> <int> <char>
#> 1: 07:04:42 13937810_25 13937810 25 NA NA
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA
sacct.df#> Elapsed JobID
#> 1 07:04:42 13937810_25
#> 2 07:04:42 13937810_25.batch
#> 3 07:04:49 13937810_25.extern
#> 4 00:00:00 14022192_[1-3]
#> 5 00:00:00 14022204_[4]
<- data.table::as.data.table(sacct.df))
(sacct.DT #> Elapsed JobID
#> <char> <char>
#> 1: 07:04:42 13937810_25
#> 2: 07:04:42 13937810_25.batch
#> 3: 07:04:49 13937810_25.extern
#> 4: 00:00:00 14022192_[1-3]
#> 5: 00:00:00 14022204_[4]
::capture_first_df(sacct.DT, JobID=job.pattern)
nc#> Elapsed JobID job task task1 taskN type
#> <char> <char> <int> <int> <int> <int> <char>
#> 1: 07:04:42 13937810_25 13937810 25 NA NA
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA
sacct.DT#> Elapsed JobID job task task1 taskN type
#> <char> <char> <int> <int> <int> <int> <char>
#> 1: 07:04:42 13937810_25 13937810 25 NA NA
#> 2: 07:04:42 13937810_25.batch 13937810 25 NA NA batch
#> 3: 07:04:49 13937810_25.extern 13937810 25 NA NA extern
#> 4: 00:00:00 14022192_[1-3] 14022192 NA 1 3
#> 5: 00:00:00 14022204_[4] 14022204 NA 4 NA
In this section we explain how to use various features to parse the fsa file names in PROVEDIt. Here are a few representative examples:
<- c(#control samples:
fsa.vec "A01-Ladder-PP16-001.10sec.fsa",
"D07_Ladder-IP_004.5sec.fsa",
"A12_RB121514ADG_001.10sec.fsa",
"A10_RB102191515LEA-IP_001.10sec.fsa",
##single-source samples:
"A02-RD12-0002-35-0.5PP16-001.10sec.fsa",
"G01_RD14-0003-35d3S30-0.01563P-Q10.0_003.10sec.fsa",
"A06_RD14-0003-24d3a-0.0625IP-Q0.8_001.10sec.fsa",
"A08-RD12-0002-01d-0.125PP16-001.10sec.fsa",
"A10-RD12-0002-04d1-0.0625PP16-001.10sec.fsa",
"C02_RD14-0003-15d2b-0.25IP-Q0.5_003.5sec.fsa",
##mixture samples:
"A02-RD12-0002-1_2-1;9-0.125PP16-001.10sec.fsa",
"H07_RD14-0003-35_36_37_38_39-1;4;4;4;1-M2I35-0.75IP-QLAND_004.5sec.fsa")
The goal is to build a regex that can convert this character vector to a data table with different columns for the different variables. The structure of the file names is explained in the supplementary materials PDF. We will build a complex regex in terms of simpler sub-patterns. First let’s just match the start of each file name, which has the row/column of the 96-well plate in which the sample was tested:
<- list(
well.pattern "^",
well.letter="[A-H]",
well.number="[0-9]+", as.integer)
::capture_first_vec(fsa.vec, well.pattern)
nc#> well.letter well.number
#> <char> <int>
#> 1: A 1
#> 2: D 7
#> 3: A 12
#> 4: A 10
#> 5: A 2
#> 6: G 1
#> 7: A 6
#> 8: A 8
#> 9: A 10
#> 10: C 2
#> 11: A 2
#> 12: H 7
Now, let’s match the end of each file name:
<- list(
end.pattern "[-_]",
capillary="[0-9]+", as.integer,
"[.]",
seconds="[0-9]+", as.integer,
"sec[.]fsa$")
::capture_first_vec(fsa.vec, end.pattern)
nc#> capillary seconds
#> <int> <int>
#> 1: 1 10
#> 2: 4 5
#> 3: 1 10
#> 4: 1 10
#> 5: 1 10
#> 6: 3 10
#> 7: 1 10
#> 8: 1 10
#> 9: 1 10
#> 10: 3 5
#> 11: 1 10
#> 12: 4 5
Now, let’s take a look at what’s in between:
<- nc::capture_first_vec(
between.dt between=".*?", end.pattern)
fsa.vec, well.pattern, $between
between.dt#> [1] "-Ladder-PP16"
#> [2] "_Ladder-IP"
#> [3] "_RB121514ADG"
#> [4] "_RB102191515LEA-IP"
#> [5] "-RD12-0002-35-0.5PP16"
#> [6] "_RD14-0003-35d3S30-0.01563P-Q10.0"
#> [7] "_RD14-0003-24d3a-0.0625IP-Q0.8"
#> [8] "-RD12-0002-01d-0.125PP16"
#> [9] "-RD12-0002-04d1-0.0625PP16"
#> [10] "_RD14-0003-15d2b-0.25IP-Q0.5"
#> [11] "-RD12-0002-1_2-1;9-0.125PP16"
#> [12] "_RD14-0003-35_36_37_38_39-1;4;4;4;1-M2I35-0.75IP-QLAND"
Notice that PP16/P/IP are the kit types. There may optionally be a DNA template mass before, and optionally a Q score after. Let’s match those:
<- list(
mass.pattern template.nanograms="[0-9.]*", as.numeric)
<- nc::quantifier(
kit.pattern kit=nc::alternatives("PP16", "IP", "P"),
"?")
<- nc::quantifier(
q.pattern "-Q",
Q.chr=nc::alternatives("LAND", "[0-9.]+"),
"?")
<- options(width=100)
old.opt <- nc::capture_first_vec(
(before.dt $between,
between.dt"^",
before=".*?",
"$"))
mass.pattern, kit.pattern, q.pattern, #> before template.nanograms kit Q.chr
#> <char> <num> <char> <char>
#> 1: -Ladder- NA PP16
#> 2: _Ladder- NA IP
#> 3: _RB121514ADG NA
#> 4: _RB102191515LEA- NA IP
#> 5: -RD12-0002-35- 0.50000 PP16
#> 6: _RD14-0003-35d3S30- 0.01563 P 10.0
#> 7: _RD14-0003-24d3a- 0.06250 IP 0.8
#> 8: -RD12-0002-01d- 0.12500 PP16
#> 9: -RD12-0002-04d1- 0.06250 PP16
#> 10: _RD14-0003-15d2b- 0.25000 IP 0.5
#> 11: -RD12-0002-1_2-1;9- 0.12500 PP16
#> 12: _RD14-0003-35_36_37_38_39-1;4;4;4;1-M2I35- 0.75000 IP LAND
options(old.opt)
Now to match the before column, we need some alternatives. The first four rows are controls, and the other rows are samples which are indicated by a project ID (prefix RD) and then a sample ID. The mixture samples at the bottom contain semicolon-delimited mixture proportions.
<- list(project="RD[0-9]+-[0-9]+", "-")
project.pattern <- list(id="[0-9]+", as.integer)
single.pattern <- list(
mixture.pattern ids.chr="[0-9_]+",
"-",
parts.chr="[0-9;]+",
"-")
<- list(
single.or.mixture
project.pattern,::alternatives(mixture.pattern, single.pattern))
nc<- list(control="[^-_]+")
control.pattern <- list("[-_]", nc::alternatives(
single.mixture.control
single.or.mixture, control.pattern))<- nc::capture_first_vec(
(rest.dt $before, single.mixture.control, rest=".*"))
before.dt#> project ids.chr parts.chr id control rest
#> <char> <char> <char> <int> <char> <char>
#> 1: NA Ladder -
#> 2: NA Ladder -
#> 3: NA RB121514ADG
#> 4: NA RB102191515LEA -
#> 5: RD12-0002 35 -
#> 6: RD14-0003 35 d3S30-
#> 7: RD14-0003 24 d3a-
#> 8: RD12-0002 1 d-
#> 9: RD12-0002 4 d1-
#> 10: RD14-0003 15 d2b-
#> 11: RD12-0002 1_2 1;9 NA
#> 12: RD14-0003 35_36_37_38_39 1;4;4;4;1 NA M2I35-
The rest may contain some information related to the dilution and treatment, which we can parse as follows:
<- nc::quantifier(
dilution.pattern "[dM]",
dilution.number="[0-9]?", as.integer,
"?")
<- nc::quantifier(
treatment.pattern treatment.letter="[a-zA-Z]",
::quantifier(treatment.number="[0-9]+", as.integer, "?"),
nc"?")
<- list(
dilution.treatment
dilution.pattern,
treatment.pattern,"[_-]?")
::capture_first_vec(rest.dt$rest, "^", dilution.treatment, "$")
nc#> dilution.number treatment.letter treatment.number
#> <int> <char> <int>
#> 1: NA NA
#> 2: NA NA
#> 3: NA NA
#> 4: NA NA
#> 5: NA NA
#> 6: 3 S 30
#> 7: 3 a NA
#> 8: NA NA
#> 9: 1 NA
#> 10: 2 b NA
#> 11: NA NA
#> 12: 2 I 35
Now we just have to put all of those patterns together to get a full match:
<- list(
fsa.pattern
well.pattern,
single.mixture.control,
dilution.treatment,
mass.pattern, kit.pattern, q.pattern,
end.pattern)<- nc::capture_first_vec(fsa.vec, fsa.pattern))
(match.dt #> well.letter well.number project ids.chr parts.chr id
#> <char> <int> <char> <char> <char> <int>
#> 1: A 1 NA
#> 2: D 7 NA
#> 3: A 12 NA
#> 4: A 10 NA
#> 5: A 2 RD12-0002 35
#> 6: G 1 RD14-0003 35
#> 7: A 6 RD14-0003 24
#> 8: A 8 RD12-0002 1
#> 9: A 10 RD12-0002 4
#> 10: C 2 RD14-0003 15
#> 11: A 2 RD12-0002 1_2 1;9 NA
#> 12: H 7 RD14-0003 35_36_37_38_39 1;4;4;4;1 NA
#> control dilution.number treatment.letter treatment.number
#> <char> <int> <char> <int>
#> 1: Ladder NA NA
#> 2: Ladder NA NA
#> 3: RB121514ADG NA NA
#> 4: RB102191515LEA NA NA
#> 5: NA NA
#> 6: 3 S 30
#> 7: 3 a NA
#> 8: NA NA
#> 9: 1 NA
#> 10: 2 b NA
#> 11: NA NA
#> 12: 2 I 35
#> template.nanograms kit Q.chr capillary seconds
#> <num> <char> <char> <int> <int>
#> 1: NA PP16 1 10
#> 2: NA IP 4 5
#> 3: NA 1 10
#> 4: NA IP 1 10
#> 5: 0.50000 PP16 1 10
#> 6: 0.01563 P 10.0 3 10
#> 7: 0.06250 IP 0.8 1 10
#> 8: 0.12500 PP16 1 10
#> 9: 0.06250 PP16 1 10
#> 10: 0.25000 IP 0.5 3 5
#> 11: 0.12500 PP16 1 10
#> 12: 0.75000 IP LAND 4 5
And to verify we parsed everything correctly we print the subjects next to each row:
<- data.table::data.table(match.dt)
disp.dt names(disp.dt) <- paste(1:ncol(disp.dt))
<- options("datatable.print.colnames"="none")
old.opt split(disp.dt, fsa.vec)
#> $`A01-Ladder-PP16-001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 1 NA Ladder NA NA NA PP16 1 10
#>
#> $`A02-RD12-0002-1_2-1;9-0.125PP16-001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 2 RD12-0002 1_2 1;9 NA NA NA 0.125 PP16 1 10
#>
#> $`A02-RD12-0002-35-0.5PP16-001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 2 RD12-0002 35 NA NA 0.5 PP16 1 10
#>
#> $`A06_RD14-0003-24d3a-0.0625IP-Q0.8_001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 6 RD14-0003 24 3 a NA 0.0625 IP 0.8 1 10
#>
#> $`A08-RD12-0002-01d-0.125PP16-001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 8 RD12-0002 1 NA NA 0.125 PP16 1 10
#>
#> $`A10-RD12-0002-04d1-0.0625PP16-001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 10 RD12-0002 4 1 NA 0.0625 PP16 1 10
#>
#> $`A10_RB102191515LEA-IP_001.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 10 NA RB102191515LEA NA NA NA IP 1 10
#>
#> $A12_RB121514ADG_001.10sec.fsa
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: A 12 NA RB121514ADG NA NA NA 1 10
#>
#> $`C02_RD14-0003-15d2b-0.25IP-Q0.5_003.5sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: C 2 RD14-0003 15 2 b NA 0.25 IP 0.5 3 5
#>
#> $`D07_Ladder-IP_004.5sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: D 7 NA Ladder NA NA NA IP 4 5
#>
#> $`G01_RD14-0003-35d3S30-0.01563P-Q10.0_003.10sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: G 1 RD14-0003 35 3 S 30 0.01563 P 10.0 3 10
#>
#> $`H07_RD14-0003-35_36_37_38_39-1;4;4;4;1-M2I35-0.75IP-QLAND_004.5sec.fsa`
#> Warning: Column classes will be suppressed when col.names is 'none'
#> 1: H 7 RD14-0003 35_36_37_38_39 1;4;4;4;1 NA 2 I 35 0.75 IP LAND 4 5
options(old.opt)
Incidentally, if you print the regex as a string it looks like something that would be essentially impossible to write/maintain by hand (without nc
helper functions):
::var_args_list(fsa.pattern)$pattern
nc#> [1] "(?:(?:^([A-H])([0-9]+))(?:[-_](?:(?:(?:(RD[0-9]+-[0-9]+)-)(?:(?:([0-9_]+)-([0-9;]+)-)|(?:([0-9]+))))|(?:([^-_]+))))(?:(?:(?:[dM]([0-9]?))?)(?:(?:([a-zA-Z])(?:(?:([0-9]+))?))?)[_-]?)(?:([0-9.]*))(?:(?:(PP16|IP|P))?)(?:(?:-Q(LAND|[0-9.]+))?)(?:[-_]([0-9]+)[.]([0-9]+)sec[.]fsa$))"
In conclusion, we have shown how nc
can help build complex regex patterns from simple, understandable sub-patterns in R code. The general workflow we have followed in this section can be generalized to other projects. First, identify a small set of subjects to use for testing (fsa.vec
in the code above). Second, create some sub-patterns for the start/end of the subjects, and create a sub-pattern that will capture everything else in between. Third, create some more sub-patterns to match the “everything else” in the previous step (e.g. before.dt$before
, between.dt$between
, rest.dt$rest
above). Repeat the process until there is nothing left to match, and then concatenate the sub-patterns to match the whole subject.