Using clean_strings


Using clean_strings

clean_strings is the way to prepare strings for name matching, either within tier_match (see the Using-tier-match vignette). There are several useful options that allow for many different options.

Here’s the example string we’ll be using:

name_vec <- corp_data1[, Company]
#>  [1] "Walmart"            "Bershire Hataway"   "Apple"             
#>  [4] "Exxon Mobile"       "McKesson "          "UnitedHealth Group"
#>  [7] "CVS Health"         "General Motors"     "AT&T"              
#> [10] "Ford Motor Company"

First, we can use the basic string cleaning defaults:

#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "general motors"     "atandt"            
#> [10] "ford motor company"

Without any additional arguments, clean_strings does the following:

Then, we have a few different options we can use.


sp_char_words is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch as a built-in set of symbols:

#>    character replacement
#> 1:       \\&         and
#> 2:       \\$      dollar
#> 3:       \\%     percent
#> 4:       \\@          at

But, you can use any data.frame you’d like, to make whatever replacements you’d like:

new_sp_char <- data.table::data.table(character = c("o"), replacement = c("apple"))
clean_strings(name_vec, sp_char_words = new_sp_char)
#>  [1] "walmart"                            "bershire hataway"                  
#>  [3] "apple"                              "exxapplen mapplebile"              
#>  [5] "mckessapplen"                       "unitedhealth grappleup"            
#>  [7] "cvs health"                         "general mappletapplers"            
#>  [9] "at t"                               "fapplerd mappletappler capplempany"


common_words is similar, but it respects word boundaries (so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for example.) fedmatch has a built-in set of 54 words and their replacements:

#>     abbr     long.names
#> 1: accep     acceptance
#> 2:  amer        america
#> 3: assoc     associates
#> 4:    cl company listed
#> 5: cmnty      community

But, you can use whatever words you’d like:

clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
                                                              replacement = c("bananas", "oranges")))
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "bananas motors"     "atandt"            
#> [10] "ford motor company"

(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,

You can also use a related function, word_frequency, to look for the most common strings in your data:

word_frequency(sample(c("hi", "Hello", "bye    "), 1e4, replace = TRUE))
#>     Word Count
#> 1: hello  3376
#> 2:   bye  3323
#> 3:    hi  3301

Remove characters and words

remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.

clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#>  [1] "w lm rt"                           "bershire h t w y"                 
#>  [3] "pple"                              "exxapplen mapplebile"             
#>  [5] "m kessapplen"                      "unitedhe lth grappleup"           
#>  [7] "vs he lth"                         "gener l mappletapplers"           
#>  [9] "t t"                               "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
                                                              replacement = c("bananas", "oranges")),
              remove_words = TRUE)
#>  [1] "walmart"            "bershire hataway"   "apple"             
#>  [4] "exxon mobile"       "mckesson"           "unitedhealth group"
#>  [7] "cvs health"         "motors"             "atandt"            
#> [10] "ford motor"


stem is a boolean that lets you stem words, using SnowballC::wordStem. ‘stemming’ words means removing common suffixes:

clean_strings(c( "call", "calling", "called"), stem = TRUE)
#> [1] "call" "call" "call"

See the documentation in SnowballC::wordStem for details.