R string removes punctuation on split R string removes punctuation on split r r

R string removes punctuation on split


You can switch on PCRE by using perl=TRUE and use a lookbehind assertion.

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)

Regular expression:

(?<!          look behind to see if there is not: [^!?.]       any character except: '!', '?', '.')             end of look-behind\s+           whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Live Demo


The sentSplit function in the qdap package was create just for this task:

library(qdap)sentSplit(data.frame(text = x), "text")##   tot                       text## 1 1.1       The world is at end.## 2 2.2         What do you think?## 3 3.3          I am going crazy!## 4 4.4 These people are too calm.


Take a look at this question. Character classes like [:space:] are defined within bracket expressions, so you need to enclose it in a set of brackets. Try:

vec <- strsplit(x, '[!?.][[:space:]]*')vec# [[1]]# [1] "The world is at end"       "What do you think"        # [3] "I am going crazy"          "These people are too calm"

This gets rid of the leading spaces. To keep punctuation, use a positive lookbehind assertion with perl = TRUE:

vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)vec# [[1]]# [1] "The world is at end."       "What do you think?"        # [3] "I am going crazy!"          "These people are too calm."