R string removes punctuation on split
You can switch on PCRE
by using perl=TRUE
and use a lookbehind assertion.
strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)
Regular expression:
(?<! look behind to see if there is not: [^!?.] any character except: '!', '?', '.') end of look-behind\s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times)
The sentSplit
function in the qdap package was create just for this task:
library(qdap)sentSplit(data.frame(text = x), "text")## tot text## 1 1.1 The world is at end.## 2 2.2 What do you think?## 3 3.3 I am going crazy!## 4 4.4 These people are too calm.
Take a look at this question. Character classes like [:space:]
are defined within bracket expressions, so you need to enclose it in a set of brackets. Try:
vec <- strsplit(x, '[!?.][[:space:]]*')vec# [[1]]# [1] "The world is at end" "What do you think" # [3] "I am going crazy" "These people are too calm"
This gets rid of the leading spaces. To keep punctuation, use a positive lookbehind assertion with perl = TRUE
:
vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)vec# [[1]]# [1] "The world is at end." "What do you think?" # [3] "I am going crazy!" "These people are too calm."