bash - Remove all Unicode Spaces and replace with Normal Space
Easy using perl:
perl -CSDA -plE 's/\s/ /g' file
but as @mklement0 corectly said in comment, it will match the \t
(TAB) too. If this is problem, you could use
perl -CSDA -plE 's/[^\S\t]/ /g'
Demo:
X X
the above containing:
U+00058 X LATIN CAPITAL LETTER XU+01680 OGHAM SPACE MARKU+02002 EN SPACEU+02003 EM SPACEU+02004 THREE-PER-EM SPACEU+02005 FOUR-PER-EM SPACEU+02006 SIX-PER-EM SPACEU+02007 FIGURE SPACEU+02008 PUNCTUATION SPACEU+02009 THIN SPACEU+0200A HAIR SPACEU+0202F NARROW NO-BREAK SPACEU+0205F MEDIUM MATHEMATICAL SPACEU+03000 IDEOGRAPHIC SPACEU+00058 X LATIN CAPITAL LETTER X
using:
perl -CSDA -plE 's/\s/_/g' <<<"X X"
note, for the demo replacing to underscore, prints
X_____________X
also, doable using pure bash
LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")while read -r line; do echo "${line//[$spaces]/ }"done
The LC_ALL=en_US.UTF-8
is necessary only if your default locale isn't UTF-8
. (which you should have, if do you working with utf8 texts) :)demo:
str="X X"echo "${str//[$spaces]/_}"
prints again:
X_____________X
same using sed
- prepare the variable $spaces
as above and use:
sed "s/[$spaces]/ /g" file
Edit - because some strange copy/paste (or Locale) problems:
xxd -ps <<<"$spaces"
shows
c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e28087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a
the md5
digest (two different programs)
md5sum <<<"$spaces"LC_ALL=C md5 <<<"$spaces"
prints the same md5
35cf5e1d7a5f512031d18f3d2ec6612f -35cf5e1d7a5f512031d18f3d2ec6612f
It is possible to identify the characters by their unicode, the sed 's/[[:space:]]\+/\ /g'
wont do the trick unfortunately.
By reworking another SO answer, we list all the unicodes save them in a variable, then use sed for the replacement (note using -i.bak
we will also save a copy of the original file)
CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF") sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt
If you're faced with this task repeatedly, consider installing nws
(normalize whitespace), a utility (of mine) that simplifies the task:
nws --ascii file # convert non-ASCII whitespace and punctuation to ASCIInws --ascii -i file # update file in place
The --ascii
mode of nws
:
transliterates (non-ASCII) Unicode whitespace (such as a no-break space (
“”
), en dash (–
), ... ) to their closest ASCII equivalentwhile leaving any other Unicode characters alone.
This mode is helpful for source-code samples that have been formatted for display with typographic quotes, em dashes, and the like, which usually makes the code indigestible to compilers/interpreters.
Installation of nws
from the npm registry (Linux and macOS)
Note: Even if you don't use Node.js, npm
, its package manager, works across platforms and is easy to install; trycurl -L https://git.io/n-install | bash
With Node.js installed, install as follows:
[sudo] npm install nws-cli -g
Note:
- Whether you need
sudo
depends on how you installed Node.js and whether you've changed permissions later; if you get anEACCES
error, try again withsudo
. - The
-g
ensures global installation and is needed to putnws-cli
in your system's$PATH
.
Manual installation (any Unix platform with bash
)
- Download this
bash
script asnws
. - Make it executable with
chmod +x nws
. - Move it or symlink it to a folder in your
$PATH
, such as/usr/local/bin
(macOS) or/usr/bin
(Linux).
Optional reading: POSIX character classes [:space:]
and [:blank:]
and non-ASCII Unicode whitespace
In UTF-8-based locales, POSIX-compatible utilities should make POSIX character class [:space:]
and [:blank:]
match (non-ASCII) Unicode whitespace.
This relies on the locale charmap's correct classification of Unicode characters based on the POSIX-mandated character classifications, which directly correspond to character classes such as [:space:]
, available in patterns and regular expressions.
There are two pitfalls:
Unicode is an evolving standard (version 9 as of this writing); your platform's UTF-8 charmap may not be current.
- For instance, on
Ubuntu 16.04
the following characters are not properly classified and therefore not matched by[:space:]
/[:blank:]
:
no-break space, figure space, narrow no-break space, next line
- For instance, on
The utilities should use the active locale's charmap - but there are regrettable exceptions - the following utilities are NOT Unicode-aware (there may be more):
Among GNU utilities (as of coreutils v8.27):
cut
,tr
Mawk, the
awk
implementation that is the default on Ubuntu, for instance.Among BSD/macOS utilities (as of macOS 10.12):
awk
Therefore, on a platform that has a current UTF-8 charmap, the following sed
command should work, but note that [:space:]
also matches tab characters and therefore replaces them with a single space too:
sed 's/[[:space:]]/ /g' file