bash - Remove all Unicode Spaces and replace with Normal Space

Easy using perl:

perl -CSDA -plE 's/\s/ /g' file

but as @mklement0 corectly said in comment, it will match the \t (TAB) too. If this is problem, you could use

perl -CSDA -plE 's/[^\S\t]/ /g'

Demo:

X            　X

the above containing:

U+00058 X LATIN CAPITAL LETTER XU+01680   OGHAM SPACE MARKU+02002   EN SPACEU+02003   EM SPACEU+02004   THREE-PER-EM SPACEU+02005   FOUR-PER-EM SPACEU+02006   SIX-PER-EM SPACEU+02007   FIGURE SPACEU+02008   PUNCTUATION SPACEU+02009   THIN SPACEU+0200A   HAIR SPACEU+0202F   NARROW NO-BREAK SPACEU+0205F   MEDIUM MATHEMATICAL SPACEU+03000 　 IDEOGRAPHIC SPACEU+00058 X LATIN CAPITAL LETTER X

using:

perl -CSDA -plE 's/\s/_/g'  <<<"X            　X"

note, for the demo replacing to underscore, prints

X_____________X

also, doable using pure bash

LC_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")while read -r line; do    echo "${line//[$spaces]/ }"done

The LC_ALL=en_US.UTF-8 is necessary only if your default locale isn't UTF-8. (which you should have, if do you working with utf8 texts) :)demo:

str="X            　X"echo "${str//[$spaces]/_}"

prints again:

X_____________X

same using sed - prepare the variable $spaces as above and use:

sed "s/[$spaces]/ /g" file

Edit - because some strange copy/paste (or Locale) problems:

xxd -ps <<<"$spaces"

shows

c2a0e19a80e1a08ee28080e28081e28082e28083e28084e28085e28086e28087e28088e28089e2808ae2808be280afe2819fe38080efbbbf0a

the md5 digest (two different programs)

md5sum <<<"$spaces"LC_ALL=C md5 <<<"$spaces"

prints the same md5

35cf5e1d7a5f512031d18f3d2ec6612f  -35cf5e1d7a5f512031d18f3d2ec6612f

bash unicode sed spaces

It is possible to identify the characters by their unicode, the sed 's/[[:space:]]\+/\ /g' wont do the trick unfortunately.

By reworking another SO answer, we list all the unicodes save them in a variable, then use sed for the replacement (note using -i.bak we will also save a copy of the original file)

 CHARS=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF") sed -i.bak 's/['"$CHARS"']/ /g' /tmp/file_to_edit.txt

bash unicode sed spaces

If you're faced with this task repeatedly, consider installing nws (normalize whitespace), a utility (of mine) that simplifies the task:

nws --ascii file # convert non-ASCII whitespace and punctuation to ASCIInws --ascii -i file  # update file in place

The --ascii mode of nws:

transliterates (non-ASCII) Unicode whitespace (such as a no-break space ( )) and punctuation (such as curly quotes (“”), en dash (–), ... ) to their closest ASCII equivalent
while leaving any other Unicode characters alone.

This mode is helpful for source-code samples that have been formatted for display with typographic quotes, em dashes, and the like, which usually makes the code indigestible to compilers/interpreters.

Installation of `nws` from the npm registry (Linux and macOS)

^{Note: Even if you don't use Node.js, npm, its package manager, works across platforms and is easy to install; try
curl -L https://git.io/n-install | bash}

With Node.js installed, install as follows:

[sudo] npm install nws-cli -g

Note:

Whether you need sudo depends on how you installed Node.js and whether you've changed permissions later; if you get an EACCES error, try again with sudo.
The -g ensures global installation and is needed to put nws-cli in your system's $PATH.

Manual installation (any Unix platform with `bash`)

Download this bash script as nws.
Make it executable with chmod +x nws.
Move it or symlink it to a folder in your $PATH, such as /usr/local/bin (macOS) or /usr/bin (Linux).

Optional reading: POSIX character classes `[:space:]` and `[:blank:]` and non-ASCII Unicode whitespace

In UTF-8-based locales, POSIX-compatible utilities should make POSIX character class [:space:] and [:blank:] match (non-ASCII) Unicode whitespace.

This relies on the locale charmap's correct classification of Unicode characters based on the POSIX-mandated character classifications, which directly correspond to character classes such as [:space:], available in patterns and regular expressions.

There are two pitfalls:

Unicode is an evolving standard (version 9 as of this writing); your platform's UTF-8 charmap may not be current.
- For instance, on Ubuntu 16.04 the following characters are not properly classified and therefore not matched by [:space:] / [:blank:]:
  no-break space, figure space, narrow no-break space, next line
The utilities should use the active locale's charmap - but there are regrettable exceptions - the following utilities are NOT Unicode-aware (there may be more):
- Among GNU utilities (as of coreutils v8.27):
  - cut, tr
- Mawk, the awk implementation that is the default on Ubuntu, for instance.
- Among BSD/macOS utilities (as of macOS 10.12):
  - awk

Therefore, on a platform that has a current UTF-8 charmap, the following sed command should work, but note that [:space:] also matches tab characters and therefore replaces them with a single space too:

sed 's/[[:space:]]/ /g' file

CodeHunter

bash - Remove all Unicode Spaces and replace with Normal Space

Installation of `nws` from the npm registry (Linux and macOS)

Manual installation (any Unix platform with `bash`)

Optional reading: POSIX character classes `[:space:]` and `[:blank:]` and non-ASCII Unicode whitespace

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

bash - Remove all Unicode Spaces and replace with Normal Space

Installation of nws from the npm registry (Linux and macOS)

Manual installation (any Unix platform with bash)

Optional reading: POSIX character classes [:space:] and [:blank:] and non-ASCII Unicode whitespace

Recent Posts

Installation of `nws` from the npm registry (Linux and macOS)

Manual installation (any Unix platform with `bash`)

Optional reading: POSIX character classes `[:space:]` and `[:blank:]` and non-ASCII Unicode whitespace