Difference between \w and \b regular expression meta characters
The metacharacter \b
is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character isa word character.
- After the last character in the string, if thelast character is a word character.
- Between two characters in thestring, where one is a word character and the other is not a word character.
Simply put: \b
allows you to perform a "whole words only" search using a regular expression in the form of \bword\b
. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".
In all flavors, the characters [a-zA-Z0-9_]
are word characters. These are also matched by the short-hand character class \w
. Flavors showing "ascii" for word boundaries in the flavor comparison recognize only these as word characters.
\w
stands for "word character", usually [A-Za-z0-9_]
. Notice the inclusion of the underscore and digits.
\B
is the negated version of \b
. \B
matches at every position where \b
does not. Effectively, \B
matches at any position between two word characters as well as at any position between two non-word characters.
\W
is short for [^\w]
, the negated version of \w
.
\w
matches a word character. \b
is a zero-width match that matches a position character that has a word character on one side, and something that's not a word character on the other. (Examples of things that aren't word characters include whitespace, beginning and end of the string, etc.)
\w
matches a
, b
, c
, d
, e
, and f
in "abc def"
\b
matches the (zero-width) position before a
, after c
, before d
, and after f
in "abc def"
@Mahender, you probably meant the difference between \W
(instead of \w
) and \b
. If not, then I would agree with @BoltClock and @jwismar above. Otherwise continue reading.
\W
would match any non-word character and so its easy to try to use it to match word boundaries. The problem is that it will not match the start or end of a line. \b
is more suited for matching word boundaries as it will also match the start or end of a line. Roughly speaking (more experienced users can correct me here) \b
can be thought of as (\W|^|$)
. [Edit: as @Ωmega mentions below, \b
is a zero-length match so (\W|^|$)
is not strictly correct, but hopefully helps explain the diff]
Quick example: For the string Hello World
, .+\W
would match Hello_
(with the space) but will not match World
. .+\b
would match both Hello
and World
.