Why does findstr not handle case properly (in some circumstances)?

I believe this is mostly a horrible design flaw.

We all expect the ranges to collate based on the ASCII code value. But they don't - instead the ranges are based on a collation sequence that nearly matches the default sequence used by SORT. EDIT -The exact collation sequence used by FINDSTR is now available at https://stackoverflow.com/a/20159191/1012053 under the section titled Regex character class ranges [x-y].

I prepared a text file containing one line for each extended ASCII character from 1 - 255, excluding 10 (LF), 13 (CR), and 26 (EOF on Windows).On each line I have the character, followed by a space, followed by the decimal code for the character. I then ran the file through SORT and captured the output in a sortedChars.txt file.

I now can easily test any regex range against this sorted file and demonstrate how the range is determined by a collation sequence that is nearly the same as SORT.

>findstr /nrc:"^[0-9]" sortedChars.txt137:0 048138:½ 171139:¼ 172140:1 049141:2 050142:² 253143:3 051144:4 052145:5 053146:6 054147:7 055148:8 056149:9 057

The results are not quite what we expected in that chars 171, 172 and 253 are thrown in the mix. But the results make perfect sense. The line number prefix corresponds to the SORT collation sequence, and you can see that the range exactly matches according to the SORT sequence.

Here is another range test that exactly follows the SORT sequence:

>findstr /nrc:"^[!-=]" sortedChars.txt34:! 03335:" 03436:# 03537:$ 03638:% 03739:& 03840:( 04041:) 04142:* 04243:, 04444:. 04645:/ 04746:: 05847:; 05948:? 06349:@ 06450:[ 09151:\ 09252:] 09353:^ 09454:_ 09555:` 09656:{ 12357:| 12458:} 12559:~ 12660:¡ 17361:¿ 16862:¢ 15563:£ 15664:¥ 15765:₧ 15866:+ 04367:∙ 24968:< 06069:= 061

There is one small anomaly with alpha characters. Character "a" sorts between "A" and "Z" yet it does not match [A-Z]. "z" sorts after "Z", yet it matches [A-Z]. There is a corresponding problem with [a-z]. "A" sorts before "a", yet it matches [a-z]. "Z" sorts between "a" and "z", yet it does not match [a-z].

Here are the [A-Z] results:

>findstr /nrc:"^[A-Z]" sortedChars.txt151:A 065153:â 131154:ä 132155:à 133156:å 134157:Ä 142158:Å 143159:á 160160:ª 166161:æ 145162:Æ 146163:B 066164:b 098165:C 067166:c 099167:Ç 128168:ç 135169:D 068170:d 100171:E 069172:e 101173:é 130174:ê 136175:ë 137176:è 138177:É 144178:F 070179:f 102180:ƒ 159181:G 071182:g 103183:H 072184:h 104185:I 073186:i 105187:ï 139188:î 140189:ì 141190:í 161191:J 074192:j 106193:K 075194:k 107195:L 076196:l 108197:M 077198:m 109199:N 078200:n 110201:ñ 164202:Ñ 165203:ⁿ 252204:O 079205:o 111206:ô 147207:ö 148208:ò 149209:Ö 153210:ó 162211:º 167212:P 080213:p 112214:Q 081215:q 113216:R 082217:r 114218:S 083219:s 115220:ß 225221:T 084222:t 116223:U 085224:u 117225:û 150226:ù 151227:ú 163228:ü 129229:Ü 154230:V 086231:v 118232:W 087233:w 119234:X 088235:x 120236:Y 089237:y 121238:ÿ 152239:Z 090240:z 122

And the [a-z] results

>findstr /nrc:"^[a-z]" sortedChars.txt151:A 065152:a 097153:â 131154:ä 132155:à 133156:å 134157:Ä 142158:Å 143159:á 160160:ª 166161:æ 145162:Æ 146163:B 066164:b 098165:C 067166:c 099167:Ç 128168:ç 135169:D 068170:d 100171:E 069172:e 101173:é 130174:ê 136175:ë 137176:è 138177:É 144178:F 070179:f 102180:ƒ 159181:G 071182:g 103183:H 072184:h 104185:I 073186:i 105187:ï 139188:î 140189:ì 141190:í 161191:J 074192:j 106193:K 075194:k 107195:L 076196:l 108197:M 077198:m 109199:N 078200:n 110201:ñ 164202:Ñ 165203:ⁿ 252204:O 079205:o 111206:ô 147207:ö 148208:ò 149209:Ö 153210:ó 162211:º 167212:P 080213:p 112214:Q 081215:q 113216:R 082217:r 114218:S 083219:s 115220:ß 225221:T 084222:t 116223:U 085224:u 117225:û 150226:ù 151227:ú 163228:ü 129229:Ü 154230:V 086231:v 118232:W 087233:w 119234:X 088235:x 120236:Y 089237:y 121238:ÿ 152240:z 122

Sort sorts upper case before lower case. (EDIT - I just read the help for SORT and learned that it does not differentiate between upper and lower case. The fact that my SORT output consistently put upper before lower is probably a result of the order of the input.) But regex apparently sorts lower case before upper case. All of the following ranges fail to match any characters.

>findstr /nrc:"^[A-a]" sortedChars.txt>findstr /nrc:"^[B-b]" sortedChars.txt>findstr /nrc:"^[C-c]" sortedChars.txt>findstr /nrc:"^[D-d]" sortedChars.txt

Reversing the order finds the characters.

>findstr /nrc:"^[a-A]" sortedChars.txt151:A 065152:a 097>findstr /nrc:"^[b-B]" sortedChars.txt163:B 066164:b 098>findstr /nrc:"^[c-C]" sortedChars.txt165:C 067166:c 099>findstr /nrc:"^[d-D]" sortedChars.txt169:D 068170:d 100

There are additional characters that regex sorts differently than SORT, but I haven't got a precise list.

windows regex batch-file cmd findstr

So if you want

only numbers : FindStr /R "^[0123-9]*$"
octal : FindStr /R "^[0123-7]*$"
hexadécimal : FindStr /R "^[0123-9aAb-Cd-EfF]*$"
alpha with no accent : FindStr /R "^[aAb-Cd-EfFg-Ij-NoOp-St-Uv-YzZ]*$"
alphanumeric : FindStr /R "^[0123-9aAb-Cd-EfFg-Ij-NoOp-St-Uv-YzZ]*$"

windows regex batch-file cmd findstr

This appears to be caused by the use of ranges within regular expression searches.

It doesn't occur for the first character in the range. It doesn't occur at all for non-ranges.

> echo a | findstr /r "[A-C]"> echo b | findstr /r "[A-C]"    b> echo c | findstr /r "[A-C]"    c> echo d | findstr /r "[A-C]"> echo b | findstr /r "[B-C]"> echo c | findstr /r "[B-C]"    c> echo a | findstr /r "[ABC]"> echo b | findstr /r "[ABC]"> echo c | findstr /r "[ABC]"> echo d | findstr /r "[ABC]"> echo b | findstr /r "[BC]"> echo c | findstr /r "[BC]"> echo A | findstr /r "[A-C]"    A> echo B | findstr /r "[A-C]"    B> echo C | findstr /r "[A-C]"    C> echo D | findstr /r "[A-C]"

According to the SS64 CMD FINDSTR page (which, in a stunning display of circularity, references this question), the range [A-Z]:

... includes the complete English alphabet, both upper and lower case (except for "a"), as well as non-English alpha characters with diacriticals.

To get around the problem in my environment, I simply used specific regular expressions (such as [ABCD] rather than [A-D]). A more sensible approach for those that are allowed would be to download CygWin or GnuWin32 and use grep from one of those packages.

CodeHunter

Why does findstr not handle case properly (in some circumstances)?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last