How can I convert a complex binary Perl regular expression to C# or PowerShell?
This post at http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 describes several workarounds.
The odds are pretty good that if a sequence has no invalid UTF-8 characters, it can be treated as UTF-8. Since RegExps are for text in .Net, not byte arrays, here's a non-regexp solution that should work. Personally, I'd rather use this as a fallback mechanism (e.g. mycommand -autodetect) and offer pipeline parameters that allow user-specified encodings.
string result=String.Empty; Encoding ae = Encoding.GetEncoding( Encoding.UTF8.EncodingName, new EncoderExceptionFallback(), new DecoderExceptionFallback()); try { result=ae.GetString(mybytes); } catch (DecoderFallbackException e) { //revert to some sensible default. Maybe the Ansi Code page for this environment? // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks. result=Encoding.Default.GetString(mybytes); }
If you can interact with unmanaged code, research the MLANG dll that ships with IE. It has alternate encoding autodetection methods that may be more useful.
Try this: (I haven't checked that it matches correctly; you can easily try it in LINQPad).
new Regex(@" ^( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$", RegexOptions.IgnorePatternWhitespace)
EDIT:
Try reading your file using an ASCII StreamReader
; that should do what you're looking for. (Note that I didn't actually try it)