How can I convert a complex binary Perl regular expression to C# or PowerShell? How can I convert a complex binary Perl regular expression to C# or PowerShell? powershell powershell

How can I convert a complex binary Perl regular expression to C# or PowerShell?


The odds are pretty good that if a sequence has no invalid UTF-8 characters, it can be treated as UTF-8. Since RegExps are for text in .Net, not byte arrays, here's a non-regexp solution that should work. Personally, I'd rather use this as a fallback mechanism (e.g. mycommand -autodetect) and offer pipeline parameters that allow user-specified encodings.

       string result=String.Empty;        Encoding ae = Encoding.GetEncoding(              Encoding.UTF8.EncodingName,              new EncoderExceptionFallback(),               new DecoderExceptionFallback());        try {            result=ae.GetString(mybytes);        }        catch (DecoderFallbackException e)        {            //revert to some sensible default. Maybe the Ansi Code page for this environment?            // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks.            result=Encoding.Default.GetString(mybytes);        }

If you can interact with unmanaged code, research the MLANG dll that ships with IE. It has alternate encoding autodetection methods that may be more useful.


Try this: (I haven't checked that it matches correctly; you can easily try it in LINQPad).

new Regex(@"    ^(    [\x09\x0A\x0D\x20-\x7E]            # ASCII    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16    )*$", RegexOptions.IgnorePatternWhitespace)

EDIT:

Try reading your file using an ASCII StreamReader; that should do what you're looking for. (Note that I didn't actually try it)