How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings? How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings? oracle oracle

How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?


Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:

CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2AS    encoded VARCHAR2(32767);    type  array_t IS varray(3) OF VARCHAR2(15);    array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');BEGIN    FOR I IN 1..array.count LOOP        encoded := CASE array(i)            WHEN 'AL32UTF8' THEN string            ELSE CONVERT(string, 'AL32UTF8', array(i))        END;        IF instr(            rawtohex(                utl_raw.cast_to_raw(                    utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8')                )            ),            'EFBFBD'        ) = 0 THEN            RETURN encoded;        END IF;    END LOOP;    RAISE VALUE_ERROR;END;

Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.


to check whether your database column contains invalid utf-8 use the following query:

 select CASE            INSTR (                  RAWTOHEX (                      utl_raw.cast_to_raw (                          utl_i18n.raw_to_char (                                utl_raw.cast_to_raw ( <your_column> )                              , 'utf8'                          )                      )                  )                , 'EFBFBD'            )        WHEN 0 THEN 'OK'        ELSE 'FAIL'         END   from <your_table>      ;

given that your db charset is al32utf8.

note that EF BF BD represents an illegal encoding in utf-8.

as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.

best regards, carsten

ps:oracle names for charsets:CP1252 -> WE8MSWIN1252LATIN-1 -> WE8ISO8859P1