How can I convert Oracle VARCHAR2 values to UTF-8 from a list of possible encodings?
Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I've come up with this:
CREATE OR REPLACE FUNCTION reencode(string IN VARCHAR2) RETURN VARCHAR2AS encoded VARCHAR2(32767); type array_t IS varray(3) OF VARCHAR2(15); array array_t := array_t('AL32UTF8', 'WE8MSWIN1252', 'WE8ISO8859P1');BEGIN FOR I IN 1..array.count LOOP encoded := CASE array(i) WHEN 'AL32UTF8' THEN string ELSE CONVERT(string, 'AL32UTF8', array(i)) END; IF instr( rawtohex( utl_raw.cast_to_raw( utl_i18n.raw_to_char(utl_raw.cast_to_raw(encoded), 'utf8') ) ), 'EFBFBD' ) = 0 THEN RETURN encoded; END IF; END LOOP; RAISE VALUE_ERROR;END;
Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can't complain. This will be good enough to sanitize my data, I think.
to check whether your database column contains invalid utf-8 use the following query:
select CASE INSTR ( RAWTOHEX ( utl_raw.cast_to_raw ( utl_i18n.raw_to_char ( utl_raw.cast_to_raw ( <your_column> ) , 'utf8' ) ) ) , 'EFBFBD' ) WHEN 0 THEN 'OK' ELSE 'FAIL' END from <your_table> ;
given that your db charset is al32utf8.
note that EF BF BD
represents an illegal encoding in utf-8.
as all the other charsets you indicate are byte-oriented, transformation to unicode will never fail but possibly produce different code points. without contextual information automated determination of the actual source charset won't be possible.
best regards, carsten
ps:oracle names for charsets:CP1252
-> WE8MSWIN1252
LATIN-1
-> WE8ISO8859P1