Using Objective C/Cocoa to unescape unicode characters, ie \u1234 Using Objective C/Cocoa to unescape unicode characters, ie \u1234 objective-c objective-c

Using Objective C/Cocoa to unescape unicode characters, ie \u1234


It's correct that Cocoa does not offer a solution, yet Core Foundation does: CFStringTransform.

CFStringTransform lives in a dusty, remote corner of Mac OS (and iOS) and so it's a little know gem. It is the front end to Apple's ICU compatible string transformation engine. It can perform real magic like transliterations between greek and latin (or about any known scripts), but it can also be used to do mundane tasks like unescaping strings from a crappy server:

NSString *input = @"\\u5404\\u500b\\u90fd";NSString *convertedString = [input mutableCopy];CFStringRef transform = CFSTR("Any-Hex/Java");CFStringTransform((__bridge CFMutableStringRef)convertedString, NULL, transform, YES);NSLog(@"convertedString: %@", convertedString);// prints: 各個都, tada!

As I said, CFStringTransform is really powerful. It supports a number of predefined transforms, like case mappings, normalizations or unicode character name conversion. You can even design your own transformations.

I have no idea why Apple does not make it available from Cocoa.

Edit 2015:

OS X 10.11 and iOS 9 add the following method to Foundation:

- (nullable NSString *)stringByApplyingTransform:(NSString *)transform reverse:(BOOL)reverse;

So the example from above becomes...

NSString *input = @"\\u5404\\u500b\\u90fd";NSString *convertedString = [input stringByApplyingTransform:@"Any-Hex/Java"                                                     reverse:YES];NSLog(@"convertedString: %@", convertedString);

Thanks @nschmidt for the heads up.


There is no built-in function to do C unescaping.

You can cheat a little with NSPropertyListSerialization since an "old text style" plist supports C escaping via \Uxxxx:

NSString* input = @"ab\"cA\"BC\\u2345\\u0123";// will cause trouble if you have "abc\\\\uvw"NSString* esc1 = [input stringByReplacingOccurrencesOfString:@"\\u" withString:@"\\U"];NSString* esc2 = [esc1 stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];NSString* quoted = [[@"\"" stringByAppendingString:esc2] stringByAppendingString:@"\""];NSData* data = [quoted dataUsingEncoding:NSUTF8StringEncoding];NSString* unesc = [NSPropertyListSerialization propertyListFromData:data                   mutabilityOption:NSPropertyListImmutable format:NULL                   errorDescription:NULL];assert([unesc isKindOfClass:[NSString class]]);NSLog(@"Output = %@", unesc);

but mind that this isn't very efficient. It's far better if you write up your own parser. (BTW are you decoding JSON strings? If yes you could use the existing JSON parsers.)


Here's what I ended up writing. Hopefully this will help some people along.

+ (NSString*) unescapeUnicodeString:(NSString*)string{// unescape quotes and backwards slashNSString* unescapedString = [string stringByReplacingOccurrencesOfString:@"\\\"" withString:@"\""];unescapedString = [unescapedString stringByReplacingOccurrencesOfString:@"\\\\" withString:@"\\"];// tokenize based on unicode escape charNSMutableString* tokenizedString = [NSMutableString string];NSScanner* scanner = [NSScanner scannerWithString:unescapedString];while ([scanner isAtEnd] == NO){    // read up to the first unicode marker    // if a string has been scanned, it's a token    // and should be appended to the tokenized string    NSString* token = @"";    [scanner scanUpToString:@"\\u" intoString:&token];    if (token != nil && token.length > 0)    {        [tokenizedString appendString:token];        continue;    }    // skip two characters to get past the marker    // check if the range of unicode characters is    // beyond the end of the string (could be malformed)    // and if it is, move the scanner to the end    // and skip this token    NSUInteger location = [scanner scanLocation];    NSInteger extra = scanner.string.length - location - 4 - 2;    if (extra < 0)    {        NSRange range = {location, -extra};        [tokenizedString appendString:[scanner.string substringWithRange:range]];        [scanner setScanLocation:location - extra];        continue;    }    // move the location pas the unicode marker    // then read in the next 4 characters    location += 2;    NSRange range = {location, 4};    token = [scanner.string substringWithRange:range];    unichar codeValue = (unichar) strtol([token UTF8String], NULL, 16);    [tokenizedString appendString:[NSString stringWithFormat:@"%C", codeValue]];    // move the scanner past the 4 characters    // then keep scanning    location += 4;    [scanner setScanLocation:location];}// donereturn tokenizedString;}+ (NSString*) escapeUnicodeString:(NSString*)string{// lastly escaped quotes and back slash// note that the backslash has to be escaped before the quote// otherwise it will end up with an extra backslashNSString* escapedString = [string stringByReplacingOccurrencesOfString:@"\\" withString:@"\\\\"];escapedString = [escapedString stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];// convert to encoded unicode// do this by getting the data for the string// in UTF16 little endian (for network byte order)NSData* data = [escapedString dataUsingEncoding:NSUTF16LittleEndianStringEncoding allowLossyConversion:YES];size_t bytesRead = 0;const char* bytes = data.bytes;NSMutableString* encodedString = [NSMutableString string];// loop through the byte array// read two bytes at a time, if the bytes// are above a certain value they are unicode// otherwise the bytes are ASCII characters// the %C format will write the character value of byteswhile (bytesRead < data.length){    uint16_t code = *((uint16_t*) &bytes[bytesRead]);    if (code > 0x007E)    {        [encodedString appendFormat:@"\\u%04X", code];    }    else    {        [encodedString appendFormat:@"%C", code];    }    bytesRead += sizeof(uint16_t);}// donereturn encodedString;}