delimiter [0001] in a text file, reading using np.loadtxt in python delimiter [0001] in a text file, reading using np.loadtxt in python numpy numpy

delimiter [0001] in a text file, reading using np.loadtxt in python


I know that \u0001 is not the right delimiter. It was just a hypothetical example. I am unable to paste delimiter here, it looks like a closed square box with 0001 in a 2 row by 2 column fashion.

Most likely, \u0001 is the right delimiter in a sense, you're just doing it wrong.

There are fonts that use symbols like that for displaying non-printing control characters, so that 0001-in-a-box is the representation of U+0001, aka Start of Heading, aka control-A.*

The first problem is that the Python 2.x literal '\u0001' doesn't specify that character. You can't use \u escapes in str literals, only unicode literals. The docs explain this, but it makes sense if you think about it. So, the literal '\u0001' isn't the character U+0001 in your source file's encoding, it's six separate characters (a backslash, a letter, and four numbers).

So, could you just use u'\u0001'? Well, yes, but then you'd need to decode the text file to Unicode, which is probably not appropriate here. It isn't really a text file at all, it's a binary file. And the key is to look at it that way.

Your text editor can't do that, because it's… well, a text editor, so it decodes your binary file as if it were ASCII (or maybe UTF-8, Latin-1, cp1252, whatever) text, then displays the resulting Unicode, which is why you're seeing your font's representation of U+0001. But Python lets you deal with binary data directly; that's what a str does.

So, what are the actual bytes in the file? If you do this:

b = f.readline()print repr(b)

You'll probably see something like this:

'357812\x0110\x0113\x017\x018\n'

And that's the key: the actual delimiter you want is '\x01'.**


Of course you could use u'\u0001'.encode('Latin-1'), or whatever encoding your source file is in… but that's just silly. You know what byte you want to match, why try to come up with an expression that represents that byte instead of just specifying it?


If you wanted to, you could also just convert the control-A delimiters into something more traditional like a comma:

lines = (line.replace('\x01', ',') for line in file)

But there's no reason to go through the extra effort to deal with that. Especially if some of the columns may contain text, which may contain commas… then you'd have to do something like prepend a backslash to every original comma that's not inside quotes, or quote every string column, or whatever, before you can replace the delimiters with commas.


* Technically, it should be shown as a non-composing non-spacing mark… but there are many contexts where you want to see invisible characters, especially control characters, so many fonts have symbols for them, and many text editors display those symbols as if they were normal spacing glyphs. Besides 0001 in a box, common representations include SOH (for "Start of Heading") or A (for "control-A") or 001 (the octal code for the ASCII control character) in different kinds of boxes. This page and this show how a few fonts display it.

** If you knew enough, you could have easily deduced that, because '\x01' in almost any charset will decode to u'\u0001'. But it's more important to know how to look at the bytes directly than to learn other people's guesses…