Decompress and read Dukascopy .bi5 tick files
The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.
import lzmaimport structimport pandas as pddef bi5_to_df(filename, fmt): chunk_size = struct.calcsize(fmt) data = [] with lzma.open(filename) as f: while True: chunk = f.read(chunk_size) if chunk: data.append(struct.unpack(fmt, chunk)) else: break df = pd.DataFrame(data) return df
The most important thing is to know the right format. I googled around and tried to guess and '>3i2f'
(or >3I2f
) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f'
doesn't produce sensible floats - regardless whether big or little endian.) For struct
and format syntax see the docs.
df = bi5_to_df('13h_ticks.bi5', '>3i2f')df.head()Out[177]: 0 1 2 3 40 210 110218 110216 1.87 1.121 362 110219 110216 1.00 5.852 875 110220 110217 1.00 1.123 1408 110220 110218 1.50 1.004 1884 110221 110219 3.94 1.00
Update
To compare the output of bi5_to_df
with https://github.com/ninety47/dukascopy,I compiled and run test_read_bi5
from there. The first lines of the output are:
time, bid, bid_vol, ask, ask_vol2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.52012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.52012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.252012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.52012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5
And bi5_to_df
on the same input file gives:
bi5_to_df('01h_ticks.bi5', '>3I2f').head()Out[295]: 0 1 2 3 40 3581 131966 131945 1.50 1.51 5142 131964 131943 1.50 1.52 5202 131964 131943 2.25 1.53 5321 131964 131944 1.50 1.54 5441 131964 131944 1.50 1.5
So everything seems to be fine (ninety47's code reorders columns).
Also, it's probably more accurate to use '>3I2f'
instead of '>3i2f'
(i.e. unsigned int
instead of int
).
import requestsimport structfrom lzma import LZMADecompressor, FORMAT_AUTO# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 fileres = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)print(res.headers.get('content-type'))rawdata = res.contentdecomp = LZMADecompressor(FORMAT_AUTO, None, None)decompresseddata = decomp.decompress(rawdata)firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])print("firstrow:", firstrow)# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)# time = 2020/06/15/10h + (1 month) + 436 milisecondsecondrow = struct.unpack('!IIIff', decompresseddata[20: 40])print("secondrow:", secondrow)# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)# time = 2020/06/15/10h + (1 month) + 537 milisecond# ask = 114271 / 100000 = 1.14271# bid = 114267 / 100000 = 1.14267# askvolume = 4.31# bidvolume = 2.25# note that 00 -> is january# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february# iteratingprint(len(decompresseddata), int(len(decompresseddata) / 20))for i in range(0, int(len(decompresseddata) / 20)): print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))