Decompress and read Dukascopy .bi5 tick files Decompress and read Dukascopy .bi5 tick files pandas pandas

Decompress and read Dukascopy .bi5 tick files


The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.

import lzmaimport structimport pandas as pddef bi5_to_df(filename, fmt):    chunk_size = struct.calcsize(fmt)    data = []    with lzma.open(filename) as f:        while True:            chunk = f.read(chunk_size)            if chunk:                data.append(struct.unpack(fmt, chunk))            else:                break    df = pd.DataFrame(data)    return df

The most important thing is to know the right format. I googled around and tried to guess and '>3i2f' (or >3I2f) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f' doesn't produce sensible floats - regardless whether big or little endian.) For struct and format syntax see the docs.

df = bi5_to_df('13h_ticks.bi5', '>3i2f')df.head()Out[177]:       0       1       2     3     40   210  110218  110216  1.87  1.121   362  110219  110216  1.00  5.852   875  110220  110217  1.00  1.123  1408  110220  110218  1.50  1.004  1884  110221  110219  3.94  1.00

Update

To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy,I compiled and run test_read_bi5 from there. The first lines of the output are:

time, bid, bid_vol, ask, ask_vol2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.52012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.52012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.252012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.52012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5

And bi5_to_df on the same input file gives:

bi5_to_df('01h_ticks.bi5', '>3I2f').head()Out[295]:       0       1       2     3    40  3581  131966  131945  1.50  1.51  5142  131964  131943  1.50  1.52  5202  131964  131943  2.25  1.53  5321  131964  131944  1.50  1.54  5441  131964  131944  1.50  1.5

So everything seems to be fine (ninety47's code reorders columns).

Also, it's probably more accurate to use '>3I2f' instead of '>3i2f' (i.e. unsigned int instead of int).


import requestsimport structfrom lzma import LZMADecompressor, FORMAT_AUTO# for download compressed EURUSD 2020/06/15/10h_ticks.bi5 fileres = requests.get("https://www.dukascopy.com/datafeed/EURUSD/2020/06/15/10h_ticks.bi5", stream=True)print(res.headers.get('content-type'))rawdata = res.contentdecomp = LZMADecompressor(FORMAT_AUTO, None, None)decompresseddata = decomp.decompress(rawdata)firstrow = struct.unpack('!IIIff', decompresseddata[0: 20])print("firstrow:", firstrow)# firstrow: (436, 114271, 114268, 0.9399999976158142, 0.75)# time = 2020/06/15/10h + (1 month) + 436 milisecondsecondrow = struct.unpack('!IIIff', decompresseddata[20: 40])print("secondrow:", secondrow)# secondrow: (537, 114271, 114267, 4.309999942779541, 2.25)# time = 2020/06/15/10h + (1 month) + 537 milisecond# ask = 114271 / 100000 = 1.14271# bid = 114267 / 100000 = 1.14267# askvolume = 4.31# bidvolume = 2.25# note that 00 -> is january# "https://www.dukascopy.com/datafeed/EURUSD/2020/00/15/10h_ticks.bi5" for january# "https://www.dukascopy.com/datafeed/EURUSD/2020/01/15/10h_ticks.bi5" for february#  iteratingprint(len(decompresseddata), int(len(decompresseddata) / 20))for i in range(0, int(len(decompresseddata) / 20)):    print(struct.unpack('!IIIff', decompresseddata[i * 20: (i + 1) * 20]))


Did you try using numpy as to parse the data before transfer it to pandas. Maybe is a long way solution, but I will allow you to manipulate and clean the data before you made the analysis in Panda, also the integration between them are pretty straight forward,