Best way to find position in the Stream where given byte sequence starts Best way to find position in the Stream where given byte sequence starts arrays arrays

Best way to find position in the Stream where given byte sequence starts


I've reached this solution.

I did some benchmarks with an ASCII file that was 3.050 KB and 38803 lines.With a search byte array of 22 bytes in the last line of the file I've got the result in about 2.28 seconds (in a slow/old machine).

public static long FindPosition(Stream stream, byte[] byteSequence){    if (byteSequence.Length > stream.Length)        return -1;    byte[] buffer = new byte[byteSequence.Length];    using (BufferedStream bufStream = new BufferedStream(stream, byteSequence.Length))    {        int i;        while ((i = bufStream.Read(buffer, 0, byteSequence.Length)) == byteSequence.Length)        {            if (byteSequence.SequenceEqual(buffer))                return bufStream.Position - byteSequence.Length;            else                bufStream.Position -= byteSequence.Length - PadLeftSequence(buffer, byteSequence);        }    }    return -1;}private static int PadLeftSequence(byte[] bytes, byte[] seqBytes){    int i = 1;    while (i < bytes.Length)    {        int n = bytes.Length - i;        byte[] aux1 = new byte[n];        byte[] aux2 = new byte[n];        Array.Copy(bytes, i, aux1, 0, n);        Array.Copy(seqBytes, aux2, n);        if (aux1.SequenceEqual(aux2))            return i;        i++;    }    return i;}


If you treat the stream like another sequence of bytes, you can just search it like you were doing a string search. Wikipedia has a great article on that. Boyer-Moore is a good and simple algorithm for this.

Here's a quick hack I put together in Java. It works and it's pretty close if not Boyer-Moore. Hope it helps ;)

public static final int BUFFER_SIZE = 32;public static int [] buildShiftArray(byte [] byteSequence){    int [] shifts = new int[byteSequence.length];    int [] ret;    int shiftCount = 0;    byte end = byteSequence[byteSequence.length-1];    int index = byteSequence.length-1;    int shift = 1;    while(--index >= 0){        if(byteSequence[index] == end){            shifts[shiftCount++] = shift;            shift = 1;        } else {            shift++;        }    }    ret = new int[shiftCount];    for(int i = 0;i < shiftCount;i++){        ret[i] = shifts[i];    }    return ret;}public static byte [] flushBuffer(byte [] buffer, int keepSize){    byte [] newBuffer = new byte[buffer.length];    for(int i = 0;i < keepSize;i++){        newBuffer[i] = buffer[buffer.length - keepSize + i];    }    return newBuffer;}public static int findBytes(byte [] haystack, int haystackSize, byte [] needle, int [] shiftArray){    int index = needle.length;    int searchIndex, needleIndex, currentShiftIndex = 0, shift;    boolean shiftFlag = false;    index = needle.length;    while(true){        needleIndex = needle.length-1;        while(true){            if(index >= haystackSize)                return -1;            if(haystack[index] == needle[needleIndex])                break;            index++;        }        searchIndex = index;        needleIndex = needle.length-1;        while(needleIndex >= 0 && haystack[searchIndex] == needle[needleIndex]){            searchIndex--;            needleIndex--;        }        if(needleIndex < 0)            return index-needle.length+1;        if(shiftFlag){            shiftFlag = false;            index += shiftArray[0];            currentShiftIndex = 1;        } else if(currentShiftIndex >= shiftArray.length){            shiftFlag = true;            index++;        } else{            index += shiftArray[currentShiftIndex++];        }               }}public static int findBytes(InputStream stream, byte [] needle){    byte [] buffer = new byte[BUFFER_SIZE];    int [] shiftArray = buildShiftArray(needle);    int bufferSize, initBufferSize;    int offset = 0, init = needle.length;    int val;    try{        while(true){            bufferSize = stream.read(buffer, needle.length-init, buffer.length-needle.length+init);            if(bufferSize == -1)                return -1;            if((val = findBytes(buffer, bufferSize+needle.length-init, needle, shiftArray)) != -1)                return val+offset;            buffer = flushBuffer(buffer, needle.length);            offset += bufferSize-init;            init = 0;        }    } catch (IOException e){        e.printStackTrace();    }    return -1;}


You'll basically need to keep a buffer the same size as byteSequence so that once you've found that the "next byte" in the stream matches, you can check the rest but then still go back to the "next but one" byte if it's not an actual match.

It's likely to be a bit fiddly whatever you do, to be honest :(