How to extract file extension from byte array How to extract file extension from byte array arrays arrays

How to extract file extension from byte array


It turned out that there is a decent method in JDK's URLConnection class, please refer to the following answer: Getting A File's Mime Type In Java

If one needs to extract file extension from byte array instead of file, one should simply use java.io.ByteArrayInputStream (class to read bytes specifically from byte arrays) instead of java.io.FileInputStream (class to read bytes specifically from files) like in the following example:

byte[] content = ;InputStream is = new ByteArrayInputStream(content);String mimeType = URLConnection.guessContentTypeFromStream(is); //...close stream

Hope this helps...


If this is for storing a file that is uploaded:

  • create a column for the filename extension
  • create a column for the mime type as sent by the browser

If you don't have the original file, and you only have bytes, you have a couple of good solutions.

If you're able to use a library, look at using mime-util to inspect the bytes:

http://technopaper.blogspot.com/2009/03/identifying-mime-using-mime-util.html

If you have to build your own byte detector, here are many of the most popular starting bytes:

"BC" => bitcode,"BM" => bitmap,"BZ" => bzip,"MZ" => exe,"SIMPLE"=> fits,"GIF8" => gif,"GKSM" => gks,[0x01,0xDA].pack('c*') => iris_rgb,[0xF1,0x00,0x40,0xBB].pack('c*') => itc,[0xFF,0xD8].pack('c*') => jpeg,"IIN1" => niff,"MThd" => midi,"%PDF" => pdf,"VIEW" => pm,[0x89].pack('c*') + "PNG" => png,"%!" => postscript,"Y" + [0xA6].pack('c*') + "j" + [0x95].pack('c*') => sun_rasterfile,"MM*" + [0x00].pack('c*') => tiff,"II*" + [0x00].pack('c*') => tiff,"gimp xcf" => gimp_xcf,"#FIG" => xfig,"/* XPM */" => xpm,[0x23,0x21].pack('c*') => shebang,[0x1F,0x9D].pack('c*') => compress,[0x1F,0x8B].pack('c*') => gzip,"PK" + [0x03,0x04].pack('c*') => pkzip,"MZ" => dos_os2_windows_executable,".ELF" => unix_elf,[0x99,0x00].pack('c*') => pgp_public_ring,[0x95,0x01].pack('c*') => pgp_security_ring,[0x95,0x00].pack('c*') => pgp_security_ring,[0xA6,0x00].pack('c*') => pgp_encrypted_data,[0xD0,0xCF,0x11,0xE0].pack('c*') => docfile


Maybe I need to save additional column in my DB for file extension.

That is a better solution than attempting to deduce a mimetype based on the database content, for (at least) the following reasons:

  • If you have a mime type from the document source, you can store and use that.
  • You could (potentially) ask the user to specify a mimetype when they lodge the document.
  • If you have to use some heuristic-based scheme for figuring out a mimetype:
    • you can do the work once before creating the table row, rather than N times after extracting it, and
    • you can report cases where the heuristic gives no good answer, and maybe ask the user to say what the file type really is.

(I'm making some assumptions that may not be warranted, but the question doesn't give any clues on how the larger system is intended to work.)