How to detect if a file is PDF or TIFF? How to detect if a file is PDF or TIFF? asp.net asp.net

How to detect if a file is PDF or TIFF?


OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;private const int kHeaderSize = 2;private const int kMinimumTiffSize = 8;private const byte kIntelMark = 0x49;private const byte kMotorolaMark = 0x4d;private const ushort kTiffMagicNumber = 42;private bool IsTiff(Stream stm){    stm.Seek(0);    if (stm.Length < kMinimumTiffSize)        return false;    byte[] header = new byte[kHeaderSize];    stm.Read(header, 0, header.Length);    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))        return false;    bool isIntel = header[0] == kIntelMark;    ushort magicNumber = ReadShort(stm, isIntel);    if (magicNumber != kTiffMagicNumber)        return false;    return true;}private ushort ReadShort(Stream stm, bool isIntel){    byte[] b = new byte[2];    _stm.Read(b, 0, b.Length);    return ToShort(_isIntel, b[0], b[1]);}private static ushort ToShort(bool isIntel, byte b0, byte b1){    if (isIntel)    {        return (ushort)(((int)b1 << 8) | (int)b0);    }    else    {        return (ushort)(((int)b0 << 8) | (int)b1);    }}

I hacked apart some much more general code to get this.

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm){    stm.Seek(0, SeekOrigin.Begin);    PdfToken token;    while ((token = GetToken(stm)) != null)     {        if (token.TokenType == MLPdfTokenType.Comment)         {            if (token.Text.StartsWith("%PDF-1."))                 return true;        }        if (stm.Position > 1024)            break;    }    return false;}

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

% the following is a PostScript file, NOT a PDF file% you'll note that in our previous version, it started with %PDF-1.3,% incorrectly marking it as a PDF%clippath stroke showpage

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere withinthe first 1024 bytes of the file.


TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/

The first 8 bytes forms the header. The first two bytes of which is either "II" for little endian byte ordering or "MM" for big endian byte ordering.

About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf

The header contains just one line that identifies the version of PDF. Example: %PDF-1.6


Reading the specification for each file format will tell you how to identify files of that format.

TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.

Page 13 of the spec reads:

A TIFF file begins with an 8-byte image file header, containing the following information: Bytes 0-1: The byte order used within the file. Legal values are: “II” (4949.H) “MM” (4D4D.H) In the “II” format, byte order is always from the least significant byte to the most significant byte, for both 16-bit and 32-bit integers This is called little-endian byte order. In the “MM” format, byte order is always from most significant to least significant, for both 16-bit and 32-bit integers. This is called big-endian byte order. Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as a TIFF file. The byte order depends on the value of Bytes 0-1.

PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)

Section 7.5.2

The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7. A conforming reader shall accept files with any of the following headers: %PDF–1.0, %PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4, %PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, "File Trailer"), if present, shall be used instead of the version specified in the Header.

If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

Of course you could do a "deeper" check on each file by checking more file specific items.