PDF doc text shows differently in IE / Firefox / Chrome

google-chrome firefox pdf pdf-generation

A very short and simplified introduction

Fonts in PDF are PDF objects - Font dictionaries, containing numerous parameters and sub-dictionaries, necessary to select glyphs, show them and translate character codes to logical (Unicode) representation for content extraction. Fonts in layman terms -- as we see them as *.ttf or *.pfb files -- are called font programs, either embedded or external, and are referred to by one of sub-dictionaries of Font objects.

Fonts are divided into two groups:

Simple fonts (Type1, Type3 or TrueType), in which glyphs are selected by single-byte character codes obtained from a string that is shown by the text-showing operators. The mapping from codes to glyphs is called the font’s encoding, it can be either built-in into font program or defined by Font object (by predefined name or explicitly) or, under special circumstances, constructed according to defined rules by viewer application.

The file in question doesn't contain simple fonts, and we won't discuss them any further -- but, note, over-simplistic description doesn't even start to reflect any of real-life complexity.

Composite fonts (Type0), used to show text in which character codes can have variable length (up to 4 bytes), and which, therefore, isn't restricted to 256 code-points. Type0 font always has one descendant which is a font-like object called CIDFont, and, similar to encoding for simple fonts, a CMap object, that maps character codes to character selectors, which, in PDF, are always CIDs -- integers up to 65536.

Now, character selector (CID) is not, in general, directly used to select glyphs from font program. For CIDFont of CIDFontType2 type, its dictionary contains CIDToGIDMap entry, that, obviously, maps CID to glyph identifiers. Those GIDs are, at last, used to select glyphs from embedded font program (which, for CIDFontType2 font, is a TrueType font program (do not confuse with Font object of TrueType Subtype)).

Font object can have ToUnicode resource, that maps CIDs to Unicode values for indexing, searching and extraction. It's called ToUnicode Cmap (as it follows similar syntax), but it should not to be confused with CMap object, mentioned above.

In what I call a simple case (and, I think, sensible decision), CMap is predefined Identity-H name, CIDToGIDMap is a predefined Identity name, and, therefore, character codes extracted from a string (argument to text showing operator) are always 2-byte numbers that, effectively, directly select glyphs from embedded TrueType program. From my experience, it's most common scenario, and as it appears, that's the case, against which common software is tested.

But, it's not the case with file in question.

(The end of a short and simplified introduction)

In our file, text showing operator, effectively, gets this string:

0x000a 0x000a 0x000a 0x20 0x0020 0x0020 0x0020 0x20 0x0025 0x0025 0x0025

Of course there are no 'groups', they are here because I made them, based on CMap that contains 2 ranges:

<20> <20><0000> <19FF>

To make a long story short, if we look up character codes in CMap and get CIDs, then look up CIDs in CIDToGIDMap and get GIDs, then look up GIDs in embedded David-Bold font and get Unicode values, here's the table

Code        CID     GID     Unicode     Name0x000a      10      180     05EA        tav0x0020      32      159     05D5        vav0x0025      37      154     05D0        alef0x20        228     03      0020        space

Now we have enough information to speculate, what confuses viewer applications

In my first attempt, I suggested it's 32 code (and CID) that's used for non-space character (see comment above). This assumption was based on a case, several years ago, when (older version of) Acrobat didn't show character with 0x20 code, when it's at the end of a string -- assuming it to be space, when in fact, according to encoding vector (of a simple font), it was another character.

I changed this:

0x0020 to 0x0004 in content stream;
bytes 08 and 09 in CIDToGIDMap to GID=159;
value in Widths array of CID=4 to 'vav' width;
ToUnicode cmap was adjusted accordingly.
(+ later I tried to remove <0020> 32 string from CMAP - not reflected in a file, linked in comment)

Well, it did help, but unfortunately, some of viewers still rejected to comply to specification.

Then I thought, that maybe variable character code width was the issue.

I returned to the original file and changed this:

0x20 to 0x00e4 in content stream;
<20> 228 to <00e4> 228 in CMAP;
codespacerange <20> <20> in CMAP deleted;
codespacerange <20> <20> in ToUnicode Cmap deleted.

This file appears to open perfectly in all viewers, mentioned in original question and comments below. Miraculously, 0x0020 code and 32 CID do not interfere.

The conclusion, I think, can be this:

Given current state of affairs, PDF-creators are NOT advised to mix single and double byte codes in font encoding (CMAP).

CodeHunter

PDF doc text shows differently in IE / Firefox / Chrome

A very short and simplified introduction

(The end of a short and simplified introduction)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last