Xml Escaping/Encoding terminology Xml Escaping/Encoding terminology xml xml

Xml Escaping/Encoding terminology


Encoding describes how the file's characters are physically written in binary (as in Unicode or ANSI).

Escaping refers to the process of replacing special characters (such as < and >) with their XML entity equivalent (such as < and >). For URLs, escaping refers to replacing characters with strings starting with %, such as %20 for a single whitespace.

Escaping differs by language, but encodings are usually widely-accepted standards. Sometimes the terms are used ambiguously (particularly with encoding used to mean escaping), but they are well defined and distinct.


In every Web Application, data consists of various layers like the View Layer, Model Layer, Database Layer, etc. Each layer is "supposed" to be developed independently to satisfy various scalability and maintainability requirements.

Now, basically, every layer needs to "talk" to every other, and they have to decide upon a language through which they can talk.This is called encoding. Various types of encodings exists like ASCII, UTF-8, UTF-16, etc.Now if the user is Chinese or Japanese, for instance, then for him ASCII wouldn't work, hence he would go ahead with UTF-16 or any other encoding technique which would guarantee communication in Chinese. So from the web layer, Chinese characters would pass through the business layer, and then to the data layer, and everywhere, the same "encoding" scheme is to be used.

Why ?

Now suppose , your Web Layer , sends data in UTF-16 , supporting chinese language , but the database layer accepts , only ASCII , then the database layer would get confused as to what are you talking ! it understands only English characters , it won't understanding the rest.This was about Encoding.

Escaping :

There is a certain set of data called "metadata" which have a special meaning from the browsers perspective. For example , <> are metadata from the browsers perspective. The browsers parser knows that all the data contained inside these <> are to be interpreted. Now the attackers use this technique to confuse the browsers. For Example :

<input type="text" value="${name} />

if i replace the name with

name="/><script>alert(document.cookie)</script>

Then the resultant code as the browser sees it will be

<input type="text" value=""/><script>alert(document.cookie)</script> />

Means, now you need to instruct the browser that whatever I put in the name="" should be "escaped" , or should be considered as data only. So there are various functions which either encode/escape <> as their html equivalent %3C%3E, so now the browser knows that this needs to be treated differently. Basically escaping means to escape their actual meaning (roughly speaking).

 <input type="text" value="${fn:escapeXML(name)} />

using JSTL.


TL;DR Both terms are interchangeable (if what you mean is to convert some characters so they will be interpreted as plain string data). This debate is old. From CWE-116: Improper Encoding or Escaping of Output:

The usage of the "encoding" and "escaping" terms varies widely. For example, in some programming languages, the terms are used interchangeably, while other languages provide APIs that use both terms for different tasks. This overlapping usage extends to the Web, such as the "escape" JavaScript function whose purpose is stated to be encoding. Of course, the concepts of encoding and escaping predate the Web by decades. Given such a context, it is difficult for CWE to adopt a consistent vocabulary that will not be misinterpreted by some constituency.

Comically enough JavaScript also has encodeURIComponent(), and its specification avoids the debate entirely:

The encodeURIComponent function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

Personally I believe it's more appropriate to refer to the general process as "encoding", as you're creating a code to be transmitted through a communications channel (a piece of markup/programming code) and interpreted by a receiver (the parser). I think it's silly to replace < with something completely different like &#60; and call that "escaping".