Monday, January 18, 2010

Character set vs. character encoding

(copied from http://www.grauw.nl/blog/entry/254)
Recently I was asked to explain the difference between character encoding and character set, and I thought it would be interesting to write about this over here as well.

In these two terms, ‘set’ refers to the set of characters and their numbers (code points), and ‘encoding’ refers to the representation of these code points. For example, Unicode is a character set, and UTF-8 and UTF-16 are different character encodings of Unicode.

To illustrate this difference, in the Unicode character set, the € character has code point 8364 (usually written as U+20AC, in hexadecimal notation). Using the UTF-16LE character encoding this is stored as AC 20, while UTF-16BE stores this as 20 AC, and the UTF-8 representation is E2 82 AC.

In practice however, the two terms are used interchangeably. The difference as described above is not applicable to most non-Unicode character sets (such as Latin-1 and SJIS) because their code points are the same as their encoding. Because of that, there has never been a real distinction from a historical perspective.

The most important difference in English is that the term character set is a little old fashioned, and character encoding is most commonly used nowadays. The reason for this is likely that it is more correct to speak of character encoding when UTF-8 and UTF-16 are different possible encodings.

Some examples:

* The HTTP protocol uses
Content-Type: text/html; charset=UTF-8
* The more recent XML uses
<?xml version="1.0" encoding="UTF-8"?>

This illustrates how they are used synonymously. Both describe the character encoding of the content that follows.

No comments:

Post a Comment