Advertisement
Advertisement


What is Unicode, UTF-8, UTF-16?


Question

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.

In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

Please explain in simple terms.

2010/02/11
1
406
2/11/2010 3:40:14 PM


  • Unicode
    • is a set of characters used around the world
  • UTF-8
    • a character encoding capable of encoding all possible characters (called code points) in Unicode.
    • code unit is 8-bits
    • use one to four code units to encode Unicode
    • 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "" (three 8-bits)
  • UTF-16
    • another character encoding
    • code unit is 16-bits
    • use one to two code units to encode Unicode
    • 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "" (two 16-bits)
2015/01/06

Unicode is a fairly complex standard. Don’t be too afraid, but be prepared for some work! [2]

Because a credible resource is always needed, but the official report is massive, I suggest reading the following:

  1. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO.
  2. To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium. (first 20 slides and you are done)

A brief explanation:

Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (7 bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character max. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).

The main difference is that an ASCII character can fit to a byte (8 bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:

Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called code point.
An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses 1 to 4 units of 8 bits, and UTF-16 uses 1 or 2 units of 16 bits, to cover the entire Unicode of 21 bits max. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses 1 byte for the Latin script it needs 3 bytes for later scripts inside Basic Multilingual Plane, while UTF-16 uses 2 bytes for all these. And that's their main difference.
Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.

character: π
code point: U+03C0
encoding forms (code units):
      UTF-8: CF 80
      UTF-16: 03C0
encoding schemes (bytes):
      UTF-8: CF 80
      UTF-16BE: 03 C0
      UTF-16LE: C0 03

Tip: a hex digit represents 4 bits, so a two-digit hex number represents a byte
Also take a look at Plane maps in Wikipedia to get a feeling of the character set layout

2015/10/27

Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.

Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.

Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.

2010/07/05

This article explains all the details http://kunststube.net/encoding/

WRITING TO BUFFER

if you write to a 4 byte buffer, symbol with UTF8 encoding, your binary will look like this:

00000000 11100011 10000001 10000010

if you write to a 4 byte buffer, symbol with UTF16 encoding, your binary will look like this:

00000000 00000000 00110000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

e.g. For this particular symbol: UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

READING FROM BUFFER

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with not

Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

enter image description here

2018/05/11

Unicode is a standard which maps the characters in all languages to a particular numeric value called Code Points. The reason it does this is that it allows different encodings to be possible using the same set of code points.

UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.

Choosing a particular encoding depends upon your requirements. Different encodings have different memory requirements and depending upon the characters that you will be dealing with, you should choose the encoding which uses the least sequences of bytes to encode those characters.

For more in-depth details about Unicode, UTF-8 and UTF-16, you can check out this article,

What every programmer should know about Unicode

2019/05/13

Source: https://stackoverflow.com/questions/2241348
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]