Language: EN

representar-caracteres-en-binario

How to represent characters in binary

We have already seen how to store positive and negative integers, and fractional numbers. But our programs will not only have numbers. We will also generally have texts.

The representation of texts is one of the headaches of computing since its origins. We saw it when talking about Bytes, and Char, the encoding of texts was

Fortunately, we have already overcome this, and you will almost never have to worry about it too much. But, it is convenient that you know how they are encoded and how to work with it, because surely at some point you will need it.

The characters or “letters” are a set of symbols that we use to communicate. It is a fairly extensive set, which includes uppercase letters, lowercase letters, punctuation marks, the own digits of the decimal system.

When talking about numbers we could play with representation. In the end, it was a change of basis. But when talking about characters there is no other option but to use a translation table. THIS binary number corresponds to THIS letter.

So this problem was encountered when the first computers began to develop. They said, How big does that table need to be? How many binary digits do I need?

And so the ASCII table emerges

ASCII Representation

ASCII (American Standard Code for Information Interchange) is a standard encoding that dates back to 1963, which assigns a unique integer to every character in the basic set of English characters.

Each ASCII character is represented by a numerical value of 7 bits, allowing a total of 128 different characters.

These numerical values can be represented in binary, allowing ASCII characters to be processed by computers efficiently.

For example, the character ‘A’ has an ASCII value of 65, which is represented in binary as 01000001.

The first 31 characters in the ASCII table are control characters, which are interpreted by the computer.

Dec Char
0 NUL (null)
1 SOH (start of heading)
2 STX (start of text)
3 ETX (end of text)
4 EOT (end of transmission)
5 ENQ (enquiry)
6 ACK (acknowledge)
7 BEL (bell)
8 BS (backspace)
9 TAB (horizontal tab)
10 LF (NL line feed, new line)
11 VT (vertical tab)
12 FF (NP form feed, new page)
13 CR (carriage return)
14 SO (shift out)
15 SI (shift in)
16 DLE (data link escape)
17 DC1 (device control 1)
18 DC2 (device control 2)
19 DC3 (device control 3)
20 DC4 (device control 4)
21 NAK (negative acknowledge)
22 SYN (synchronous idle)
23 ETB (end of trans. block)
24 CAN (cancel)
25 EM (end of medium)
26 SUB (substitute)
27 ESC (escape)
28 FS (file separator)
29 GS (group separator)
30 RS (record separator)
31 US (unit separator)

The remaining are letters and symbols, according to

Dec CharDec CharDec Char
32 SPACE64 @96 `
33 !65 A97 a
34 ”66 B98 b
35 #67 C99 c
36 $68 D100 d
37 %69 E101 e
38 &70 F102 f
39 ’71 G103 g
40 (72 H104 h
41 )73 I105 i
42 *74 J106 j
43 +75 K107 k
44 ,76 L108 l
45 -77 M109 m
46 .78 N110 n
47 /79 O111 o
48 080 P112 p
49 181 Q113 q
50 282 R114 r
51 383 S115 s
52 484 T116 t
53 585 U117 u
54 686 V118 v
55 787 W119 w
56 888 X120 x
57 989 Y121 y
58 :90 Z122 z
59 ;91 [123 {
60 <92 \124 |
61 =93 ]125 }
62 >94 ^126 ~
63 ?95 _127 DEL

Extended ASCII table

The ASCII table was very limited in terms of characters. Fortunately, in computing, it was already normal for a Byte to be 8 bits. Of these, ASCII only used 7bits, so there were another 128 characters to expand it.

The extended ASCII table is an extension of the ASCII standard that increases the number of characters to 256 (from 128 to 255). This includes additional characters such as accented letters, special symbols, letters and characters used in languages other than English, such as Spanish, French, German, among others.

The extended ASCII table is not a single official standard, but there are several variants that assign different characters to codes from 128 to 255. Some of the most common variants are ISO 8859-1 (also known as Latin-1), ISO 8859-15 (Latin-9), Windows-1252, among others.

Unicode

As computing became more global, the ASCII character set was insufficient to represent all characters used in different languages and writing systems.

To address this limitation, Unicode was developed, an encoding standard that assigns a unique code to every character used in any language in the world.

Unicode uses a 16-bit (or more) representation for each character, allowing the representation of a much wider set of characters.

For compatibility, the first 128 Unicode characters are identical to the ASCII character set.

For example, the character ’✓’ has a Unicode value of U+2713, which is represented in binary as 10011100010011.

  1. Unicode 1.0 (1991): The first official version of Unicode, which included 24,000 characters.

  2. Unicode 1.1 (1993): Added 10,000 additional characters.

  3. Unicode 2.0 (1996): A major revision that added the ability to support bidirectional scripts (such as Arabic and Hebrew), in addition to adding another 35,000 characters.

  4. Unicode 3.0 (1999): Incorporated a large number of additional characters to support languages like Chinese, Japanese and Korean, along with many other symbols and technical characters.

  5. Unicode 3.1 (2001): Introduced minor changes and error corrections.

  6. Unicode 3.2 (2002): Included improvements in the handling of bidirectional scripts and changes in the encoding.

  7. Unicode 4.0 (2003): Added over 96,000 additional characters, including many ideograms for Asian languages.

  8. Unicode 4.1 (2005): Introduced some technical improvements and new encoding standards.

  9. Unicode 5.0 (2006): Added about 6,000 additional characters, including many mathematical and technical symbols.

  10. Unicode 5.1 (2008): A minor version with some corrections and clarifications.

  11. Unicode 5.2 (2009): Added approximately 800 new characters, including characters for mathematics and minority languages.

  12. Unicode 6.0 (2010): Introduced support for writing in emoji characters, in addition to adding many new characters.

  13. Unicode 6.1 (2012): Added about 7,000 new characters, including many symbols for mathematics and music.

  14. Unicode 6.2 (2012): Introduced support for Burmese and Kaithi characters, in addition to others.

  15. Unicode 6.3 (2013): Added support for Tibetan characters and some other improvements.

  16. Unicode 7.0 (2014): Introduced about 2,834 new characters, including many for minority languages and symbols.

  17. Unicode 8.0 (2015): Added approximately 7,716 new characters, including support for languages such as Cherokee and Meitei Mayek.

  18. Unicode 9.0 (2016): Introduced about 7,500 additional characters, including support for the new emoji standard.

  19. Unicode 10.0 (2017): Added more than 8,500 new characters, including glyphs for languages of the Caucasus and emoji symbols.

  20. Unicode 11.0 (2018): Introduced about 7,864 new characters, including glyphs for the Sindhi alphabet and additional emoji.

  21. Unicode 12.0 (2019): Added more than 137,000 new characters, including support for the ancient Egyptian alphabet and many new symbols.

  22. Unicode 12.1 (2019): A minor version with some corrections and improvements.

  23. Unicode 13.0 (2020): Added about 5,930 new characters, including new emoji and symbols.

  24. Unicode 14.0 (2021): Introduced approximately 5,280 new characters, including new emoji and characters from minority languages.

  25. Unicode 15.0 (2022): The most recent version, which added 17,189 new characters, including new emoji, glyphs for African languages and technical symbols.

Currently, the Unicode table has about 150,000 encoded characters. This means that 16 bits are not enough (they only reach 65,536). And this is where UTF comes into play.

UTF Encoding

Unicode and UTF (Unicode Transformation Format) are closely related, but are slightly different concepts:

  • Unicode: Is a character encoding standard that assigns a unique number to each character in almost all known writing systems in the world, including letters, numbers, symbols, and special characters. For example, the letter “A” has a unique number in Unicode, as well as any other character you can imagine.

  • UTF (Unicode Transformation Format): Is a way to encode Unicode code points as byte sequences. UTF defines how these Unicode code points are stored in a computer’s memory or transmitted over a network.

There are several variants of UTF, such as UTF-8, UTF-16, and UTF-32, which differ in how they represent Unicode characters as byte sequences.

Now, regarding the number of bytes that Unicode uses:

  • UTF-8: It is the most common and widely used. In UTF-8, each Unicode character is represented using 1, 2, 3, or 4 bytes. ASCII characters (the first 128 characters of Unicode) are represented with 1 byte in UTF-8, which means that it is compatible with ASCII. Additional Unicode characters use more bytes according to their range.

  • UTF-16: Each Unicode character is represented in UTF-16 using 2 or 4 bytes. Unicode characters that are in the “BMP” (Basic Multilingual Plane) range are represented with 2 bytes, while characters outside the BMP use 4 bytes.

  • UTF-32: It is the simplest format, as it assigns exactly 4 bytes to each Unicode character. This means that any Unicode character, regardless of its range, will be represented with 4 bytes in UTF-32.

In summary, the number of bytes that a Unicode character occupies depends on the UTF format being used:

  • UTF-8: 1 to 4 bytes per character.
  • UTF-16: 2 or 4 bytes per character.
  • UTF-32: Always 4 bytes per character.

Therefore, the answer to how many bytes Unicode has depends on the UTF format that is being used to encode it.