|
Character Encodings
| ISO Encodings | |
| The following character encodings are supported by JDK 1.1.7a and JDK 1.2. All of the encoding names are case-sensitive. The JDK can successfully convert from these encodings to Unicode and vice-versa. This does not, however, mean that the JDK comes with fonts to display these character sets. | |
| Encoding Name | Character Encoding |
| ASCII | U.S. ASCII (ISO 646, ANSI X3.4) |
| ISO8859_1 | Latin 1 (Western Europe) |
| ISO8859_2 | Latin 2 (Eastern Europe) |
| ISO8859_3 | Latin 3 (Southern Europe) |
| ISO8859_4 | Latin 4 (Northern Europe) |
| ISO8859_5 | Cyrillic |
| ISO8859_6 | Arabic |
| ISO8859_7 | Greek |
| ISO8859_8 | Hebrew |
| ISO8859_9 | Latin 5 (Turkish) |
| ISO8859_15_FDIS | Updated Latin 1 with Euro |
| ISO8859_15_FDIS | Updated Latin 1 with Euro |
| Big5 | Traditional Chinese |
| EUC_CN | Simplified Chinese |
| EUC_JP | Japanese |
| EUC_TW | Traditional Chinese |
| GBK | Simplified Chinese |
| ISO2022CN | Chinese |
| ISO2022CN_CNS | Traditional Chinese |
| ISO2022CN_GB | Simplified Chinese |
| ISO2022JP | Japanese |
| ISO2022KR | Korean |
| JIS0201 | Japanese |
| JIS0208 | Japanese |
| JIS0212 | Japanese |
| JISAutoDetect | JIS autodetect (bytes-to-chars only) |
| SJIS | Shift-JIS Japanese |
| Unicode | Marked big-endian Unicode |
| UnicodeBig | Marked big-endian Unicode |
| UnicodeBigUnmarked | Unmarked big-endian Unicode |
| UnicodeLittle | Marked little-endian Unicode |
| UnicodeLittleUnmarked | Unmarked little-endian Unicode |
| UTF8 | Unicode Transfer Format-8 |

| Unicode 2.0 Block Allocations | ||
| Unicode is a 16-bit international character encoding standard that supports the alphabets of many different languages in addition to a variety of mathematical and geometric shapes. Groups of characters from different alphabets and different origins are assigned contigous blocks of the character set; this table list the Unicode 2.0 block allocatesion. | ||
| Start Code | End Code | Block Name |
| \u0000 | \u007F | Basic Latin |
| \u0080 | \u00FF | Latin-1 Supplement |
| \u0100 | \u017F | Latin Extended-A |
| \u0180 | \u024F | Latin Extended-B |
| \u0250 | \u02AF | IPA Extensions |
| \u02B0 | \u02FF | Spacing Modifier Letters |
| \u0300 | \u036F | Combining Diacritical Marks |
| \u0370 | \u03FF | Greek |
| \u0400 | \u04FF | Cyrillic |
| \u0530 | \u058F | Armenian |
| \u0590 | \u05FF | Hebrew |
| \u0600 | \u06FF | Arabic |
| \u0900 | \u097F | Devanagari |
| \u0980 | \u09FF | Bengali |
| \u0A00 | \u0A7F | Gurmukhi |
| \u0A80 | \u0AFF | Gujarati |
| \u0B00 | \u0B7F | Oriya |
| \u0B80 | \u0BFF | Tamil |
| \u0C00 | \u0C7F | Telugu |
| \u0C80 | \u0CFF | Kannada |
| \u0D00 | \u0D7F | Malayalam |
| \u0E00 | \u0E7F | Thai |
| \u0E80 | \u0EFF | Lao |
| \u0F00 | \u0FBF | Tibetan |
| \u10A0 | \u10FF | Georgian |
| \u1100 | \u11FF | Hangul Jamo |
| \u1E00 | \u1EFF | Latin Extended Additional |
| \u1F00 | \u1FFF | Greek Extended |
| \u2000 | \u206F | General Puncuation |
| \u2070 | \u209F | Superscripts and Subscripts |
| \u20A0 | \u20CF | Currency Symbols |
| \u20D0 | \u20FF | Combining Marks for Symbols |
| \u2100 | \u214F | Letterlink Symbols |
| \u2150 | \u218F | Number Forms |
| \u2190 | \u21FF | Arrows |
| \u2200 | \u22FF | Mathematical Operators |
| \u2300 | \u23FF | Miscellaneous Technical |
| \u2400 | \u243F | Control Pictures |
| \u2440 | \u245F | Optical Character Recognition |
| \u2460 | \u24FF | Enclosed Alphanumerics |
| \u2500 | \u257F | Box Drawing |
| \u2580 | \u259F | Block Elements |
| \u25A0 | \u25FF | Geometric Shapes |
| \u2600 | \u26FF | Miscellaneous Symbols |
| \u2700 | \u27BF | Dingbats |
| \u3000 | \u303F | CJK Symbols and Punctuation |
| \u3040 | \u309F | Hiragana |
| \u30A0 | \u30FF | Katakana |
| \u3100 | \u312F | Bopomofo |
| \u3130 | \u318F | Hangul Compatibility Jamo |
| \u3190 | \u319F | Kanbun |
| \u3200 | \u32FF | Enclosed CJK Letters and Months |
| \u3300 | \u33FF | CJK Compatibility |
| \u4E00 | \u9FFF | CJK Unified Ideographs |
| \uAC00 | \uD7A3 | Hangul Syllables |
| \uD800 | \uDB7F | High Surrogates |
| \uDB80 | \uDBFF | High Private Use Surrogates |
| \uDC00 | \uDFFF | Low Surrogates |
| \uE000 | \uF8FF | Private Use |
| \uF900 | \uFAFF | CJK Compatibility Ideographs |
| \uFB00 | \uFB4F | Alphabetic Presentation Forms |
| \uFB50 | \uFDFF | Arabic Presentation Forms-A |
| \uFE20 | \uFE2F | Combining Half Marks |
| \uFE30 | \uFE4F | CJK Compatibility Forms |
| \uFE50 | \uFE6F | Small Form Variants |
| \uFE70 | \uFEFF | Arabic Presentation Forms-B |
| \uFF00 | \uFFEF | Halfwidth and Fullwidth Forms |
| \uFEFF | \uFEFF | Specials |
| \uFFF0 | \uFFFF | Specials |

| Modified UTF-8 Encoding | ||
|
UTF-8 is an efficient encoding of Unicode character strings that recognizes
the fact that the majority of text-based communications are in ASCII, adn
therefore optimizes the encoding of these characters.
Strings are encoded as two bytes that specify the length of the string followed by the encoded string characters. The 2-byte length is written in network byte order, and indicates the length of the encoded string characters, not just the number of characters in the string. [lenHI][lenLO] {encoded characters}
The individual characters are encoded according to the following
table. ASCII characters are encoded as a single byte; Greek,
Hebrew, and Arabic characters are uncoded as two bytes; and all
other characters are encoded as three bytes. The variant of
UTF-8 used by Java has one modification: the character |
||
| Character | Encoding | |
| \u0000 | [11000000][10000000] (Java) | |
| \u0001 - \u007F | [0][bits 0-6] | |
| \u0080 - \u07FF | [110][bits 6-10] [10][bits 0-5] | |
| \u0800 - \uFFFF | [1110][bits 12-15] [10][bits 6-11] [10][bits 0-5] | |