Programming Resources |
|
Character Encodings
by Jeff Hunter, Sr. Database Administrator
The following character encodings are supported by JDK 1.1.7a and JDK 1.2. All of the encoding names are case-sensitive. The JDK can successfully convert from these encodings to Unicode and vice-versa. This does not, however, mean that the JDK comes with fonts to display these character sets.
| Encoding Name | Character Encoding |
|---|---|
| ASCII | U.S. ASCII (ISO 646, ANSI X3.4) |
| ISO8859_1 | Latin 1 (Western Europe) |
| ISO8859_2 | Latin 2 (Eastern Europe) |
| ISO8859_3 | Latin 3 (Southern Europe) |
| ISO8859_4 | Latin 4 (Northern Europe) |
| ISO8859_5 | Cyrillic |
| ISO8859_6 | Arabic |
| ISO8859_7 | Greek |
| ISO8859_8 | Hebrew |
| ISO8859_9 | Latin 5 (Turkish) |
| ISO8859_15_FDIS | Updated Latin 1 with Euro |
| ISO8859_15_FDIS | Updated Latin 1 with Euro |
| Big5 | Traditional Chinese |
| EUC_CN | Simplified Chinese |
| EUC_JP | Japanese |
| EUC_TW | Traditional Chinese |
| GBK | Simplified Chinese |
| ISO2022CN | Chinese |
| ISO2022CN_CNS | Traditional Chinese |
| ISO2022CN_GB | Simplified Chinese |
| ISO2022JP | Japanese |
| ISO2022KR | Korean |
| JIS0201 | Japanese |
| JIS0208 | Japanese |
| JIS0212 | Japanese |
| JISAutoDetect | JIS autodetect (bytes-to-chars only) |
| SJIS | Shift-JIS Japanese |
| Unicode | Marked big-endian Unicode |
| UnicodeBig | Marked big-endian Unicode |
| UnicodeBigUnmarked | Unmarked big-endian Unicode |
| UnicodeLittle | Marked little-endian Unicode |
| UnicodeLittleUnmarked | Unmarked little-endian Unicode |
| UTF8 | Unicode Transfer Format-8 |
Unicode is a 16-bit international character encoding standard that supports the alphabets of many different languages in addition to a variety of mathematical and geometric shapes. Groups of characters from different alphabets and different origins are assigned contigous blocks of the character set; this table list the Unicode 2.0 block allocatesion.
| Start Code | End Code | Block Name |
|---|---|---|
| \u0000 | \u007F | Basic Latin |
| \u0080 | \u00FF | Latin-1 Supplement |
| \u0100 | \u017F | Latin Extended-A |
| \u0180 | \u024F | Latin Extended-B |
| \u0250 | \u02AF | IPA Extensions |
| \u02B0 | \u02FF | Spacing Modifier Letters |
| \u0300 | \u036F | Combining Diacritical Marks |
| \u0370 | \u03FF | Greek |
| \u0400 | \u04FF | Cyrillic |
| \u0530 | \u058F | Armenian |
| \u0590 | \u05FF | Hebrew |
| \u0600 | \u06FF | Arabic |
| \u0900 | \u097F | Devanagari |
| \u0980 | \u09FF | Bengali |
| \u0A00 | \u0A7F | Gurmukhi |
| \u0A80 | \u0AFF | Gujarati |
| \u0B00 | \u0B7F | Oriya |
| \u0B80 | \u0BFF | Tamil |
| \u0C00 | \u0C7F | Telugu |
| \u0C80 | \u0CFF | Kannada |
| \u0D00 | \u0D7F | Malayalam |
| \u0E00 | \u0E7F | Thai |
| \u0E80 | \u0EFF | Lao |
| \u0F00 | \u0FBF | Tibetan |
| \u10A0 | \u10FF | Georgian |
| \u1100 | \u11FF | Hangul Jamo |
| \u1E00 | \u1EFF | Latin Extended Additional |
| \u1F00 | \u1FFF | Greek Extended |
| \u2000 | \u206F | General Puncuation |
| \u2070 | \u209F | Superscripts and Subscripts |
| \u20A0 | \u20CF | Currency Symbols |
| \u20D0 | \u20FF | Combining Marks for Symbols |
| \u2100 | \u214F | Letterlink Symbols |
| \u2150 | \u218F | Number Forms |
| \u2190 | \u21FF | Arrows |
| \u2200 | \u22FF | Mathematical Operators |
| \u2300 | \u23FF | Miscellaneous Technical |
| \u2400 | \u243F | Control Pictures |
| \u2440 | \u245F | Optical Character Recognition |
| \u2460 | \u24FF | Enclosed Alphanumerics |
| \u2500 | \u257F | Box Drawing |
| \u2580 | \u259F | Block Elements |
| \u25A0 | \u25FF | Geometric Shapes |
| \u2600 | \u26FF | Miscellaneous Symbols |
| \u2700 | \u27BF | Dingbats |
| \u3000 | \u303F | CJK Symbols and Punctuation |
| \u3040 | \u309F | Hiragana |
| \u30A0 | \u30FF | Katakana |
| \u3100 | \u312F | Bopomofo |
| \u3130 | \u318F | Hangul Compatibility Jamo |
| \u3190 | \u319F | Kanbun |
| \u3200 | \u32FF | Enclosed CJK Letters and Months |
| \u3300 | \u33FF | CJK Compatibility |
| \u4E00 | \u9FFF | CJK Unified Ideographs |
| \uAC00 | \uD7A3 | Hangul Syllables |
| \uD800 | \uDB7F | High Surrogates |
| \uDB80 | \uDBFF | High Private Use Surrogates |
| \uDC00 | \uDFFF | Low Surrogates |
| \uE000 | \uF8FF | Private Use |
| \uF900 | \uFAFF | CJK Compatibility Ideographs |
| \uFB00 | \uFB4F | Alphabetic Presentation Forms |
| \uFB50 | \uFDFF | Arabic Presentation Forms-A |
| \uFE20 | \uFE2F | Combining Half Marks |
| \uFE30 | \uFE4F | CJK Compatibility Forms |
| \uFE50 | \uFE6F | Small Form Variants |
| \uFE70 | \uFEFF | Arabic Presentation Forms-B |
| \uFF00 | \uFFEF | Halfwidth and Fullwidth Forms |
| \uFEFF | \uFEFF | Specials |
| \uFFF0 | \uFFFF | Specials |
UTF-8 is an efficient encoding of Unicode character strings that recognizes the fact that the majority of text-based communications are in ASCII, adn therefore optimizes the encoding of these characters.
Strings are encoded as two bytes that specify the length of the string followed by the encoded string characters. The 2-byte length is written in network byte order, and indicates the length of the encoded string characters, not just the number of characters in the string.
|
The individual characters are encoded according to the following
table. ASCII characters are encoded as a single byte; Greek,
Hebrew, and Arabic characters are uncoded as two bytes; and all
other characters are encoded as three bytes. The variant of
UTF-8 used by Java has one modification: the character \u0000
is encoded in two bytes, so that no character will be encoded with
the byte zero.
| Character | Encoding |
|---|---|
| \u0000 | [11000000][10000000] (Java) |
| \u0001 - \u007F | [0][bits 0-6] |
| \u0080 - \u07FF | [110][bits 6-10] [10][bits 0-5] |
| \u0800 - \uFFFF | [1110][bits 12-15] [10][bits 6-11] [10][bits 0-5] |
Jeffrey Hunter is an Oracle Certified Professional, Java Development Certified Professional, Author, and an Oracle ACE. Jeff currently works as a Senior Database Administrator for The DBA Zone, Inc. located in Pittsburgh, Pennsylvania. His work includes advanced performance tuning, Java and PL/SQL programming, developing high availability solutions, capacity planning, database security, and physical / logical database design in a UNIX, Linux, and Windows server environment. Jeff's other interests include mathematical encryption theory, programming language processors (compilers and interpreters) in Java and C, LDAP, writing web-based database administration tools, and of course Linux. He has been a Sr. Database Administrator and Software Engineer for over 17 years and maintains his own website site at: http://www.iDevelopment.info. Jeff graduated from Stanislaus State University in Turlock, California, with a Bachelor's degree in Computer Science.
Copyright (c) 1998-2012 Jeffrey M. Hunter. All rights reserved.
All articles, scripts and material located at the Internet address of http://www.idevelopment.info is the copyright of Jeffrey M. Hunter and is protected under copyright laws of the United States. This document may not be hosted on any other site without my express, prior, written permission. Application to host any of the material elsewhere can be made by contacting me at jhunter@idevelopment.info.
I have made every effort and taken great care in making sure that the material included on my web site is technically accurate, but I disclaim any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on it. I will in no case be liable for any monetary damages arising from such loss, damage or destruction.
Last modified on
Wednesday, 28-Dec-2011 14:09:10 EST
Page Count: 14964