The difference between UTF8 and AL32UTF8

原创 eygle 2012-01-18

1023

The difference between UTF8 and AL32UTF8 are:

UTF8 stores Unicode characters with code points > U+FFFF as two surrogate characters, three bytes each

AL32UTF8 stores Unicode characters with code points > U+FFFF as one four-byte character

UTF8 will not be updated anymore when new Unicode versions are released, only AL32UTF8 and AL16UTF16 will.

Due
to compatibility problems with pre-9i versions use UTF8 if you have
Oracle8i clients connecting to the database. Use AL32UTF8 in pure
Oracle9i environment.

UTF-8 encoding is variable-width. In UTF-8, each character can be represented by either one, two, or three bytes.

UTF8
is a varying width 1-3 bytes per character Unicode encoding. It is
supported for both database and national character sets. It is a binary
superset of US7ASCII. UTF8 corresponds to Unicode CESU-8 encoding.

AL32UTF8
is a varying width 1-4 bytes per character. It is supported for CHAR,
VARCHAR2, LONG and CLOB only (database character set). It is a binary
superset of UTF8 (in 9.2 only) and US7ASCII. AL32UTF8 corresponds to
Unicode UTF-8 encoding.

This is what Metalink says. In Note: 237593.1

There
is a possible problem for 817 and lower versions: Problems connecting
to AL32UTF8 databases from older versions 8i and lower.

The
default UTF8 characterset for 9i/10G is AL32UTF8, however this
characterset is NOT recognised by any pre-9i clients/server systems.

Oracle
recommends that you use UTF8 instead of AL32UTF8 as database
characterset if you have 8i (or older) servers and clients connecting to
the 9i/10g system until you can upgrade the older versions.

UTF8
is unicode revision 3.0 in 8.1.7 and up. AL32UTF8 is Unicode 3.0 in
9.0.1, Unicode 3.1 in 9.2, Unicode 3.2 in 10.1 and Unicode 4.01 in 10.2

Besides
the difference in Unicode version the "big difference" is that AL32UTF8
has build in support for "Surrogate Pairs" (also known as Surrogate
characters or "Supplementary characters").
Practically this means that in 99% of the cases you can use UTF8 instead of AL32UTF8 without any problem.

There
are only a few situations where Surrogate Pairs are already used on
client side. Windows system with HKSCS2001 (hong kong extension) is one
of those. Note that you actually can *store* Surrogate Pairs in UTF8 but
will store 2 * 3 byte characters and not like AL32UTF8 one 4 byte
character.

Note that if you now use UTF8 as database characterset
and -in the future you do a roll out of new 9i or higher clients and
all your other databases are upgraded to 9i or higher, you can simply do
a alter database characterset to go from UTF8 to AL32UTF8 so downtime
will be limited to a few minutes if the need to go to AL32UTF8 should
arise. There is no performance impact on staying on UTF8

NOTE:
This
note is ONLY relevant if you have already a 9i AL32UTF8 database with
data in. If you still need to create the 9i system then choose UTF8
instead of AL32UTF8 as database characterset in the database creation
assistant.

So, *IF* you have already a 9i system running with
AL32UTF8 then you can use the following steps in this note to change the
database characterset to UTF8 without losing data.

You can't
simply use "ALTER DATABASE CHARACTERSET" to go from AL32UTF8 to UTF8
because UTF8 is a SUB-set of AL32UTF8 (some codepoints which are correct
in AL32UTF8 are not known in UTF8)

But again, UTF8 *contains*
all characters know in AL32UTF8, the difference between them is pure the
way some characters are stored (AL32UTF8 is a bit more efficient for
some characters)

So you will run into ORA-12712 if you try alter database ...

Quote from : http://songhefei.itpub.net/post/7281/473958

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

The difference between UTF8 and AL32UTF8

评论