Thursday, December 24, 2009

IDENTIFYING THE EXPORT CHARACTER SET


Introduction:
One of the most common NLS-related problems reported to Oracle Worldwide Customer Support is the loss or changing of characters after an export and import. This is almost always due to NLS_LANG being set to the incorrect character set during export.


Role of NLS_LANG During Export/Import:

This is explained in detail in Note 15095.1 but, in brief, export and import are client tools and will work under the character set specified by NLS_LANG. If, for instance, the database is created with a  character set of WE8DEC and NLS_LANG is set to AMERICAN_AMERICA.WE8PC850 then the ascii values of the stored characters in the database will be translated to the values for the same characters in the WE8PC850 character set. The character set defined by NLS_LANG during the export is stored in the export dump file and is used to ensure that the data is translated correctly to the character set defined by NLS_LANG for the import.

Potential Problems:

If NLS_LANG is not set, for instance, export will be done under US7ASCII, the default character set. If the database was built with character set WE8DEC the characters stored in the database will be converted to US7ASCII and any 8-bit characters, having no equivalent in US7ASCII, will be stripped out.

The same problem will be seen if the character set defined by NLS_LANG is not a superset of the one being translated from (ie: the database character set on export or the export file character set on import).

Identifying the Export Character Set:

When investigating problems like these it is useful to check the character set used for the export. As said above, this is held in the export dump file. It can be seen by doing a hex dump of the export file as follows (in Unix):

cat expdat.dmp | od -x | head

This will produce output similar to:

0000000 0300 0145 5850 4f52 543a 5630 372e 3033
0000020 2e30 330a 4454 534f 0a52 5441 424c 4553
0000040 0a31 3032 340a 300a 0020 2020 2020 2020
0000060 2020 2020 2020 2020 2020 2020 2020 2020
*
0000140 2020 2020 2020 2020 4d6f 6e20 4e6f 7620
0000160 3130 2031 343a 3031 3a33 3620 3139 3937
0000200 0a54 4142 4c45 2022 454d 5022 0a43 5245
0000220 4154 4520 5441 424c 4520 2245 4d50 2220

The second and the third byte in the file define the character set used for the export.

In the example above, the second byte is 0x00 and the third byte is 0x01, yielding 0x0001 as the character set ID. This shows that NLS_LANG was set to US7ASCII during the export. The new Oracle8 functions NLS_CHARSET_NAME and NLS_CHARSET_ID can be used to map character set IDs to character set names.

Note that the 16-bit value is stored in the EXP platform endian.

Most unix platforms are big-endian SparcPowerPcPARisc, RS/6000, SGI R4000 systems),  i.e. the most significant byte is showed first. (like above example -> if the file begins with 03xx -> big endian)

=============================================================

On little-endian platforms, (platforms running on Intel/AMD x86 and Alpha mainly) the output will be slightly different as below:

00000000 0003 4501 5058 524f 3a54 2e37 3330
etc.

Here the most significant byte is showed *last* (!)

(if the file begins with xx03 -> little endian)

The first and the forth byte in the file define the character set used for the export.


0000000 0003 4501 5058 524f 3a54 3156 2e30 3130
0000020 302e 0a30 5344 5359 414d 0a4e 5452 4241
0000040 454c 0a53 3138 3239 300a 320a 0a30 0a30
0000060 6703 6903 6703 0100 0000 0000 0000 0000
0000100 0800 2000 2020 2020 2020 2020 2020 2020
0000120 2020 2020 2020 2020 2020 2020 2020 2020
0000140 2020 2020 2020 2020 7553 206e 634f 2074
0000160 3531 3120 3a31 3832 343a 2033 3032 3630
0000200 6574 7473 642e 706d 0000 0000 0000 0000
0000220 0000 0000 0000 0000 0000 0000 0000 0000
In the example above, the first byte is 0x00 and the forth byte is 0x01, yielding 0x0001 as the character set ID. This shows that NLS_LANG was set to US7ASCII during the export. The new Oracle8 functions NLS_CHARSET_NAME and NLS_CHARSET_ID can be used to map character set IDs to character set names.

The values for the most commonly used character sets are below:
ID
ID(Hex)
Name
ID
ID(Hex)
Name
ID
ID(Hex)
Name
1
0001
US7ASCII
2
0002
WE8DEC
3
0003
WE8HP
4
0004
US8PC437
5
0005
WE8EBCDIC37
6
0006
WE8EBCDIC500
7
0007
WE8EBCDIC1140
8
0008
WE8EBCDIC285
9
0009
WE8EBCDIC1146
10
000A
WE8PC850
11
000B
D7DEC
12
000C
F7DEC
13
000D
S7DEC
14
000E
E7DEC
15
000F
SF7ASCII
16
0010
NDK7DEC
17
0011
I7DEC
18
0012
NL7DEC
19
0013
CH7DEC
20
0014
YUG7ASCII
21
0015
SF7DEC
22
0016
TR7DEC
23
0017
IW7IS960
25
0019
IN8ISCII
27
001B
WE8EBCDIC1148
28
001C
WE8PC858
31
001F
WE8ISO8859P1
32
0020
EE8ISO8859P2
33
0021
SE8ISO8859P3
34
0022
NEE8ISO8859P4
35
0023
CL8ISO8859P5
36
0024
AR8ISO8859P6
37
0025
EL8ISO8859P7
38
0026
IW8ISO8859P8
39
0027
WE8ISO8859P9
40
0028
NE8ISO8859P10
41
0029
TH8TISASCII
42
002A
TH8TISEBCDIC
43
002B
BN8BSCII
44
002C
VN8VN3
45
002D
VN8MSWIN1258
46
002E
WE8ISO8859P15
50
0032
WE8NEXTSTEP
61
003D
AR8ASMO708PLUS
70
0046
AR8EBCDICX
72
0048
AR8XBASIC
81
0051
EL8DEC
82
0052
TR8DEC
90
005A
WE8EBCDIC37C
91
005B
WE8EBCDIC500C
92
005C
IW8EBCDIC424
93
005D
TR8EBCDIC1026
94
005E
WE8EBCDIC871
95
005F
WE8EBCDIC284
96
0060
WE8EBCDIC1047
97
0061
WE8EBCDIC1140C
98
0062
WE8EBCDIC1145
99
0063
WE8EBCDIC1148C
110
006E
EEC8EUROASCI
113
0071
EEC8EUROPA3
114
0072
LA8PASSPORT
140
008C
BG8PC437S
150
0096
EE8PC852
152
0098
RU8PC866
153
0099
RU8BESTA
154
009A
IW8PC1507
155
009B
RU8PC855
156
009C
TR8PC857
158
009E
CL8MACCYRILLIC
159
009F
CL8MACCYRILLICS
160
00A0
WE8PC860
161
00A1
IS8PC861
162
00A2
EE8MACCES
163
00A3
EE8MACCROATIANS
164
00A4
TR8MACTURKISHS
165
00A5
IS8MACICELANDICS
166
00A6
EL8MACGREEKS
167
00A7
IW8MACHEBREWS
170
00AA
EE8MSWIN1250
171
00AB
CL8MSWIN1251
172
00AC
ET8MSWIN923
173
00AD
BG8MSWIN
174
00AE
EL8MSWIN1253
175
00AF
IW8MSWIN1255
176
00B0
LT8MSWIN921
177
00B1
TR8MSWIN1254
178
00B2
WE8MSWIN1252
179
00B3
BLT8MSWIN1257
180
00B4
D8EBCDIC273
181
00B5
I8EBCDIC280
182
00B6
DK8EBCDIC277
183
00B7
S8EBCDIC278
184
00B8
EE8EBCDIC870
185
00B9
CL8EBCDIC1025
186
00BA
F8EBCDIC297
187
00BB
IW8EBCDIC1086
188
00BC
CL8EBCDIC1025X
189
00BD
D8EBCDIC1141
190
00BE
N8PC865
191
00BF
BLT8CP921
192
00C0
LV8PC1117
193
00C1
LV8PC8LR
194
00C2
BLT8EBCDIC1112
195
00C3
LV8RST104090
196
00C4
CL8KOI8R
197
00C5
BLT8PC775
198
00C6
DK8EBCDIC1142
199
00C7
S8EBCDIC1143
200
00C8
I8EBCDIC1144
201
00C9
F7SIEMENS9780X
202
00CA
E7SIEMENS9780X
203
00CB
S7SIEMENS9780X
204
00CC
DK7SIEMENS9780X
205
00CD
N7SIEMENS9780X
206
00CE
I7SIEMENS9780X
207
00CF
D7SIEMENS9780X
208
00D0
F8EBCDIC1147
210
00D2
WE8GCOS7
211
00D3
EL8GCOS7
221
00DD
US8BS2000
222
00DE
D8BS2000
223
00DF
F8BS2000
224
00E0
E8BS2000
225
00E1
DK8BS2000
226
00E2
S8BS2000
230
00E6
WE8BS2000E
231
00E7
WE8BS2000
232
00E8
EE8BS2000
235
00EB
CL8BS2000
239
00EF
WE8BS2000L5
241
00F1
WE8DG
251
00FB
WE8NCR4970
261
0105
WE8ROMAN8
262
0106
EE8MACCE
263
0107
EE8MACCROATIAN
264
0108
TR8MACTURKISH
265
0109
IS8MACICELANDIC
266
010A
EL8MACGREEK
267
010B
IW8MACHEBREW
277
0115
US8ICL
278
0116
WE8ICL
279
0117
WE8ISOICLUK
301
012D
EE8EBCDIC870C
311
0137
EL8EBCDIC875S
312
0138
TR8EBCDIC1026S
314
013A
BLT8EBCDIC1112S
315
013B
IW8EBCDIC424S
316
013C
EE8EBCDIC870S
317
013D
CL8EBCDIC1025S
319
013F
TH8TISEBCDICS
320
0140
AR8EBCDIC420S
322
0142
CL8EBCDIC1025C
323
0143
CL8EBCDIC1025R
324
0144
EL8EBCDIC875R
351
015F
WE8MACROMAN8
352
0160
WE8MACROMAN8S
353
0161
TH8MACTHAI
354
0162
TH8MACTHAIS
368
0170
HU8CWI2
380
017C
EL8PC437S
381
017D
EL8EBCDIC875
382
017E
EL8PC737
383
017F
LT8PC772
384
0180
LT8PC774
385
0181
EL8PC869
386
0182
EL8PC851
390
0186
CDN8PC863
401
0191
HU8ABMOD
500
01F4
AR8ASMO8X
504
01F8
AR8NAFITHA711T
505
01F9
AR8SAKHR707T
506
01FA
AR8MUSSAD768T
507
01FB
AR8ADOS710T
508
01FC
AR8ADOS720T
509
01FD
AR8APTEC715T
511
01FF
AR8NAFITHA721T
514
0202
AR8HPARABIC8T
554
022A
AR8NAFITHA711
555
022B
AR8SAKHR707
556
022C
AR8MUSSAD768
557
022D
AR8ADOS710
558
022E
AR8ADOS720
559
022F
AR8APTEC715
560
0230
AR8MSWIN1256
560
0230
AR8MSAWIN
561
0231
AR8NAFITHA721
563
0233
AR8SAKHR706
565
0235
AR8ARABICMAC
566
0236
AR8ARABICMACS
567
0237
AR8ARABICMACT
590
024E
LA8ISO6937
798
031E
WE8DECTST
829
033D
JA16VMS
830
033E
JA16EUC
831
033F
JA16EUCYEN
832
0340
JA16SJIS
833
0341
JA16DBCS
834
0342
JA16SJISYEN
835
0343
JA16EBCDIC930
836
0344
JA16MACSJIS
840
0348
KO16KSC5601
842
034A
KO16DBCS
845
034D
KO16KSCCS
846
034E
KO16MSWIN949
850
0352
ZHS16CGB231280
851
0353
ZHS16MACCGB231280
852
0354
ZHS16GBK
853
0355
ZHS16DBCS
860
035C
ZHT32EUC
861
035D
ZHT32SOPS
862
035E
ZHT16DBT
863
035F
ZHT32TRIS
864
0360
ZHT16DBCS
865
0361
ZHT16BIG5
866
0362
ZHT16CCDC
867
0363
ZHT16MSWIN950
868
0364
ZHT16HKSCS
870
0366
AL24UTFFSS
871
0367
UTF8
872
0368
UTFE
994
03E2
WE16DECTST2
995
03E3
WE16DECTST
996
03E4
KO16TSTSET
997
03E5
JA16TSTSET2
998
03E6
JA16TSTSET
1001
03E9
US16TSTFIXED
1830
0726
JA16EUCFIXED
1832
0728
JA16SJISFIXED
1833
0729
JA16DBCSFIXED
1840
0730
KO16KSC5601FIXED
1842
0732
KO16DBCSFIXED
1850
073A
ZHS16CGB231280FIXED
1852
073C
ZHS16GBKFIXED
1853
073D
ZHS16DBCSFIXED
1860
0744
ZHT32EUCFIXED
1863
0747
ZHT32TRISFIXED
1864
0748
ZHT16DBCSFIXED
1865
0749
ZHT16BIG5FIXED
9996
270C
HZ-GB-2312
9997
270D
ISO2022-KR
9998
270E
ISO2022-CN
9999
270F
ISO2022-JP




select nls_charset_id(value) nls_charset_id,  value
        from  v$nls_valid_values
        where parameter = ‘CHARACTERSET’
        order by nls_charset_id(value);

Gives the nls_charset_id in DECIMAL, so you need to convert it to HEX first.

Warning:  User modifications of export dump files are not supported by Oracle.  The character set information is also held in other places in the export dump file and modifying only the two bytes           may lead to problems with imported data.

NOT WORKING any more with Oracle 9i R2 (9.2) and up, due to changes in the import /export tools          but you can use a 8i exp against a 9i db for example, more info is in Note 132904.1 Compatibility Matrix for Export & Import Between Different Oracle Versions

In some cases it can be useful to modify the character set information held in the dump file. This should not be taken lightly since the character set information is also held in other places.

We STRONGLY advice you to log a NLS tar FIRST to get confirmation that this is a solution for your problem before starting to change this header.

If, after careful consideration of other options and verification by support, you do decide to edit the character set simply use a binary file editor to do so.

You could for example use the freeware Hex Editor XVI32 from
or the shareware File Editor 2000 from

In case of multiple dump files from a single export, you need to modify each and every export dump file. Otherwise, you will error out with ?

IMP-00008: unrecognized statement in the export file:

when opening the second file.

No comments: