insert characterset problems - Python

Hi
I am running a script that reads data from a local text file, parses it, and then do an INSERT with the parsed values.
The inserts run ok.
But, when I retrieve the values from the database I see strange characters instead of our Norwegian characters æ,ø,å.
Some info:
- The NLS_CHARACTERSET in the database is AL32UTF8.
- The database is Oracle XE 10.
- The OS is CentOS 5.3
- The file is saved as UTF in the Gedit Gnome editor.
- I open the file with open('filename.txt')
- Read all lines into a list with readlines()
- I parse all lines using [from:to] string splitting notation.
- And add to the INSERT using "INSERT into mytable values (%d, %s)" % (int(pn), ps).
I also print out the INSERT string just to make sure everything looks ok and it does, but when inside the database the characters æøå are weird.
Regards
Rune
Edited by: user10719935 on Jun 12, 2009 12:51 AM 

Oracle allows the database client (e.g. Python) character set to be different to the database character set. Oracle libraries will attempt to map data between the two character sets. If the mapping isn't possible, then characters will often appear as question marks.
Set the character set used for the client with the environment variable NLS_LANG.

Related

NLS_LANG part 2 and os codepage to support globalisation at server side.

hi billy, orafaq and wolicki (Sergiusz) and everybody else..
This is a follow up from the previous thread i posted under
Wrong NLS settings But Correct Output ? change for character input? SQLPLUS
at
Wrong NLS settings But Correct Output ? change for character input? SQLPLUS please read it if it interest you.
I have done some tests and i have some related questions with regards to the previous thread i posted.
Mr Wolicki, if you are reading this, would you mind telling me if i am right or wrong in understanding what you have replied to me in the previous thread
as follows
1) Comparison and Conversion never took place between OS and Application Level (SQL PLUS) even if the ACP (code page) and NLS_LANG (client,sqlplus) is different - BUT it does BETWEEN the client(sql*plus) and the database level (db character set) - am i right ?
2) Thus, in the previous example where by i enter ACP 874 thai fonts using SQLplus into a varchar2 column with the same characterset as my NLS_LANG(1525), i get back my thai fonts perfectly.
3) For NVARCHAR2 insertion, conversion is done implicity in the following sequence and comparison - nls_lang character set ---> varchar2 (database character set) ---> nvarchar2 (national character set) am i right ?
4) Thus, again if my NLS_LANG is set to same as the varchar2(database character set) no conversion is done and conversion from 1525 to nvarchar2 (AL16UTF16) does not cause a change in encoding. so i am able to retreive the thai fonts correctly.
THUS Retrieving back correct data, doesnt mean your data is store correctly in the database for retrieval by other clients.
----------------------------------------------------------------------------------------------------------------------
Most important questions (*****)
1) NLS_Lang must be set to the same as your OS code page. So that proper conversion can take place. between client and db side.
Now assume I have a Unicode DB ( database character set = AL32UTF8 )
- I have 2 client (A and B) that will do insertion to the database each at their terminal.
- Client A will do insertion in 'CHINESE' (NLS_LANG = CHINA.ZHS16GBK)
- Client B as usual will do insertion in 'THAI' (NLS_LANG = TH8TISASCII)
- and I will do insertion in ENGLISH (NLS_LANG = WE8MSWIN1252)
AND most importantly
- I am required to see Correct values(FONTS) of what they inserted both IN CHINESE and THAI as well as my own INSERTION.
---------------------------------------------------------------------------------------------------------------------
What should i DO IF
1) Client A needs to see data inserted by Client B but even if correctly converted from the DB (base on the NLS_LANG), some of the encoding in THAI are not included (out of range) in the CHINESE encodings - will it happen ? OR even if its mapped correctly, the thai font will be shown as CHINESE characters - what should i do to correct this ?
2) Client B needs to see data inserted by Client A - (same as above)
3) Most importantly, I need to see data inserted (chinese and thai) by both clients.
Is there a UNICODE OS codepage so that i can use (as well as set in the NLS_LANG) so that i can insert in ENGLISH and yet READ thai and chinese also ?
Please advice.
Message was edited by:
OracleWannabe
Message was edited by:
OracleWannabe 
1) Yes, this is true for a SQL*Plus session.
2) Yes, as long as you do not try to manipulate the data. Only simple storage/retrieval works "perfectly". Most other string processing may fail. This includes any attempt to retrieve the same data through JDBC.
3) Yes, for SQL literals and bind variables without specific declaration of the character set form, and generally for SQL*Plus. Bind variables declared as NCHAR (the method depends on the API) are converted from the character set of the variable directly to the national character set of the database. The character set of the variable depend on the API.
4) There is conversion between WE8MSWIN1252 and AL16UTF16 but it is reversible for the Thai codes that you entered.
## THUS Retrieving back correct data, doesnt mean your data is store
## correctly in the database for retrieval by other clients.
Right!!!
Regarding your most important questions:
The requirement to set NLS_LANG to the O/S code page is true for SQL*Plus and other applications using traditional OCI access, with default NLS configuration. This type of access is known from Oracle7 times. OCI 8i and newer can be configured in Unicode mode. In Unicode mode, the character set part of NLS_LANG does not matter. Similarily, current Oracle JDBC drivers do not look at NLS_LANG. They always talk to the database in Unicode. Also, Windows (NT-based) is internally working in Unicode UTF-16. The O/S default code page is used by older ANSI Win32 API (this API is used by SQL*Plus) only. Modern applications that do not have to support Win9x/Me should use the Unicode Win32 API. Therefore, it is possible to write a C application using OCI in UTF16 mode or a Java application using JDBC that will display all your required languages. For this type of applications, NLS_LANG charater set does not matter. Some font configuration steps may be needed to make sure that all client workstations can display the whole character repertoires. Due to various advanced reasons there is not a single font that contains all Unicode characters. Multilingual output is usually composed from glyphs from multiple fonts.
-- Sergiusz 
hi sorry for late reply,
fantastic answer,
just wanna offer my deepest thanks to you.
hope to learn from you again
- noob

Wrong NLS settings But Correct Output ? change for character input? SQLPLUS

hi everybody, my question is abit long,i spend quite some time typing it so please bear with me for awhile and i hope you gurus would read and help me if you can.
I have read from nls_lang faq that
NLS_LANG environment parameter is the simplest way to specify locale >>behavior for Oracle software. It also indicates the client's character set, which >>corresponds to the character set for data to be entered or displayed by a client >>program.Now, I need to input thai using my SQL*PLUS as well, but besides the output of data, i want to read the remaining as ENGLISH. And out of curiousity to test my understanding of whether NLS_LANG needs to be set for displaying thai fonts during insertion of values ( it use to show me wierd output, when i type in THAI as value for insertion in SQLPLUS)
i have done the following.
I went to Control Panel / Regional and Language Setting / Under Advance Tab / Language for Non-unicode software - I set to Thai (Previously English-United States) and also in the Code Page Conversion table , i have ticked everything that has a mapping in THAI. Then i done a restart of my PC.
I am a window XP user. I am running on Oracle 10g and i am using window SQL*PLUS.
My current settings are
For my OS Code Page
In registry - hlm\system\currentcontrolset\control\nls - parameter ACP - 874 (Thai)
For my NLS_LANG
In registry under my Oracle Key, i set it to AMERICAN_AMERICA.WE8MSWIN1252
(Instead of AMERICAN_AMERICA.TH8TISASCII)
For my DB(non-unicode) nls parameters, my current setting are
Db Character Set - WE8MSWIN1252
National Character Set - AL16UTF16
------------------------------------------------------------------------------------------------------------------------
Test 1) NLS_LANG need to be set for input of foreign language
i login to the SCOTT/TIGER schema and simply done a insertion using thai font, it works.
- So i conclude that , for just INPUTTING of foreign language , changing the OS code page is enough.
However NLS_LANG is needed so that correct conversion can take place between the OS application layer and the DB layer
Q1) am i right to say so ? If i am right, here comes the main tricky problem
------------------------------------------------------------------------------------------------------------------------
From what i knew ( i am not sure if i am right, pls guide me if i aint), NLS_LANG characterset is set to what your OS code page is, so that Oracle can do the conversion of character set encoding if your character set is different from the DB.
And so if my OS code page is difference from my NLS_LANG, i might get a different conversion back when retreving and rubbish data will results.
Since i have a non-unicode DB, i will use a NVARCHAR2 column to do the testing.
Using my current setting
Os Code page - 874 Thai
NLS_LANG - AMERICAN_AMERICA.WE8MSWIN1252
DB NVARCHAR2 column - AL16UTF16
Step 1)
I create a table test with a nvarchar column of size 50 byte.
Step 2)
I insert a thai word into it.
Step 3)
I done a query, and the query shows me in thai perfectly.
Q2) why am i able to read the thai value inserted correctly, since my NLS_LANG is WE8MSWIN1252 characterset but my OS code page is 874 Thai. The conversion is done from UNICODE to WE8MSWIN1252. why am i able to read thai from WE8MSWIN1252 character set ?
------------------------------------------------------------------------------------------------------------------------
What nexts puzzled me even more.......... ( Now i try to set my NLS_LANG to be the same character set with my OS (874 thai), stated in GUIDE also)
NLS_LANG - AMERICAN_AMERICA.TH8TISASCII ( change to thai )
I drop the table and reboot my OS.
Now i do step 1 to 3.
But i am not able to see THAI anymore, INSTEAD i am seeing ??. Why so ? I am using the same character set from my OS in NLS_LANG and i cant see the value inserted when AL16UTF16 is converted to TH8TISASCII. But in WE8MSWIN1252 , it works when it doesnt matches my OS CODE page.
Is there some conversion lost between TH8TISASCII and AL16UTF16 and why WE8MSWIN1252 would work, what are the relationship between the 3 ?
Thanks for reading up till this far, and your kind advice would be really very much appreciated.
-Noob
---------------------------------------------------------------------------------------------------------------------- 
> But i am not able to see THAI anymore, INSTEAD i am seeing ??. Why so ? I am
using the same character set from my OS in NLS_LANG and i cant see the value
inserted when AL16UTF16 is converted to TH8TISASCII. But in WE8MSWIN1252 , it
works when it doesnt matches my OS CODE page.
Is this perhaps not a mere rendering issue? Assuming that the client understands Unicode, it can display it.. but using what font set to render it?
One way to ensure that the data is indeed correct - as you've actually entered it - is to use the DUMP() SQL function, e.g.
SQL> select DUMP(the_column) from the_table
The code page, as I understand, is how the o/s interprets that character bits - and this determines the rendering to use. In Windows, this means loading a font set into memory to render that character.
If the code page does not "correspond" to the actual Unicode charset, it will render incorrectly. That however does not mean that the source data is now incorrect - simply that it is not rendered correctly. 
To respond properly to your post, I really should take some time with a few of the paragraphs and follow up with questions. But for now, I'll just focus on the basics--
You should take for granted that NLS_LANG charset value must correspond to the character set of the actual data source (e.g. file contents, os code page, application specific character set). Hence the correct setup in your case with ansi (and oem, which happens to be the same in this case) code page windows-874 and a win32 based app such as Sql*plus, is NLS_LANG=<language>_<territory>.TH8TISASCII (or just NLS_LANG=.TH8TISASCII but beware of defaults for remaining settings). This is how you let Oracle know which character set is in use, thus allowing for proper conversion to/from the database character set. A different setting is incorrect; however it may in some circumstances appear to work (but it doesn't).
You have a database character set of WE8MSWIN1252. Your Oracle Client home NLS_LANG indication of client code page (character set) is WE8MSWIN1252.
So, that means a character set not capable of defining Thai characters (e.g. not matching windows-874) and you are lying to Oracle about the data coming from Sql*plus (since when you are typing you are using windows-874).
What you appear to have is a classic pass-through setup aka garbage-in, garbage-out; as explained by the nls_lang faq, in An example of a wrong NLS_LANG setup: "When the client NLS_LANG character set is set to the same value as the database character set, Oracle assumes that the data being sent or received are of the same (correct) encoding, so no conversions or validations may occur for performance reasons. The data is just stored as delivered by the client, bit by bit."
On top of that, the database character set (NLS_CHARACTERSET parameter) chosen in your case is not suitable for storing and handling Thai script.
Hope this helps.
Message was edited by:
orafad 
A lot of work put into formulating a question deserves a prize. Fredrik already explained some of the issue, so here are further details:
## However NLS_LANG is needed so that correct conversion can take place
## between the OS application layer and the DB layer
NLS_LANG is needed so that correct conversion takes place between client, which receives data from O/S in the O/S character set, and the database, which stores and processes data in the database character set (NLS_CHARACTERSET).
In your test 1, O/S passes bytes to the client. O/S thinks the bytes denote TH8TISASCII-encoded characters (ACP) so it selects Thai glyphs from relevant font files. The client thinks it got WE8MSWIN1252-encode characters as it was told so by NLS_LANG. The client knows that the database uses WE8MSWIN1252 to process characters, so it sends received bytes unmodified to the database. The database stores them. If you select the same data, the database sends the same codes back, the client thinks that it got WE8MSWIN1252. It thinks that it should return data to the application and the O/S in WE8MSWIN1252 (NLS_LANG), so it returns them unmodified. SQL*Plus sends the byte to O/S (when outputting) and the O/S thinks the bytes are in TH8TISASCII (ACP). It displays them again using Thai glyps.
You have not seen any problem because you have not asked the database to do any real interpretation of data. Try to apply UPPER or LOWER to the stored data. As Thai letters do not have case but Western European letters with the same bnary codes do have, UPPER/LOWER will corrupt your data.
In your NVARCHAR2 test, the conversion is more complicated. The data is actually interpreted as VARCHAR2 as long as it does not reach the server. Therefore, it remains unmodified as in the previous test. But in the server itself, the INSERT statement notice that the target column is NVARCHAR2 and not VARCHAR2. It applies implicit conversion from WE8MSWIN1252 to AL16UTF16. But Thai letters received from the O/S have codes between 0xA1 to 0xFB. If interpreted as WE8MSWIN1252, these codes just map from 0xNN to 0x00NN in AL16UTF16. They get stored with such mapped values in the column. When you select them back, the opposite conversions happen, and you get your codes back. Again, the result will not be correct, if you make the database interpret the codes, for example with UPPER/LOWER.
If you change NLS_LANG to the correct value in the NVARCHAR2 test, the client now knows correctly that it got TH8TISASCII codes. It now notice that the database character set is different and tries to convert. As WE8MSWIN1252 encodes no Thai letters, all letters become the default replacement character, which is '?'.
You may ask, why the conversion is to the database character set and not to the national character set (NLS_NCHAR_CHARACTERSET). There are two cases:
1) If you embed the letters in a text literal that is part of an INSERT or an UPDATE statement, then they become part of the statement. And SQL and PL/SQL statement text is always processed in the server in the database character set.
2) If you pass the letters in a bind variable, then the database must know that the bind variable is of NVARCHAR2 type. You can declare a bind variable in SQL*Plus as NVARCHAR2 but SQL*Plus does not have an assignment command. You call an anonymous PL/SQL block to assign values to SQL*Plus bind variables. As PL/SQL is also interpreted in the DB character set, you cannot assign a value outside of the DB character set to a bind variable in SQL*Plus.
A solution to this issue exists in 10.2 (both client and server must be 10.2). You set the environment variable (or Registry setting) ORA_NCHAR_LITERAL_REPLACE=TRUE and precede each literal in INSERT or UPDATE with the letter N (e.g. n'This is my Thai text'). The client code encodes such literals into the form known already to users of the SQL UNISTR function. This makes the literals contain ASCII characters only (this is a method resembling the quoted-printable encoding for e-mails). The server parses the encoded literals yeilding NVARCHAR2 values in AL16UTF16.
For users of database versions before 10.2, the solution is to encode Thai manually as arguments to UNISTR calls or to use a tool other than SQL*Plus that can label the data properly as NVARCHAR2.
And the really recommended approach is to use AL32UTF8 databases for multilingual databases.
-- Sergiusz 
hi mr wolicki and orafad,
thank you so much for reading my post and answering me out.
now i got a clearer picture, but please , if you guys are still around,
give me 1 more day, to digest out the info and test so as to put things into picture..
hope you guys are still around by then.
thanks! 
hi guys,
i have posted another thread with questions to coupled on this thread..
over here
NLS_LANG part 2 and os codepage to support globalisation at server side.
please do help and kindly advice if you are free and much appreciated if you do so.
Once again 100 and millions thanks for the clear explainations.
Regards,
Noob

Database Character Set Issue

Hi All,
We have production DB 10g with character set US7ASCII. This DB stores Arabic data and English data.Production DB located in HP unix Operating System.
When I query data from DB through SQL developer data is shown as Junk or Unknown characters(Square Boxes).
Client (Workstation from where query is issued from SQL develope- Windows XP OS) Settings: NLS_LANG = AMERICAN_AMERICA.US7ASCII
In Client workstation Oracle 10g client is installed from where I used to query data through SQL developer. The problem is I am unable to see Arabic characters in the sense that it is displayed as Junk character. However English characters and Eneglish numeric values are displayed properly.
I tried below way to make sure that data is not corrupted: Converted "Name" column to hex value (rawtohex) and displayed its HEX value. Executed below query in UTF-8 DB.
select UTL_I18N.RAW_TO_CHAR(hex_value_of-name) from dual;
This displayed Arabic name properly in UTF8 DB.
Character set for this production DB can not be changed at this time. There are many applications which is based on this DB. All these applications are well capable of converting Junk data to Arabic to display in application.
My concern is: What I should required to do to view Arabic data properly through SQL developer? Is there any settings needs to be done at my client workstation?
Thanks in Advance.
Regards,
SK 
There is no way to see Arabic data from a US7ASCII database in SQL Developer. Period.
If your applications are able to see Arabic data from the US7ASCII database, then they use pass-through configuration (NLS_LANG=.US7ASCII). You can still change the database character set to the correct one (needs to be determined from context but AR8MSWIN1256 or AR8ISO8859P6 are the most probable) and change all NLS_LANG values to this new character set. This way, both SQL Developer and the legacy applications would work.
-- Sergiusz 
Is there any character set that I need to use in my client workstation?
I am fine if it is possible to view data through any sql tools or anything else so that Arabic data displays properly. But changing character set of production DB is impossible at this time.
character set "ARABIC_UNITED ARAB EMIRATES.AR8MSWIN1256", if I set this in my cleint machine, by any chance will it resolve my issue? 
If you describe the platform and technology used for the applications that normally enter and view the Arabic data I might be able to give you some more advice.
-- Sergiusz 
Hi Sergiusz,
Plateform is HP (AIX) Unix machine
Technology: Application is programmed in Cobol language.
Appreciate your help.
Edited by: SK on Jun 7, 2012 1:33 AM 
COBOL, wow :)
Can you check the configuration of the terminals used to enter data for this application and see what is the configured character set encoding?
You should theoretically be able to view the data on those terminals if running SQL*Plus with NLS_LANG=.US7ASCII
-- Sergiusz 
I am trying to understand your question:
when you say "Can you check the configuration of the terminals used to enter data for this application", you actually mean character set of the DB where application server resides. Can you please elaborate it little bit?
For the record
"You should theoretically be able to view the data on those terminals if running SQL*Plus with NLS_LANG=.US7ASCII"
Yes. I understood "terminals" as my client workstation. I am able to see Arabic for few data. SQL*PLUS displays special characters along with the arabic characters, So my understanding here is, it is unable to understand few Arabic letters.
Correct me about my understanding.
Thanks a lot.
Regards,
SK
Edited by: SK on Jun 7, 2012 1:16 AM
Edited by: SK on Jun 7, 2012 1:17 AM 
Does the regional settings option available in "Control Panel" matters the way Arabic data displays? 
In a pass-through configuration (i.e. Arabic data stored in a non-Arabic DB character set, because the client-server conversion does not happen when NLS_LANG=DB character set), the configured character set of the terminal (physical terminal or a terminal emulator program in Windows) that is used to physically enter the data that goes to the database determines the encoding of this data. If you enter the data in a Windows application, this will be AR8MSWIN1256. If you enter data in a Command Prompt window, this will be AR8ADOS720, on Unix it will usually be AR8ISO8859P6 (but not always), etc. Of course, if the application happens to have some character set conversion built in, the encoding of the terminal may differ from the encoding of data in the database.
You have to identify the real character set of data to be able to work on any solution concerning viewing this data outside of the original application.
-- Sergiusz 
Hi Sergiusz,
Character set of the terminal is Windows-1256 encoding.
Thanks for your continued support.
Regards,
SK
Edited by: SK on Jun 7, 2012 2:00 AM 
If you are sure this is the case for the application itself, that is, the terminal where the data is entered, not the terminal where you want to view the data, then do the following to view the data on your Windows workstation:
1) Open a Command Prompt window.
2) Open its system menu (click on the icon left of the window title).
3) Select Properties
4) On the Font tab, select a TrueType font (e.g. Consolas or Lucida Console ) and click OK
5) In the command prompt, enter:
C:\>chcp 1256
C:\>set NLS_LANG=.US7ASCII
C:\>sqlplus ...You should be able to view the Arabic data, although SQL*Plus is not good at right-to-left rendering, so the data may come distorted. You may also spool the data from SQL*Plus into a text file and view the file in Notepad. This may help with the legibility.
You should not expect much more. Arabic has complex rendering requirements and many tools are not good at it in general. If you add the incorrect database character set declaration (cheating), you get a configuration where viewing Arabic is very difficult.
-- Sergiusz 
Hi Sergiusz,
As per your suggestion, I have tested in command prompt to see if Arabic data is Viewable. Unfortunately, it was showing Unknown charter (??????--- question mark data).
I will double check character set of machine where application is installed and will get back to you in a day or two.
This is application used in Bank to enter data. When you say terminal configuration, do you mean character set of the machines where employees used to login to application?
or
Application Server character set?
Does regional settings in control panel affects the way data displayed in SQL PLus?
Thanks for the help...
Regards,
SK
Edited by: SK on Jun 7, 2012 8:48 PM 
## As per your suggestion, I have tested in command prompt to see if Arabic data is Viewable. Unfortunately, it was showing Unknown charter (??????--- question mark data).
Repeat the test but also run SELECT USERENV('LANGUAGE') FROM DUAL and let me know the result.
## This is application used in Bank to enter data. When you say terminal configuration, do you mean character set of the machines where employees used to login to application?
Yes, this one. Actually, I mean the physical terminal, unless they work on GUI workstations, in which case I mean the configuration of the program on these workstations used to emulate the terminal (I assume they use character mode with COBOL programs).
I suggest you describe the architecture of the bank application precisely. This will help me understand what components are involved and which may influence the character set encoding.
## or Application Server character set?
What application server? Do we talk about a web application written in COBOL?
## Does regional settings in control panel affects the way data displayed in SQL PLus?
Yes, the Windows system default settings define the default configuration of the Command Prompt window and thus the display character set.
We may also try another approach to identify the character set. Choose some value from the application tables that contains a string with a good set of different characters, including Arabic and possibly some non-ASCII punctuation. Locate this value in the particular column and table in a particular row, and then run the following select:
SELECT DUMP(<column with the value>, 1016) FROM <table with the value> WHERE <condition to select the right row>and paste the result here.
-- Sergiusz 
Hello Sergiusz,
I tried with Toad, it did not help, everything shown as Junk. Terminal character set will be default wrt NLS_LANG. Most of the terminals do not have oracle client installed.
## Repeat the test but also run SELECT USERENV('LANGUAGE') FROM DUAL and let me know the result.
I have repeated the test, still it shows unknown characters.
SQL> select userenv('LANGUAGE') from dual;
USERENV('LANGUAGE')
----------------------------------------------------
AMERICAN_AMERICA.US7ASCII
## SELECT DUMP(<column with the value>, 1016) FROM <table with the value> WHERE <condition to select the right row>
SQL> SELECT DUMP(trim(cust_name), 1016),trim(cust_name) FROM table1 where rownum <2;
DUMP(TRIM(CUST_NAME),1016)
--------------------------------------------------------------------------------
TRIM(CUST_NAME)
----------------------------------------
Typ=1 Len=25 CharacterSet=US7ASCII: e5,c8,d1,e8,e3,e7,20,d9,c8,cf,20,c7,e4,d3,e4
,c7,e5,20,c7,e4,d9,e5,e8,cf,e9
åÈÑèãç ÙÈÏ ÇäÓäÇå ÇäÙåèÏé
Regional Settings in my workstation:
Country: United States
Language: English (United States)
Current Language for Non-Unicode Programs: Arabic (Saudi Arabia)
##I suggest you describe the architecture of the bank application precisely. This will help me understand what components are involved and which may influence the character set encoding
Here the character set follows something called "ASMO", which is taken care in COBOL program that calls function written in "C" programming language.
Once data (Char/Varchar) is picked from DB, this data is passed to the function written in "C", function call is taken care in Cobol program. Below is the "C" function invoked:
Task of this function is to Map valid Arabic characters and English characters based on data from DB.
It gets HEX value from DB, character is derived based on HEX value. The conversion takes place based on the rule defined as per "ICU" (International Components for Unicode) standards.
Regards,
SK
Edited by: SK on Jun 9, 2012 11:26 PM
Edited by: SK on Jun 9, 2012 11:29 PM 
Hello Sergiusz,
Appreciate your thoughts on resolving this issue.
Regards,
SK

instr, substr & length function can't work as expected in SJIS

OS: Japanes Windows Xp SP2
Charset of text file: SJIS
Oracle: 10g Express Edition
When I tried to read and split strings from a records read by UTL_FILE.GET.LINE, the position of delimiter(',') returned from instr doen't indicate the character position but byte position and sometimes position which isn't character & byte position.
The followings are original CSV file content;
aiu,eo,たのし,い,と
aiu,eo,たのし,い,1
探し,"もの",あり,と
探し,"もの",ありが,とう
愛,,うし,oracle
The followings are strings splited which doesn't work as expected. First line shows splited strings and 2nd line shows the starting position and delimiter posion except last number which get through length function.
--------1--------2--------3--------4--------5--------6--------7----
aiu eo たのし い ・
1:4;5:7;8:13;14:16;17:17
aiu eo たのし い 1
1:4;5:7;8:13;14:16;17:17
探し "もの" あり,・
1:5;6:10;11:14
探し "もの" ありが とう
1:5;6:10;11:15;16:18
愛 うし oracle
1:3;4:4;5:9;10:15
Are there any settings necessary to solve this incident. 
10g Express Edition does not have an JA16SJIS version, therefore I assume that you use the AL32UTF8 version.
Does SELECT USERENV('LANGUAGE') FROM DUAL shows .AL32UTF8?
If this is the case, than the problem is with incorrect use of UTL_FILE.GET_LINE. UTL_FILE does not support character set conversion and any file you read GET_LINE must be encoded in the database character set. Therefore, you should convert the file to Unicode UTF-8 before reading it with UTL_FILE.GET_LINE. Alternatively, use external tables with ORACLE_LOADER driver where you can specify JA16SJIS as the data file character set.
-- Sergiusz 
Thank you for the comment.
I checked the Datase Language then it is AL32UTF8
as you indicated.
I also checked using keyword 'UTL_FILE JA16SJIS' then found the same coment in Japanese to use convert function.
I tried to load the same data then Loader function in XE works fine.
I'm wondering how Loader identifies the language and handle as expected? 
For data in separate data files, the character set is determined by the CHARACTERSET clause in the control file. For data directly in the control file (after BEGINDATA) and for data files for which no CHARACTERSET clause is present, the NLS_LANG environment variable (or Registry setting) defines the character set. If no NLS_LANG is defined, US7ASCII is the default (except for EBCDIC mainframes, where WE8EBCDIC1047 or WE8EBCDIC37C is the default).
-- Sergiusz

Problem with storing Chinese characters when app running on service account

Having a problem with storing Chinese characters in an Oracle database when program is running under a service account on a Windows 2003 server. Running Oracle 11g.
I have a non Unicode compliant application that will read a data file and store the data in an Oracle database.
· The database NLS_CHARACTERSET is UTF8
· The Key_Ora1020\NLS_LANG setting in the registry is AMERICAN_AMERICA.UTF8
· The Regional and Language Options, Language for non-Unicode programs is Chinese (PRC) (and I applied to the default profile)
When I run the application in foreground it works fine – I can run a query in SQLPLUS against the table and see that
the field is updated with Chinese characters. and can dump the field and verify the data.
When I run the application as a service (initially under windows LOCAL SYSTEM account) it doesn’t work - after populating the table,
the query returns “??????” in the field.
I then changed the service to run under my account (the same one that works interactively) and again when I run the application
as a background service I get “??????” in the field.
Apparently somewhere the input data is getting mapped differently via the code page when running as a service vs in foreground
Any suggestions on how to troubleshoot this?
So far I have tried setting NLS_LANG as a SYSTEM ENV variable – I can’t think what else to try here.
Thanks, Marty 
First, the most important information please: what technology/API does the application use to connect to Oracle?
Then, when checking the correctness of your data in the database, please use SQL Developer and not SQL*Plus. SQL*Plus is susceptible to pass-through aka garbage-in/garbage-out issues.
-- Sergiusz 
The software is proprietary - the only thing the documentation says is that the connection is via 'SOLNET' . I'm using TOAD to look at the data.
I guess I don't understand how Oracle stores / represents UTF8 data. For example - I can reduce the string I am trying to post to the database to one symbol represented in the UTF8 tables as U + 672C = e6 9c ac
If I look at my text file used as input with a hex editor, I can see the symbol represented in hex as e6 9c ac as I would expect.
When I post that string to the field in Oracle, and then select the value I can see the correct SYMBOL - i.e. everything looks correct.
However, if I dump the value I get: typ = 96, len=2 characterset=UTF8: b1, be
Why am I getting B1, BE (which apparently does represent the symbol because I see it correctly) ? I would expect to get len=3 characterset-UTF8: e6,9c,ac
Thanks 
915130 wrote:
However, if I dump the value I get: typ = 96, len=2 characterset=UTF8: b1, be
Why am I getting B1, BE (which apparently does represent the symbol because I see it correctly) ? I would expect to get len=3 characterset-UTF8: e6,9c,acPlus, from op:
+· The Key_Ora1020\NLS_LANG setting in the registry is AMERICAN_AMERICA.UTF8+
+· The Regional and Language Options, Language for non-Unicode programs is Chinese (PRC) (and I applied to the default profile)+
Looking at the above, I'm pretty certain that what you see is due to a classic "gigo" setup.
The client environment (service, GUI login, etc.) is using code page 936 in Simplified Chinese, but Oracle is not told so via char set part of NLS_LANG. Instead it is incorrectly stated that client uses UTF-8. Then, since db character set is UTF8, character coded as 0xb1 be is passed through as is, without proper conversion. Which means code units illegal in UTF-8 ends up being stored.
Same thing "shows" up on the way out, no conversion, and it appears to display correctly when in fact the character data is invalid. Aka. garbage-in garbage-out.
See B1 lead byte page and position BE, http://msdn.microsoft.com/en-us/goglobal/cc305153, where it says B1BE = U+672C.
To verify character data stored, you could use SQL Developer, which is a Unicode aware tool and do not suffer from "gigo" scenario.
http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html
Edited by: orafad on Feb 21, 2012 1:59 PM
Edited by: orafad on Feb 21, 2012 2:00 PM 
I don't have a choice - I have to use the application that is non-unicode compliant to post the data. So if db = UTF8 and Regional setting is Simplified Chinese, what should NLS_LANG be set to to get the data into the database as UTF8 and back out again as local simplified Chinese? 
How was NLS_LANG setup per above (registry setting)?
If application works with the code page when inputting character data, then use .ZHS16GBK setting to match the requirement.
See e.g. D.1.1.2 as it mentions setting for Chinese (PRC) locale in Windows, specifically the client character set part
- "+characterset+ specifies the encoding used by the client application"
http://docs.oracle.com/cd/E11882_01/install.112/e24186/gblsupp.htm#BCEDGIJC
Also note that when, for example, transferring data using traditional exp/imp tools, that's a different application that may need different setting.
I.e. environment derived from registry setting may need to be overrridden (by setting env variable manually, in scripts, etc.).
Of course, the NLS_LANG FAQ should be mentioned.
Edit:
added faq url
Edited by: orafad on Feb 22, 2012 9:15 AM
Edited by: orafad on Feb 22, 2012 9:25 AM 
Thanks for your help. I changed NLS_LANG to AMERICAN_AMERICA.ZHS16GBK and kept regional settings Simplieifed Chinese (PRC). CHCP command returns 936 which I understand to be correct for Simplified Chinese.
So now the application that loads the data is working fine if I run it interactively (in windows) - I can see the correct UTF8 hex values that correspond to the Chinese characters found in my input file, and the data is displayed correctly.
However - if I run the application in background under a service account it is not loading correctly. All of the Chinese data is loaded as ?'s. I had the application spawn a dos chcp command to verify the code page when running in background and it still says 936.
But it looks to me like the code page is not recognized by an application running under a service account.
Marty 
Where do you set NLS_LANG? In the registry? As environment variable? If as environment variable note you can set it
1) system-wide
2) for a specific user
If the the latter is true, it does not apply to a system account.
Werner 
I set it as a system wide env variable 
Set it in Registry (HKLM/Software/Oracle/<correct Oracle Home key>). Otherwise, you will have the problem described here: http://support.microsoft.com/kb/821761
-- Sergiusz 
Thanks - yes I set in registry and system wide env variable. Following the changes I rebooted the server. Still a problem. 
Let's start from scratch, so that I get the whole, clean picture:
1. What is the database character set? I understand, it is UTF8, right? Have you checked this with SELECT * FROM NLS_DATABASE_PARAMETERS?
2. What does the application running under the Local System account actually do?
-- Sergiusz 
Correct:
1) The database is set up with NLS_CHARACTERSET = UTF8
2) The Registry value of NLS_LANG is AMERICAN_AMERICA.ZHS16GBK
3) I’ve added a line in the system env variables NLS_LANG= AMERICAN_AMERICA.ZHS16GBK
4) The Advanced tab on the Regional and Language Options is set to "Chinese (PRC)"
5) When I check the active code page (chcp in dos) it returns 936 (Chinese PRC)
The application uses a 'proprietary' language called VGL. Basically - it is very simple - it reads a record from a file in a folder (the file happens to be an utf8 encoded XML file), parses out the tags (it doesn;t use any xml functions - just treats the file as a flat-file), assigns the value to a VGL variable of datatype string and the either create a new Oracle record or updates an existing Oracle record with the string value.
Since the vgl application is non-unicode on the vendors advise I have set the NLS_LANG and Regional settings as above (there was some confusion regarding NLS_LANG, clarified by orafad).
As mentioned, everything now works correctly if I run the application from within windows on the server. It continues to not map the data to unicode in the database if I run the app as a local service. The application will need to run as a service because it is acting as a 'message' interface between 2 systems. I have run the service as both 'LOCAL SERVICE' and under my own LAN id (same id that I run the app interactively).
I have asked the app vendor for support but as of this note they don't know why it is not working when running as a service.
I posted here in the hope that others may have has a similar issue running under a service and it might be a windows / oracle issue that I was not aware of - or some additional server configuration required to so non-unicode apps running as a service will recognize the code page mapping.
If I can't get the vendor app to work I may have to write something in .net to run as a service and provide the same functionality.
Thanks to everyone who has read all this and spent some time thinking about it!
Marty 
Marty,
## the file happens to be an utf8 encoded XML file
Unless the application re-encodes the parsed contents into the Chinese code page 936, setting NLS_LANG to ZHS16GBK is actually not a good idea. NLS_LANG should be set to ZHS16GBK for interactive (GUI or Command Prompt) applications that process Chinese keyboard input and screen output, but it does not need to be set to this character set if processing text files in another encoding, such as UTF-8.
Can you identify what DB interface is used to access Oracle? Is it OCI, ODBC, JDBC? Is the VGL runtime a C/C++ or a Java application?
-- Sergiusz

Categories

Resources