Tuesday, January 31, 2012

Talking to multi-lingual DB2 from a .NET application

Database Code Page
IBM DB2 supports storing and processing data in many languages. A lit of supported code sets, and territory codes can be found here.  For example, supported code sets for the Chinese language is shown here
Table 12. China (PRC), territory identifier: CN
Code page Group Code set Territory code Locale Operating system
437 S-1 IBM-437 86 - -
819 S-1 ISO8859-1 86 - -
850 S-1 IBM-850 86 - -
923 S-1 ISO8859-15 86 - -
1051 S-1 roman8 86 - -
1383 D-4 IBM-eucCN 86 zh_CN AIX
1386 D-4 GBK 86 Zh_CN.GBK AIX
1208 N-1 UTF-8 86 ZH_CN AIX
935 D-4 IBM-935 86 - Host
1388 D-4 IBM-1388 86 - Host
1383 D-4 hp15CN 86 zh_CN.hp15CN HP-UX
1386 D-4 GBK 86 zh_CN.GBK Linux
1381 D-4 IBM-1381 86 - OS/2
1386 D-4 GBK 86 - OS/2
1383 D-4 eucCN 86 zh_CN SCO
1383 D-4 eucCN 86 zh_CN.eucCN SCO
1383 D-4 gb2312 86 zh Solaris
1208 N-1 UTF-8 86 zh.UTF-8 Solaris
1381 D-4 IBM-1381 86 - Windows
1386 D-4 GBK 86 - Windows
1392 D-6 GB18030 86 Zh_CN.GB18030 AIX
1392 D-6 GB18030 86 zh_CN.GB18030 Solaris
1392 D-6 GB18030 86 zh_CN.gb18030 HP-UX, Linux
See notes 8 and 10.
The code set (same as character set in Sybase) determine the language(s) that can be stored in the database. The code set maps to a IBM specific code page, which in turn maps to an operating system one. One of the ways to create a database using a specific code set is to issue the following command at the DB2 CLP
db2 => CREATE DATABASE dbname USING CODESET codeset TERRITORY territory COLLATE USING collation
The TERRITORY code is for region-specific support. For example, to create a Chinese database, one can issue
If the CODESET is not specified, the default is Unicode (code set UTF8, code page 1208). A Unicode database can store any language data.
The Group column from the table is made of a letter that identifies the code page type - S-single byte, D-double byte or N-neutral and a number. All the code pages within the same Group can convert to/from each other. All the N-Neutral groups are compatible with each other and all other groups. For example, S-1 code pages can be converted to/from each other but not to/from any other group.
The Operating System column shows the operating system that can natively support the language and code set. You can use any supported code page on any supported platform. For example, you can create a database using code set 1252 on AIX, even though the operating system itself does not support the 1252 code page. This code page, under which the database was created is called database code page
Application Code Page
The code page under which the client application runs is called application code page. For all Windows operating systems, if the DB2CODEPAGE environment variable is not set, the code page is derived from the ANSI code page setting in the Regional Options menu in the Control Panel.
Normally, Windows will not report a Unicode code page in place of the aforementioned ANSI code page. Therefore, a Windows application will typically not behave as a Unicode client. To override this behavior, the DB2CODEPAGE variable can be set to a Unicode code page (1208), which will force the application to behave as a Unicode client.
Code Page Conversion
Character conversion occurs when a client or application runs in a code page that is different from the code page of the database that it accesses. The conversion occurs on the database server machine that receives the data. If the database server receives the data, character conversion is from the application code page to the database code page. If the application machine receives the data, conversion is from the database code page to the application code page.
To determine if the code page conversion is supported, look at the Group column of the Supported territory codes and code pages table. As mentioned previously, conversion between code pages that fall under the same Group are supported. For example, a client using a code page 1252 (Group S-1) can not talk to a Chinese server using a code set IBM-1381 (Group D-4). But, note that a Unicode client (Group N-1) can talk to any DB2.
A DB2 product or platform that does not support, or that does not have support installed, for the desired combination of code pages will raise an SQLCODE -332 (SQLSTATE 57017) when you try to run your application. When your application converts from one code page to another, it is possible that one or more characters are not represented in the target code page. If this occurs, DB2 inserts a substitution character into the target string in place of the character that has no representation.
Using the ADO.NET DB2 driver
The ADO.NET driver comes in three flavors - OLE DB .NET Data Provider  ODBC .NET Data Provider and IBM Data Server Provider for .NET (called DB2 .NET Data Provider in older versions). IBM recommends using the last one and as such we’ll be talking about it here.  It is available via the IBM Data Server Driver package. It is also available via other clients such as IBM Data Server Runtime Client and IBM Data Server Client.
As previously mentioned, when DB2CODEPAGE variable is not set, applications on Windows would typically use the ANSI code page (1252) reported by the underlying operating system. Since code page 1252 falls under group S-1, it seemed not possible to communicate with other groups like D-4 which include code sets like GBK and IBM-1381 (typically used for languages like Chinese).
One option would be to set DB2CODEPAGE, but AFAIK ADO.NET exposes no API for doing this and manually setting it to the Unicode (1208) code page is an option but I did not explore it further and I felt it is not the most elegant solution.
Then I stumbled upon documentation for the Unicode support in CLI driver which mentioned that “To be considered a Unicode application, the application must connect to the database using either SQLConnectW() or SQLDriverConnectW(). This will ensure that CLI will consider Unicode the preferred method of communication between itself and the database.” I sort of knew from my past work that ADO.NET drivers internally used the CLI driver to communicate with DB2. You can also see this if you open the IBM.Data.DB2.dll assembly in .NET Reflector and look at the references, you will see unmanaged DLLs like db2app.dll and db2sys.dll. And if you run dumpbin
dumpbin /exports db2app.dll

you can see all the exported CLI/ODBC functions like SQLConnect, SQLAllocEnv, SQLExecute etc.
So now I needed to know if the ADO.NET driver uses SQLConnectW() or SQLDriverConnectW(), and if it did I would know that my .NET application would communicate in Unicode with DB2.
Tracing DB2 Connections
In order to find out this, I searched for a way to trace DB2 connections. One of the methods I found was to use the DB2 CLI trace facility. In the db2cli.ini add the following lines and run a sample program that uses

the DB2 ADO.NET driver to connect to a DB2 database. If you then look at the trace file, you can see the calls that were made by the driver – and you can see that it does in fact call the SQLDriverConnectW() function! Here is some sample of the calls  
SQLInitializeADONET( )
<--- SQL_SUCCESS Time elapsed - +6.540000E-004 seconds

SQLConnectADONET( hEnv=0:1, phDbc=&02a83c34, pSetiInfo=<NULL pointer> )
---> Time elapsed - +2.849800E-002 seconds
SQLAllocHandle( fHandleType=SQL_HANDLE_DBC, hInput=0:1, phOutput=&02a83c34 )
---> Time elapsed - +2.849800E-002 seconds
SQLAllocHandle( phOutput=0:1 )
<--- SQL_SUCCESS Time elapsed - +1.058100E-002 seconds
SQLDriverConnectW( hDbc=0:1, hwnd=0:0, szConnStrIn="protocol=tcpip;hostname=yang;port=50000;database=SAMPLE;uid=administrator;pwd=********;", cbConnStrIn=87, szConnStrOut=<NULL pointer>, cbConnStrOutMax=0, pcbConnStrOut=<NULL pointer>, fDriverCompletion=SQL_DRIVER_NOPROMPT )
---> Time elapsed - +2.883200E-002 seconds
( DBMS NAME="DB2/NT64", Version="09.05.0000", Fixpack="0x26010107" )
( Application Codepage=1252, Database Codepage=1381, Char Send/Recv Codepage=1208, Graphic Send/Recv Codepage=1200 )
SQLDriverConnectW( )
<--- SQL_SUCCESS Time elapsed - +7.274694E+000 seconds
( UID="administrator" )
( PWD="" )
( HOSTNAME="yang" )
( PORT="50000" )
As you can see the application code page is still 1252, but the code page used for communicating with the database is 1208 (UTF-8) for character data and 1200 (UTF-16) for graphic data.
So to conclude, if you are using the IBM Data Server Provider for .NET then it will send/receive data in Unicode and thus will be able to communicate with DB2 servers using any code page (and language).

Refs -

Friday, January 27, 2012

Talking to multi-lingual Sybase from a .NET client


Internationalization is the process of enabling an application to support multiple languages and cultural conventions. An internationalized application uses external files to provide language-specific information at execution time. A single version of a software product can be adapted to different languages or regions, conforming to local requirements and customs without engineering changes.

Localization is the process of adapting an internationalized product to meet the requirements of one particular language or region, for example Spanish, including providing translated system messages; translations for the user interface; and the correct formats for date, time, and currency. One version of a software product may have many localized versions.
Adaptive Server includes the character set definition files and sort order definition files required for data processing support for the major business languages in Western Europe, Eastern Europe, the Middle East, Latin America, and Asia.
Sybase Language Modules provide translated system messages and formats for Chinese (Simplified), French, German, Japanese, Korean, Brazilian Portuguese, and Spanish. By default, Adaptive Server comes with U.S. English message files.
All data is encoded in your server in a special code. For example, the letter “a” is encoded as “97” in decimal. A character set is a specific collection of characters (including alphabetic and numeric characters, symbols, and nonprinting control characters) and their assigned numerical values, or codes. Character sets that are platform-specific and support a subset of languages, for example, the Western European languages, are called native or national character sets. All character sets that come with Adaptive Server, except for Unicode UTF-8, are native character sets. Refer to Supported character sets for list of supported character sets. A script is a writing system, a collection of all the elements that characterize the written form of a human language—for example, Latin, Japanese, or Arabic. Depending on the languages supported by an alphabet or script, a character set can support one or more languages. For example, the Latin alphabet supports the languages of Western Europe (see Group 1 in Table 7-1). On the other hand, the Japanese script supports only one language, Japanese. Therefore, the Group 1 character sets support multiple languages, while many character sets, such as those in Group 101, support only one language. The language or languages that are covered by a character set is called a language group.
Within a client/server network, you can support data processing in multiple languages if all the languages belong to the same language group (see Table 7-1) Unlike the native character sets just described, Unicode is an international character set that supports over 650 of the world’s languages, such as Japanese, Chinese, Russian, French, and German. Unicode allows you to mix different languages from different language groups in the same server, no matter what the platform.  Look at Table 7-1: Supported languages and character sets which enlists supported languages and the character sets.
Selecting the server default character set
When you configure your server, using the Server Config for example , you can change the default character set for the server. The default character set is the character set in which the server stores and manipulates data. Each server can have only one default character set. By default, the installation tool assumes that the native character set of the platform operating system is the server’s default character set. However, you can select any character set supported by Adaptive Server as the default on your server (see Table 7-1). 

Selecting a language for system messages

Any installation of Adaptive Server can use Language Modules containing files of messages in different languages. Adaptive Server provides Language Modules for messages in the following languages: English, Chinese (Simplified), French, German, Japanese, Korean, Brazilian Portuguese, and Spanish. If your client language is not one of these languages, you will see system messages in English, the default language.  Each client can choose to view messages in their own language at the same time, from the same server; for example, one client views system messages in French, another in Spanish, and another in German, as long as the languages are part of the same group as per Table 7-1.  Therefore, for example, if Japanese is your server language, you can display system messages only in Japanese or English. Remember that all language groups can display messages in English. If you use Unicode, you can view system messages in any of the supported languages.
If you wish to know more about the files that affect the internationalization and localization, please refer to Internationalization and localization files section  

Character set conversion in Adaptive Server

In a heterogeneous environment, Adaptive Server may need to communicate with clients running on different platforms using different character sets. Although different character sets may support the same language group (for example, ISO 8858-1 and CP 850 support the group 1 languages), they may encode the same characters differently, and hence the need for character set conversion.  The supported conversions in any particular client/server system depend on the character sets used by the server and its clients. One type of character set conversion occurs if the server uses a native character set as the default; a different type of conversion is used if the server default is Unicode UTF-8.
Conversion for native character sets : Adaptive Server supports character set conversion between native character sets belonging to the same language group. If the server has a native character set as its default, the clients’ character sets must belong to the same language group
Conversion in a Unicode system : Adaptive Server also supports character set conversion between UTF-8 and any native character set that Sybase supports. For example, a client can be using any native character set while the server uses UTF-8 character set. The native character set encoded data sent to the server would be converted to UTF8 at the server. Also, the UTF8  data in the server would be converted to the native character set before being sent to the client. Note however, that each client can view data only in the language supported by its character set.
Character set conversion is implemented on Adaptive Server in two different ways:
  • Adaptive Server direct conversions
  • Unicode conversions
Adaptive Server direct conversions support conversions between two native character sets of the same language group. For example, Adaptive Server supports conversion between CP 437 and CP 850, because both belong to the group 1 language group. Refer to Table 8-1 for details
Unicode conversions exists for all native character sets. When converting between two native character sets, Unicode conversion uses Unicode as an intermediate character set. For example, to convert between the server default character set (CP 437), and the client character set (CP 860), CP 437 is first converted to Unicode; Unicode is then converted to CP 860.

As this example illustrates, Unicode conversions may be used either when the default character set of the server is UTF-8, or a native character set. You must specifically configure your server to use Unicode conversions (unless the server’s default character set is UTF-8). Earlier versions of Adaptive Server used direct conversions, and it is the default method for character set conversions.
Making this work for a .NET client application
In case of a .NET application, the Sybase client (with ADO.NET driver)should be installed (or deployed with the application) before hand. The charset property in the connection string for ADO.NET driver determines the character set that would be used on the client side. 
  1. If left unspecified or set to charset=ServerDefault , then the server’s character set is used. This is a confusing option and I could not really understand the details of how it works. But it seems that if the client does not natively support the servers character set then the ADO.NET connection would fail with a ‘Could not load code page for requested charset’ error.
  2. If set to charset=ClientDefault then you also need to specify the CodePageType of ANSI or OEM. In case of ANSI the character set is derived based upon what is in use by the underlying Windows system (it can be seen/changed from Regional, and Language Settings or in registry HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP). For English (and most other Western European languages) on Windows systems, the character set is CP 1252 (Microsoft Windows US (ANSI)). In case of OEM the code page is cp437 (on English Windows). This can be found in Windows registry HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage or simply by using the chcp at the command prompt.
  3. Last option is to set it to any desired character set that the client supports.
The multi-lingual Sybase servers against which the .NET application would run could potentially be using  any character set. If the server’s default character set and CP 1252 belong to the same language group (Group 1 from Table 8-1)  or in other words, the server’s default language is from the same group as English (Group 1 from Table 7-1),  then Sybase server can do direct conversion between the two character sets. In that case, the .NET application can successfully talk to the Sybase server by specifying charset=ClientDefault or charset=ServerDefault or without  specifying the charset property at all.  But if the languages are not from the same group, which would be the case when talking to a Chinese or Japanese Sybase from a English Windows client, direct conversion would not be possible. In this case use of Unicode conversion is possible where by CP 1252 is converted to UTF8, which is then converted to whatever native character set is used by the server.  Unicode conversion  can be enabled on the server by running sp_configure with the ‘enable unicode conversion’ option . Refer to this document for more details. By default, it is turned off. However, in this case, characters from the server’s native character set that are not present in the client’s CP 1252 charset would be replaced with some substitution character (such as a ?, a question mark). So, for example, if the .NET application needs to talk to a Chinese Sybase server (where the charset could be gb18030 or eucgb etc), the Chinese characters would show up as series of ?(questions marks).  That’s is no good. But if we use the 3rd option where we specify charset=utf8 the server would convert its native character set to UTF8 which by its nature supports all possible characters from the servers native character set, there would be no need for substitution characters. And since a .NET application natively understands UTF8, it can always specify charset=utf8 and everything should work well. But the use of this option requires that Unicode conversion  be enabled on the server, which is not enabled by default. Hence one approach that can be taken is to try the 1st option and if the error is ‘Could not load code page for requested charset’ (error message no 310061) then try the 3rd option with UTF8 as the charset. The 2nd option is not much use because of possibility of server characters missing from client character set.

Friday, January 6, 2012

Oracle Instant Client and NLS_LANG in Windows environment

When using Oracle Instant client in a multi-lingual environment, the NLS_LANG parameter needs to be set manually. When installing other flavors of Oracle client, the Universal Installer does this for you automatically. The Instant client doesn't have any installer - it is just a bunch of libraries which you unzip into a directory of your choice. Our .NET application uses (also redistributes) the Instant client, so for its proper operation in multi-lingual environment it needs to correctly set NLS_LANG.

NLS_LANG is of the form <language>_<country/region>.<character set>
On Windows, NLS_LANG can be set in Registry or as an environmental variable. On UNIX it is usually set as  an environmental variable. The current locale settings in effect determines the values for language, country/region and the character set. For example, when using a non-unicode console based client like SQLPlus On Windows 7, the current locale settings is determined by the Language for Non-Unicode programs under Control Panel -> Clock, Language and Region -> Region and Language ->Administrative . On my system this setting is English (United States) where English is the language and United States is the country/region . The corresponding terms in Oracle would be AMERICAN (language) and AMERICA (country/region). The character set is determined by the Windows codepage in use - which one can look up in the registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\) or use the National Language Support (NLS) API Reference. For English (United States) the ANSI codepage is 1252 and the OEM codepage is 437. The ANSI codepage is used for GUI programs and the OEM codepage is used for console programs. Since SQLPlus is a console program, the OEM codepage of 437 would be used. The corresponding Oracle character set  for this OEM codepage would be US8PC437. The value of NLS_LANG in this case would be AMERICAN_AMERICA.US8PC437

If the client is a .NET application, the current locale can be found  by looking at the CurrentCulture.Name property of System.Threading.Thread.CurrentThread. For example, on one of my test machines with a Chinese version of Windows 2008,  the culture name is zh-CN, which in Oracle's terms translates to 'SIMPLIFIED CHINESE' (language) and 'CHINA' (country/region). And because .NET uses Unicode the character set  in Oracle's terms would be UTF8 or AL32UTF8. The NLS_LANG would thenlook like

Its possible to leave the region/country portion blank, in which case a default matching the language would be chosen by Oracle. To help with mapping the language/region from Microsoft to Oracle here are two references
1. The National Language Support (NLS) API Reference lists all the supported locales on Windows.
2. Oracle's list of supported languages/regions can be found here in its Globalization Support Guide 

Some more useful references on this topic - unfortunately they both require a support account
1. The correct NLS_LANG in a Windows Environment [ID 179133.1] (requires support account)
2. NLS_LANG Explained (How does Client-Server Character Conversion Work?) [ID 158577.1 (requires support account)