Tuesday, January 31, 2012

Talking to multi-lingual DB2 from a .NET application

Database Code Page
IBM DB2 supports storing and processing data in many languages. A lit of supported code sets, and territory codes can be found here.  For example, supported code sets for the Chinese language is shown here
Table 12. China (PRC), territory identifier: CN
Code page Group Code set Territory code Locale Operating system
437 S-1 IBM-437 86 - -
819 S-1 ISO8859-1 86 - -
850 S-1 IBM-850 86 - -
923 S-1 ISO8859-15 86 - -
1051 S-1 roman8 86 - -
1383 D-4 IBM-eucCN 86 zh_CN AIX
1386 D-4 GBK 86 Zh_CN.GBK AIX
1208 N-1 UTF-8 86 ZH_CN AIX
935 D-4 IBM-935 86 - Host
1388 D-4 IBM-1388 86 - Host
1383 D-4 hp15CN 86 zh_CN.hp15CN HP-UX
1386 D-4 GBK 86 zh_CN.GBK Linux
1381 D-4 IBM-1381 86 - OS/2
1386 D-4 GBK 86 - OS/2
1383 D-4 eucCN 86 zh_CN SCO
1383 D-4 eucCN 86 zh_CN.eucCN SCO
1383 D-4 gb2312 86 zh Solaris
1208 N-1 UTF-8 86 zh.UTF-8 Solaris
1381 D-4 IBM-1381 86 - Windows
1386 D-4 GBK 86 - Windows
1392 D-6 GB18030 86 Zh_CN.GB18030 AIX
1392 D-6 GB18030 86 zh_CN.GB18030 Solaris
1392 D-6 GB18030 86 zh_CN.gb18030 HP-UX, Linux
See notes 8 and 10.
The code set (same as character set in Sybase) determine the language(s) that can be stored in the database. The code set maps to a IBM specific code page, which in turn maps to an operating system one. One of the ways to create a database using a specific code set is to issue the following command at the DB2 CLP
db2 => CREATE DATABASE dbname USING CODESET codeset TERRITORY territory COLLATE USING collation
The TERRITORY code is for region-specific support. For example, to create a Chinese database, one can issue
If the CODESET is not specified, the default is Unicode (code set UTF8, code page 1208). A Unicode database can store any language data.
The Group column from the table is made of a letter that identifies the code page type - S-single byte, D-double byte or N-neutral and a number. All the code pages within the same Group can convert to/from each other. All the N-Neutral groups are compatible with each other and all other groups. For example, S-1 code pages can be converted to/from each other but not to/from any other group.
The Operating System column shows the operating system that can natively support the language and code set. You can use any supported code page on any supported platform. For example, you can create a database using code set 1252 on AIX, even though the operating system itself does not support the 1252 code page. This code page, under which the database was created is called database code page
Application Code Page
The code page under which the client application runs is called application code page. For all Windows operating systems, if the DB2CODEPAGE environment variable is not set, the code page is derived from the ANSI code page setting in the Regional Options menu in the Control Panel.
Normally, Windows will not report a Unicode code page in place of the aforementioned ANSI code page. Therefore, a Windows application will typically not behave as a Unicode client. To override this behavior, the DB2CODEPAGE variable can be set to a Unicode code page (1208), which will force the application to behave as a Unicode client.
Code Page Conversion
Character conversion occurs when a client or application runs in a code page that is different from the code page of the database that it accesses. The conversion occurs on the database server machine that receives the data. If the database server receives the data, character conversion is from the application code page to the database code page. If the application machine receives the data, conversion is from the database code page to the application code page.
To determine if the code page conversion is supported, look at the Group column of the Supported territory codes and code pages table. As mentioned previously, conversion between code pages that fall under the same Group are supported. For example, a client using a code page 1252 (Group S-1) can not talk to a Chinese server using a code set IBM-1381 (Group D-4). But, note that a Unicode client (Group N-1) can talk to any DB2.
A DB2 product or platform that does not support, or that does not have support installed, for the desired combination of code pages will raise an SQLCODE -332 (SQLSTATE 57017) when you try to run your application. When your application converts from one code page to another, it is possible that one or more characters are not represented in the target code page. If this occurs, DB2 inserts a substitution character into the target string in place of the character that has no representation.
Using the ADO.NET DB2 driver
The ADO.NET driver comes in three flavors - OLE DB .NET Data Provider  ODBC .NET Data Provider and IBM Data Server Provider for .NET (called DB2 .NET Data Provider in older versions). IBM recommends using the last one and as such we’ll be talking about it here.  It is available via the IBM Data Server Driver package. It is also available via other clients such as IBM Data Server Runtime Client and IBM Data Server Client.
As previously mentioned, when DB2CODEPAGE variable is not set, applications on Windows would typically use the ANSI code page (1252) reported by the underlying operating system. Since code page 1252 falls under group S-1, it seemed not possible to communicate with other groups like D-4 which include code sets like GBK and IBM-1381 (typically used for languages like Chinese).
One option would be to set DB2CODEPAGE, but AFAIK ADO.NET exposes no API for doing this and manually setting it to the Unicode (1208) code page is an option but I did not explore it further and I felt it is not the most elegant solution.
Then I stumbled upon documentation for the Unicode support in CLI driver which mentioned that “To be considered a Unicode application, the application must connect to the database using either SQLConnectW() or SQLDriverConnectW(). This will ensure that CLI will consider Unicode the preferred method of communication between itself and the database.” I sort of knew from my past work that ADO.NET drivers internally used the CLI driver to communicate with DB2. You can also see this if you open the IBM.Data.DB2.dll assembly in .NET Reflector and look at the references, you will see unmanaged DLLs like db2app.dll and db2sys.dll. And if you run dumpbin
dumpbin /exports db2app.dll

you can see all the exported CLI/ODBC functions like SQLConnect, SQLAllocEnv, SQLExecute etc.
So now I needed to know if the ADO.NET driver uses SQLConnectW() or SQLDriverConnectW(), and if it did I would know that my .NET application would communicate in Unicode with DB2.
Tracing DB2 Connections
In order to find out this, I searched for a way to trace DB2 connections. One of the methods I found was to use the DB2 CLI trace facility. In the db2cli.ini add the following lines and run a sample program that uses

the DB2 ADO.NET driver to connect to a DB2 database. If you then look at the trace file, you can see the calls that were made by the driver – and you can see that it does in fact call the SQLDriverConnectW() function! Here is some sample of the calls  
SQLInitializeADONET( )
<--- SQL_SUCCESS Time elapsed - +6.540000E-004 seconds

SQLConnectADONET( hEnv=0:1, phDbc=&02a83c34, pSetiInfo=<NULL pointer> )
---> Time elapsed - +2.849800E-002 seconds
SQLAllocHandle( fHandleType=SQL_HANDLE_DBC, hInput=0:1, phOutput=&02a83c34 )
---> Time elapsed - +2.849800E-002 seconds
SQLAllocHandle( phOutput=0:1 )
<--- SQL_SUCCESS Time elapsed - +1.058100E-002 seconds
SQLDriverConnectW( hDbc=0:1, hwnd=0:0, szConnStrIn="protocol=tcpip;hostname=yang;port=50000;database=SAMPLE;uid=administrator;pwd=********;", cbConnStrIn=87, szConnStrOut=<NULL pointer>, cbConnStrOutMax=0, pcbConnStrOut=<NULL pointer>, fDriverCompletion=SQL_DRIVER_NOPROMPT )
---> Time elapsed - +2.883200E-002 seconds
( DBMS NAME="DB2/NT64", Version="09.05.0000", Fixpack="0x26010107" )
( Application Codepage=1252, Database Codepage=1381, Char Send/Recv Codepage=1208, Graphic Send/Recv Codepage=1200 )
SQLDriverConnectW( )
<--- SQL_SUCCESS Time elapsed - +7.274694E+000 seconds
( UID="administrator" )
( PWD="" )
( HOSTNAME="yang" )
( PORT="50000" )
As you can see the application code page is still 1252, but the code page used for communicating with the database is 1208 (UTF-8) for character data and 1200 (UTF-16) for graphic data.
So to conclude, if you are using the IBM Data Server Provider for .NET then it will send/receive data in Unicode and thus will be able to communicate with DB2 servers using any code page (and language).

Refs -

No comments:

Post a Comment