Friday, January 27, 2012

Talking to multi-lingual Sybase from a .NET client


Internationalization is the process of enabling an application to support multiple languages and cultural conventions. An internationalized application uses external files to provide language-specific information at execution time. A single version of a software product can be adapted to different languages or regions, conforming to local requirements and customs without engineering changes.

Localization is the process of adapting an internationalized product to meet the requirements of one particular language or region, for example Spanish, including providing translated system messages; translations for the user interface; and the correct formats for date, time, and currency. One version of a software product may have many localized versions.
Adaptive Server includes the character set definition files and sort order definition files required for data processing support for the major business languages in Western Europe, Eastern Europe, the Middle East, Latin America, and Asia.
Sybase Language Modules provide translated system messages and formats for Chinese (Simplified), French, German, Japanese, Korean, Brazilian Portuguese, and Spanish. By default, Adaptive Server comes with U.S. English message files.
All data is encoded in your server in a special code. For example, the letter “a” is encoded as “97” in decimal. A character set is a specific collection of characters (including alphabetic and numeric characters, symbols, and nonprinting control characters) and their assigned numerical values, or codes. Character sets that are platform-specific and support a subset of languages, for example, the Western European languages, are called native or national character sets. All character sets that come with Adaptive Server, except for Unicode UTF-8, are native character sets. Refer to Supported character sets for list of supported character sets. A script is a writing system, a collection of all the elements that characterize the written form of a human language—for example, Latin, Japanese, or Arabic. Depending on the languages supported by an alphabet or script, a character set can support one or more languages. For example, the Latin alphabet supports the languages of Western Europe (see Group 1 in Table 7-1). On the other hand, the Japanese script supports only one language, Japanese. Therefore, the Group 1 character sets support multiple languages, while many character sets, such as those in Group 101, support only one language. The language or languages that are covered by a character set is called a language group.
Within a client/server network, you can support data processing in multiple languages if all the languages belong to the same language group (see Table 7-1) Unlike the native character sets just described, Unicode is an international character set that supports over 650 of the world’s languages, such as Japanese, Chinese, Russian, French, and German. Unicode allows you to mix different languages from different language groups in the same server, no matter what the platform.  Look at Table 7-1: Supported languages and character sets which enlists supported languages and the character sets.
Selecting the server default character set
When you configure your server, using the Server Config for example , you can change the default character set for the server. The default character set is the character set in which the server stores and manipulates data. Each server can have only one default character set. By default, the installation tool assumes that the native character set of the platform operating system is the server’s default character set. However, you can select any character set supported by Adaptive Server as the default on your server (see Table 7-1). 

Selecting a language for system messages

Any installation of Adaptive Server can use Language Modules containing files of messages in different languages. Adaptive Server provides Language Modules for messages in the following languages: English, Chinese (Simplified), French, German, Japanese, Korean, Brazilian Portuguese, and Spanish. If your client language is not one of these languages, you will see system messages in English, the default language.  Each client can choose to view messages in their own language at the same time, from the same server; for example, one client views system messages in French, another in Spanish, and another in German, as long as the languages are part of the same group as per Table 7-1.  Therefore, for example, if Japanese is your server language, you can display system messages only in Japanese or English. Remember that all language groups can display messages in English. If you use Unicode, you can view system messages in any of the supported languages.
If you wish to know more about the files that affect the internationalization and localization, please refer to Internationalization and localization files section  

Character set conversion in Adaptive Server

In a heterogeneous environment, Adaptive Server may need to communicate with clients running on different platforms using different character sets. Although different character sets may support the same language group (for example, ISO 8858-1 and CP 850 support the group 1 languages), they may encode the same characters differently, and hence the need for character set conversion.  The supported conversions in any particular client/server system depend on the character sets used by the server and its clients. One type of character set conversion occurs if the server uses a native character set as the default; a different type of conversion is used if the server default is Unicode UTF-8.
Conversion for native character sets : Adaptive Server supports character set conversion between native character sets belonging to the same language group. If the server has a native character set as its default, the clients’ character sets must belong to the same language group
Conversion in a Unicode system : Adaptive Server also supports character set conversion between UTF-8 and any native character set that Sybase supports. For example, a client can be using any native character set while the server uses UTF-8 character set. The native character set encoded data sent to the server would be converted to UTF8 at the server. Also, the UTF8  data in the server would be converted to the native character set before being sent to the client. Note however, that each client can view data only in the language supported by its character set.
Character set conversion is implemented on Adaptive Server in two different ways:
  • Adaptive Server direct conversions
  • Unicode conversions
Adaptive Server direct conversions support conversions between two native character sets of the same language group. For example, Adaptive Server supports conversion between CP 437 and CP 850, because both belong to the group 1 language group. Refer to Table 8-1 for details
Unicode conversions exists for all native character sets. When converting between two native character sets, Unicode conversion uses Unicode as an intermediate character set. For example, to convert between the server default character set (CP 437), and the client character set (CP 860), CP 437 is first converted to Unicode; Unicode is then converted to CP 860.

As this example illustrates, Unicode conversions may be used either when the default character set of the server is UTF-8, or a native character set. You must specifically configure your server to use Unicode conversions (unless the server’s default character set is UTF-8). Earlier versions of Adaptive Server used direct conversions, and it is the default method for character set conversions.
Making this work for a .NET client application
In case of a .NET application, the Sybase client (with ADO.NET driver)should be installed (or deployed with the application) before hand. The charset property in the connection string for ADO.NET driver determines the character set that would be used on the client side. 
  1. If left unspecified or set to charset=ServerDefault , then the server’s character set is used. This is a confusing option and I could not really understand the details of how it works. But it seems that if the client does not natively support the servers character set then the ADO.NET connection would fail with a ‘Could not load code page for requested charset’ error.
  2. If set to charset=ClientDefault then you also need to specify the CodePageType of ANSI or OEM. In case of ANSI the character set is derived based upon what is in use by the underlying Windows system (it can be seen/changed from Regional, and Language Settings or in registry HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP). For English (and most other Western European languages) on Windows systems, the character set is CP 1252 (Microsoft Windows US (ANSI)). In case of OEM the code page is cp437 (on English Windows). This can be found in Windows registry HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage or simply by using the chcp at the command prompt.
  3. Last option is to set it to any desired character set that the client supports.
The multi-lingual Sybase servers against which the .NET application would run could potentially be using  any character set. If the server’s default character set and CP 1252 belong to the same language group (Group 1 from Table 8-1)  or in other words, the server’s default language is from the same group as English (Group 1 from Table 7-1),  then Sybase server can do direct conversion between the two character sets. In that case, the .NET application can successfully talk to the Sybase server by specifying charset=ClientDefault or charset=ServerDefault or without  specifying the charset property at all.  But if the languages are not from the same group, which would be the case when talking to a Chinese or Japanese Sybase from a English Windows client, direct conversion would not be possible. In this case use of Unicode conversion is possible where by CP 1252 is converted to UTF8, which is then converted to whatever native character set is used by the server.  Unicode conversion  can be enabled on the server by running sp_configure with the ‘enable unicode conversion’ option . Refer to this document for more details. By default, it is turned off. However, in this case, characters from the server’s native character set that are not present in the client’s CP 1252 charset would be replaced with some substitution character (such as a ?, a question mark). So, for example, if the .NET application needs to talk to a Chinese Sybase server (where the charset could be gb18030 or eucgb etc), the Chinese characters would show up as series of ?(questions marks).  That’s is no good. But if we use the 3rd option where we specify charset=utf8 the server would convert its native character set to UTF8 which by its nature supports all possible characters from the servers native character set, there would be no need for substitution characters. And since a .NET application natively understands UTF8, it can always specify charset=utf8 and everything should work well. But the use of this option requires that Unicode conversion  be enabled on the server, which is not enabled by default. Hence one approach that can be taken is to try the 1st option and if the error is ‘Could not load code page for requested charset’ (error message no 310061) then try the 3rd option with UTF8 as the charset. The 2nd option is not much use because of possibility of server characters missing from client character set.

No comments:

Post a Comment