Omnis Technical Note TNSQ0028 August 2010

Omnis Character Mapping Explained

for Omnis Studio 5.0.1 and later
by Gary Ashford

Introduction
In this Technote, we attempt to explain the processes used to convert and map Omnis character data when written to and read from an arbitrary database and to illustrate how the various properties introduced in Studio 5 are used. The intended audience is developers porting non-Unicode applications to Studio 5 or who need to access data from non-Unicode databases. For any new applications written using Studio 5, we recommend that DAMs should be used in the default "Unicode mode" ($unicode=kTrue).

Please note that "character mapping" can only be performed when the session object is operating in non-Unicode mode; since character maps apply only to 8-bit data; providing conversion between various ANSI code pages. When the session object is operating in Unicode mode, the conversion to and from the Unicode encoding expected by the database is carried out automatically by the DAM.

Omnis Character Mapping
Historically, Omnis supports three types of character mapping; using the native character set, the Omnis character set and custom mapping tables (implemented using .in and .out files), described as follows:

  • The Omnis Character Set option causes character data to be passed to and from the DAM in the internal Omnis character set.
  • The Native API Character Set option causes character data to be passed to and from the DAM in the native character set for the platform in question. For example, on Windows this means that data is exchanged between Omnis and the DAM in the ANSI character set. If you are using National characters (characters with a value greater than 127), then this option may be more appropriate for some DAMS, especially if the data you store in the database needs to be accessible to applications other than Omnis. To use this option, you need to understand the character set used by the database server. If the character set is neither the Omnis character set or the native API characters set, then you will need to use a character mapping table to handle national characters... Note that when using a character mapping table, you should select the Omnis Character Set option.

Thus, there is provision for developers who want to write cross platform applications whose data is used exclusively by Omnis and for developers who need their data to be compliant with external applications and specific database character sets.

On the Linux platform it should be noted that the native character set is ISO8859-P1 (Latin 1) is a subset of the Windows CP1252 (or "ANSI") character set. ISO8859-P1 character values in the range 0x80 to 0x9F are not defined/displayable for ISO8859-P1. If an application is to be cross-platform between Windows, Mac and Linux; and the destination character set is ISO8859, then CP1252 characters in the range 0x80 to 0x9F should be avoided or otherwise mapped to different character codes. These include the Euro currency symbol, hooked f, trademark symbol and oe ligature characters for example.

Character Mapping Diagram
The following (simplified) diagram illustrates the processes involved in converting Omnis character data during insertion into and retrieval from an external database.


Omnis Studio 5.0: Functional flowchart illustrating input and output character mapping


From the above diagram, it is possible to infer the following:

  • When a session object is operating in Unicode mode ($unicode = kTrue), the only conversion which takes place is conversion to and from the database encoding; i.e. conversion from the Omnis UTF32 encoding to "$encoding" upon insertion of data and conversion from "$encoding" to UTF32 upon reading data.

    As of Studio 5.0.1, $encoding is a read-only property and is hard-coded according to the value required by the database/client API being used. For Oracle for example; this is set to kSessionEncodingUtf16 whereas for MySQL, it is set to kSessionEncodingUtf8.
  • 8-bit character mapping requires that character data is first converted to the specified 8-bit ANSI codepage. After any character mapping has been performed, the data must then be converted back to the encoding expected by the client API.

    The $codepage property is used to specify the codepage required and accepts any of the following constant values (see Catalog/F9->Unicode types):

    kUniTypeAnsiArabic
    kUniTypeAnsiBaltic
    kUniTypeAnsiCentralEuropean
    kUniTypeAnsiCryllic,
    kUniTypeAnsiGreek
    kUniTypeAnsiHebrew
    kUniTypeAnsiLatin1
    kUniTypeAnsiThai
    kUniTypeAnsiTurkish,
    kUniTypeAnsiVietnamese
    kUniTypeISO8859_1 - kUniTypeISO8859_16


    This means that the DAM will attempt to find any Unicode characters encountered within the specified codepage. Any Unicode characters not catered for by the codepage will be mapped to a "." (0x2E) character. When fetching and converting from these codepages, the DAM assumes that fetched data will consist of characters from the specified codepage. Any incoming 8-bit characters that are not part of the code page will be mapped to a "."

    In addition, kUniTypeNativeCharacters can be assigned to $codepage. When this value is specified, the DAM uses an identity mapping: Outgoing Unicode code points are interpreted directly "as" 8-bit character codes and vice-versa. Using this codepage, any Unicode characters (>0xFF) are converted to a "."  

    Referring to the diagram, the "net" effect of kSessionCharMapNative appears to be conversion from $encoding to $codepage, then back again. Wouldn't it be more efficient to simply skip character mapping in this case? To use an example: conversion from UTF-16 to kUniTypeAnsiLatin1 results in characters not present in the Latin1 codepage being eliminated from the data (replaced by "."s). When converted back to UTF-16, this ensures that the database never "sees" characters which may be incompatible with its character set, thus avoiding any potential insertion errors. (Unicode DAMs must still pass data using the API encoding even when the target database only supports non-Unicode).
     
  • Once converted to 8-bit data, the old (pre-Studio 5) character mapping rules are applied.
     
    That is; if $charmap is set to kSessionCharMapOmnis or kSessionCharMapTable, outgoing data is converted to the Omnis character set. (On the Mac this step is skipped, as data is assumed to be already in the Omnis character set). If $charmap is set to kSessionCharMapNative, conversion is also skipped.

    If $charmap is set to kSessionCharMapTable, the custom character map is then applied to the data. (Custom character maps assume that the supplied data will be in the Omnis character set).

    Oracle users. It may be of interest to note that when the Oracle session property: $internalcharmapping is set to kFalse, Windows to Omnis and Omnis to Windows character mapping is disabled even when $charmap=kSessionCharMapOmnis or kSessionCharMapTable. Thus, it can be seen that this property enables custom mapping tables to be applied to native character data if required.

  • When data is read from the database, the inverse conversion process is applied.

    Note that when $charmap is set to kSessionCharMapTable, incoming data is assumed to be in the Omnis character set. Omnis character data is converted to the Windows character set after custom mapping has been applied. When $charmap is set to kSessionCharMapNative, no character set conversion is performed.

    When reading data, kSessionCharMapOmnis/kSessionCharMapTable implies that data should be converted from the Omnis character set to the Native character set. (Incoming character set conversion is skipped on the Mac platform). When $codepage is set to kUniTypeNativeCharacters, each byte "becomes" the Unicode codepoint for that character.

Omnis to Windows Character Conversion
The following table shows the legacy mappings for all MacRoman extended characters to a notional Windows character set. This mapping table is inherited from that used by the old-style DAMs and is of uncertain origin. It will be noted that certain characters for which there are corresponding characters in the Windows 1252 character set are not mapped correctly (shown highlighted); notably the dagger, bullet point and trade mark symbols as well as certain accented characters. For other MacRoman characters which legitimately do not exist in the CP1252 character set, unique character codes have been designated. Character codes are shown in both hex and decimal format:

MacRoman
Character

MacRoman
Character
Code
MacToWin

WinToMac
Windows
Character
Code
Corresponding
CP1252/ANSI
Character
Ä  A diaeresis
80/128
 
C4/196
Ä
Å  A ring
81/129
 
C5/197
Å
Ç  C cedilla
82/130
 
C7/199
Ç
É  E acute
83/131
 
C9/201
É
Ñ  N tilde
84/132
 
D1/209
Ñ
Ö  O diaeresis
85/133
 
D6/214
Ö
Ü  U diaeresis
86/134
 
DC/220
Ü
á  a acute
87/135
 
E1/225
á
à  a grave
88/136
 
E0/224
à
â  a circumflex
89/137
 
E2/226
â
ä  a diaeresis
8A/138
 
E4/228
ä
ã  a tilde
8B/139
 
E3/227
ã
å  a ring
8C/140
 
E5/229
å
ç  c cedilla
8D/141
 
E7/231
ç
é  e acute
8E/142
 
E9/233
é
è   e grave
8F/143
 
E8/232
è
ê  e circumflex
90/144
 
EA/234
ê
ë  e diaeresis
91/145
 
EB/235
ë
í  i acute
92/146
 
ED/237
í
ì  i grave
93/147
 
EC/236
ì
î  i circumflex
94/148
 
EE/238
î
ï  i diaeresis
95/149
 
EF/239
ï
ñ  n tilde
96/150
 
F1/241
ñ
ó  o acute
97/151
 
F3/243
ó
ò  o grave
98/152
 
F2/242
ò
ô  o circumflex
99/153
 
F4/244
ô
ö  o diaeresis
9A/154
 
F6/246
ö
õ  o tilde
9B/155
 
F5/245
õ
ú  u acute
9C/156
 
FA/250
ú
ù  u grave
9D/157
 
F9/249
ù
û u circumflex
9E/158
 
FB/251
û
ü  u diaeresis
9F/159
 
FC/252
ü
†  dagger
A0/160
 
8A/138
Š
°  degree sign
A1/161
 
B0/176
°
¢  cent sign
A2/162
 
A2/162
¢
£  pound sign
A3/163
 
A3/163
£
§  section sign
A4/164
 
A7/167
§
•  bullet
A5/165
 
AF/175
¯
¶  pilcrow sign
A6/166
 
B6/182
ß  sz ligature
A7/167
 
DF/223
ß
®  registered sign
A8/168
 
AE/174
®
©  copyright sign
A9/169
 
A9/169
©
™  trademark sign
AA/170
 
81/129
  not defined
´  acute accent
AB/171
 
B4/180
´
¨  diaeresis
AC/172
 
A8/168
¨
  not equal to
AD/173
 
82/130
Æ  AE ligature
AE/174
 
C6/198
Æ
Ø  O slash
AF/175
 
D8/216
Ø
∞  infinity
B0/176
 
83/131
ƒ
±  plus-minus sign
B1/177
 
B1/177
±
≤  less than or equal to
B2/178
 
84/132
"
≥  more than or equal to
B3/179
 
85/133
¥  yen sign
B4/180
 
A5/165
¥
µ  micro sign
B5/181
 
B5/181
µ
∂  partial differential
B6/182
 
F0/240
ð
Σ  n-ary summation
B7/183
 
86/134
∏  n-ary product
B8/184
 
87/135
π  Greek letter pi
B9/185
 
88/136
ˆ
∫  integral
BA/186
 
89/137
ª  feminine ordinal indicator
BB/187
 
AA/170
ª
º  masculine ordinal indicator
BC/188
 
BA/186
°
Ω  Greek capital omega
BD/189
 
8B/139
æ  ae ligature
BE/190
 
E6/230
æ
ø  o slash
BF/191
 
F8/248
ø
¿  inverted question mark
C0/192
 
BF/191
¿
¡  inverted exclamation mark
C1/193
 
A1/161
¡
¬  not sign
C2/194
 
AC/172
¬
√  square root
C3/195
 
8C/140
Œ
ƒ  hooked f
C4/196
 
8D/141
  not defined
≈  almost equal to
C5/197
 
8E/142
Ž
 increment
C6/198
 
8F/143
  not defined
«   double left-pointing angle
C7/199
 
AB/171
«
»  double right-pointing angle
C8/200
 
BB/187
»
… horizonal ellipsis
C9/201
 
90/144
  not defined
    non-breaking space
CA/202
 
A0/160
non-breaking space
À   A grave
CB/203
 
C0/192
À
à A tilde
CC/204
 
C3/195
Ã
Õ  O tilde
CD/205
 
D5/213
Õ
Œ  OE ligature
CE/206
 
94/148
"
œ  oe ligature
CF/207
 
95/149
–  en dash
D0/208
 
AD/173
­
—  em dash
D1/209
 
96/150
­
“  double left quotation mark
D2/210
 
97/151
”  double right quotation mark
D3/211
 
98/152
˜
‘  single left quotation mark
D4/212
 
91/145
'
’  single right quotation mark
D5/213
 
92/146
'
÷  division sign
D6/214
 
F7/247
÷
â—Š◊  lozenge
D7/215
 
A4/164
¤
ÿ  y diaeresis
D8/216
 
FF/255
ÿ
Ÿ  Y diaeresis
D9/217
 
93/147
"
⁄⁄  fraction slash
DA/218
 
A6/166
|
€  euro sign
DB/219
 
80/128
‹  left-pointing angle
DC/220
 
B2/178
²
›  right-pointing angle
DD/221
 
B3/179
³
lfi ligature
DE/222
 
B7/183
·
fl ligature
DF/223
 
B8/184
¸
‡  double dagger
E0/224
 
B9/185
¹
·  middle dot
E1/225
 
BC/188
¼
‚  single low-9 quotation mark
E2/226
 
BD/189
½
„  double low-9 quotation mark
E3/227
 
BE/190
¾
‰  per mille sign
E4/228
 
C1/193
Á
  A circumflex
E5/229
 
C2/194
Â
Ê  E circumflex
E6/230
 
C8/200
È
Á  A acute
E7/231
 
CA/202
Ê
Ë  E diaeresis
E8/232
 
CB/203
Ë
È  E grave
E9/233
 
CC/204
Ì
Í  I acute
EA/234
 
CD/205
Í
Π I circumflex
EB/235
 
CE/206
Î
Ï  I diaeresis
EC/236
 
CF/207
Ï
Ì  I grave
ED/237
 
D0/208
Ð
Ó  O acute
EE/238
 
D2/210
Ò
Ô  O circumflex
EF/239
 
D3/211
Ó
 Apple logo
F0/240
 
D4/212
Ô
Ò  O grave
F1/241
 
D9/217
Ù
Ú  U acute
F2/242
 
DA/218
Ú
Û  U circumflex
F3/243
 
DB/219
Û
Ù  U grave
F4/244
 
DD/221
Ý
dotless i
F5/245
 
DE/222
Þ
  ˆ  circumflex accent
F6/246
 
FD/253
ý
  ˜  tilde
F7/247
 
FE/254
þ
  ¯  macron
F8/248
 
D7/215
×
breve
F9/249
 
9B/155
dot above
FA/250
 
9C/156
œ
ring above
FB/251
 
9D/157
  not defined
  ¸    cedilla
FC/252
 
9E/158
ž
double acute accent
FD/253
 
9F/159
Ÿ
ogonek
FE/254
 
9A/154
š
caron
FF/255
 
99/153

Although not shown here (to avoid duplication), the inverse table maps from the hybrid Windows character set back to the original MacRoman character codes.

This table exposes a potential problem with cross-platform applications using the Omnis character set in that the highlighted characters will not be "cross-platform". They will appear correctly on Mac or on Windows (depending on which platform they were inserted from), but not both. The workaround for this problem has always been to implement custom mapping tables to handle these additional characters if needed.

UTF8 Data
Where the database uses the UTF8 encoding (MySQL and PostgreSQL for example), this poses an additional problem for DAMs operating in non-Unicode mode. Specifically; when a byte value greater than 0x7F is read from the database, should this be treated as a non-Unicode extended character or as the first byte of a multi-byte UTF8 character? (UTF8 bytes greater than 0x7F are used to indicate that one or more additional bytes are required to encode a character. UTF8 characters can use between 1 and 4 bytes)

Problems can ensue if UTF8 byte sequences are read by a DAM operating in non-Unicode mode and treated as individual extended 8-bit characters. This situation should be avoided by ensuring that you do not access the UTF8 database using DAM operating in Unicode mode (thus avoiding the possibility of Unicode characters). If the database already contains a mixture of ANSI extended characters and multi-byte UTF8 characters, your best option is to revert to Unicode mode ($unicode=kTrue) and use the $validateutf8 property instead.

In Studio 5, the $validateutf8 session property forces any fetched character data to be validated using the rules for UTF8 encoding. If the byte (or bytes) of data satisfy the rules for UTF8 encoding, that sequence is taken as a UTF8 character. All characters in the data must satisfy these rules for the data to be treated as UTF8. Otherwise the data is treated as non-Unicode and is converted as described earlier. When $unicode is set to kTrue, any character data written back to the database will be converted to UTF8.

Database Character Conversion
There is one further consideration regarding character conversion; namely any conversion which may be performed by the database and/or client library when reading and writing data. Oracle for example has provision for many non-Unicode as well as Unicode character sets and it is up to the developer to ensure that the target encoding and character set are compatible with the database and that the destination data types are suitable (VARCHAR2 versus NVARCHAR2 for example). Where Oracle is concerned, you are responsible for matching the client character set (specified via the NLS_LANG environment variable) with the character set being used by the DAM. It is the responsibility of the Oracle database to convert between the client character set and the database character set. This is usually possible in all but the most extreme combinations, although it should be noted that writing Windows CP1252 character data to an ISO8859-P1 database (NLS_LANG = AMERICAN_AMERICA.WE8ISO8859P1) for example, will result in the "loss" of character codes in the range 0x80 to 0x9F. (Oracle will convert them to "¿" (0xBF) ).

Conclusions
In Studio 5, it is possible to continue accessing non-Unicode databases in a largely cross-platform and cross-application manner. Furthermore, in Studio 5 it is possible to interface with non-Unicode databases using different 8-bit ANSI codepages; by making use of the $codepage property in conjunction with the "Native character set". In this manner, you can map to and from the extended characters in a given codepage.

Alternatively, the "Omnis character set" can be used to store non-Unicode data directly, or Omnis character set data can be translated using custom mapping tables, thus retaining the old non-Unicode DAM behaviour. Studio 5 can automatically port non-Unicode data in UTF8 databases to Unicode by detecting Unicode and non-Unicode byte sequences. Once converted however, care should be taken not to expose non-Unicode applications to Unicode data.

References and Further Reading
The following links may be of interest:

About the ANSI/CP1252 code page About the $validateutf8 property
About the ISO8859 code pages Mixing Unicode and Non-Unicode Data Types with Oracle
About the MacRoman code page   Mapping Character Sets
About the UTF8 Unicode encoding