Tech News Back Issues Issue: 020107
Omnis Studio and Unicode
In the spirit of the New Year, perhaps it is time to reflect a bit on how humanity can all come together in peace and harmony - at least with regard to the technology that is beginning to break down some of the barriers that divide us.
Wouldn't it be nice if the adoption of Unicode could have that effect? Well, it may not be the harbinger of world peace, but it does have the potential for us to be able to better communicate with people of different nations - and it certainly could open new markets for Omnis Studio and the applications we write with it!
A Brief History of Written Communication
While we share the ability to communicate vocally and visually with many of our vertebrate brethren, we humans appear to be unique in the evolution of a written form of communication. This was developed over a long time - and long before we ever had computers. The development of writing systems is just one example of the creativity and diversity of human thought.
We can imagine that the first form of "written" communication was something like the time-honored tradition of using small rocks, twigs, bundles of grass and scratches in the dirt to represent prominent features of the terrain, herds of game animals or groups of "enemies" and a plan of action. ("You go around this way, we'll go around that way and we'll have them cornered...") Of course, we must imagine this because the chance of finding a "fossilized" example of such early communication - and recognizing it as such - is extremely remote after tens or even hundreds of thousands of years. Yet it seems reasonable that we began communicating like this.
Eventually, we learned to dispense with all of the props and just use images scratched in the dirt. We developed standard representations for those things for which we once used the physical "models".
As we spread out in waves of migration from Africa, we diverged in many ways besides geographically. Someone somewhere got the idea to begin leaving more permanent "writings" on the rock walls of canyons and caves. Perhaps there were important meanings to these carved petroglyphs and paintings or perhaps they were just expressions of our innate urge to draw graffiti and leave our personal mark on the world. Whatever the reason for their creation, many show signs of an evolving counting system and standardization (on a local level) of a symbol system.
Once we settled into more permanent villages and towns, there came to be more reasons for us to keep records and to write down transactions and stories. There were also reasons to make these writings more portable. Various writing media and implements emerged - and often the symbols used to form the local standard system were in part determined by the means used to write down the symbols. Cuneiform symbols were very much related to the use of clay tablets and stylus writing implements. Sheets of papyrus or simple paper and reed brushes or quill pens also influenced the symbolic forms we devised. But more important were the symbolic units that we chose to use in our many locales. The invention of printing (only recently) further led to the standardization of written forms.
Some cultures developed writing systems with sets of symbols that represent ideas or that use simplified pictures of things while others chose to use symbols that represent sounds - either individual consonants and/or vowels or entire syllables. While there was a fair amount or regional borrowing and copying of symbols, geographic and cultural isolation over long periods of time have lead to an amazingly diverse set of symbols when viewed in the whole. Much later, anthropologists located many languages that did not have a written form and symbol systems were developed to at least transcribe the strings of sounds made by the speakers of those languages.
But it is not just words that we symbolize. There have been numerous symbol systems devised for numbering systems, mathematics, various branches of scientific and engineering endeavor, music and rhythmic notation and other fields of human thought. We have been very busy being creative over the past dozen millennia or so!
What Exactly Is Unicode?
Technically speaking, Unicode itself is simply a concept or a "standard", like SQL. It is an attempt to represent all known forms of written communication (past, present and even imaginary in some cases) with a single coding standard - translating each possible written symbol or symbolic element used in textual and graphical human communication to a unique number which can then be easily stored, transmitted and otherwise manipulated by computers and software that subscribe to this standard. This allows us to deal with text in any combination of languages within a single document - or database - without confusion or ambiguity. No small task, given the diversity of the symbolic characters that we have learned to commit to paper (and other media) over the years!
The difficulty in attaining this goal is that a number of earlier standards have already been established in various regions around the world for working with the local character set on the less capable computers of the past. Now that computers are much faster, can work with more complex data values and have vastly more memory than decades ago when those local standards were first established, it is a good time to revamp the system. But one obstacle that had to be overcome was how to accommodate all of the existing standards, as well as to include ancient and marginalized (or even vanishing) writing systems, in the least disruptive way.
So the Unicode standard is actually a compromise in a number of ways, with a bit of built-in redundancy forced on it by legacy systems. One of those compromises deals with characters that are a composite of multiple glyph elements. To use the Latin symbol system as an example, there are a small number of basic characters. But these also have upper and lower case versions. And in many languages that use this basic set, one or more diacritical marks (accents and other modifying symbols) may be used with a given symbol to create a new symbol. In many local standards, both the composite symbol and the individual elements used to build it are in the local character set. This becomes an even greater challenge when dealing with logographic and ideographic characters. But the Unicode standard has been devised to accommodate all of these characters - and to provide rules and techniques for constructing, deconstructing and equating composite symbols and their components.
A "Unicode Font" is a font that follows the Unicode encoding standards to identify character glyphs. Such a font also contains rules for constructing valid composite character sequences from elemental symbols. Both TrueType and OpenType fonts support Unicode and use Unicode code points to map the glyphs they contain.
Only a handful of Unicode fonts actually attempt to contain glyphs for all of the possible character positions, though. Most support the basic ASCII set of characters and then one or more sets of glyphs for a related group of language families or special uses (scientific or mathematical symbols, dingbats, etc.). This is reasonable when we consider how much information must be included within the font for holding the rules regarding composition and decomposition of its characters (among other rules) - and the amount of RAM that would be required to hold all of that information when the font is in use.
While we have fonts that contain glyphs associated with specific Unicode values, those fonts must also contain methods for dealing with composite glyph duplication. Some more thought had to be given to the Unicode standard to handle this problem.
In the database world, we are familiar with the concept of normalization. To us, this is the process of designing information structures to remove ambiguity and redundancy to enhance the storage and retrieval of information. The Unicode techniques for character normalization serve a similar purpose - and they are a key part of using Unicode with a database and/or database application.
The idea is to bring each composite character into a normal form for storage, sorting and other purposes, when there may be a number of ways of representing that character. In the extreme case, a character may be a composite of a number of elements - and many subsets of those elements could form valid composite characters on their own. So the composite character might have been created using just the elements, a combination of elementals and simpler composite characters, or it might have been entered using the one Unicode value for the composite itself.
For example, the character ("Latin capital letter C with cedilla and acute" - character number U+1E08) can be composed in a number of ways:
We can simply use the Unicode character number (U+1E08) of its pre-composed
All of these represent the same character, but they are stored and manipulated differently. Normalization brings any one of these combinations into a standard form. But what should that standard be?
There are two schools of thought: decomposition and composition. Character composition is the process of combining simpler characters into fewer precomposed characters. Decomposition is the opposite process, breaking precomposed characters back into their component pieces.
Decomposition (NFD normalization) is the simpler of the two techniques. Using that technique, any composite character is broken into its individual elements, which are sequenced according to internal rules. This leaves us with a standard form for the character, but potentially many pieces to keep track of.
Composition (NFC normalization) first invokes decomposition (just in case there is a mix of basic and pre-composed elements in the sequence for a character) and then re-composes the character sequence into a single pre-composed character (with the elements being applied in a specific standard order). This reduces the number of characters we need to store and manipulate, but the process could potentially create a different character than the original in some rare cases.
The rules for these processes are included within a Unicode font. Applications that a Unicode compatible understand how to apply those rules and invoke those processes.
The Unicode Version of Omnis Studio
Omnis Studio internally uses the broadest Unicode standard for representing characters: UTF-32. This is a fixed-width 32-bit format that can accommodate any Unicode character code. It uses the UTF-8 standard for most operations. This is a variable width encoding of 1 to 4 (8-bit) bytes that can also represent any Unicode character - the standard used by most operating systems, web browsers and fonts. UTF-32 is just more efficient for internal operations because it is fixed width, so every character is the same "size".
When we paste values from the clipboard, Omnis Studio automatically applies NFC normalization to the string being pasted. This reduces the number of character codes within the string and makes data entry more intuitive for the user. If uncomposed (deconstructed or semi-constructed) composite characters exist within a string, additional keystrokes would be required to move through the character using the arrow keys, for example. Also, the insertion point would become increasingly de-coupled from the visual character positions the more deconstructed characters there are in the string. It is highly recommended that we use this same normalization for pre-processing strings before performing operations like sorting or comparisons. Users may still enter character sequences to build composite characters, so we should perform normalization for consistency.
And so Omnis Studio provides us with both nfd() and nfc() functions to perform normalization operations. Each function accepts a single parameter, which is the string to be normalized.
To observe the difference between a pre-composed character and a composite character sequence, we can perform a simple experiment:
Now find a Unicode character viewing application (I am using the Character Palette of the Font Viewer within TextEdit on Mac OS X).
Locate an obvious composite character, copy it to the clipboard (which may require applying it to a document first) and then paste it into the field for the original variable. Now click the pushbutton.
When I paste in and then click the pushbutton, the result I see in the OK message is that the deconstructed value in original has a length of 3, while the normalized value in normalized has a length of 1. I also notice that it requires 3 uses of the arrow key to move the insertion point in front of this character if it is at the end of the string.
For more incentive to normalize strings in entry fields using nfc(), I clicked behind the character in original and pressed the left arrow key only once. I then began typing. The rightmost character that I typed displayed the acute accent, while the original C-with-all-the-trimmings lost the acute accent. So the "combining acute" gets applied to whatever base character precedes it.
Fortunately, Omnis Studio performs the NFC normalization for us on pasting (which is why we used nfd() to decompose the string in our experiment), but performing this operation on strings imported by other means - or retrieved from existing database records - is good insurance. Also, the automatic NFC normalization is NOT performed when using remote forms and the nfc() function is NOT available in client-side methods. We must perform such operations in server-side methods when using the Omnis Studio web technologies.
Ultimately, future versions of Omnis Studio may well be Unicode-compatible by default. But for now we must use a separate version. Libraries converted to the Unicode version of Omnis Studio are not useable by the non-Unicode version - and there is no reverse conversion. So if you want to create a Unicode version of one of your libraries, make sure that you are converting a copy of that library and not the original (unless you have no desire to have a non-Unicode version any more).
Until Next Time
There is more to tell about the Unicode version of Omnis Studio, but it will have to wait for another issue of Omnis Tech News. I hope that this article has been at least interesting, if not useful, to you. If you are in the business of creating applications for sale, or if you create applications for in-house use either for a company with international branches or for an entity that needs to handle data in multiple languages, you really should begin experimenting with, or simply using, the Unicode version. The world awaits...
|© 2007 Copyright
of the text and images herein remains with the respective author. No part
of this newsletter may be reproduced, transmitted, stored in a retrieval
system or translated into any language in any form by any means without
the written permission of the author or Raining Data.
Omnis® and Omnis Studio® are registered trademarks, and Omnis 7 is a trademark of Raining Data UK Ltd. Other products mentioned are trademarks or registered trademarks of their corporations. All rights reserved.