11 Feb

Adding Latin and Greek support to utf8.h!

I’ve had a lot of people using my unicode library utf8.h since its release – thanks to all who’ve found the library useful and provided feedback!

One of the things that I’d scrimped on previously was my support for case-insensitive comparisons for letters beyond those in ASCII. I knew little about this, but when a user requested that my library also supported accented latin characters, and later greek symbols, I jumped to the occassion to add support.

The following utf8 functions utf8casecmp, utf8ncasecmp, utf8casestr, utf8isupper, utf8islower, utf8lwr, and utf8upr, have been modified to support the Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Greek & Coptic unicode sections. I’ve also added two new functions utf8lwrcodepoint and utf8uprcodepoint that’ll make a single codepoint upper or lower case.

The main logic of how you convert between the lower and upper cases is both slightly concise and utterly disgusting. Lets take a look at the code to convert a lower case codepoint to an upper case.

For ASCII characters, and also some Latin and Greek, the upper case codepoints are simply 32 places below the lower case ones:

if ((('a' <= cp) && ('z' >= cp)) ||
    ((0x00e0 <= cp) && (0x00f6 >= cp)) ||
    ((0x00f8 <= cp) && (0x00fe >= cp)) ||
    ((0x03b1 <= cp) && (0x03c1 >= cp)) ||
    ((0x03c3 <= cp) && (0x03cb >= cp))) {
  cp -= 32;
}

The next set of codepoints are offset by 1 between the lower and upper cased variants. Depending on whether the lower case codepoint was odd or even, we have two if statements that handle both cases:

if (((0x0100 <= cp) && (0x012f >= cp)) ||
    ((0x0132 <= cp) && (0x0137 >= cp)) ||
    ((0x014a <= cp) && (0x0177 >= cp)) ||
    ((0x0182 <= cp) && (0x0185 >= cp)) ||
    ((0x01a0 <= cp) && (0x01a5 >= cp)) ||
    ((0x01de <= cp) && (0x01ef >= cp)) ||
    ((0x01f8 <= cp) && (0x021f >= cp)) ||
    ((0x0222 <= cp) && (0x0233 >= cp)) ||
    ((0x0246 <= cp) && (0x024f >= cp)) ||
    ((0x03d8 <= cp) && (0x03ef >= cp))) {
  cp &= ~0x1;
}

if (((0x0139 <= cp) && (0x0148 >= cp)) ||
    ((0x0179 <= cp) && (0x017e >= cp)) ||
    ((0x01af <= cp) && (0x01b0 >= cp)) ||
    ((0x01b3 <= cp) && (0x01b6 >= cp)) ||
    ((0x01cd <= cp) && (0x01dc >= cp))) {
  cp -= 1;
  cp |= 0x1;
}

And lastly, for all other codepoints in the ranges that don’t have any sane approach whatsoever, we’ll fire them all into a single big switch statement:

switch (cp) {
  default: break;
  case 0x00ff: cp = 0x0178; break;
  case 0x0180: cp = 0x0243; break;
  case 0x01dd: cp = 0x018e; break;
  case 0x019a: cp = 0x023d; break;
  case 0x019e: cp = 0x0220; break;
  case 0x0292: cp = 0x01b7; break;
  case 0x01c6: cp = 0x01c4; break;
  case 0x01c9: cp = 0x01c7; break;
  case 0x01cc: cp = 0x01ca; break;
  case 0x01f3: cp = 0x01f1; break;
  case 0x01bf: cp = 0x01f7; break;
  case 0x0188: cp = 0x0187; break;
  case 0x018c: cp = 0x018b; break;
  case 0x0192: cp = 0x0191; break;
  case 0x0199: cp = 0x0198; break;
  case 0x01a8: cp = 0x01a7; break;
  case 0x01ad: cp = 0x01ac; break;
  case 0x01b0: cp = 0x01af; break;
  case 0x01b9: cp = 0x01b8; break;
  case 0x01bd: cp = 0x01bc; break;
  case 0x01f5: cp = 0x01f4; break;
  case 0x023c: cp = 0x023b; break;
  case 0x0242: cp = 0x0241; break;
  case 0x037b: cp = 0x03fd; break;
  case 0x037c: cp = 0x03fe; break;
  case 0x037d: cp = 0x03ff; break;
  case 0x03f3: cp = 0x037f; break;
  case 0x03ac: cp = 0x0386; break;
  case 0x03ad: cp = 0x0388; break;
  case 0x03ae: cp = 0x0389; break;
  case 0x03af: cp = 0x038a; break;
  case 0x03cc: cp = 0x038c; break;
  case 0x03cd: cp = 0x038e; break;
  case 0x03ce: cp = 0x038f; break;
  case 0x0371: cp = 0x0370; break;
  case 0x0373: cp = 0x0372; break;
  case 0x0377: cp = 0x0376; break;
  case 0x03d1: cp = 0x03f4; break;
  case 0x03d7: cp = 0x03cf; break;
  case 0x03f2: cp = 0x03f9; break;
  case 0x03f8: cp = 0x03f7; break;
  case 0x03fb: cp = 0x03fa; break;
};

With the above, we can handle all the lower/upper case variants for the Latin and Greek characters requested!

I hope these additions are found to be useful to my users, and if you’ve got any requests yourself feel free to file them here.

Leave a Reply

Your email address will not be published. Required fields are marked *