theodore-s-beers/later
Unicode Collation Algorithm implementation in Zig
This is a Zig implementation of the
Unicode Collation Algorithm, using no
dependencies beyond the standard library. later
passes the conformance tests
for both the Default Unicode Collation Element Table (DUCET) and the root
collation order of the Common Locale Data Repository (CLDR). You can verify this
by running the tests:
zig build test --release=safe
const std = @import("std");
const later = @import("later");
test "sort multilingual list of names" {
const alloc = std.testing.allocator;
var coll = try later.Collator.initDefault(alloc);
defer coll.deinit();
var input = [_][]const u8{
"چنگیز",
"Éloi",
"Ötzi",
"Melissa",
"صدام",
"Mélissa",
"Overton",
"Elrond",
};
const expected = [_][]const u8{
"Éloi",
"Elrond",
"Melissa",
"Mélissa",
"Ötzi",
"Overton",
"چنگیز",
"صدام",
};
std.mem.sort([]const u8, &input, &coll, later.collateComparator);
try std.testing.expectEqualSlices([]const u8, &expected, &input);
}
later
is based on my somewhat more mature UCA implementation in Rust,
feruca. I wanted to learn Zig
properly, and Unicode collation is a good, non-trivial problem to solve toward
that end. Perhaps this collator will also be of use to others in the Zig
community.
It's worth noting that a substantial part of the work of implementing the UCA is
generating maps from Unicode data files. We need to be able to determine
quickly, for a given code point, its canonical decomposition, combining class,
collation weights, and more. I built the maps for later
in a separate
repository, uca-maps-zig.
They're serialized in simple custom binary formats (see src/bin
) and loaded
on-demand in collation.
Another part of the fun in this project was implementing UTF-8 decoding/validation and Unicode form NFD normalization. It turns out the algorithms are well established, and they work nicely, and they aren't all that difficult to code. We stand on the shoulders of giants!