UTF8 and Zig
25 January 2022

So I am messing around with Zig and quite enjoying it. It is certainly on at the 0.9 release in the sense of things are not quite there yet. But it seem there enough for what I want it to do. I may even attempt to do a Zig/Rust comparison once I have written a little more code in it. I am not exactly good at Rust but at the moment I have written lots more Rust than Zig.

So to get into Zig I need a few small projects to mess with. Practice projects that I can make a mess in rather than something of release quality.

So I thought I would write a UTF8 decoder.

Well I have to admit I have forgotten or just not every really understood UTF8 encoding. I am not sure which. Either way I ended up reading the wiki page on UTF8. It is pretty good but I still struggled putting the info together.

Zig has Unicode support so you really don't need to write you own unless you are learning Zig and want an interesting practice topic.

So what do you need to know about UTF8, so you can read the wiki page and know how to write a decoder with ease. This is not a full description of UTF8 rather snippets of information that can be combined with the wiki page.

UTF8

Decoding - The first byte uses up to 4 bits to tell use how many bytes there are in a codepoint. - Any remaining bytes start with 10xxxxxx - So you get 6 bits of the value out of each following byte.

Extract out the bits from the first byte and then if there are more bytes to follow, shift the value by 6 bits and bitwise or in the values from following byte.

You are probably going to end up with code looking something like this. Perhaps with loops and and mistakes I have made removed

    fn next(self: *Utf8_iterator) ?u32 {
        if (self.pos < self.data.len) {
            const c1 = self.data[self.pos];
            const num_chars = bytes_to_read(c1);

            switch (num_chars) {
                1 => {
                    self.pos += 1;
                    return c1 & 0x7f;
                },
                2 => {
                    self.pos += 1;
                    const c2 = self.data[self.pos];
                    self.pos += 1;

                    // Mask of encoding parts
                    const d1: u8 = c1 & @as(u8, 0b0001_1111); //  last 5 bits only
                    const d2: u8 = c2 & @as(u8, 0b0011_1111);

                    var c: u32 = (@as(u32, d1) << 6) | d2;
                    return c;
                },
                3 => {
                    self.pos += 1;
                    const c2 = self.data[self.pos];
                    self.pos += 1;
                    const c3 = self.data[self.pos];
                    self.pos += 1;
                    // Mask of encoding parts
                    const d1: u8 = c1 & @as(u8, 0b0000_1111); //  last 4 bits only
                    const d2: u8 = c2 & @as(u8, 0b0011_1111);
                    const d3: u8 = c3 & @as(u8, 0b0011_1111);

                    var c: u32 = (@as(u32, d1) << 6) | d2;
                    c = c << 6 | d3;
                    return c;
                },
                4 => {
                    self.pos += 1;
                    const c2 = self.data[self.pos];
                    self.pos += 1;
                    const c3 = self.data[self.pos];
                    self.pos += 1;
                    const c4 = self.data[self.pos];
                    self.pos += 1;
                    // Mask of encoding parts
                    const d1: u8 = c1 & @as(u8, 0b0000_0111); //  last 3 bits only
                    const d2: u8 = c2 & @as(u8, 0b0011_1111);
                    const d3: u8 = c3 & @as(u8, 0b0011_1111);
                    const d4: u8 = c4 & @as(u8, 0b0011_1111);

                    var c: u32 = (@as(u32, d1) << 6) | d2;
                    c = c << 6 | d3;
                    c = c << 6 | d4;
                    return c;
                },
                else => @panic("Ooops maybe invalid utf8, it might not be as we don't support unicode scalar values"),
            }
        }
        return null;
    }

So it was my first iterator like function and it when pretty well. I also started to use @as in anger even though I am not sure I am using it right. This code is very much a first draft and I don't plan on a second draft as that is not the point of practive projects.

Zig is meant to simple and while I was having to google to look things up because it is still very new to me I would google and then go "yep got that" rather mmm need to read more.

I still have a lot to cover in Zig in particular the comptime and getting comfortable with the basics but so fair the experience has been pleasant. Probably the most negative experience has been with the editor. I use Emacs and the experience it not great. I couldn't get the LSP server working. Given, at the time of writing, Zig is only at v0.9 it is hard to hold that against the language.

It is not until you have written well over 10k lines of code do you get a proper feel for how the language holds together in terms of the trades offs every language has to make. So here's to the next 9950 :)