Lexical elements: Rune literals pt 1, Intro to Unicode

January 23, 2023

Runes… Oh boy! This is one of bits of Go that shines for its elegant simplicity, but constantly trips up everyone (myself included). As such, I think this may be a 2-, or maybe even a 3-parter.

Let’s get started.

Rune literals

A rune literal represents a rune constant, an integer value identifying a Unicode code point.

If you’re already familiar with Unicode, and have a strong understanding of what a “code point” is, you can probably skip this one. See you tomorrow! 👋

If you’re not completely sure what a Unicode code point is, stick around… I’ll do my best to untangle this seemingly simple phrase.

First off, what is Unicode? It’s emojis, right? Ehh. If that’s your understanding of Unicode, which would be completely understandable, if your career in software is relatively young, then you need a bit more context. I encourage you to read a bit about the history of why Unicode was invented, and the problems it was meant to solve. This PDF, Introduction to Unicode: History of Character Codes is a good starting point.

But here are the highlights:

Before Unocide, we had many different coding systems. The most popular was ASCII, the American Standard Code for Information Interchange. But since not all of the world is American, this had obvious drawbacks. Different countries or languages would often have their own coding schemes, but this made it very difficult to share documents between regions.

If I wrote a document using the Cyrillic alphabet, then sent it to my colleague in France, he would likely see a jumble of French letters in seemingly random order.

So Unicode came along to Unify all the codes.

One code to rule them all

Great. So now instead of 127 possible characters in ASCII, we have a virtually unlimited number of characters, right?

Not so fast.

While there’s room in Unicode for more than 1 million individual code points, most are not (yet) defined. But what’s more, Unicode is smarter than ASCII in a number of ways. It is possible to combine Unicode code points to form a single physical character.

For example, if you want to display the Cyrillic letter ў, this can be done by combining the Cyrillic Y (у) with the breve mark (˘), to give you ў. But while this looks like a single character, and in print terms it is, it’s actually two Unicode codepoints. As such, this code won’t compile:

	x := rune('ў')

Because while it visually looks like we’re quoting a single character, that character is composed of two Unicode codepoints, and (as we’ll see in the next section), a rune literal must be a single Unicode codepoint.

Now if that’s not confusing enough, this code will compile:

	x := rune('ў')

What’s the difference?

Well, Unicode includes a number of precomposed characters. This is a nice convenience for languages that commonly use a large number of these types of diacritics. But it’s an incovenience for us programmers. Not only does it mean that of these two character representations, only one is a valid rune, it also introduces certain headaches when trying to compare Unicode strings for equality, or when sorting, etc.

A last note for today, especially for anyone very new to Unicode. This concept of combining characters to add diacritics and other markings to an existing letter is the same way that Unicode emojis are modified to change skin tone, gender, or other attributes.

Unicde is pretty powerful. And confusing at times.

Quotes from The Go Programming Language Specification, Version of June 29, 2022

Share this

Related Content

Empty structs

We finally we have enough knowledge for the EBNF format not to seem completely foreign, so let’s jump back and take a look at that, with the examples provided in the spec… Struct types … StructType = "struct" "{" { FieldDecl ";" } "}" . FieldDecl = (IdentifierList Type | EmbeddedField) [ Tag ] . EmbeddedField = [ "*" ] TypeName [ TypeArgs ] . Tag = string_lit . // An empty struct.

Struct tags

Struct types … A field declaration may be followed by an optional string literal tag, which becomes an attribute for all the fields in the corresponding field declaration. An empty tag string is equivalent to an absent tag. The tags are made visible through a reflection interface and take part in type identity for structs but are otherwise ignored. struct { x, y float64 "" // an empty tag string is like an absent tag name string "any string is permitted as a tag" _ [4]byte "ceci n'est pas un champ de structure" } // A struct corresponding to a TimeStamp protocol buffer.

Struct method promotion

Yesterday we saw an example of struct field promotion. But methods (which we haven’t really discussed yet) can also be promoted. Struct types … Given a struct type S and a named type T, promoted methods are included in the method set of the struct as follows: If S contains an embedded field T, the method sets of S and *S both include promoted methods with receiver T. The method set of *S also includes promoted methods with receiver *T.