Lexical elements: Rune literals pt 1, Intro to Unicode

January 23, 2023

Runes… Oh boy! This is one of bits of Go that shines for its elegant simplicity, but constantly trips up everyone (myself included). As such, I think this may be a 2-, or maybe even a 3-parter.

Let’s get started.

Rune literals

A rune literal represents a rune constant, an integer value identifying a Unicode code point.

If you’re already familiar with Unicode, and have a strong understanding of what a “code point” is, you can probably skip this one. See you tomorrow! 👋

If you’re not completely sure what a Unicode code point is, stick around… I’ll do my best to untangle this seemingly simple phrase.

First off, what is Unicode? It’s emojis, right? Ehh. If that’s your understanding of Unicode, which would be completely understandable, if your career in software is relatively young, then you need a bit more context. I encourage you to read a bit about the history of why Unicode was invented, and the problems it was meant to solve. This PDF, Introduction to Unicode: History of Character Codes is a good starting point.

But here are the highlights:

Before Unocide, we had many different coding systems. The most popular was ASCII, the American Standard Code for Information Interchange. But since not all of the world is American, this had obvious drawbacks. Different countries or languages would often have their own coding schemes, but this made it very difficult to share documents between regions.

If I wrote a document using the Cyrillic alphabet, then sent it to my colleague in France, he would likely see a jumble of French letters in seemingly random order.

So Unicode came along to Unify all the codes.

One code to rule them all

Great. So now instead of 127 possible characters in ASCII, we have a virtually unlimited number of characters, right?

Not so fast.

While there’s room in Unicode for more than 1 million individual code points, most are not (yet) defined. But what’s more, Unicode is smarter than ASCII in a number of ways. It is possible to combine Unicode code points to form a single physical character.

For example, if you want to display the Cyrillic letter Ñž, this can be done by combining the Cyrillic Y (у) with the breve mark (˘), to give you Ñž. But while this looks like a single character, and in print terms it is, it’s actually two Unicode codepoints. As such, this code won’t compile:

	x := rune('ў')

Because while it visually looks like we’re quoting a single character, that character is composed of two Unicode codepoints, and (as we’ll see in the next section), a rune literal must be a single Unicode codepoint.

Now if that’s not confusing enough, this code will compile:

	x := rune('Ñž')

What’s the difference?

Well, Unicode includes a number of precomposed characters. This is a nice convenience for languages that commonly use a large number of these types of diacritics. But it’s an incovenience for us programmers. Not only does it mean that of these two character representations, only one is a valid rune, it also introduces certain headaches when trying to compare Unicode strings for equality, or when sorting, etc.

A last note for today, especially for anyone very new to Unicode. This concept of combining characters to add diacritics and other markings to an existing letter is the same way that Unicode emojis are modified to change skin tone, gender, or other attributes.

Unicode is pretty powerful. And confusing at times.

Quotes from The Go Programming Language Specification, Version of June 29, 2022


Share this

Direct to your inbox, daily. I respect your privacy .

Unsure? Browse the archive .

Related Content


Lexical elements: Rune literals pt 3

Let’s continue our disection of rune literals. If you missed the parts, check them out from Monday when we discussed Unicode, and yesterday when we discussed quoting single characters. Today we’re looking at the various escape sequences supported by the rune literal syntax. Rune literals Several backslash escapes allow arbitrary values to be encoded as ASCII text. There are four ways to represent the integer value as a numeric constant: \x followed by exactly two hexadecimal digits; \u followed by exactly four hexadecimal digits; \U followed by exactly eight hexadecimal digits, and a plain backslash \ followed by exactly three octal digits.


Lexical elements: Rune literals pt 2

Let’s continue our exploration of rune literals, which began on Monday. In summary from Monday, a rune in Go represents a Unicode code point. Continuing from there… Rune literals A rune literal is expressed as one or more characters enclosed in single quotes, as in 'x' or '\n'. Within the quotes, any character may appear except newline and unescaped single quote. A single quoted character represents the Unicode value of the character itself, while multi-character sequences beginning with a backslash encode values in various formats.


Valid import paths

Today we’re looking at one of the more essoteric parts of the spec. Import declarations … Implementation restriction: A compiler may restrict ImportPaths to non-empty strings using only characters belonging to Unicode’s L, M, N, P, and S general categories (the Graphic characters without spaces) and may also exclude the characters !"#$%&’()*,:;<=>?[]^`{|} and the Unicode replacement character U+FFFD. Other than what’s stated in the quote, we’re not going to look at an exhaustive list of what is and isn’t guaranteed to be supported by a Go implementation.

Get daily content like this in your inbox!

Subscribe