Lua String Operations

Let’s get into the string operations in Lua. Lua offers a wide range of functions for working with text. As we mentioned earlier, Lua is Unicode agnostic. Unicode agnostic is good enough for most things, but there are situations where this just doesn’t work, so we’ll introduce a library for working with utf8 in Lua as well. Let’s go over the basic string operations first.

Basic String Functions

Like everything else in Lua, our indexes all start at 1 and not 0! Remember that for everything involving an index. Let’s start with a table and introduce each function as we go.

Commonly Used String Functions
FunctionDescription
string.len( s )Get the length of s in bytes
string.upper( s )Returns the conversion of “s” to uppercase
string.lower( s )Returns the conversion of “s” to lowercase
string.reverse( s )Returns the reverse of “s”
string.sub( s [, i [, j ] ] )Returns a substring from “s” starting at “i” and ending at “j”
string.rep( s, n )Returns “s” repeated n times
string.match( s, pattern [, i] )Returns the substring from “s” which matches “pattern” (see below for patterns), starts at an option i index
string.gmatch( s, pattern )Returns the substrings as an iterator from “s” which match “pattern” (see below for patterns)
string.gsub( s, pattern, replacement [, n] )Returns the string “s” which each instance of “pattern” replaced with “replacement”, takes an optional n which limits the number of times to do the replacement
string.find( s, pattern [, index [, boolplain ] ] )Returns the start and end index (or nil) for finding the “pattern” in “s”, starts at an optional “index” and can take an optional bool “boolplain” to ignore pattern search and search literally
string.format( s, o1, …, on )Returns the formatting of “s” via options in o1 and beyond, similar to the printf options in C

We’re going to go over each of these and how to use them. This table is just to get you used to what all is on the table. Things like match and gmatch will be detailed since each of them has different use cases. Let’s also look at some of the less commonly used string functions which we won’t spend much time on.

Less Commonly Used String Functions
FunctionDescription
string.byte( s [, i [ , j ] ] )Return the byte representation of “s” (optionally) starting at “i”, and ending at “j”
string.char( byte1, …, byten )Return the character representation of the bytes provided
string.dump( f )Return the binary representation of a Lua function

Try working with these on your own to see what they do. Here’s a hint: string.byte and string.char convert back and forth.

Syntactic Sugar

Remember in the class lesson where we discussed syntactic sugar? Well, the string library is no exception to using the alternate class form for strings. You can write string.len( s ) or shorten it to s:len(). We’ll touch on these a bit below, but we’re not going to spend any extra time on it. Read the previous lesson on classes to make sure you understand what this is and why this works. It’s specifically this feature which is why this lesson is placed after so many other things.

The Basic String Functions

Let’s dive into the easiest to learn string functions first.

string.len( s ), string.upper( s ), string.lower( s ), string.reverse( s )

Each of these are easy since they just take a single parameter. Here’s an example of them in action:

#!/usr/bin/lua5.1

local test = "123456789"
local test2 = "aBcDeFgHiJkLmNoPqRsTuVwXyZ"

print( "length of test: " .. string.len( test ) )
print( "length of test2: " .. test2:len() )
print( "test put in uppercase: " .. test:upper() )
print( "test2 put in uppercase: " .. string.upper( test2 ) )
print( "test2 put in lowercase: " .. test2:lower() )
print( "test reversed: " .. test:reverse() )
print( "test2 reversed in lowercase: " .. string.reverse( test2:lower() ) )

This gets us the following from running it:

./luastring.lua 
length of test: 9
length of test2: 26
test put in uppercase: 123456789
test2 put in uppercase: ABCDEFGHIJKLMNOPQRSTUVWXYZ
test2 put in lowercase: abcdefghijklmnopqrstuvwxyz
test reversed: 987654321
test2 reversed in lowercase: zyxwvutsrqponmlkjihgfedcba

We’re not going to go into these too deep as each of them is self-explanatory. Remember again, that for our lengths, we’re working with an index of 1.

string.sub( s [, i [, j ] ] )

string.sub(…) is extremely useful for extracting known parts of a string. Like Perl and some other languages, you can also use negative indexes. A negative index means start from the back, so -1 would be the last character. Let’s see this all in practice:

#!/usr/bin/lua5.1

local test = "123456789"

print( "test substring starting at 5: " .. test:sub( 5 ) )
print( "test substring from 5 to 8: " .. string.sub( test, 5, 8 ) )
print( "test substring from -3 to -1: " .. string.sub( test, -3, -1 ) )

This gets us:

./luastring2.lua 
test substring starting at 5: 56789
test substring from 5 to 8: 5678
test substring from -3 to -1: 789

If you want to get a single character at a specific position, set “i” and “j” to the same number.

string.rep( s, n )

This is another trivial string function. Repeat “s” n times. Let’s see a quick example:

#!/usr/bin/lua5.1

print( "repeat 'abcdefg' 3 times: " .. string.rep( "abcdefg", 3 ) )

This gets us:

./luastring3.lua 
repeat 'abcdefg' 3 times: abcdefgabcdefgabcdefg

Patterns and Special Characters

Before we can dive too far into the rest of our common string functions, we need to understand the basics of patterns and how to handle special characters in Lua. Special characters are things like newlines and tabs. Let’s go over special characters first since they’re the easiest.

Common Special Characters
Escaped CharacterDescription
\nNewline
\rCarriage Return
\tTab
\\\
\”
\’
\[[
\]]

These are all extremely common and work in virtually all string operations and should be memorized. “New Line” is common on all OSes, but “Carriage Returns” are used on Windows in conjunction with “New Lines”. I tend to just regex (see patterns below for more) them out. You will also need to take in account the magic characters (see below).

Uncommon Special Characters
Escaped CharacterDescription
\aBell
\bBackspace
\fForm Feed
\vVertical Tab

I’ve never used any of these, but I’m including them for completion’s sake.

Patterns
PatternDescriptionOpposite
.All characters[none]
%aAll letters%A
%cAll control characters%C
%dAll digits%D
%lAll lowercase characters%L
%pAll punctuation%P
%sAll whitespace%S
%uAll uppercase characters%U
%wAll word characters (alphanumeric)%W
%xAll hexidecimal%X
%zZero character (0)%Z
PatternDescriptionOpposite
.All characters[none]
%AAnything but letters%a
%CAnything but control characters%c
%DAnything but digits%d
%LAll non-lowercase characters%l
%PAnything but punctuation%p
%SAll non-whitespace characters%s
%UAll non-uppercase characters%u
%WAll non-word characters (non-alphanumeric)%w
%XNot hexidecimal%x
%ZNot zero (not 0)%z

We’ll get into these in a minute, but first we have to introduce magic characters in the context of patterns.

Magic Characters
Magic CharacterFunction
%Either a pattern or escapes a character
.Any character
^Beginning of string; “Not” inside character sets
$End of string
*Match any number of times (greedy) (see below)
Match any number of times (non-greedy) (see below)
+Match 1 or more times
?Match 0 or 1 times
[]A character set (see below)
()A character capture (see below)

Confused yet? Don’t worry, we’re going to go over all of these. The special characters are easy enough, but let’s dive into the magic characters and patterns. Magic characters can be escaped with %. We’re going to go over string.match first, but then we’ll add in our patterns. We have a bit of a chicken and the egg problem with which is worth teaching first, so we’re going to start with the function first.

string.match( s, pattern [, i] )

Let’s start with a trivial example:

#!/usr/bin/lua5.1

local test = "abc 123 ABC 123 !!! cat d0g -+[]"

print( string.match( test, "123", 8 ) )
print( string.match( test, "ABC" ) )
print( string.match( test, "%-%+%[%]" ) )

This gets us:

./luamatch.lua 
123
ABC
-+[]

That’s great, but pretty much useless, let’s throw in a pattern or two:

#!/usr/bin/lua5.1

local test = "abc 123 ABC 123 !!! cat d0g -+[]"

print( string.match( test, "%d" ) )
print( string.match( test, "%d%d%d" ) )

This gets us:

./luamatch.lua 
1
123

Let’s throw in some magic characters and spice things up!

#!/usr/bin/lua5.1

local test = "abc 123 ABC 123 !!! cat d0g -+[] 123"

print( string.match( test, "%d+" ) )
print( string.match( test, "%d*" ) )
print( string.match( test, "%d+.-%d+") )
print( string.match( test, "%d+.*%d+") )

This gets up:

./luamatch.lua 
123

123 ABC 123
123 ABC 123 !!! cat d0g -+[] 123

Now we’re just getting weird with this. Let’s go over each line.
print( string.match( test, “%d+” ) ) gets us “123”. We match a digit via %d and look for 1 or more repetitions with our magic character (“+”).
print( string.match( test, “%d*” ) ) gets us “”. Remember that “*” matches zero or more times. We match zero characters without conditions around our %d.
print( string.match( test, “%d+.-%d+”) ) gets us “123 ABC 123”. We look for at least one digit, then anything, then another digit. The “-” tells us to match any number of times, but to not be greedy.
print( string.match( test, “%d+.*%d+”) ) gets us “123 ABC 123 !!! cat d0g -+[] 123”. We do the same thing as before, but we’re greedy so we match until the last instance of this.

Greedy Matching

Greedy matching means that we match anything which matches the pattern while non-greedy matching means we match until the condition is first matched. In our string before, we have “123” then a bunch of extra stuff and end with “123” for the test string. We have another “123” midway in the string as well. We match a digit or more, then anything, then a digit or more. Because of this, greedy matching will match everything to that last “123” as our digits are technically “any character”, but non-greedy matching will end as soon as it sees the next batch of digits.

Greedy matching is useful when you want everything between something even if it would match and end case. Non-greedy matching is useful when you want something in between a known set of things and don’t ever want it to spill over. This is useful when you have sequences of similar formatting (think tags in a string or similar). Greedy and non-greedy matching get more important with some of the other string formats.

string.gmatch( s, pattern )

string.gmatch( s, pattern ) is similar to string.match, except it provides all of the matches. You can iterate over these with a for loop or similar. Let’s look at an example:

#!/usr/bin/lua5.1

local test = "abc 123 ABC 456 !!! cat d0g -+[] 789"

for s in ( string.gmatch( test, "%d+" ) ) do
	print( "found: " .. s )
end

This gets us:

./luamatch.lua 
found: 123
found: 456
found: 0
found: 789

string.gsub( s, pattern, replacement [, n] )

string.gsub( s, pattern, replacement [, n] ) is one of the most useful functions in all of Lua for working with strings. It allows you to take a string and replace something in it from a pattern. You get back a string and the number of replacements made. Here’s an easy example:

#!/usr/bin/lua5.1

local test = "abc 123 ABC 456 !!! cat d0g -+[] 789"

local newstring, replacements = string.gsub( test, "%d+", "[Numbers]" )
print( "Replaced: " .. replacements )
print( "New string: " .. newstring )

This gets us:

./luamatch.lua 
Replaced: 4
New string: abc [Numbers] ABC [Numbers] !!! cat d[Numbers]g -+[] [Numbers]

You can limit the number of replacements by adding in an integer n to limit how many occurrences you affect. This is useful for certain scenarios where you know roughly where the replacements will need to take place. You’ll use things like this in data munging.

string.find( s, pattern [, index [, boolplain ] ] )

string.find returns the beginning index and end index for a given pattern. The index affects where the search begins, and the boolean for a plain search affects whether it uses regular expressions or patterns.

Let’s look at a basic example:

#!/usr/bin/lua5.1

function print_range (a, b)
	return a .. " to " .. b
end

local test = "ABCdefGHIjkl123%d%s%wABC"

local i, j = string.find( test, "%d+", 3 )
print( "found '%d+' from " .. i .. " to " .. j )

print( "found 'ABC' at: " .. print_range( test:find( "ABC" ) ) )
print( "found '%d%d%d' at: " .. print_range( test:find( "%d%d%d" ) ) )
print( "found 'ABC' starting at 5: " .. print_range( string.find( test, "ABC", 5 ) ) )
print( "found '%d' at: " .. print_range( string.find( test, '%d', 1, true ) ) .. " with plain search on" )

This gets us:

./stringfind.lua 
found '%d+' from 13 to 15
found 'ABC' at: 1 to 3
found '%d%d%d' at: 13 to 15
found 'ABC' starting at 5: 22 to 24
found '%d' at: 16 to 17 with plain search on

Note that this function returns an initial index and an ending index which can be passed to string.sub. Let’s use a pattern first, then pass our string.find to a new function we made print_range to make our formatting easy.

This type of function is extremely useful for writing a basic parser. You can use the index to find specific pieces between certain ranges as you slowly take apart the string when writing a parser. Standard regular expressions and patterns solve a lot, but not everything.

string.format( s, o1, …, on )

I saved the best and the worst for last. string.format is one of the most useful for outputting things, but can be one of the most complicated string functions to master unless you’ve worked with C. Speaking of which, this almost entirely follows printf. We’ll go over some of it, but it’s best to consult a reference for a lot of it.

string.format takes a string s which contains special sequences to format the string. Each sequence can have associated variables which are passed to be formatted as wanted. Let’s see an example of something easy:

#!/usr/bin/lua5.1

print( string.format( "My name is %s. I'm %d years old!", "John", 22 ) )

This gets us:

./stringfind.lua 
My name is John. I'm 22 years old!

Don’t worry about understanding all of this yet. Just keep reading. string.format is extremely complicated, but also extremely powerful.

Basic Options for string.format

There’s a lot to go over before we can fully use string.format. Let’s see some of the basic options so that we can know what goes in our format string.

Format StringDescription
%sA standard string
%dA standard integer (base 10)
%fA standard floating point number
%eA number represented in exponential notation
%gA number represented in the shortest form possible (%e or %f)
Other Options for string.format
Format StringDescription
%uUnsigned int
%xHexadecimal formatting
%X%x but uppercase
%oOctet formatting
%E%e but uppercase
%G%g but uppercase
%%Just print ‘%’

Here’s an example which touches on all of the basics:

#!/usr/bin/lua5.1

print( string.format( "%s : %d : %i : %f : %e : %g", "Example", 123, 456, 1.23, 1234567890.0, 123.9000000 ) )
print( string.format( "%u : %x = %X : %o : %E : %G : %%", 123, 255, 255, 255, 123456789, 123.40000 ) )

This gets us:

./luaformat.lua 
Example : 123 : 456 : 1.230000 : 1.234568e+09 : 123.9
123 : ff = FF : 377 : 1.234568E+08 : 123.4 : %

Notice that string.format can take a variable number of arguments. This is an interesting feature of Lua which allows some functions to take extra arguments as necessary. Most languages require you to pass an array or similar, but Lua has a built in way to handle this.

Formatting Formats

printf can do some awesome formatting and justification, and so can string.format. Let’s see an example and then we’ll break down what all string.format can do:

#!/usr/bin/lua5.1

print( string.format( "% d vs. % d", 123, -123 ) )
print( string.format( "%+d vs. %+d", 123, -123 ) )

print( string.format( "[%-5d vs. %-5d]", 123, -123 ) )
print( string.format( "[%05d vs. %05d]", 123, -123 ) )
print( string.format( "[%-8.2f vs. %-8.2f]", 123.000123, -123.000123 ) )
print( string.format( "[%08.2f vs. %08.2f]", 123.000123, -123.000123 ) )
print( string.format( "[%#.2f vs. %#.2f]", 123.00000, -123.00000 ) )
print( string.format( "[%.0f vs. %.0f]", 123.01, -123.01 ) )

This gets us:

./luaformat.lua 
 123 vs. -123
+123 vs. -123
[123   vs. -123 ]
[00123 vs. -0123]
[123.00   vs. -123.00 ]
[00123.00 vs. -0123.00]
[123.00 vs. -123.00]
[123 vs. -123]

Let’s break it all down.

Parts of a Format

Each format is built up of the following: %[flag][length][.[precision]][code]. We have gone over the basics of %[code] in our table. Let’s go over the flags and how to format our length and precision.

Flags
FlagDescription
[single space] [ ]Formats the string with spaces for the length
0Formats the string with zeroes for the length
+Formats the string with + if positive, – if negative (numbers)
Left align
#Cuts off trailing zeroes (number)

You can then specify the number of overall digits or characters and the amount of precision for your formatted item. Let’s look at an absurd example for a string:

print( string.format( "[% 8.3s]", "12345" ) )

This gets us:

[     123]

Almost all of these combinations work, so try them out and see what you get with weird options. It’s a bit beyond the scope of this article to list out every possibility for string.format. There are other options, but I have excluded them due to their rarity, and due to the fact some are not implemented/applicable in Lua.

True Unicode

We’ve gone through all of the cool string features in Lua, and we can work with UTF8 if we don’t need certain things, but what happens when we need true unicode support? You get a breakdown pretty quickly. It’s hard to break strings down on each character point. It’s also hard to deal with lengths and similar.

Luckily, there is a good UTF8 library which can be installed via Luarocks. For this specific library, unlike our Lunajson library, you’ll need a C compiler set up with Luarocks. Make sure you have GCC or similar on your system before trying to use this library.

You want to do the following to get it installed on Debian or MacOS:

sudo luarocks install utf8
Installing https://luarocks.org/utf8-1.2-0.src.rock
gcc -O2 -fPIC -I/usr/include/lua5.1 -c lutf8lib.c -o lutf8lib.o
gcc -shared -o utf8.so -L/usr/local/lib lutf8lib.o
utf8 1.2-0 is now installed in /usr/local (license: MIT)

This library doesn’t need much, but works great for dealing with UTF8. I use it in my translation work to split up characters and map Pinyin from dictionaries. This library is called “lua-utf8” (and installed as “luautf8” from Luarocks) for Lua 5.3.

Per the documentation, it includes the following functions:
utf8.byte
utf8.char
utf8.find
utf8.gmatch
utf8.gsub
utf8.len
utf8.lower
utf8.match
utf8.reverse
utf8.sub
utf8.upper

The usage is the same as our standard string functions. To use it, just use the following line of code:

utf8 = require 'utf8'

Or for Lua 5.3:

utf8 = require 'lua-utf8'

Lua Strings

This article covers the basics of string functions in Lua. A lot of these are directly ported from C, but there are some notable differences. I have skipped things which are missing from C and which aren’t commonly used. Most of these functions are fine to use with UTF8, but there are some things missing.

Review this article for when you need some of these functions. You should try to memorize at least a few of the operations as soon as possible. We’ll have some new exercises coming up to work with different real life examples soon enough.