Let’s get into the string operations in Lua. Lua offers a wide range of functions for working with text. As we mentioned earlier, Lua is Unicode agnostic. Unicode agnostic is good enough for most things, but there are situations where this just doesn’t work, so we’ll introduce a library for working with utf8 in Lua as well. Let’s go over the basic string operations first.
Basic String Functions
Like everything else in Lua, our indexes all start at 1 and not 0! Remember that for everything involving an index. Let’s start with a table and introduce each function as we go.
Commonly Used String Functions
Function | Description |
---|---|
string.len( s ) | Get the length of s in bytes |
string.upper( s ) | Returns the conversion of “s” to uppercase |
string.lower( s ) | Returns the conversion of “s” to lowercase |
string.reverse( s ) | Returns the reverse of “s” |
string.sub( s [, i [, j ] ] ) | Returns a substring from “s” starting at “i” and ending at “j” |
string.rep( s, n ) | Returns “s” repeated n times |
string.match( s, pattern [, i] ) | Returns the substring from “s” which matches “pattern” (see below for patterns), starts at an option i index |
string.gmatch( s, pattern ) | Returns the substrings as an iterator from “s” which match “pattern” (see below for patterns) |
string.gsub( s, pattern, replacement [, n] ) | Returns the string “s” which each instance of “pattern” replaced with “replacement”, takes an optional n which limits the number of times to do the replacement |
string.find( s, pattern [, index [, boolplain ] ] ) | Returns the start and end index (or nil) for finding the “pattern” in “s”, starts at an optional “index” and can take an optional bool “boolplain” to ignore pattern search and search literally |
string.format( s, o1, …, on ) | Returns the formatting of “s” via options in o1 and beyond, similar to the printf options in C |
We’re going to go over each of these and how to use them. This table is just to get you used to what all is on the table. Things like match and gmatch will be detailed since each of them has different use cases. Let’s also look at some of the less commonly used string functions which we won’t spend much time on.
Less Commonly Used String Functions
Function | Description |
---|---|
string.byte( s [, i [ , j ] ] ) | Return the byte representation of “s” (optionally) starting at “i”, and ending at “j” |
string.char( byte1, …, byten ) | Return the character representation of the bytes provided |
string.dump( f ) | Return the binary representation of a Lua function |
Try working with these on your own to see what they do. Here’s a hint: string.byte and string.char convert back and forth.
Syntactic Sugar
Remember in the class lesson where we discussed syntactic sugar? Well, the string library is no exception to using the alternate class form for strings. You can write string.len( s ) or shorten it to s:len(). We’ll touch on these a bit below, but we’re not going to spend any extra time on it. Read the previous lesson on classes to make sure you understand what this is and why this works. It’s specifically this feature which is why this lesson is placed after so many other things.
The Basic String Functions
Let’s dive into the easiest to learn string functions first.
string.len( s ), string.upper( s ), string.lower( s ), string.reverse( s )
Each of these are easy since they just take a single parameter. Here’s an example of them in action:
#!/usr/bin/lua5.1
local test = "123456789"
local test2 = "aBcDeFgHiJkLmNoPqRsTuVwXyZ"
print( "length of test: " .. string.len( test ) )
print( "length of test2: " .. test2:len() )
print( "test put in uppercase: " .. test:upper() )
print( "test2 put in uppercase: " .. string.upper( test2 ) )
print( "test2 put in lowercase: " .. test2:lower() )
print( "test reversed: " .. test:reverse() )
print( "test2 reversed in lowercase: " .. string.reverse( test2:lower() ) )
This gets us the following from running it:
./luastring.lua
length of test: 9
length of test2: 26
test put in uppercase: 123456789
test2 put in uppercase: ABCDEFGHIJKLMNOPQRSTUVWXYZ
test2 put in lowercase: abcdefghijklmnopqrstuvwxyz
test reversed: 987654321
test2 reversed in lowercase: zyxwvutsrqponmlkjihgfedcba
We’re not going to go into these too deep as each of them is self-explanatory. Remember again, that for our lengths, we’re working with an index of 1.
string.sub( s [, i [, j ] ] )
string.sub(…) is extremely useful for extracting known parts of a string. Like Perl and some other languages, you can also use negative indexes. A negative index means start from the back, so -1 would be the last character. Let’s see this all in practice:
#!/usr/bin/lua5.1
local test = "123456789"
print( "test substring starting at 5: " .. test:sub( 5 ) )
print( "test substring from 5 to 8: " .. string.sub( test, 5, 8 ) )
print( "test substring from -3 to -1: " .. string.sub( test, -3, -1 ) )
This gets us:
./luastring2.lua
test substring starting at 5: 56789
test substring from 5 to 8: 5678
test substring from -3 to -1: 789
If you want to get a single character at a specific position, set “i” and “j” to the same number.
string.rep( s, n )
This is another trivial string function. Repeat “s” n times. Let’s see a quick example:
#!/usr/bin/lua5.1
print( "repeat 'abcdefg' 3 times: " .. string.rep( "abcdefg", 3 ) )
This gets us:
./luastring3.lua
repeat 'abcdefg' 3 times: abcdefgabcdefgabcdefg
Patterns and Special Characters
Before we can dive too far into the rest of our common string functions, we need to understand the basics of patterns and how to handle special characters in Lua. Special characters are things like newlines and tabs. Let’s go over special characters first since they’re the easiest.
Common Special Characters
Escaped Character | Description |
---|---|
\n | Newline |
\r | Carriage Return |
\t | Tab |
\\ | \ |
\” | “ |
\’ | ‘ |
\[ | [ |
\] | ] |
These are all extremely common and work in virtually all string operations and should be memorized. “New Line” is common on all OSes, but “Carriage Returns” are used on Windows in conjunction with “New Lines”. I tend to just regex (see patterns below for more) them out. You will also need to take in account the magic characters (see below).
Uncommon Special Characters
Escaped Character | Description |
---|---|
\a | Bell |
\b | Backspace |
\f | Form Feed |
\v | Vertical Tab |
I’ve never used any of these, but I’m including them for completion’s sake.
Patterns
|
|
---|
We’ll get into these in a minute, but first we have to introduce magic characters in the context of patterns.
Magic Characters
Magic Character | Function |
---|---|
% | Either a pattern or escapes a character |
. | Any character |
^ | Beginning of string; “Not” inside character sets |
$ | End of string |
* | Match any number of times (greedy) (see below) |
– | Match any number of times (non-greedy) (see below) |
+ | Match 1 or more times |
? | Match 0 or 1 times |
[] | A character set (see below) |
() | A character capture (see below) |
Confused yet? Don’t worry, we’re going to go over all of these. The special characters are easy enough, but let’s dive into the magic characters and patterns. Magic characters can be escaped with %. We’re going to go over string.match first, but then we’ll add in our patterns. We have a bit of a chicken and the egg problem with which is worth teaching first, so we’re going to start with the function first.
string.match( s, pattern [, i] )
Let’s start with a trivial example:
#!/usr/bin/lua5.1
local test = "abc 123 ABC 123 !!! cat d0g -+[]"
print( string.match( test, "123", 8 ) )
print( string.match( test, "ABC" ) )
print( string.match( test, "%-%+%[%]" ) )
This gets us:
./luamatch.lua
123
ABC
-+[]
That’s great, but pretty much useless, let’s throw in a pattern or two:
#!/usr/bin/lua5.1
local test = "abc 123 ABC 123 !!! cat d0g -+[]"
print( string.match( test, "%d" ) )
print( string.match( test, "%d%d%d" ) )
This gets us:
./luamatch.lua
1
123
Let’s throw in some magic characters and spice things up!
#!/usr/bin/lua5.1
local test = "abc 123 ABC 123 !!! cat d0g -+[] 123"
print( string.match( test, "%d+" ) )
print( string.match( test, "%d*" ) )
print( string.match( test, "%d+.-%d+") )
print( string.match( test, "%d+.*%d+") )
This gets up:
./luamatch.lua
123
123 ABC 123
123 ABC 123 !!! cat d0g -+[] 123
Now we’re just getting weird with this. Let’s go over each line.
print( string.match( test, “%d+” ) ) gets us “123”. We match a digit via %d and look for 1 or more repetitions with our magic character (“+”).
print( string.match( test, “%d*” ) ) gets us “”. Remember that “*” matches zero or more times. We match zero characters without conditions around our %d.
print( string.match( test, “%d+.-%d+”) ) gets us “123 ABC 123”. We look for at least one digit, then anything, then another digit. The “-” tells us to match any number of times, but to not be greedy.
print( string.match( test, “%d+.*%d+”) ) gets us “123 ABC 123 !!! cat d0g -+[] 123”. We do the same thing as before, but we’re greedy so we match until the last instance of this.
Greedy Matching
Greedy matching means that we match anything which matches the pattern while non-greedy matching means we match until the condition is first matched. In our string before, we have “123” then a bunch of extra stuff and end with “123” for the test string. We have another “123” midway in the string as well. We match a digit or more, then anything, then a digit or more. Because of this, greedy matching will match everything to that last “123” as our digits are technically “any character”, but non-greedy matching will end as soon as it sees the next batch of digits.
Greedy matching is useful when you want everything between something even if it would match and end case. Non-greedy matching is useful when you want something in between a known set of things and don’t ever want it to spill over. This is useful when you have sequences of similar formatting (think tags in a string or similar). Greedy and non-greedy matching get more important with some of the other string formats.
string.gmatch( s, pattern )
string.gmatch( s, pattern ) is similar to string.match, except it provides all of the matches. You can iterate over these with a for loop or similar. Let’s look at an example:
#!/usr/bin/lua5.1
local test = "abc 123 ABC 456 !!! cat d0g -+[] 789"
for s in ( string.gmatch( test, "%d+" ) ) do
print( "found: " .. s )
end
This gets us:
./luamatch.lua
found: 123
found: 456
found: 0
found: 789
string.gsub( s, pattern, replacement [, n] )
string.gsub( s, pattern, replacement [, n] ) is one of the most useful functions in all of Lua for working with strings. It allows you to take a string and replace something in it from a pattern. You get back a string and the number of replacements made. Here’s an easy example:
#!/usr/bin/lua5.1
local test = "abc 123 ABC 456 !!! cat d0g -+[] 789"
local newstring, replacements = string.gsub( test, "%d+", "[Numbers]" )
print( "Replaced: " .. replacements )
print( "New string: " .. newstring )
This gets us:
./luamatch.lua
Replaced: 4
New string: abc [Numbers] ABC [Numbers] !!! cat d[Numbers]g -+[] [Numbers]
You can limit the number of replacements by adding in an integer n to limit how many occurrences you affect. This is useful for certain scenarios where you know roughly where the replacements will need to take place. You’ll use things like this in data munging.
string.find( s, pattern [, index [, boolplain ] ] )
string.find returns the beginning index and end index for a given pattern. The index affects where the search begins, and the boolean for a plain search affects whether it uses regular expressions or patterns.
Let’s look at a basic example:
#!/usr/bin/lua5.1
function print_range (a, b)
return a .. " to " .. b
end
local test = "ABCdefGHIjkl123%d%s%wABC"
local i, j = string.find( test, "%d+", 3 )
print( "found '%d+' from " .. i .. " to " .. j )
print( "found 'ABC' at: " .. print_range( test:find( "ABC" ) ) )
print( "found '%d%d%d' at: " .. print_range( test:find( "%d%d%d" ) ) )
print( "found 'ABC' starting at 5: " .. print_range( string.find( test, "ABC", 5 ) ) )
print( "found '%d' at: " .. print_range( string.find( test, '%d', 1, true ) ) .. " with plain search on" )
This gets us:
./stringfind.lua
found '%d+' from 13 to 15
found 'ABC' at: 1 to 3
found '%d%d%d' at: 13 to 15
found 'ABC' starting at 5: 22 to 24
found '%d' at: 16 to 17 with plain search on
Note that this function returns an initial index and an ending index which can be passed to string.sub. Let’s use a pattern first, then pass our string.find to a new function we made print_range to make our formatting easy.
This type of function is extremely useful for writing a basic parser. You can use the index to find specific pieces between certain ranges as you slowly take apart the string when writing a parser. Standard regular expressions and patterns solve a lot, but not everything.
string.format( s, o1, …, on )
I saved the best and the worst for last. string.format is one of the most useful for outputting things, but can be one of the most complicated string functions to master unless you’ve worked with C. Speaking of which, this almost entirely follows printf. We’ll go over some of it, but it’s best to consult a reference for a lot of it.
string.format takes a string s which contains special sequences to format the string. Each sequence can have associated variables which are passed to be formatted as wanted. Let’s see an example of something easy:
#!/usr/bin/lua5.1
print( string.format( "My name is %s. I'm %d years old!", "John", 22 ) )
This gets us:
./stringfind.lua
My name is John. I'm 22 years old!
Don’t worry about understanding all of this yet. Just keep reading. string.format is extremely complicated, but also extremely powerful.
Basic Options for string.format
There’s a lot to go over before we can fully use string.format. Let’s see some of the basic options so that we can know what goes in our format string.
Format String | Description |
---|---|
%s | A standard string |
%d | A standard integer (base 10) |
%f | A standard floating point number |
%e | A number represented in exponential notation |
%g | A number represented in the shortest form possible (%e or %f) |
Other Options for string.format
Format String | Description |
---|---|
%u | Unsigned int |
%x | Hexadecimal formatting |
%X | %x but uppercase |
%o | Octet formatting |
%E | %e but uppercase |
%G | %g but uppercase |
%% | Just print ‘%’ |
Here’s an example which touches on all of the basics:
#!/usr/bin/lua5.1
print( string.format( "%s : %d : %i : %f : %e : %g", "Example", 123, 456, 1.23, 1234567890.0, 123.9000000 ) )
print( string.format( "%u : %x = %X : %o : %E : %G : %%", 123, 255, 255, 255, 123456789, 123.40000 ) )
This gets us:
./luaformat.lua
Example : 123 : 456 : 1.230000 : 1.234568e+09 : 123.9
123 : ff = FF : 377 : 1.234568E+08 : 123.4 : %
Notice that string.format can take a variable number of arguments. This is an interesting feature of Lua which allows some functions to take extra arguments as necessary. Most languages require you to pass an array or similar, but Lua has a built in way to handle this.
Formatting Formats
printf can do some awesome formatting and justification, and so can string.format. Let’s see an example and then we’ll break down what all string.format can do:
#!/usr/bin/lua5.1
print( string.format( "% d vs. % d", 123, -123 ) )
print( string.format( "%+d vs. %+d", 123, -123 ) )
print( string.format( "[%-5d vs. %-5d]", 123, -123 ) )
print( string.format( "[%05d vs. %05d]", 123, -123 ) )
print( string.format( "[%-8.2f vs. %-8.2f]", 123.000123, -123.000123 ) )
print( string.format( "[%08.2f vs. %08.2f]", 123.000123, -123.000123 ) )
print( string.format( "[%#.2f vs. %#.2f]", 123.00000, -123.00000 ) )
print( string.format( "[%.0f vs. %.0f]", 123.01, -123.01 ) )
This gets us:
./luaformat.lua
123 vs. -123
+123 vs. -123
[123 vs. -123 ]
[00123 vs. -0123]
[123.00 vs. -123.00 ]
[00123.00 vs. -0123.00]
[123.00 vs. -123.00]
[123 vs. -123]
Let’s break it all down.
Parts of a Format
Each format is built up of the following: %[flag][length][.[precision]][code]. We have gone over the basics of %[code] in our table. Let’s go over the flags and how to format our length and precision.
Flags
Flag | Description |
---|---|
[single space] [ ] | Formats the string with spaces for the length |
0 | Formats the string with zeroes for the length |
+ | Formats the string with + if positive, – if negative (numbers) |
– | Left align |
# | Cuts off trailing zeroes (number) |
You can then specify the number of overall digits or characters and the amount of precision for your formatted item. Let’s look at an absurd example for a string:
print( string.format( "[% 8.3s]", "12345" ) )
This gets us:
[ 123]
Almost all of these combinations work, so try them out and see what you get with weird options. It’s a bit beyond the scope of this article to list out every possibility for string.format. There are other options, but I have excluded them due to their rarity, and due to the fact some are not implemented/applicable in Lua.
True Unicode
We’ve gone through all of the cool string features in Lua, and we can work with UTF8 if we don’t need certain things, but what happens when we need true unicode support? You get a breakdown pretty quickly. It’s hard to break strings down on each character point. It’s also hard to deal with lengths and similar.
Luckily, there is a good UTF8 library which can be installed via Luarocks. For this specific library, unlike our Lunajson library, you’ll need a C compiler set up with Luarocks. Make sure you have GCC or similar on your system before trying to use this library.
You want to do the following to get it installed on Debian or MacOS:
sudo luarocks install utf8
Installing https://luarocks.org/utf8-1.2-0.src.rock
gcc -O2 -fPIC -I/usr/include/lua5.1 -c lutf8lib.c -o lutf8lib.o
gcc -shared -o utf8.so -L/usr/local/lib lutf8lib.o
utf8 1.2-0 is now installed in /usr/local (license: MIT)
This library doesn’t need much, but works great for dealing with UTF8. I use it in my translation work to split up characters and map Pinyin from dictionaries. This library is called “lua-utf8” (and installed as “luautf8” from Luarocks) for Lua 5.3.
Per the documentation, it includes the following functions:
utf8.byte
utf8.char
utf8.find
utf8.gmatch
utf8.gsub
utf8.len
utf8.lower
utf8.match
utf8.reverse
utf8.sub
utf8.upper
The usage is the same as our standard string functions. To use it, just use the following line of code:
utf8 = require 'utf8'
Or for Lua 5.3:
utf8 = require 'lua-utf8'
Lua Strings
This article covers the basics of string functions in Lua. A lot of these are directly ported from C, but there are some notable differences. I have skipped things which are missing from C and which aren’t commonly used. Most of these functions are fine to use with UTF8, but there are some things missing.
Review this article for when you need some of these functions. You should try to memorize at least a few of the operations as soon as possible. We’ll have some new exercises coming up to work with different real life examples soon enough.