Modern Perl: Why Perl Rules For Text

The Swiss Army Chainsaw that is Perl may not be a hot, new language, but it is a powerhouse still for text processing. I learned Perl to work with text in the form of log processing originally. When I started learning Chinese seriously, I used it to digest massive stores of text to see what I should focus on. CGI and Dancer enabled me to create web applications which really kick-started my current programming career.

Modern Perl has many reasons it’s so effective with text. It also supports newer data formats which keep it relevant. Perl isn’t going the best choice for every single problem, but it can outclass many languages for text processing.

Perl and Text

Perl was originally created to be a general purpose language to help with reports. Reporting capabilities require text extraction and processing as well as easy ways to process and output the relevant data. Perl enables one-liners, has text-specific functions which are not necessarily in every (common) language, and has a typing which lends itself to working with data.

Perl One-Liners

One-Liners are small language statements which can traditionally be run from a single line. This doesn’t sound that powerful, but imagine if you want to convert a DOS or Windows file to Unix or POSIX formats or vice versa. You don’t have to write a full script, you just run some of the language a special way. All you have to do to go from DOS to POSIX is run:

perl -pe 's/\r//g' myfile.dos > myfile.unix

This simple operation is quick and simple, and can be run on any machine with Perl. You can dig through logs much quicker and more effectively with Perl as well. Let’s take the fake log fakelog:

WARN: Line 234 more than 140 characters, which is more than old Tweets could have
ERROR: There's a problem on line 234
PRINT: There are characters on line 234, but no semicolons
PRINT: Some junk no one wants
PRINT: More junk!!
WARN: Oh no, some stuff that doesn't hurt anything
ERROR: There's a big problem on line 234
PRINT: I'm a fake log!
WARN: Here's some problem that will affect you in 202020 if you don't patch me
PRINT: Line 234 ends with a newline
ERROR: Seriously, line 234, it's bad!
PRINT: Did you know that 'today' ends in 'y'?

This log has a lot of useless lines which are just noise when hunting bugs. Let’s filter out everything that’s an error:

perl -ne 'print if m/^ERROR/g' fakelog 
ERROR: There's a problem on line 234
ERROR: There's a big problem on line 234
ERROR: Seriously, line 234, it's bad!

Well, we know that ‘line 234’ has something up, let’s see if there is something that the log includes which can help further narrow this down. We just run:

perl -ne 'print if m/line 234/g' fakelog 
ERROR: There's a problem on line 234
PRINT: There are characters on line 234, but no semicolons
ERROR: There's a big problem on line 234
ERROR: Seriously, line 234, it's bad!

By using this method, we can deduce that our issue is a missing semicolon or similar on line 234. While this is a bit of an exaggeration for logs you’ll deal with, the same process applies for real life data. Warning and standard debug usually don’t mean much, until they suddenly do. A huge number of major issues boil down to some basic debug or verbose output pointing the way to the actual issue and the actual solution if the actual error doesn’t solve the problem.

Despite the snark that Perl is a write only language, it might as well be so for me with the number of things I throw at one-liners. I write as many projects as I can with Perl, but the glue it provides ends up showing up enough in my standard computing that it outclasses my programming on many fronts. Why open a text processing program and hitting Ctrl+h and figuring out the internal regex system when I can just use Perl?

Perl Regex

When most coding applications implement regex, they delineate between “regular expressions” and “Perl regular expressions” (which many newer tools call “advanced regular expressions” now). The Perl regular expression engine is insanely powerful. You can theoretically write a basic XML parser with it (but just don’t). It is actually practical to write a basic parser (when the rules are relatively predictable).

We’ve used bits and pieces of Perl’s regular expressions for our previous examples with one-liners since the engine lends itself to the task so well. I’ve written several data munging tools with a combination of one-liners and complicated regular expressions. Languages like Lua have regular expressions, but their implementation looks like the first lesson of a full course for Perl, and Lua is considered decently powerful on this front. Neither system is simple, but Perl just takes it further than most and is arguably more expressive.

String Processing in Perl

Certain built-in functions in Perl just make sense. Things like chomp make working with user data a breeze. There’s an easy way to chomp off the last new line which doesn’t contribute to traditional input. You have to make this from scratch in Lua (and others) if you want it.

Perl:

#!/usr/bin/perl -l

require strict;

my $str = "junk\r\n";
chomp( $str );
print( "[" . $str . "]" );

Lua:

#!/usr/bin/lua5.1

function chomp( line )
	line = string.gsub( line, "\n*\r*\n*$", "" ) --in case we care about being cross platform
	return string.gsub( line, "\n$", "" )
end

local str = "test\n"

print( "Shortened str to: " .. chomp( str ) )

This is just a quality of life improvement, but it also shows where the developers placed their effort. Perl makes text work a breeze, other languages require you to build more and more boilerplate. Perl is the first language I reach for when I need to build a parser or reporting process.

There are other basics functions like join which exist in languages like C# and similar, but aren’t in others like Lua.

Perl:

#!/usr/bin/perl -l

require strict;

my @arr = ( "a", "b", "c" );
my @arr2 = ();
my @arr3 = ( "a", "b" );

print( join( " AND ", @arr ) );
print( join( " AND ", @arr2 ) );
print( join( " AND ", @arr3 ) );

Lua:

#!/usr/bin/lua5.1

function join( table, jstring )
	local str = ""
	
	local tlength = #table
	
	if( tlength == 0 ) then
		return str
	end
	
	str = table[ 1 ]
	
	if( tlength == 1 ) then
		return str
	end
	
	for i = 2, tlength, 1 do
		str = str .. jstring .. table[ i ]
	end
	
	return str
end

local arr = { "a", "b", "c" }
local arr2 = {  }
local arr3 = { "a" }
local arr4 = { "a", "b" }

print( join( arr, " AND " ) )
print( join( arr2, " ANYTHING " ) )
print( join( arr3, " ANYTHING " ) )
print( join( arr4, " AND " ) )

Our Lua function is the lightest version I could write. We don’t account for the whole thing not being a table and many other simple handling points. The Perl version just works, the Lua version is a teaching exercise. Languages like C# offer a join function, but their typing makes them a bit weaker than Perl for similar text processing tasks.

Perl Typing and Idioms

Perl works out to being weakly typed in any use case, but you have the difference between strict Perl which is statically typed, and regular Perl which is dynamically typed. The weak typing means that a scalar is a scalar, we can just stick an int in with a string. Perl also allows us to do something most languages won’t. “cat” + 1 = 1 in Perl. The cast between a string and a number is something that doesn’t really come up, it just happens.

Perl’s typing lends itself to data processing at a level which most languages don’t. It’s more intuitive on many fronts with certain language design features.

For instance, any of these are valid:

if( [condition] ) { [do thing]; }
[do thing] if( [condition] );
[do thing] unless [condition];

This gets us something in practice like:

#!/usr/bin/perl -l

if( 1 + 1 == 2 ) { print( "1 + 1 = 2" ); }
print( "1 + 1 = 2" ) if( 1 + 1 == 2 );
print( "1 + 1 = 2" ) unless 1 + 1 > 2;

Perl Versus the World

With the power of CPAN, Perl has everything it needs to take on XML, JSON, CSV, SQL, etc. There are multiple libraries of various types to handle CSV’s from the most regular to the most obscure. XML and JSON are all well represented in Perl as well. The standard SQL types and obscure databases almost all have modules to interact with Perl.

Due to Perl’s other advantages with text processing and regular expressions, it’s usually easy enough to parse a text-based format. I’ve written many parsers to turn human readable data into other kinds of human readable data and been paid well for it. Projects which take weeks in other languages can take nights in Perl. A $10,000 a year program was replaced with some clever Perl scripts and Open-Source Software (FOSS).

Stacking Languages

Perl was my first real language, and has been my favorite since. I’ve moved on because I’ve had to, but the language is my comparison for every new paradigm. It isn’t perfect, but it beats anything else I’ve ever even considered using for text processing.

Lua is a bit more useful for certain things, but Perl is still the best general purpose scripting language anywhere remotely POSIX you can’t control installed software. C# is better on Windows, but the language itself will struggle against certain parsing scenarios without a lot of work. C, C++, and many other common languages just sit somewhere on the spectrum below Lua.

No language is necessarily better, some are just better for specific tasks. Perl excels at text processing and no other language manages to come close. Perl may not be right for every project, but almost any large project I’ve been involved in is glued together with it. A one-liner can save you hours, and the regex makes you feel you’re in the future of computing despite the many reports of the language being dead.

Featured image by Nino Carè from Pixabay