Data Munging: Reverse Engineering a Format

Facebook Pin

Data munging is the practice of taking data from one format and converting it into another format programmatically. Data munging is extremely important for working with logs and esoteric data formats. It’s a useful skill from basic systems automation to more specific fields like GIS or even financial positions.

We’re going to cover the basics of how to approach data munging, how to look at and think about the reverse engineering process, and a few examples with Lua using real data formats.

What Is Data Munging Good For?

Data munging is one of the most important skills for low hanging fruit in automation. There are countless data formats floating around which are painful and expensive to convert. Writing a tool is (usually) easy if you understand the format you’re converting from and what you’re converting to.

I had to write many tools when I worked in the energy industry due to technology becoming obsolete and log and data types no longer being supported, or even an expensive software license not being financially viable to continue renewing. By using data munging, we saved thousands and didn’t have to buy expensive converters, or even worse, buy ancient software and the systems to run it. I’ve also written tools to convert Windows logs and similar for automation.

Data munging can be boiled down to “converting between formats with extra steps”. You have to figure out the format you want to go from, the format you want to get to, and how to make it happen. That may include using other converters in the process in order to make the path more practical.

Choosing a Language

The language is usually pretty immaterial to converting a data type. You have to take into account whether the data is binary or text based though. Almost any language can handle either, but you want to assess the built in functions before committing.

I prefer a static typed and strongly typed language (usually C or C++) for working with explicitly binary data. I prefer a dynamic typed language for text based and data, and mixed binary and text data (usually Perl or Lua). Remember, this isn’t a rule, it’s just my personal preference. I use this arrangement since I want binary reading to fail if there is a type mismatch since most binary data I convert is going to need to be type-aware for its final use case.

The Process at a Glance

The data munging process can be broken down to the following steps:
1. Assess the Format
2. Assess Existing Tools
3. Sanitizing Data
4. Plan a Conversion Process
5. Convert the Data
6. Sanity Check the Data

Let’s get into what each of these mean and what they entail.

1. Assess the Format

The first thing to do is to break down what format you are working with. Is it binary, text based, or a mix? What kind of header does it have (if any)? Are there comments? All of these factor into how you approach taking apart the format.

Some formats are fixed width, some use a deliminator, others have tags or other complicated structures which define each line or section. Assess what tools already exist. Converting an extremely esoteric into a less esoteric format can be worthwhile, and then you can write a simpler tool to finalize the process.

The other important part to consider is what format you are going to and why. You have to know what the data will be used for. If you’re archiving data, you probably shouldn’t omit anything. If you’re just trying to extract a few data points, why write a full on parser if each point is in an easy to find place? Don’t spin the wheels for the sake of perfection, get your data out and move on.

2. Assess Existing Tools

The more complicated the format, the harder it is to work with. What existing tools exist if any? Are there tools to change the format to something easier to work with? Is there an easy format you can convert your esoteric format to which can be converted to what you need? This has saved me huge amounts of time when done right. You may still need to do some data munging, but it can reduce the scope of the work substantially.

3. Sanitizing Data

Consider whether you only need parts of the format or what can be cut. Can you do processing to cut out some complex parts? This can be manual or automated depending on the scope of changes. It can be faster to just pop the files open and delete an erroneous section than to automate depending on the rules governing the format.

The earlier you sanitize data, the fewer complications which will arise. The more simple the format gets munged to earlier on, the easier it is to push through a process pipeline. If there are several versions of a format with differing levels of complexity, it might be worth using an older version to make your tool easier. It’s often worth writing extra tools to reduce unpredictability and complexity at each step.

4. Plan a Conversion Process

Plan a pipeline to take format A to format A’. It may require several jumps with preexisting tools and some minor tweaks at each step, or it might require a complex tool to convert the format yourself. You won’t know until you’ve assessed everything from start to finish. Once you have a plan, you can begin attacking the problem.

5. Convert the Data

I try to not write any code (aside from exploratory code) until I have a full plan in place. This is the moneymaking step for the whole process and every step leading up to this is just preparation. Turn one format into the other with whatever tricks you need. Implement each step in your plan and make sure you can convert the data as you go. If you’ve planned properly, this phase is usually pretty easy (depending on the format).

6. Sanity Check the Data

The last step was the most important to the whole data munging process, but sanity checking the data is almost as important. It’s a bit of a non-issue if you can’t convert the data, but you need to be sure everything is right when you do convert it. Spot check points in the data to make sure they’re correct. Import that data into a program that uses it to make sure it works, or even run it through a converter and see if it has warnings or errors. Everything you did is in vain if you don’t actually check it.

Doing It For Real

We’ve covered the theory, but let’s see how it applies to a few test formats. Let’s go over an example involving a basic Comma Separated Values (CSV) file. We’re going to reinvent the wheel with the CSV, but this is something you ideally never do (since you use a pre-made, well tested library instead). The CSV format is one of the worst for data munging, but we’ll get into that in a minute. We’ll also cover a specialized Chinese to English dictionary, CEDict (released under a CC Attribution-ShareAlike 4.0 International license).

To really get a grip on the code, this assumes you are familiar with Lua string operations, files in Lua, Lua classes, Lua tables, and functions in Lua, among the other basics of Lua. Lua is very simple (like most newer languages, it’s easy to learn hard to master), so most of this should be pretty self-explanatory if you have some programming knowledge.

Comma Separated Values – CSV

Let me just start off saying, if you came here to make your own CSV parser, please reconsider. Do not write your own CSV parser except as an exercise or an act of desperation on a horribly limited environment (like a Zipit). Unless you’re next level as a coder or have a lot of eyes on the project, writing a CSV parser is a fool’s errand. Warnings aside, let’s look at writing a test CSV parser (but please don’t actually use it for anything real).

1. Assess the Format

CSV is a widely variable format. It often gets roped in with fixed test formats too. Despite being called a Comma separated value format, you can use pretty much whatever as a deliminator. Different programs spit out different types of CSVs and there’s no easy way to tell what’s what.

Let’s look at some valid examples:

Key,Value,Order
Pookie,Dog,1
Kitty,Cat,2
Flippy,Spatula,3

"Key","Value","Order"
"Pookie","Dog",1
"Kitty","Cat",2
"Flippy","Spatula",3

Key	Value	Order
Pookie	Dog	1
Kitty	Cat	2
Flippy	Spatula	3

"Key",Value,Order
Pookie,"Dog, ""dog""",1
Kitty,Cat,"2"
"Flippy","Spatula",3

All of these are valid CSV files (yes, despite the fact one uses tab, it’s still colloquially a CSV). Look at how we’re doing quotes as well in the last example. This is why CSV is such a hard format to deal with.

Figure out whether you’ll need to deal with the exotic ones or not. You should use a CSV library outside of a learning exercise, but let’s pretend we know our CSV will be sane. We’re only going to deal with something like the first or second example for this exercise.

2. Assess Existing Tools

Let’s skip this since this is a learning example. Again, you shouldn’t ever be writing a CSV parser for production unless you absolutely have to!

3. Sanitizing Data

It can be worth pulling quotes out at this stage if we know we have a fixed format CSV without embedded quotes. You can regex them out before regexing double double quotes (“”) into single double quotes (“) if you don’t have any embedded commas. The simpler you can make the CSV before processing, the simpler you can make the whole process.

4. Plan a Conversion Process

What are we dealing with? Is it a complicated mixed format CSV or is it predictable? Audit the CSVs to see. We don’t want to deal with complicated files when we could simplify the outliers.

5. Convert the Data

Let’s work with the following CSV:

"C1",C2,C3
a,1,alpha
b,2,"beta"

This CSV is luckily very predictable. We may have quotes or we may not, but ultimately, they don’t affect the readability of our CSV. That means, we can just dump them as long as we separate on a comma.

We get code like follows for processing this CSV:

#!/usr/bin/lua5.1

local csv = "test.csv"

function get_data ( filein )
	local f = io.open( filein, "rb" ) --open our file to check if it exists, this isn't strictly speaking needed, but is cleaner for following our logic
	local data = {}
	
	--if filein exists and is readable then close the handle and read all of the lines in, else, close the handle and return ""
	if f ~= nil then
		f:close()
		for line in io.lines( filein ) do
			table.insert( data, line )
		end
	else
		f:close()
		return ""
	end
	
	return data
end

function chomp ( line )
	return string.gsub( line, "%s+$", "" )
end

function split ( line, sep )
	local fields = {}
	
	local index = 1
	
	while( index < line:len() ) do
		local myEnd = line:find( sep, index )
		
		if myEnd ~= nil then
			table.insert( fields, line:sub( index, myEnd - 1 ) )
			index = myEnd + 1
		else
			table.insert( fields, line:sub( index, -1 ) )
			index = line:len()
		end
	end
	
	return fields
end

function clean_field( f )
	local field = f:match( "\"(.*)\"" )

	if field ~= nil then
		return field
	end
	
	return f
end

function join( arr )
	local string = ""

	for i, f in ipairs( arr ) do
		if( arr[ i + 1 ] ~= nil ) then
			string = string .. f .. "; "
		else
			string = string .. f
		end
	end
	
	return string
end

local data = get_data( csv )

for i, line in ipairs( data ) do
	local sLine = split( chomp(line), "," )
	
	local fields = {}
	
	for j, l in ipairs( sLine ) do
		table.insert( fields, j .. ": " .. clean_field( l ) )
	end
	
	print( join( fields ) )
end

We have 84 lines for a basic CSV program to extract data without doing anything to it. This is actually really clean, simple code all things considered (there are plenty of improvements that can be made, but we’re sticking to simple code). We’re using a clean, predictable CSV so it’s easy to write for.

Let’s look at a few of the functions. function get_data ( filein ) reads in our CSV and returns the data as an array (or table in Lua) of lines. function chomp ( line ) chomps all of the white space at the end of a line off. function split ( line, sep ) splits our line based on a separator supplied. function clean_field( f ) removes our quotes at the beginning and end of a field. We have a function join( arr ) which we use to make our output look nice by joining fields in an array together with a predefined separator. We put all of these together and get our data, then we split it up, clean it, and all of this work gets us:

./csv.lua 
1: C1; 2: C2; 3: C3
1: a; 2: 1; 3: alpha
1: b; 2: 2; 3: beta

At least now the data is normalized and easy to work with. This is also an insanely clean CSV. Real life data doesn’t tend to be this clean!

6. Sanity Check the Data

Since our quotes don’t matter, there isn’t much to check. This is an academic exercise so don’t expect it to be this easy going forward. Working with CSVs is often miserable at best unless you sanitize the input. Keep in mind that we can work with any type of CSV and the extension and headers won’t tell us any different.

It is usually computationally cheaper to have something to check what file we’re working with and then parse it if it’s sane. Otherwise, we alert that we have a weird file. What do you do if you have to deal with a comma in quotes? Things get a lot more interesting then!

Working With CEDict

1. Assess the Format

The CEDict format is pretty easy, let’s look at a small snippet below:

# CC-CEDICT
# Community maintained free Chinese-English dictionary.
# 
# Published by MDBG
# 
# License:
# Creative Commons Attribution-ShareAlike 4.0 International License
# https://creativecommons.org/licenses/by-sa/4.0/
# 
# Referenced works:
# CEDICT - Copyright (C) 1997, 1998 Paul Andrew Denisowski
# 
# CC-CEDICT can be downloaded from:
# https://www.mdbg.net/chinese/dictionary?page=cc-cedict
#[…]
#! version=1
#! subversion=0
#! format=ts
#! charset=UTF-8
#! entries=116908
#! publisher=MDBG
#! license=https://creativecommons.org/licenses/by-sa/4.0/
#! date=2019-04-19T05:57:09Z
#! time=1555653429
#[…]
㧯 㧯 [lao3] /round-bottomed wicker basket/(dialect) to lift/to carry on one's shoulder/
㧱 拿 [na2] /old variant of 拿[na2]/
#[…]
一 一 [yi1] /one/1/single/a (article)/as soon as/entire/whole/all/throughout/"one" radical in Chinese characters (Kangxi radical 1)/also pr. [yao1] for greater clarity when spelling out numbers digit by digit/
一一 一一 [yi1 yi1] /one by one/one after another/
一一對應 一一对应 [yi1 yi1 dui4 ying4] /one-to-one correspondence/
一一映射 一一映射 [yi1 yi1 ying4 she4] /bijective map (i.e. map between sets in math. that is one-to-one and onto)/one-to-one correspondence/
一丁不識 一丁不识 [yi1 ding1 bu4 shi2] /illiterate/ignorant/
#[…]
上龍 上龙 [shang4 long2] /pliosaurus/
下 下 [xia4] /down/downwards/below/lower/later/next (week etc)/second (of two parts)/to decline/to go down/to arrive at (a decision, conclusion etc)/measure word to show the frequency of an action/
下一代 下一代 [xia4 yi1 dai4] /the next generation/
下一個 下一个 [xia4 yi1 ge5] /the next one/
#[…]

Anything with #[…] is just to show that we’ve omitted a bunch of text. What do you see when you look at this format? Well, first off, we can guess that anything with a # is a comment. Anything with a #! is metadata, and everything else is a word or phrase to parse.

This format is easy and predictable, but it’s best to hunt down a format specification if you can. Not everything will have one, but if it does, you can know exactly what to expect rather than having to reverse engineer the format.

Purpose and Format

We know that a # means either a comment or metadata is about to come. Do we care for our purpose? I used CEDict to write a quick translation tool, so I really don’t need the comments or metadata. We can safely ignore comments and metadata now. We know to skip anything with a # at the beginning.

The format includes both traditional and simplified characters in Chinese. Our format earlier mentions that the format is “[traditional] [simplified]” at the beginning of the string, but we can also verify this if we know Chinese. We also know that a space (or white space) is used as a separator between blocks.

Our third piece is the pronunciation as Pinyin, and the fourth is the definition. Since space is a separator, it uses an opening and closing pattern around these two fields (since they can have spaces). Pronunciation is surrounded by [ and ] while the definition(s) are surrounded by (and separated by) /.

2. Assess Existing Tools

There are other tools to do this, but this is a learning exercise. There isn’t a much easier format we could reduce this into either. The only exception is if we are intentionally developing a dictionary or something with a different standard. Converting to your own format is an easy way to increase technical debt if you don’t have a good reason to do so.

3. Sanitizing Data

This is a sane format, so we don’t need to do anything to clean it before processing. Depending on the language you plan to use, it might be worth doing something to strip comments and metadata or similar. You may want a separate parser to track the metadata for certain applications. I just want a basic dictionary, so none of that matters for my use case.

4. Plan a Conversion Process

Our format works out to either each line being:

#Comment

#!Metadata

[Traditional Character] [Simplified Character] [[Pronunciation]] /[Definition1]/[Definition2]/[…]/

Since we don’t care about comments or metadata, we can just exclude anything which begins with #. We also have a formula we can use to split our string up for a definition. All we need to do is exclude the two types we don’t care about, and break down the last type into what we want.

5. Convert the Data

We’re going to build this in Lua. There are two basic functions we need; we need something to read our data, then we need something to process it.

Read:

--get_data - gets our data from our file and returns a more workable table of data
--takes: filein
--returns: [table] data
function get_data ( filein )
	local f = io.open( filein, "rb" ) --open our file to check if it exists, this isn't strictly speaking needed, but is cleaner for following our logic
	local data = {}
	
	--if filein exists and is readable then close the handle and read all of the lines in, else, close the handle and return ""
	if f ~= nil then
		f:close()
		for line in io.lines( filein ) do
			table.insert( data, line )
		end
	else
		f:close()
		return ""
	end
	
	return data
end

We open our filein using one method to do a nasty check of if it’s there or not. We then use io.lines to get each line and load it into our data table. We return “” if we can’t access the file.

Parse:

utf8 = require( 'utf8' ) --we use this because we're working with utf8 data

--[…]

--chomp - returns a string with all endlines removed
--takes: line
--returns: line
function chomp ( line )
	return utf8.gsub( line, "\n$", "" )
end

--[…]

--convert_dict - converts a table into a hash table of our data
--takes: (table split by line) dictd
--returns: (hash table with ["trad"] and ["simp"], with each character) dict
function convert_dict ( dictd )
	local dict = { ["trad"] = {}, ["simp"] = {} }
	
	for i, line in ipairs ( dictd ) do
		cline = chomp( line )
		
		if cline ~= nil and utf8.len( cline ) >= 1 then
			if utf8.sub( cline, 1, 1 ) ~= "#" then
				tc, sc, py, def = utf8.match( cline, "(.*) (.*) %[(.*)%] %/(.*)%/" )
				--if you aren't familiar with Lua, think of it this way with each () as what we're pulling out
				--local tc = utf8.match( cline, "(.*) .* %[.*%] %/.*%/" )
				--local sc = utf8.match( cline, ".* (.*) %[.*%] %/.*%/" )
				--local py = utf8.match( cline, ".* .* %[(.*)%] %/.*%/" )
				--local def = utf8.match( cline, ".* .* %[.*%] %/(.*)%/" )
				
				if dict[ "trad" ][ tc ] == nil then dict[ "trad" ][ tc ] = {} end
				if dict[ "simp" ][ sc ] == nil then dict[ "simp" ][ sc ] = {} end
				
				table.insert( dict[ "trad" ][ tc ], { ["tc"] = tc, ["sc"] = sc, ["py"] = py, ["def"] = def } )
				table.insert( dict[ "simp" ][ sc ], { ["tc"] = tc, ["sc"] = sc, ["py"] = py, ["def"] = def } )
			end
		end
	end
	
	return dict
end

First off, we use our utf8 (Lua <= 5.2) module. There’s no point reinventing the wheel here. We also set up a chomp function to get rid of newlines at the end.

We then employ a convert_dict function to turn our CEDict text into something we can work with. We instantiate a table for putting our data in with an index of “trad” and “simp”. We then go through each line and make sure it’s not just white space or erroneous, then check if it’s a comment or metadata.

Once we know it’s not, we use patterns in Lua to extract our relevant data. We put that in a few variables, and then add it to a table which is an array in Lua. This is useful so we can get every pronunciation from a character due to the fact the Chinese language can have multiple pronunciations and meanings per character. We also store our definition on each line before returning so we can separate each field out. We populate both a simplified and a traditional character table so we can use this for either.

More Processing!

We have to then extract our data, or otherwise process it, but the process remains pretty easy. We read in our whole dictionary (it’s small enough we don’t need to care about RAM), and then we process it. This workflow is traditionally considered wasteful, but how much RAM do you have and how much do you need? You need to be a little more considerate with larger data files, but our sample is small and negligible (9MB). Don’t try to optimize before you know what you’re trying to do and why.

6. Sanity Check the Data

Cool, now we have a bunch of data in a Lua table. What use is it? The original scope of my program is to convert characters to Pinyin. This is a great test and shows how accurate we are. I leave this as an exercise to the reader because use case will dictate how you test.

You can theoretically take the data and throw some characters and see if you get standard Pinyin back or not. Throw in some junk that shouldn’t be in the dictionary and see how your conversion handles it. Once you know your data is solid and reliable, you can either export it as a new format, or use it to do whatever work you need.

Data Munging for Fun and for Profit

We’ve gone over the basic process and what it implies. Keep in mind, we used trivial examples. When you get into data munging, you’ll probably end up with a lot more messy code to deal with a given format. Don’t try to future proof the code unless you expect more data in your given format.

Remember to follow the 6 steps to process data. Assess the format you’re working with and assess existing tools. Sanitize data separately to reduce complexity. Plan a conversion process, and then implement it with any necessary glue. Make sure to check the data afterwards.

Data munging is easy and data munging is hard. Automate your job and simplify the conversion process where possible. Keep in mind the end goal to avoid complicating your work. Only convert what you need to convert unless there’s a reason to back up a format. It’s not the sexiest coding, but it’s reliable and challenging.

Data munging is both profitable and rewarding. It provides a challenge and it pays well. Learn to work with esoteric formats and format specifications. I bought my house off the back of data munging. There are countless tools which sell for good money. What will data munging get you?

Featured image by Seksak Kerdkanno from Pixabay

Facebook Pin