The Basics of Big O Notation

Big O notation expresses how an algorithm grows relative to the input and extrapolates the input out to something arbitrarily large or even infinity. This gives us how long an algorithm takes to run in best case and worst case scenarios.

This isn’t going to tell us exactly how long it takes to run an algorithm (for reasons we’ll get into in a bit), but gives a way to rank and compare algorithms as a whole, and gives you a scale of how big it will be. Big O notation doesn’t really matter for a small data set on most systems, but comes into play with large data sets, which is our input. Let’s get into some of the basics of the notation.

Actual Big O Notation

If you’ve seen things like O( n2 ) or O( n log2 n ), then you’ve seen the basic language of Big O notation. n defines the size of our input. If you pass an arbitrary n, you can see how big it gets and how quickly.

We are going to cover 3 different Big O classifications which are common. We have constant time with O( 1 ), this means it always takes a constant amount to run. We have linear time with O( n ) which means the time to run is directly linear with n. Finally, we have quadratic time which grows exponentially with n. This is just the tip of the iceberg as pretty much anything which can be represented mathematically can be used.

Big OnOutput
O( 1 )11
O( 1 )101
O( 1 )1001
O( 1 )10001
O( n )11
O( n )1010
O( n )100100
O( n )10001000
O( n log2 n )11*
O( n log2 n )1033.22
O( n log2 n )100664.39
O( n log2 n )10009965.78
O( n2 )11
O( n2 )10100
O( n2 )10010,000
O( n2 )10001,000,000

I did change O( n log2 n ) to 1 when n is 1 because it effectively works out to that in practice, but mathematically it should be 0. Big O notation doesn’t really apply to small values and we’re doing this just to see what is happening with our numeric trends.

Constant time is only going to apply to a single operation on a single value or similar. It doesn’t take the data size into account at all (and so it’s not actually doing anything with our data unless it’s a best case scenario). Linear time scales exactly with our data. Linear time compounded with logarithmic time grows faster than linear time but nowhere near quadratic time. Quadratic time scales the worst of all four (but there can be worse!).

Calculating the Size in Big O

Let’s go over roughly how we calculate the size of Big O notation. Also, note that for these examples, we aren’t going to include any real error handling. Pretend that all data has been checked previously and that it all makes sense unless otherwise noted.

Constant time just means there is a fixed amount of work per data set. A function like follows would have a Big O notation of O( 1 ) in Lua:

function algorithm( array )
	print( array[ 1 ] )
end

No matter how big array is, we only do a fixed number of operations on the data set which is independent of the size of the data set. We can also do:

function algorithm1( array )
	print( array[ 1 ] .. " + " .. array[ 2 ] .. " + " .. array[ 3 ] )
end

This is still going to be O( 1 ) because the cost is effectively fixed. This comes into play more in our next section though.

Linear time means that our complexity scales linearly with our data. Every additional item is an additional operation. An example of a function which uses linear time of O( n ) would be:

function algorithm( array )
	for i in pairs( array ) do
		print( array[ i ] )
	end
end

This example just iterates over every item in our array and prints each item. As the array grows, so does the complexity, but each extra item is a single extra action.

Linear time compounded with logarithmic time is a bit harder. See my article on quicksort to see an algorithm which uses O( n log2 n ) for its average case.

Quadratic time (or polynomial time) is basically the worst of the common algorithm types (O( n2 )). For every extra point, you have n extra steps, which adds up fast as you can see from the chart above. An example of this is:

function algorithm( array )
	for i in pairs( array ) do
		for j in pairs( array ) do
			print( array[ i ] .. " " .. array[ j ] )
		end
	end
end

Another example of quadratic time is the bubble sort method. Bubble sort does have its use cases as it can theoretically complete in O( n ) time for the right kind of data.

The Best of Times

Big O notation is also typically calculated out as the average runtime for an algorithm. How does this algorithm apply to an arbitrary data set? Some algorithms can be very efficient if the data is shaped a certain way, but you typically want to know what the average use case or worst case is. Big O implies the worst case unless specified otherwise. An algorithm can take up to as long as the Big O notation says, but will typically take less time (due to rounding). You also have algorithms like quicksort which have a few possibilities for the worst case, but will typically run better than expected as their worst case is rare and conditional.

The following algorithm is technically O( n ):

function algorithm( array )
	for i in pairs( array ) do
		if array[ i ] == 3 then return i end
	end
end

What if the first item is 3, do we go all the way through or do we just exit? The exit condition can be O( 1 ) in the best case, but is going to tend towards O( n / 2 ) on average (which we simplify to O( n )) and exactly O( n ) in the worst case. Calculate out your Big O notation based on it taking the maximum possible iterations to run through an arbitrary data set.

Throwing Out the Rest

Big O notation focuses on the trend rather than calculating everything out exactly. For arbitrarily large sets, anything small being added or any kind of constant is going to be negligible. Remove the constants and trailing terms leaving only the largest term. That’s why our function from before which did 3 individual fixed operations was still O( 1 ) rather than O( 3 ). Some people will (mistakenly) write this as O( k ), but this is not considered standard for Big O notation.

Let’s take this function as an example:

function algorithm( array )
	for h in pairs( array ) do
		print( array[ h ] .. " " .. array[ j ] )
	end

	for i in pairs( array ) do
		for j in pairs( array ) do
			print( array[ i ] .. " " .. array[ j ] )
		end
	end
end

You could say it works out to O( n2 + n ), but Big O does not concern itself with the small term. With 1,000 elements, this would be 1,001,000 operations instead of 1,000,000 based on our earlier chart, which is a rounding error in the grand scheme of things. Small sub-optimizations could easily outdo this minor rounding error in real code, so Big O notation keeps the largest growth rate and throws out the rest. If we repeat our loop twice in the same section, we end up with O( 2n2 ), but again, the constant is not significant and is thrown out.

Applying Big O Notation

Big O notation is one of the best tools for seeing how scalable an algorithm is. Even if you throw hardware at a problem, how much can you throw at it before you can’t keep up? It really doesn’t matter for small things, but even small operations can add up to a lot of overhead. Use Big O notation to analyze algorithms to see whether they’re efficient or not.

For instance, using bubble sort with a lot of data is a fool’s errand unless it’s largely sorted (which tends towards its best case scenario). Quicksort with mostly sorted data can be as bad as bubble sort with random data. Know what an algorithm is good at and how that applies to its Big O representation, but don’t forget what you’re doing with the algorithm. Also, keep in mind that neither make a big difference on a modern machine when done with 15 items.

Big O notation doesn’t just apply to runtime either. It can apply to memory, IO, etc. Quicksort for instance is going to be O( n ) for its memory footprint since the whole array is loaded into memory to sort. This can make it a deal-breaker if you’re trying to sort an 8GB block of data on a desktop unless you have a beefy machine.

Optimizing

There’s a reason Big O notation throws out the constants and smaller terms. Trying to prematurely optimize can end up with worse results than focusing on making the principle algorithm more efficient at scale. For arbitrarily large data, O( n2 + 100n ) is going to be the same as O( n2 ) (which is what Big O reduces it to). For smaller data sets though, this begins to matter.

Even though its not actually proper for Big O notation, using a constant of k or similar can be desirable for benchmarking algorithms after the initial optimization phase. An algorithm of O( kn ) is going to take longer than O( n ). This is going not matter for the math side and for the trend, but a user is going to notice if O( n2 ) takes 5 seconds and O( 2n2 ) takes 10. If you can get it down to some algorithm which uses O( n log2 n ), even better, but sometimes you just can’t. That’s when you start splitting hairs.

Optimize at scale first, and then go for the smaller bits and pieces. Make the pieces tied to the data more efficient and you make the algorithm more efficient as scale. If you know a data set is never going to be large, or the constants get to be large enough, it can be worth going deeper or if the set is small enough, focusing on what works easily. If you’ve already implemented quicksort, why reinvent the wheel to add in a bubble sort for data which is always going to be near sorted if there’s a whopping 20 entries max and the sort doesn’t impact anything with the user experience?

Conclusion

Big O notation is extremely powerful, but it’s important not to jump the gun and try to prematurely optimize. Use Big O notation to track down scale level efficiencies and refine from there. Big O notation is not just limited to run time; it can be used to benchmark memory, IO, etc. depending on what you’re doing. Sometimes it’s worth taking the worse optimization for memory concerns in an embedded environment.

There are a lot more types of performance levels which we didn’t touch on, but the ones listed are the most common for standard algorithms. There are some other types like factorial time (O( n! )) and exponential time (O( kn )) which are uncommon (but not rare).

Look at the best and worst case scenarios, the average scenario, and what kind of data you’re dealing with. An average scenario which doesn’t exist for your use case is usually useless for you. A best case which is exceedingly rare isn’t worth anything either. Aim for the use case which fits your data and which works with the purpose of what you’re doing. Applying Big O notation analysis to your coding will make you more efficient. Throw out that slow algorithm for one which is mathematically more efficient and understand why you’re doing it.

Featured image by Arek Socha from Pixabay