A FSharp implementation of Huffman compression algorithm
Implementating Huffman algorithm is a good practice to progress in algorithms and is a good opportunity to train on working methods such as TDD.
This article is based on the gist I wrote to train me.
Huffman compression method can be decomposed in 3 steps:
computing a dictionary with frequency of occurrences
building a tree representing frequencies and associated path
storage of tree and path of each occurrence bit per bit
In next examples, we will work on the string “aaabbffppppeee kk aa”
Computing frequencies
At first, we can imagine a dictionary of char * int containing each char frequency.
It don’t like to work with chars when I implement binary data traitment algorithms.
So, I created an empty function getting (byte * int) list
Then I wrote a test I executed each time I could compile:
A simple function to work with bytes list could be:
I did want to try compression on real files, so I couldn’t load entire content in a byte list.
Next function is computing frequencies from a seekable stream.
Building the tree
My tree model is simply:
The cost is is representing total occurencies count contained by child branches or leaf.
The swith method helps to browse sub branches of a tree node.
When the model is written, we can write tests:
After tests writting, we can implement a function populating the tree from frequencies:
Calculated paths are:
occurrency
path option
binary
frequency
a
Some({[True;False]})
10
5
p
Some({[False; True]})
01
4
e
Some({[True; True; False]})
110
3
b
Some({[False; False; False]})
000
2
f
Some({[False; False; True]})
001
2
k
Some({[True; True; True; True]})
1111
2
’ ‘
Some({[True; True; True; False]})
1110
2
This table is demonstrating that implementation is correct.
Occurrences whose frequencies are higher have the shortest paths.
A diagram summarizing this table could be like the following:
The occurrency ‘a’ will be coded in 2 bits.
We can not write less than 8 bits in a stream. (with the WriteByte method)
So I wrote a tiny BitWritter:
The buffer is a simple byte.
The write method increases the len and shift bits of buffer.
When len is equal to 8 bits, we write the byte in the stream.