# Huffman Encoding

If there were ever a data compression method to take the world by storm, it would be Huffman encoding. In fact, this was the method that got me into computational methods to begin with. I distinctly remember sitting in my data compression class and talking about the great information theorist Claude Shannon and Robert Fano, when suddenly my professor introduced a new kid to the mix: David Huffman. He managed to rip the heart out of the methods described by leaders of the field and create a data compression method that was easier to understand and implement, while also providing more robust results, and apparently this was all done for a school project!

It was in that moment, I knew I would never amount to anything. I have since accepted that fact and moved on.

Huffman encoding follows from the problem described in the Data Compression section. We have a string that we want to encode into bits. Huffman encoding ensures that our encoded bitstring is as small as possible without losing any information. Because it is both lossless and guarantees the smallest possible bit length, it outright replaces both Shannon and Shannon-Fano encoding in most cases, which is a little weird because the method was devised while Huffman was taking a course from Fano, himself!

The idea is somewhat straightforward in principle, but a little difficult to code in practice. By creating a binary tree of the input alphabet, every branch can be provided a unique bit representation simply by assigning a binary value to each child and reading to a character in a leaf node if starting from the root node.

So now the question is: how do we create a binary tree? Well, here we build it from the bottom up like so:

1. Order all characters according to the frequency they appear in the input bitstring, with the most frequent character at the top of the list. Be sure to keep track of the frequencies, too!
2. Add the smallest two values together to create a new node with a new frequency.
3. Keep doing step 2 until the tree is complete.
4. Read the tree backwards from the root node and concatenate the final bitstring codeword. Keep all codewords and put them into your final set of codewords (sometimes called a codebook)
5. Encode your phrase with the codebook.

And that's it. Here's an image of what this might look like for the phrase bibbity_bobbity: This will create a codebook that looks like this:

Character Bit Representation
b 0
i 100
t 101
y 110
o 1110
_ 1111

and bibbity_bobbity becomes 01000010010111011110111000100101110. As mentioned this uses the minimum number of bits possible for encoding. The fact that this algorithm is both conceptually simple and provably useful is rather extraordinary to me and is why Huffman encoding will always hold a special place in my heart.

## Video Explanation

Here is a quick video explanation for Huffman encoding:

## Example Code

In code, this can be a little tricky. It requires a method to continually sort the nodes as you add more and more nodes to the system. The most straightforward way to do this in some languages is with a priority queue, but depending on the language, this might be more or less appropriate. In addition, to read the tree backwards, some sort of Depth First Search needs to be implemented. Whether you use a stack or straight-up recursion also depends on the language, but the recursive method is a little easier to understand in most cases.

using Test

# This is for the PriorityQueue
using DataStructures

struct Leaf
weight::Int64
key::Char
end

struct Branch
right::Union{Leaf, Branch}
left::Union{Leaf, Branch}
weight::Int64
end

const Node = Union{Leaf, Branch}

function codebook_recurse!(leaf::Leaf, code::String,
dict::Dict{Char,String})
dict[leaf.key] = code
end

function codebook_recurse!(branch::Branch, code::String,
dict::Dict{Char,String})
codebook_recurse!(branch.left, string(code, "1"), dict)
codebook_recurse!(branch.right, string(code, "0"), dict)
end

# This will depth-first search through the tree
# to create bitstrings for each character.
# Note: Any depth-first search method will work
# This outputs encoding Dict to be used for encoding
function create_codebook(n::Node)
codebook = Dict{Char,String}()
if isa(n, Leaf)
codebook[n.key]="0"
else
codebook_recurse!(n, "", codebook)
end
return codebook
end

# This outputs huffman tree to generate dictionary for encoding
function create_tree(phrase::String)

# creating weights
weights = PriorityQueue()
for i in phrase
temp_string = string(i)
weights[temp_string] += 1
else
weights[temp_string] = 1
end
end

# Creating all nodes to iterate through
nodes = PriorityQueue{Node, Int64}()
while(length(weights) > 0)
weight = peek(weights)
key = dequeue!(weights)
temp_node = Leaf(weight, key)
enqueue!(nodes, temp_node, weight)
end

while(length(nodes) > 1)
node1 = dequeue!(nodes)
node2 = dequeue!(nodes)
temp_node = Branch(node1, node2, node1.weight + node2.weight)
enqueue!(nodes, temp_node, temp_node.weight)
end

huffman_tree = dequeue!(nodes)
return huffman_tree

end

function encode(codebook::Dict{Char, String}, phrase::String)
final_bitstring = ""
for i in phrase
final_bitstring = final_bitstring * codebook[i]
end

return final_bitstring
end

function decode(huffman_tree::Node, bitstring::String)
current = huffman_tree
final_string = ""
for i in bitstring
if isa(huffman_tree, Branch)
if (i == '1')
current = current.left
else
current = current.right
end

if (!isa(current, Branch))
final_string *= string(current.key)
current = huffman_tree
end
else
final_string *= string(huffman_tree.key)
end
end

return final_string
end

function two_pass_huffman(phrase::String)
huffman_tree = create_tree(phrase)
codebook = create_codebook(huffman_tree)
bitstring = encode(codebook, phrase)
final_string = decode(huffman_tree, bitstring)
return final_string
end

@testset "b-string tests" begin
@test two_pass_huffman("b") == "b"
@test two_pass_huffman("bbbbbbbb") == "bbbbbbbb"
@test two_pass_huffman("bibbity bobbity") == "bibbity bobbity"
end

extern crate itertools;

use std::cmp::{Ord, Ordering, PartialOrd};
use std::collections::{BinaryHeap, HashMap};

use itertools::Itertools;

#[derive(Debug)]
enum HuffmanTree {
Branch {
count: i32,
left: Box<HuffmanTree>,
right: Box<HuffmanTree>,
},
Leaf {
count: i32,
value: char,
},
}

impl PartialEq for HuffmanTree {
fn eq(&self, other: &Self) -> bool {
self.count() == other.count()
}
}

impl Eq for HuffmanTree {}

impl PartialOrd for HuffmanTree {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
other.count().partial_cmp(&self.count())
}
}

impl Ord for HuffmanTree {
fn cmp(&self, other: &Self) -> Ordering {
other.count().cmp(&self.count())
}
}

#[derive(Debug)]
struct Codebook {
codebook: HashMap<char, String>,
tree: HuffmanTree,
}

impl HuffmanTree {
pub fn from(input: &str) -> Self {
let counts = input.chars().fold(HashMap::new(), |mut map, c| {
*map.entry(c).or_insert(0) += 1;
map
});
let mut queue = counts
.iter()
.map(|(&value, &count)| HuffmanTree::Leaf { value, count })
.collect::<BinaryHeap<HuffmanTree>>();

while queue.len() > 1 {
let left = queue.pop().unwrap();
let right = queue.pop().unwrap();
queue.push(HuffmanTree::Branch {
count: left.count() + right.count(),
left: Box::new(left),
right: Box::new(right),
})
}

queue.pop().expect("The Huffman tree has to have a root")
}

pub fn count(&self) -> i32 {
match *self {
HuffmanTree::Branch { count, .. } => count,
HuffmanTree::Leaf { count, .. } => count,
}
}

pub fn make_codebook(self) -> Codebook {
let mut codebook = HashMap::new();
self.dfs(String::from(""), &mut codebook);
Codebook {
codebook,
tree: self,
}
}

pub fn decode(&self, input: &str) -> String {
let mut result = String::from("");
let mut start = 0;
while !input[start..].is_empty() {
start += self.decode_dfs(&input[start..], &mut result);
}
result
}

fn decode_dfs(&self, input: &str, result: &mut String) -> usize {
let current = input.chars().next();
match *self {
HuffmanTree::Branch { ref left, .. } if current == Some('0') => {
1 + left.decode_dfs(&input[1..], result)
}
HuffmanTree::Branch { ref right, .. } if current == Some('1') => {
1 + right.decode_dfs(&input[1..], result)
}
HuffmanTree::Leaf { value, .. } => {
result.push(value);
0
}
_ => panic!("Unexpected end of input"),
}
}

fn dfs(&self, code: String, codebook: &mut HashMap<char, String>) {
match *self {
HuffmanTree::Branch {
ref left,
ref right,
..
} => {
left.dfs(code.clone() + "0", codebook);
right.dfs(code.clone() + "1", codebook);
}
HuffmanTree::Leaf { value, .. } => {
codebook.insert(value, code);
}
}
}
}

impl Codebook {
fn encode(&self, input: &str) -> String {
input.chars().map(|c| &self.codebook[&c]).join("")
}

fn decode(&self, input: &str) -> String {
self.tree.decode(input)
}
}

fn main() {
let input = "bibbity bobbity";

let tree = HuffmanTree::from(input);
let codebook = tree.make_codebook();
let encoded = codebook.encode(input);
let decoded = codebook.decode(&encoded);

// Uncomment this line if you want to see the codebook/tree
// println!("{:#?}", codebook);
println!("{}", encoded);
println!("{}", decoded);
}

// Made by Guston and edited by Gathros
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

struct tree {
struct tree* left;
struct tree* right;

int count;
char value;
};

struct bitstring_builder {
char str;
int next_index;
};

struct codebook {
char* codes;
};

struct heap {
struct tree** data;
size_t length;
size_t capacity;
};

bool is_leaf(const struct tree* t) {
return !t->left && !t->right;
}

void swap(struct tree** lhs, struct tree** rhs) {
struct tree* tmp = *lhs;
*lhs = *rhs;
*rhs = tmp;
}

/* The two concat functions are horribly inefficient */
void concat(char** dst, const char* src) {
size_t dst_len = strlen(*dst);
size_t src_len = strlen(src);
*dst = realloc(*dst, src_len + dst_len + 1);
strcat(*dst, src);
}

void concat_char(char** dst, char c) {
size_t len = strlen(*dst);
*dst = realloc(*dst, len + 2);
(*dst)[len] = c;
(*dst)[len + 1] = '\0';
}

char* duplicate(const char* src) {
size_t length = strlen(src);
char* dst = malloc(length + 1);
memcpy(dst, src, length + 1);
return dst;
}

void heap_push(struct heap* heap, struct tree* value) {
if (heap->capacity == heap->length) {
heap->capacity = heap->capacity == 0 ? 4 : heap->capacity * 2;
heap->data = realloc(heap->data, heap->capacity * sizeof(struct tree*));
}
heap->data[heap->length++] = value;

size_t index = heap->length - 1;
while (index) {
size_t parent_index = (index - 1) / 2;
if (heap->data[parent_index]->count <= heap->data[index]->count) {
break;
}

swap(&heap->data[parent_index], &heap->data[index]);
index = parent_index;
}
}

struct tree* heap_pop(struct heap* heap) {
if (!heap->length) {
return NULL;
}

struct tree* result = heap->data;
swap(&heap->data, &heap->data[--heap->length]);

size_t index = 0;
for (;;) {
size_t target = index;
size_t left = 2 * index + 1;
size_t right = left + 1;

if (left < heap->length &&
heap->data[left]->count < heap->data[target]->count) {
target = left;
}

if (right < heap->length &&
heap->data[right]->count < heap->data[target]->count) {
target = right;
}

if (target == index) {
break;
}

swap(&heap->data[index], &heap->data[target]);
index = target;
}

return result;
}

void heap_free(struct heap* heap) {
free(heap->data);
}

struct tree* generate_tree(const char* str) {
int counts = { 0 };

for (; *str != '\0'; ++str) {
counts[(unsigned char)*str] += 1;
}

struct heap heap = { 0 };
for (size_t i = 0; i < sizeof(counts) / sizeof(int); ++i) {
if (counts[i]) {
struct tree* tree = calloc(1, sizeof(struct tree));
tree->value = (char)i;
tree->count = counts[i];
heap_push(&heap, tree);
}
}

if (heap.length == 1) {
struct tree* leaf = heap_pop(&heap);
struct tree* root = calloc(0, sizeof(struct tree));
root->left = leaf;
root->count = leaf->count;
heap_free(&heap);
return root;
}

while (heap.length > 1) {
struct tree* left = heap_pop(&heap);
struct tree* right = heap_pop(&heap);
struct tree* branch = calloc(1, sizeof(struct tree));
branch->count = left->count + right->count;
branch->left = left;
branch->right = right;
heap_push(&heap, branch);
}

struct tree* root = heap_pop(&heap);
heap_free(&heap);
return root;
}

void tree_free(struct tree* tree) {
if (!tree) return;
tree_free(tree->left);
tree_free(tree->right);
free(tree);
}

void codebook_recurse(const struct tree* tree,
struct bitstring_builder* builder,
struct codebook* codebook) {
if (!tree) {
return;
}

if (is_leaf(tree)) {
builder->str[builder->next_index] = '\0';
codebook->codes[(unsigned char)tree->value] = duplicate(builder->str);
return;
}

builder->str[builder->next_index++] = '0';
codebook_recurse(tree->left, builder, codebook);
builder->next_index -= 1;

builder->str[builder->next_index++] = '1';
codebook_recurse(tree->right, builder, codebook);
builder->next_index -= 1;
}

struct codebook generate_codebook(const struct tree* tree) {
struct codebook codebook = { .codes = { 0 } };
struct bitstring_builder builder = { .str = { 0 }, .next_index = 0 };
codebook_recurse(tree, &builder, &codebook);
return codebook;
}

void codebook_free(struct codebook* codebook) {
int size = sizeof(codebook->codes) / sizeof(codebook->codes);
for (int i = 0; i < size; ++i) {
free(codebook->codes[i]);
}
}

const char* get_code(const struct codebook* codebook, char c) {
return codebook->codes[(unsigned char)c];
}

char* encode(const char* input, struct tree** huffman_tree,
struct codebook* codebook) {
*huffman_tree = generate_tree(input);
*codebook = generate_codebook(*huffman_tree);

char* result = duplicate(get_code(codebook, *input));

input += 1;

for (; *input; ++input) {
concat(&result, get_code(codebook, *input));
}

return result;
}

const char* decode_recurse(const char* input, const struct tree* tree,
char** result) {
if (!tree) {
return input;
}

if (is_leaf(tree)) {
concat_char(result, tree->value);
return input;
}

if (*input == '0') {
return decode_recurse(input + 1, tree->left, result);
} else {
return decode_recurse(input + 1, tree->right, result);
}
}

char* decode(const char* input, const struct tree* tree) {
char* result = calloc(1, 1);
do {
input = decode_recurse(input, tree, &result);
} while (*input);
return result;
}

int main() {
struct tree* tree;
struct codebook codebook;

char* encoded = encode("bibbity bobbity", &tree, &codebook);
char* decoded = decode(encoded, tree);

printf("Codebook:\n");
for (int i = 0; i < 256; ++i) {
if (codebook.codes[i]) {
printf("%c %s\n", (char)i, codebook.codes[i]);
}
}

printf("%s\n", encoded);
printf("%s\n", decoded);

tree_free(tree);
codebook_free(&codebook);
free(encoded);
free(decoded);
return 0;
}

import qualified Data.Map as M
import           Data.List (insert, sort)

data Tree a = Leaf Int a
| Node Int (Tree a) (Tree a)
deriving (Show, Eq)

freq :: Tree a -> Int
freq (Leaf i _)   = i
freq (Node i _ _) = i

instance (Eq a) => Ord (Tree a) where
compare t1 t2 = compare (freq t1) (freq t2)

getFrequencies :: Ord a => [a] -> [(Int, a)]
getFrequencies = toSortedList . M.fromListWith (+) . flip zip (repeat 1)
where toSortedList = sort . map swap . M.toList
swap (a, i) = (i, a)

buildTree :: (Ord a) => [a] -> Maybe (Tree a)
buildTree = build . map (uncurry Leaf) . getFrequencies
where build []         = Nothing
build [t]        = Just t
build (t1:t2:ts) = build $insert (Node (freq t1 + freq t2) t1 t2) ts data Bit = Zero | One instance Show Bit where show Zero = "0" show One = "1" encode :: (Ord a) => [a] -> (Maybe (Tree a), [Bit]) encode s = (tree, msg) where tree = buildTree s msg = concatMap (table M.!) s table = case tree of Nothing -> M.empty Just t -> M.fromList$ mkTable (t, [])
mkTable (Leaf _ a, p)     = [(a, reverse p)]
mkTable (Node _ t1 t2, p) = concatMap mkTable [(t1,  Zero:p), (t2, One:p)]

decode :: (Ord a) => Maybe (Tree a) -> [Bit] -> [a]
decode Nothing _ = []
decode (Just t) m = path t m
where path (Leaf _ a) m            = a : path t m
path (Node _ t1 _) (Zero: m) = path t1 m
path (Node _ _ t2) (One: m)  = path t2 m
path _ _                     = []

main = do
let msg = "bibbity bobbity"
(tree, encoded) = encode msg
decoded = decode tree encoded
putStrLn $"Endoding \"" ++ msg ++ "\": " ++ concatMap show encoded putStrLn$ "Length: " ++ (show $length encoded) putStrLn$ "Decoding: " ++ decoded

##### HuffmanCoding.cs
// submitted by Julian Schacher (jspp), thanks to gustorn for the help
using System;
using System.Collections.Generic;
using System.Linq;

namespace HuffmanCoding
{
public class EncodingResult
{
public string BitString { get; set; }
public Dictionary<char, string> Dictionary { get; set; }
public HuffmanCoding.Node Tree { get; set; }

public EncodingResult(string bitString, Dictionary<char, string> dictionary, HuffmanCoding.Node tree)
{
this.BitString = bitString;
this.Dictionary = dictionary;
this.Tree = tree;
}
}

public class HuffmanCoding
{
// The Node class used for the Huffman Tree.
public class Node : IComparable<Node>
{
public Node LeftChild { get; set; }
public Node RightChild { get; set; }
public string BitString { get; set; } = "";
public int Weight { get; set; }
public string Key { get; set; }

public bool IsLeaf => LeftChild == null && RightChild == null;

// Creates a leaf. So just a node is created with the given values.
public static Node CreateLeaf(char key, int weight) => new Node(key.ToString(), weight, null, null);
// Creates a branch. Here a node is created by adding the keys and weights of both childs together.
public static Node CreateBranch(Node leftChild, Node rightChild) => new Node(leftChild.Key + rightChild.Key, leftChild.Weight + rightChild.Weight, leftChild, rightChild);
private Node(string key, int weight, Node leftChild, Node rightChild)
{
this.Key = key;
this.Weight =  weight;
this.LeftChild = leftChild;
this.RightChild = rightChild;
}

public int CompareTo(Node other) => this.Weight - other.Weight;
}

// Node with biggest value at the top.
class NodePriorityList
{
public int Count => nodes.Count;

private List<Node> nodes = new List<Node>();

public NodePriorityList() { }
public NodePriorityList(List<Node> givenNodes)
{
this.nodes = givenNodes.ToList();
this.nodes.Sort();
}

{
var index = this.nodes.BinarySearch(newNode);
if (index < 0)
this.nodes.Insert(~index, newNode);
else
this.nodes.Insert(index, newNode);
}

public Node Pop()
{
var result = this.nodes;
this.nodes.RemoveAt(0);
return result;
}
}

public EncodingResult Encode(string input)
{
var root = CreateTree(input);
var dictionary = CreateDictionary(root);
var bitString = CreateBitString(input, dictionary);

return new EncodingResult(bitString, dictionary, root);
}

public string Decode(EncodingResult result)
{
var output = "";
Node currentNode = result.Tree;
foreach (var bit in result.BitString)
{
// Go down the tree.
if (bit == '0')
currentNode = currentNode.LeftChild;
else
currentNode = currentNode.RightChild;

if (currentNode.IsLeaf)
{
output += currentNode.Key;
currentNode = result.Tree;
}
}
return output;
}

private Node CreateTree(string input)
{
// Create a List of all characters and their count in input by putting them into nodes.
var nodes = input
.GroupBy(c => c)
.Select(n => Node.CreateLeaf(n.Key, n.Count()))
.ToList();

// Convert list of nodes to a NodePriorityList.
var nodePriorityList = new NodePriorityList(nodes);

// Create Tree.
while (nodePriorityList.Count > 1)
{
// Pop the two nodes with the smallest weights from the nodePriorityList and create a parentNode with the CreateBranch method. (This method adds the keys and weights of the childs together.)
var leftChild = nodePriorityList.Pop();
var rightChild = nodePriorityList.Pop();
var parentNode = Node.CreateBranch(leftChild, rightChild);

}

return nodePriorityList.Pop();
}

private Dictionary<char, string> CreateDictionary(Node root)
{
// We're using a string instead of a actual bits here, since it makes the code somewhat more readable and this is an educational example.
var dictionary = new Dictionary<char, string>();
CreateDictionary(root, "", dictionary);
return dictionary;

void CreateDictionary(Node node, string bitString, Dictionary<char, string> localDictionary)
{
if (node.IsLeaf)
else
{
if (node.LeftChild != null)
CreateDictionary(node.LeftChild, bitString + '0', localDictionary);
if (node.RightChild != null)
CreateDictionary(node.RightChild, bitString + '1', localDictionary);
}
}
}

private string CreateBitString(string input, Dictionary<char, string> dictionary)
{
// We're using a string right here. While no compression is achieved with a string, it's the easiest way to display what the compressed result looks like. Also this is just an educational example.
var bitString = "";
foreach (var character in input)
bitString += dictionary[character];

return bitString;
}
}
}

##### Program.cs
// submitted by Julian Schacher (jspp), thanks to gustorn for the help
using System.Collections;
using System.Collections.Generic;

namespace HuffmanCoding
{
class Program
{
static void Main(string[] args)
{
var huffmanCoding = new HuffmanCoding();

var result = huffmanCoding.Encode("bibbity bobbity");
// The bitStrings are just strings and provide no compression. Look in HuffmanCoding.cs for explanation.
// Print dictionary.
foreach (var entry in result.Dictionary)
System.Console.WriteLine($"{entry.Key} {entry.Value}"); // Print BitString. System.Console.WriteLine($"{result.BitString} count: {result.BitString.Length}");

var originalString = huffmanCoding.Decode(result);
System.Console.WriteLine(originalString);
}
}
}

local function frequency_array(str)
-- Collect all frequency values into a dict
local map = {}
for c in str:gmatch(".") do -- Iterate over each character in str
map[c] = (map[c] or 0) + 1 -- Increment map[c] (default 0) by 1
end

-- We have a dict of frequencies but we want it in a sorted list
-- Dump each key value pair into an array
local arr = {}
for k, v in pairs(map) do
arr[#arr + 1] = {k, v}
end
table.sort(arr, function(a, b) return a > b end) -- Sort by frequency descending
return arr
end

local function build_huffman_tree(message)

if #message == 0 then return end

local freq = frequency_array(message)

while #freq > 1 do -- Repeat until we have only 1 node

-- Take two of the least frequent nodes
local node1, node2 = table.remove(freq), table.remove(freq)

-- Group node values in first index, and sum of node frequencies in second
local node3 = { {node1, node2 }, node1 + node2 }

local i = 1
while i <= #freq and freq[i] <= node3 do -- Sorted insertion, faster than inserting then sorting again
i = i + 1
end

table.insert(freq, i, node3)
end

return freq -- Return value of only element in freq array
end

local function _create_codebook(node, codebook, code)
if not node then
return
elseif type(node) == "string" then
codebook[node] = code -- if node is a leaf then add it to codebook
else
_create_codebook(node, codebook, code .. "0") -- Left side
_create_codebook(node, codebook, code .. "1") -- Right side
end
end

local function create_codebook(tree)
local codebook = {}
_create_codebook(tree, codebook, "")
return codebook
end

local function huffman_encode(codebook, message)
local encoded_chars = {}
for c in message:gmatch(".") do -- Iterate over each character in message
encoded_chars[#encoded_chars + 1] = codebook[c]
end
return table.concat(encoded_chars) -- table.concat to avoid slow string bufferin
end

local function _huffman_decode(node, bitstring, i)
if type(node) == "string" then
return node, i -- If it's a leaf node then return the value along with the next bit to read
end
if bitstring:sub(i, i) == "0" then
return _huffman_decode(node, bitstring, i + 1) -- If it's 0 traverse down the left side
elseif bitstring:sub(i, i) == "1" then
return _huffman_decode(node, bitstring, i + 1) -- If it's 1 traverse down the right side
end
end

local function huffman_decode(tree, bitstring)
-- i is the current position in the bitstring, we can track which bit we are to look at next without using string.sub
local decoded_chars, i = {}, 1
while i <= #bitstring do
decoded_chars[#decoded_chars + 1], i = _huffman_decode(tree, bitstring, i)
end

return table.concat(decoded_chars)
end

local message = "bibbity_bobbity"

local tree = build_huffman_tree(message)
local codebook = create_codebook(tree)

local bitstring = huffman_encode(codebook, message)
print("Encoded: " .. bitstring)

print("Decoded: " .. huffman_decode(tree, bitstring))

#include <algorithm>
#include <array>
#include <bitset>
#include <cassert>
#include <cctype>
#include <cstddef>
#include <limits>
#include <memory>
#include <string>
#include <utility>
#include <vector>

#include <iostream>

using std::begin;
using std::end;

namespace huffman {

[[noreturn]] inline void unreachable() {
std::cerr << "this should never happen\n";
std::abort();
}

// --- interface ---
class codebook {
struct node {
int frequency;

node(int freq) : frequency(freq) {}
virtual ~node() = 0;
};

// never null
using node_ptr = std::unique_ptr<node const>;
using bitstring = std::vector<bool>;

// this is a flatmap between char and a bitstring
// there should only ever be one character with a given
// value at any time.
using encoder_t = std::vector<std::pair<char, bitstring>>;

struct leaf final : node {
char key;

leaf(int freq, char key) : node(freq), key(key) {}
};

struct branch final : node {
node_ptr lhs;
node_ptr rhs;

branch(node_ptr lhs, node_ptr rhs)
: node(lhs->frequency + rhs->frequency), lhs(std::move(lhs)),
rhs(std::move(rhs)) {}
};

// this allows us to share [codebook]s among encoded strings
struct data {
node_ptr decoder;
encoder_t encoder;
};
std::shared_ptr<data const> underlying_;

public:
template <typename Iter>
codebook(Iter const first, Iter const last);

template <typename Iter>
std::vector<bool> encode(Iter first, Iter last) const;

template <typename Iter>
std::string decode(Iter first, Iter last) const;
};

struct encoded_string {
codebook codes;
std::vector<bool> string;

explicit encoded_string(std::string const& s)
: codes(begin(s), end(s)), string(codes.encode(begin(s), end(s))) {}

encoded_string(codebook codes, std::string const& s)
: codes(codes), string(codes.encode(begin(s), end(s))) {}

std::string decoded() const {
return codes.decode(begin(string), end(string));
}
};

// --- implementation ---
inline codebook::node::~node() {}

inline std::vector<bool> with_new_bit(std::vector<bool> bits, bool b) {
bits.push_back(b);
return bits;
}

template <typename Iter>
codebook::codebook(Iter const first, Iter const last) {
struct helper {
static node_ptr make_decoder(Iter const first, Iter const last) {
// in this part of the function, we build up a frequency list
// sorted by frequency, descending
auto freq = std::vector<leaf>();

std::for_each(first, last, [&freq](char c) {
auto const it = std::find_if(
begin(freq), end(freq), [c](auto const& p) { return p.key == c; });
if (it != end(freq)) {
// it's already in the list
it->frequency += 1;
} else {
// it's not already in the list
freq.emplace_back(1, c);
}
});

if (freq.empty()) {
throw std::invalid_argument("attempted to codebook an empty range");
}

std::sort(begin(freq), end(freq), [](auto const& lhs, auto const& rhs) {
return lhs.frequency > rhs.frequency;
});

auto ret = std::vector<std::unique_ptr<node>>();
std::transform(
begin(freq), end(freq), std::back_inserter(ret), [](auto const l) {
return std::make_unique<leaf>(l);
});

while (ret.size() > 1) {
auto rhs = std::move(ret.back());
ret.pop_back();
auto lhs = std::move(ret.back());
ret.pop_back();

auto new_node =
std::make_unique<branch>(std::move(lhs), std::move(rhs));
auto const new_freq = new_node->frequency;

// look for the first element with a smaller frequency
auto const it =
std::find_if(begin(ret), end(ret), [new_freq](auto const& n) {
return n->frequency < new_freq;
});
// and insert the new_node before that element
ret.insert(it, std::move(new_node));
// in this way, we keep the list sorted by frequency
}

return std::move(ret.back());
}

static void
encoder_rec(node const* cur, std::vector<bool> bits, encoder_t& out) {
if (auto l = dynamic_cast<leaf const*>(cur)) {
out.push_back(std::make_pair(l->key, std::move(bits)));
} else if (auto b = dynamic_cast<branch const*>(cur)) {
encoder_rec(b->lhs.get(), with_new_bit(bits, 0), out);
encoder_rec(b->rhs.get(), with_new_bit(std::move(bits), 1), out);
} else {
unreachable();
}
}

static encoder_t make_encoder(node const& decoder) {
auto ret = encoder_t();

encoder_rec(&decoder, std::vector<bool>(), ret);

return ret;
}
};

auto decoder = helper::make_decoder(first, last);
auto encoder = helper::make_encoder(*decoder);
underlying_ = std::make_shared<data const>(
data{std::move(decoder), std::move(encoder)});
}

template <typename Iter>
std::vector<bool> codebook::encode(Iter const first, Iter const last) const {
std::vector<bool> ret;

auto& encoder = underlying_->encoder;
std::for_each(first, last, [&ret, &encoder](char c) {
auto const it =
std::find_if(begin(encoder), end(encoder), [c](auto const& p) {
return p.first == c;
});
if (it != end(encoder)) {
auto const& code = it->second;
std::copy(begin(code), end(code), std::back_inserter(ret));
} else {
throw std::invalid_argument(
"The range has a character which was not in the huffman set");
}
});

return ret;
}

template <typename Iter>
std::string codebook::decode(Iter const first, Iter const last) const {
std::string ret;

node const* const top = underlying_->decoder.get();

// returns a pair:
// the second member is the decoded character
// the first member is the place we've gotten to in the range
// i.e., if  is an 'a', and we have
// [it, last) = { 0, 1, 1, 0 }
// we return (it', 'a') such that
// [it', last) = { 1, 1, 0 }
auto decode_single =
[top](Iter it, Iter const last) -> std::pair<Iter, char> {
node const* current_node = top;

for (; it != last; ++it) {
if (auto l = dynamic_cast<leaf const*>(current_node)) {
return std::make_pair(it, l->key);
} else if (auto b = dynamic_cast<branch const*>(current_node)) {
if (*it) {
current_node = b->rhs.get();
} else {
current_node = b->lhs.get();
}
} else {
unreachable();
}
}

if (auto l = dynamic_cast<leaf const*>(current_node)) {
return std::make_pair(last, l->key);
} else {
throw std::invalid_argument(
"The range was not encoded with this huffman set");
}
};

for (auto it = first; it != last;) {
auto p = decode_single(it, last);
it = p.first;
ret.push_back(p.second);
}

return ret;
}

} // namespace huffman

int main() {
std::string to_be_encoded = R"(bibbity bobbity)";

auto encoded = huffman::encoded_string(to_be_encoded);

std::cout << "Encoded, the string looks like: ";
for (bool b : encoded.string) {
std::cout << b;
}
std::cout << "\nand decoded, the string looks like: " << encoded.decoded();
std::cout << "\n\nAs opposed to the original, which is "
<< to_be_encoded.size() * 8 << " bits, the encoded has size "
<< encoded.string.size() << ".\n";
}

;; earthfail
(ns experiments.core)

;; get a vector with chars and frequencies

(defn tree-string [st]
"take a string st and return the huffmantree with the frequency of
each character included"
;; vector of [character frequency] pair
;; for every char in string, added it to hash-map
;; with value one if it doesn't exist or increment its value
(def cf-vec (vec
(reduce (fn [m c]
(assoc m c (inc (get m c 0))))
{}
st)))
;; make a sorted list with nodes with bigger frequencies first
;; take the last two which will help in dividing the tree
;; the first and last elements before and after
;; the smallest two in the tree shouldn't change
(loop [tree (sort-by last > cf-vec)]
(if (< (count tree) 2)
(first tree) ; only of tree is one node or nil
(let [sorted-tree (sort-by last > tree)
mid (take-last 2 sorted-tree)
set-mid (set mid)
func (complement (partial contains? set-mid))
firsty (take-while func tree)
[middle lasty] (split-at 2
(drop-while func tree))]
(recur
(concat
firsty
;; make a list with the two element in one list and
;; the sum of their frequencies e.g
;; '(((node1 f1) (node2 f2)) f1+f2)
(list (list middle (reduce #(+ %1 (last %2)) 0 middle)))
lasty))))))

(defn remove-freq [tree]
"remove the frequencies in the huffmantree tree"
(cond
(char? tree) tree                 ; check if this is a branch
;; if the tree is a node and frequency then ignore frequency
(integer? (second tree)) (remove-freq (first tree)) ;remove the frequency
;; if the tree consists of two nodes then apply to both and combine
:else (list (remove-freq (first tree))
(remove-freq (second tree)))))

(defn hash-tree [tree]
"make a hashmap with code for each letter as key and the letter as
value"
(cond
(char? tree) {"" tree}
:else
(let [left-map (hash-tree (first tree))
right-map (hash-tree (second tree))
func #(apply hash-map         ; apply hash-map because
; interleave return a seq
(interleave
(map (partial str %2) (keys %1)) ;add 0 or 1
;to the start
;of the keys
(vals %1)))]
;; add "0" to the keys of left nodes and "1" to the right nodes
(merge (func left-map "0") (func right-map "1")))))

(defn coder [s hash-coder]
"take a string s and return a coded string"
(apply str (map hash-coder s)))

(defn decoder [s hash-decoder]
"takes a string s and a hash-map hash-decoder and decode s"
;; code keyword in hashmap is for storing codes untill they are
;; complete and can be decoded with the decoder
(get (reduce (fn [m code]             ; reduce return {:message
; message,:code _}
(let [new-code (str (m :code) code)]
(if-let  [letter (get hash-decoder new-code)]
;; if there is a letter then add it to :message
;; and revert :code to empty
(assoc (update m :message #(str % letter))
:code "")
;; if there is not a letter then just add the
;; code letter to the :code
(update m :code #(str % code)))))
{:message "",:code ""}
s)
:message))           ;extract :message  value
;; ----------------EXAMPLE----------------
(def st "(bibbity bobbity)")

(def hash-decoder (->>
st
tree-string
remove-freq
hash-tree))
(def hash-coder (clojure.set/map-invert hash-decoder))
(println "coding...")
(def code (coder st hash-coder))
(clojure.pprint/pprint code)

(println "\ndecoding...")
(clojure.pprint/pprint (decoder code hash-decoder))

# Huffman Encoding
# Python 2.7+
# Submitted by Matthew Giallourakis

from collections import Counter

# constructs the tree
def build_huffman_tree(message):

# get sorted list of character and frequency pairs
frequencies = Counter(message)
trees = frequencies.most_common()

# while there is more than one tree
while len(trees) > 1:

# pop off the two trees of least weight from the trees list
tree_left,weight_left = trees.pop()
tree_right,weight_right = trees.pop()

# combine the nodes and add back to the nodes list
new_tree = [tree_left, tree_right]
new_weight = weight_left + weight_right

# find the first tree that has a weight smaller than new_weight and returns its index in the list
# If no such tree can be found, use len(trees) instead to append
index = next((i for i, tree in enumerate(trees) if tree < new_weight), len(trees))

# insert the new tree there
trees.insert(index, (new_tree, new_weight))

# Return an empty list if the message was empty, else the tree
huffman_tree = [] if not trees else trees
return huffman_tree

# constructs the mapping with recursion
def build_codebook(tree, code=''):
# Check whether our tree contains more than 1 element or not
if not tree:
return []
elif type(tree) != list:
return [(tree, '0')]

codebook = []

# split the tree
left_tree, right_tree = tree

# if the left node has children, find the mapping of those children
# else pair the character with the current code + 0
if type(left_tree) is list:
codebook += build_codebook(left_tree, code+'0')
else:
codebook.append((left_tree, code+'0'))

# if the right node has children, find the mapping of those children
# else pair the character with the current code + 1
if type(right_tree) is list:
codebook += build_codebook(right_tree, code+'1')
else:
codebook.append((right_tree, code+'1'))
return codebook

# encodes the message
def huffman_encode(codebook, message):

encoded_message = ''

# build a char -> code dictionary
forward_dict = dict(codebook)

# replace each character with its code
for char in message:
encoded_message += forward_dict[char]

return encoded_message

# decodes a message
def huffman_decode(codebook, encoded_message):

decoded_message = ''
key = ''

# build a code -> char dictionary
inverse_dict = dict([(v, k) for k, v in codebook])

# for each bit in the encoding
# if the bit is in the dictionary, replace the bit with the paired character
# else look at the bit and the following bits together until a match occurs
# move to the next bit not yet looked at
for index, bit in enumerate(encoded_message):
key += bit
if key in inverse_dict:
decoded_message += inverse_dict[key]
key = ''

return decoded_message

def main():

# test example
message = 'bibbity_bobbity'
tree = build_huffman_tree(message)
codebook = build_codebook(tree)
encoded_message = huffman_encode(codebook, message)
decoded_message = huffman_decode(codebook, encoded_message)

print('message: ' + message)
print('huffman tree: ' + str(tree))
print('codebook: ' + str(codebook))
print('encoded message: ' + encoded_message)
print('decoded message: ' + decoded_message)

# prints the following:
#
#  message: bibbity_bobbity
#  huffman_tree: ['b', [[['_', 'o'], 'y'], ['t', 'i']]]
#  codebook: [('b', '0'), ('_', '1000'), ('o', '1001'),
#             ('y', '101'), ('t', '110'), ('i', '111')]
#  encoded_message: 01110011111010110000100100111110101
#  decoded_message: bibbity_bobbity

if __name__ == '__main__':
main()

function encode(str) {
const tree = createTree(str);
const codebook = createCodebook(tree);
return {
string: [...str].map(c => codebook[c]).join(""),
tree,
codebook
};

function createTree(str) {
const chars = [...str];
const charCounts = chars.reduce((counts, char) => {
counts[char] = (counts[char] || 0) + 1;
return counts;
}, {});

const nodes = Object.entries(charCounts).map(([key, weight]) => ({ key, weight }));
const priorityQueue = makeQueue(nodes);
while (priorityQueue.data.length > 1) {
const left = priorityQueue.dequeue();
const right = priorityQueue.dequeue();
priorityQueue.enqueue({ weight: left.weight + right.weight, left, right });
}
return priorityQueue.dequeue();
}

function createCodebook(tree) {
return recurse(tree, "", {});

function recurse(node, bitstring, dict) {
if (!node.left && !node.right) {
dict[node.key] = bitstring;
} else {
if (node.left) {
recurse(node.left, bitstring + "0", dict);
}

if (node.right) {
recurse(node.right, bitstring + "1", dict);
}
}
return dict;
}
}
}

function decode(bitstring, tree) {
const result = [];
let node = tree;

for (const bit of [...bitstring]) {
node = bit === "0" ? node.left : node.right;
if (!node.left && !node.right) {
result.push(node.key);
node = tree;
}
}

return result.join("");
}

// This queue implementation is horribly inefficient, but a proper, heap-based implementation would
// be longer that the algorithm itself
function makeQueue(iterable) {
return {
data: [...iterable].sort((a, b) => a.weight - b.weight),
enqueue(value) {
const target = this.data.findIndex(x => x.weight > value.weight);
if (target === -1) {
this.data.push(value);
} else {
this.data = [...this.data.slice(0, target), value, ...this.data.slice(target)];
}
},
dequeue() {
return this.data.shift();
}
};
}

const encoded = encode("bibbity bobbity");
const decoded = decode(encoded.string, encoded.tree);
console.log(encoded.string);
console.log(decoded);

import java.util.*;

class Huffman {
public static void main(String[] args) {
HuffmanTree huffmanTree = new HuffmanTree("bibbity_bobbity");
huffmanTree.createTree();
String encoded = huffmanTree.encode();
System.out.println("Encoded String: " + encoded);
System.out.println("Decoded String: " + huffmanTree.decode(encoded));
}
}

class TreeNode {
String letter = "";
int frequency = 0;
TreeNode left = null, right = null;

public TreeNode(String letter, int frequency) {
this.letter = letter;
this.frequency = frequency;
}

public TreeNode(int frequency, TreeNode left, TreeNode right) {
this.frequency = frequency;
this.left = left;
this.right = right;
}
}

class HuffmanTree {
private Map<String, Integer> frequencyMap = new HashMap<>();
private Map<String, String> codeBook = new HashMap<>(), reverseCodeBook = new HashMap<>();
private TreeNode root;
private String stringToEncode;

public HuffmanTree(String stringToEncode) {
this.stringToEncode = stringToEncode;
}

public void createTree() {
for (int i = 0; i < stringToEncode.length(); i++) {
String key = Character.toString(stringToEncode.charAt(i));
if (!frequencyMap.containsKey(key)) {
frequencyMap.put(key, 1);
} else {
int frequency = frequencyMap.get(key) + 1;
frequencyMap.replace(key, frequency);
}
}
Queue<TreeNode> priorityQueue = new PriorityQueue<>(Comparator.comparingInt(o -> o.frequency));
for (Map.Entry<String, Integer> m : frequencyMap.entrySet()) {
}
while (priorityQueue.size() > 1) {
TreeNode left = priorityQueue.remove();
TreeNode right = priorityQueue.remove();
priorityQueue.add(new TreeNode(left.frequency + right.frequency, left, right));
}
root = priorityQueue.remove();
}

private void traverse(TreeNode node, StringBuilder code) {
if (node.left == null && node.right == null) {
codeBook.put(node.letter, code.toString());
}
if (node.left != null) {
traverse(node.left, code.append(0));
code.deleteCharAt(code.length() - 1);
}
if (node.right != null) {
traverse(node.right, code.append(1));
code.deleteCharAt(code.length() - 1);
}
}

public void printCodeBook() {
System.out.println("Code Book");
for (Map.Entry<String, String> m : codeBook.entrySet()) {
System.out.println(m.getKey() + "\t" + m.getValue());
}
System.out.println();
}

private void CodeBookReverse() {
for (Map.Entry<String, String> m : codeBook.entrySet()) {
reverseCodeBook.put(m.getValue(), m.getKey());
}
}

public String encode() {
traverse(root, new StringBuilder());
StringBuilder encode = new StringBuilder();
for (int i = 0; i < stringToEncode.length(); i++) {
String k = Character.toString(stringToEncode.charAt(i));
encode.append(codeBook.get(k));
}
printCodeBook();
return encode.toString();
}

public String decode(String encoded) {
StringBuilder decoded = new StringBuilder(), key = new StringBuilder();
CodeBookReverse();
for (int i = 0; i < encoded.length(); i++) {
key = key.append(encoded.charAt(i));
if (reverseCodeBook.containsKey(key.toString())) {
decoded.append(reverseCodeBook.get(key.toString()));
key = new StringBuilder();
}
}
return decoded.toString();
}
}

package main

import (
"container/heap"
"fmt"
)

type node struct {
freq  int
char  rune
left  *node
right *node
}

type codebook map[rune]string
type nodeHeap []*node

func (n nodeHeap) Len() int           { return len(n) }
func (n nodeHeap) Less(i, j int) bool { return n[i].freq > n[j].freq }
func (n nodeHeap) Swap(i, j int)      { n[i], n[j] = n[j], n[i] }

func (n *nodeHeap) Push(x interface{}) {
if node, ok := x.(*node); ok {
*n = append(*n, node)
} else {
fmt.Printf("I got a node of Type %T\n", x)
}
}

func (n *nodeHeap) Pop() interface{} {
old := *n
l := len(old)
x := old[l-1]
*n = old[0 : l-1]
return x
}

func buildTree(message string) *node {
freqMap := make(map[rune]*node)
h := new(nodeHeap)
heap.Init(h) // really needed?

for _, char := range message {
if _, ok := freqMap[char]; ok {
freqMap[char].freq++
} else {
newNode := new(node)
newNode.freq = 1
newNode.char = char
freqMap[char] = newNode
heap.Push(h, newNode)
}
}

for h.Len() > 1 {
left, right := h.Pop().(*node), h.Pop().(*node)
branch := new(node)
branch.freq = right.freq + left.freq
branch.left = left
branch.right = right
heap.Push(h, branch)
}

root := heap.Pop(h).(*node)
return root
}

func codebookRecurse(node *node, cb *codebook, code []rune) {
if node == nil {
return
}

if node.left == nil && node.right == nil {
(*cb)[node.char] = string(code)
}

code = append(code, '0')
codebookRecurse(node.left, cb, code)
code = append(code[:len(code)-1], '1')
codebookRecurse(node.right, cb, code)
}

func encode(message string) (string, *node, codebook) {
ret := ""
root := buildTree(message)
cb := generateCodebook(root)
for _, char := range message {
ret += cb[char]
}

return ret, root, cb
}

func decode(message string, root *node) string {
cur := root
ret := ""

for _, char := range message {
if cur == nil {
return message
}

switch string(char) {
case "0":
if cur.left == nil {
ret += string(cur.char)
cur = root.left
} else {
cur = cur.left
}
case "1":
if cur.right == nil {
ret += string(cur.char)
cur = root.right
} else {
cur = cur.right
}
}
}

if cur.char != 0 {
ret += string(cur.char)
}

return ret
}

func generateCodebook(root *node) codebook {
cb := make(codebook)
codeArr := make([]rune, 0)
codebookRecurse(root, &cb, codeArr)
return cb
}

func main() {
enc, root, cb := encode("bibbity_bobbity")
fmt.Println("Codebook:")
for r, c := range cb {
fmt.Println(string(r), "->", c)
}
fmt.Println("\nEncoded:", enc)
fmt.Println("Decoded:", decode(enc, root))
}

.intel_syntax noprefix

# System V calling convention cheatsheet
# Params: rdi, rsi, rdx, rcx, r8, r9, xmm0-7
# Return: rax (int 64 bits), rax:rdx (int 128 bits), xmm0 (float)
# Callee cleanup: rbx, rbp, r12-15
# Scratch: rax, rdi, rsi, rdx, rcx, r8, r9, r10, r11

.section .rodata
text:       .string "bibbity bobbity"
original:   .string "Original message: %s\n"
encoded:    .string "Encoded message: "
decoded:    .string "Decoded message: %s\n"

.equ bitstr_len,       32
.equ bitstr_size,      40
.equ codebook_size,    256 * bitstr_size

.equ tree_left,        0
.equ tree_right,       8
.equ tree_count,       16
.equ tree_value,       20
.equ tree_size,        24

.equ heap_len,         0
.equ heap_data,        4
.equ heap_size,        512 * 8 + 16         # 512 ptrs + 4 byte length + 12 byte padding
.equ counts_size,      256 * 4

.equ msg_len,          0
.equ msg_data,         8
.section .text
.global main
.extern printf, calloc, malloc, memset, puts

main:
push   r12
push   r13
sub    rsp, codebook_size + 16              # 8 extra bytes for the Huffman-tree ptr, 8 bytes for padding

# Print the original text
mov    rdi, OFFSET original
mov    rsi, OFFSET text
xor    rax, rax
call   printf

# First encode the text. This will also initialize the Huffman-tree and the codebook
mov    rdi, OFFSET text
mov    rsi, rsp
lea    rdx, [rsp + codebook_size]
call   encode
mov    r12, rax                             # Save the returned message ptr

# Print the codebook and the encoded message
mov    rdi, rsp
call   print_codebook
mov    rdi, OFFSET encoded
xor    rax, rax
call   printf
mov    rdi, r12
call   print_message

# Decode and print the message
mov    rdi, r12
mov    rsi, QWORD PTR [rsp + codebook_size]
call   decode
mov    r13, rax
mov    rdi, OFFSET decoded
mov    rsi, r13
xor    rax, rax
call   printf

# Free allocated resources
mov    rdi, r12
call   free
mov    rdi, r13
call   free
mov    rdi, QWORD PTR [rsp + codebook_size]
call   free_tree

pop    r13
pop    r12

# Indiciate success with a 0 exit code
xor    rax, rax
ret

# rdi - text
# rsi - codebook ptr
# rdx - Huffman-tree ptr
# RET rax - encoded message ptr
encode:
push   r12
push   r13
push   r14
mov    r12, rdi                             # Save the original arguments
mov    r13, rsi
mov    r14, rdx
call   generate_tree                        # The text is already in rdi
mov    QWORD PTR [r14], rax                 # Save the Huffman-tree's root
mov    rdi, r13                             # Set up the parameters for codebook generation: codebook ptr, Huffman-tree root
mov    rsi, rax
call   generate_codebook
xor    rax, rax
xor    r14, r14                             # We'll use r14 to keep track of the length of the message
mov    rcx, r12                             # Make a copy of the pointer to the message to be encoded
encode_calculate_length:
mov    al, BYTE PTR [rcx]
test   al, al                               # If we're at the terminating null character then we're ready to encode
jz     encode_message
lea    rdx, [rax + 4*rax]                   # We get the codebook entry at the specific index
lea    r8, [r13 + 8*rdx]
add    r14, QWORD PTR [r8 + bitstr_len]     # And add the encoded word length to the total
inc    rcx
jmp    encode_calculate_length
encode_message:
mov    rdi, 1
lea    rsi, [r14 + 7]                       # Calculate the number of bytes we need to allocate to fit all the bits
shr    rsi, 3                               # length % 8 rounded up = (length + 8 - 1) / 8
lea    rsi, [rsi + 8]                       # Make space for an 8-byte length field
call   calloc                               # Allocate the necessary memory, the message will be in rax
mov    QWORD PTR [rax], r14                 # Save the length of the message
# Registers:
#   - r12: text
#   - r13: codebook_ptr
#   - rax: message ptr
#   - free to use: rdi, rsi, rcx, rdx, r8, r9, r10, r11, r14
xor    r8, r8                               # Bit offset
lea    r9, [rax + 8]                        # 8-byte message block
encode_message_bits:
xor    rdi, rdi                             # We need to clear rdi because moving a single byte to dil doesn't do so
mov    dil, BYTE PTR [r12]                  # Iterate the message again
test   dil, dil                             # If we're at the the null terminator we're done
jz     encode_done
lea    rdx, [rdi + 4*rdi]                   # Get the codebook entry
lea    r10, [r13 + 8*rdx]
mov    r11, QWORD PTR [r10 + bitstr_len]    # Load the bitstring length
lea    r14, [r10]                           # The bitstring qword we're currently processing
encode_message_bits_qword:
mov    rdi, QWORD PTR [r14]                 # Calculate the first mask: [code qword] << [bit offset]
mov    rsi, rdi                             # Get a second copy of the code's current qword
mov    rcx, r8
shl    rdi, cl
or     QWORD PTR [r9], rdi                  # Apply the mask to the current block
mov    rcx, 64                              # Calculate the second mask: [code qword] >> [64 - bit offset]
sub    rcx, r8
shr    rsi, cl
mov    rcx, r11                             # Copy the code length so we can manipulate it without destroying the original value
sub    rcx, 64
jle    encode_message_bits_try_overflow     # If the length was less than or equal to 64, check if the code qword would overflow the current message block
mov    r11, rcx                             # We wanted to subtract 64 from the code length anyway
lea    r9, [r9 + 8]                         # Load the next message block
or     QWORD PTR [r9], rsi                  # Save the second mask to the new message block
jmp    encode_message_bits_qword
encode_message_bits_try_overflow:
add    rcx, r8                              # Calculate [code length] + [bit offset] - 64
jl     encode_calculate_new_bit_offset      # If the result is less than 0 then we have no remaining bits -> calculate the new bit offset
mov    r8, rcx                              # Otherwise this also happens to be our new bit offset
lea    r9, [r9 + 8]                         # Load the next message block
or     QWORD PTR [r9], rsi                  # Save the second mask to the new message block
inc    r12                                  # Go to the next character in the input
jmp    encode_message_bits
encode_calculate_new_bit_offset:
lea    r8, [r8 + r11]                       # Calculate the bit offset for the next code qword
inc    r12
jmp    encode_message_bits
encode_done:
pop    r14
pop    r13
pop    r12
ret

# rdi - encoded message
# rsi - Huffman-tree root (ptr)
# RET rax - the decoded message
decode:
push   r12
push   r13
push   r14
mov    r12, rdi
mov    r13, rsi
mov    rdi, QWORD PTR [r12]                 # Load the length of the message
mov    r14, rdi                             # We'll use the length of the message as a loop counter later
lea    rdi, [rdi + 1]                       # The null terminator
call   malloc                               # This will usually be more than enough memory to contain the whole decoded message (we don't handle pathological cases right now)
mov    rdi, r12                             # The single-character decoder doesn't touch rdi so we can hoist it before the loop
xor    rcx, rcx
mov    rdx, rax                             # The current byte in the output string
decode_loop:
cmp    rcx, r14                             # The encoded message bit counter
jge    decode_done
mov    rsi, r13                             # The current node in the Huffman-tree
decode_loop_char:
test   rsi, rsi                             # If the Huffman-tree node is null then we reached a dead-end -> start over
jz     decode_loop
cmp    QWORD PTR [rsi + tree_left], 0       # If the node has either a left or a right child, treat it as a branch
jnz    decode_loop_char_branch
cmp    QWORD PTR [rsi + tree_right], 0
jnz    decode_loop_char_branch
mov    r9d, DWORD PTR [rsi + tree_value]    # Load the value in this node in case the next iteration needs it
mov    BYTE PTR [rdx], r9b                  # And save it to the output
lea    rdx, [rdx + 1]                       # Advance the output string
jmp    decode_loop
decode_loop_char_branch:
mov    r9, rcx                              # First, load the byte of the message the current bit is in
shr    r9, 3
mov    r10b, BYTE PTR [rdi + r9 + msg_data]
mov    r11, rcx                             # Save rcx in another register temporarily so we can restore it without push/pop
and    rcx, 7
shr    r10, cl                              # Get the bit we're interested in to position 0
lea    rcx, [r11 + 1]                       # Restore rcx and immediately add 1 to get the next bit to decode
and    r10, 0x1                             # Zero out all other bits
mov    r8, rsi
mov    rsi, QWORD PTR [r8 + tree_left]      # Take the left branch for 0, the right branch for a non-zero bit
cmovnz rsi, QWORD PTR [r8 + tree_right]
jmp    decode_loop_char
decode_done:
mov    BYTE PTR [rdx], 0                    # Write the null terminator at the end of the string
pop    r14
pop    r13
pop    r12
ret

# rdi - The starting address of the codebook we want to generate
# rsi - Huffman-tree root (ptr)
generate_codebook:
push   r12
sub    rsp, bitstr_size + 16                # 16 extra bytes for alignment
mov    r12, rsi
xorps  xmm0, xmm0                           # Create a 0-initialized bitstring. This will be
movaps XMMWORD PTR [rsp], xmm0              # used in the recursive function calls
movaps XMMWORD PTR [rsp + 16], xmm0
mov    QWORD PTR [rsp + 32], 0
xor    rsi, rsi
mov    rdx, codebook_size
call   memset
mov    rdi, rax
mov    rsi, r12
mov    rdx, rsp
call   generate_codebook_recurse
pop    r12
ret

# rdi - The codebook's starting address
# rsi - The current Huffman-tree node
# rdx - The bitstring used for code generation
generate_codebook_recurse:
push   rbp
push   r12
push   r13
test   rsi, rsi                             # If we reached a null pointer we're done
jz     generate_codebook_recurse_done
mov    r12, rsi
cmp    QWORD PTR [r12 + tree_left], 0       # If at least one of the children is not null
jnz    generate_codebook_branch             # then we need to treat the current node as a branch
cmp    QWORD PTR [r12 + tree_right], 0
jnz    generate_codebook_branch
mov    r8d, DWORD PTR [r12 + tree_value]    # Get the value of the current node
movaps xmm0, XMMWORD PTR [rdx]              # Get the values of the current bitstring into some registers
movaps xmm1, XMMWORD PTR [rdx + 16]
mov    r9, QWORD PTR [rdx + 32]
lea    rax, [r8 + 4*r8]                     # The index calculation needs to add 40 * index. With lea arithmetic this can be represented as
lea    r10, [rdi + 8*rax]                   # base address + 8 * (5 * index). This is done in two lea instructions
movups XMMWORD PTR [r10], xmm0              # And copy the data over to it
movups XMMWORD PTR [r10 + 16], xmm1
mov    QWORD PTR [r10 + 32], r9
jmp    generate_codebook_recurse_done
generate_codebook_branch:
# First, calculate the necessary indices and bitmask to use for the bitstring
mov    r13, QWORD PTR [rdx + bitstr_len]    # Load the current length of the bitstring
mov    rcx, r13                             # This will be used to index into the bitstring data. We'll need two copies for it
shr    r13, 6                               # We first get which 64 bit chunk of the bitstring we want to modify
and    rcx, 63                              # Then the bit we want to change
mov    rbp, 1                               # Generate the mask we'll use to set the correct bit
shl    rbp, cl
or     QWORD PTR [rdx + 8*r13], rbp         # Set the bit
inc    QWORD PTR [rdx + bitstr_len]         # Increase the bitstring length
mov    rsi, QWORD PTR [r12 + tree_right]
call   generate_codebook_recurse
# Now we move on to the left branch: rbx - left child, r13 - bitstring index, rbp - mask
not    rbp
and    QWORD PTR [rdx + 8*r13], rbp
mov    rsi, QWORD PTR [r12 + tree_left]
call   generate_codebook_recurse
dec    QWORD PTR [rdx + bitstr_len]         # Decrease the bitstring length
generate_codebook_recurse_done:
pop    r13
pop    r12
pop    rbp
ret

# rdi - text
# RET rax - Huffman-tree root (ptr)
generate_tree:
push   r12
push   r13
sub    rsp, 5128                            # 1024 bytes for the char counts, 4 bytes for heap length, 4096 bytes for the heap, 4 byte padding
mov    r12, rdi                             # Save the original text so it doesn't get clobbered
mov    rdi, rsp                             # Zero out the character counts and the heap length
xor    rsi, rsi
mov    rdx, 1040
call   memset
xor    rax, rax
generate_tree_count_chars:
mov    al, BYTE PTR [r12]
test   al, al
jz     generate_tree_leaves_setup
inc    DWORD PTR [rsp + 4*rax]
inc    r12
jmp    generate_tree_count_chars
generate_tree_leaves_setup:
mov    r12, 255                             # The loop counter. We can only get here if the "test" on line 301 resulted in a zero so the next jl instruction will do the right thing
generate_tree_leaves:
jl     generate_tree_one_leaf               # If not then it's time to generate the branches
mov    r13d, DWORD PTR [rsp + 4*r12]        # Load the count at the ith position
test   r13d, r13d                           # And check if it's zero
jz     generate_tree_leaves_counters        # If it is we can skip this iteration
mov    rdi, 1                               # If not, we need to allocate a new leaf node
mov    rsi, tree_size
call   calloc
mov    DWORD PTR [rax + tree_value], r12d   # Save the value and the count to the tree
mov    DWORD PTR [rax + tree_count], r13d
lea    rdi, [rsp + counts_size]             # Then push it onto the heap
mov    rsi, rax
call   heap_push
generate_tree_leaves_counters:
dec    r12                                  # Decrement the loop counter and start over
jmp    generate_tree_leaves
generate_tree_one_leaf:
cmp    DWORD PTR [rsp + counts_size], 1     # Check if there is only one element in the heap
jne    generate_tree_branches
lea    rdi, [rsp + counts_size]             # Get the element
call   heap_pop
mov    r12, rax
mov    rdi, tree_size                       # Create the new tree node, the pointer to it will be in rax
call   malloc
mov    QWORD PTR [rax + tree_left], r12     # Save element in the left node
mov    ecx, DWORD PTR [r12 + tree_count]    # Save element count in branch
mov    DWORD PTR [rax + tree_count], ecx
jmp    generate_tree_ret                    # Returning
generate_tree_branches:
cmp    DWORD PTR [rsp + counts_size], 1     # Check if there are still at least two elements in the heap
jle    generate_tree_done                   # If not, we're done
lea    rdi, [rsp + counts_size]             # Get the left child
call   heap_pop
mov    r12, rax
lea    rdi, [rsp + counts_size]             # Get the right child
call   heap_pop
mov    r13, rax
mov    rdi, tree_size                       # Create the new tree node, the pointer to it will be in rax
call   malloc
mov    ecx, DWORD PTR [r12 + tree_count]    # The new node's count: left count + right count
add    ecx, DWORD PTR [r13 + tree_count]
mov    QWORD PTR [rax + tree_left], r12     # Save the new node's fields: left, right, count (leave value unititialized, it shouldn't be used with branch nodes)
mov    QWORD PTR [rax + tree_right], r13
mov    DWORD PTR [rax + tree_count], ecx
lea    rdi, [rsp + counts_size]             # Add the branch to the heap
mov    rsi, rax
call   heap_push
jmp    generate_tree_branches
generate_tree_done:
lea    rdi, [rsp + counts_size]             # The tree's root will be in rax after the pop
call   heap_pop
generate_tree_ret:
pop    r13
pop    r12
ret

# rdi - heap ptr
# rsi - tree ptr
heap_push:
lea    rax, QWORD PTR [rdi + heap_data]     # We load the heap's data ptr and length to the respective registers
mov    ecx, DWORD PTR [rdi + heap_len]      # Load the current length
lea    edx, [ecx + 1]                       # First, calculate the new length (length + 1)
mov    DWORD PTR [rdi + heap_len], edx      # Then save it
mov    QWORD PTR [rax + 8*rcx], rsi         # And finally add the new value at the end of the array
heap_push_sift_up:
test   rcx, rcx                             # Test if we got to the root (index == 0)
jz     heap_push_done
lea    rdx, [rcx - 1]                       # Calculate the parent index: (index - 1) / 2
shr    rdx, 1
lea    r8, [rax + 8*rcx]                    # Get the pointer to the current and parent elements
lea    r9, [rax + 8*rdx]
mov    r10, QWORD PTR [r8]                  # Load the current and the parent elements
mov    r11, QWORD PTR [r9]
mov    esi, DWORD PTR [r10 + tree_count]    # Load the current tree's count
cmp    DWORD PTR [r11 + tree_count], esi    # If parent count <= current count
jle    heap_push_done                       # Then we're done
mov    QWORD PTR [r8], r11                  # Otherwise swap the two elements
mov    QWORD PTR [r9], r10
mov    rcx, rdx
jmp    heap_push_sift_up
heap_push_done:
ret

# rdi - heap ptr
# RET rax - tree ptr
heap_pop:
mov    r8d, DWORD PTR [rdi + heap_len]      # Load the heap's length
test   r8d, r8d                             # If it's 0 then the heap's empty
jz     heap_empty
lea    rdx, [rdi + heap_data]               # Get the heap's data ptr
mov    rax, QWORD PTR [rdx]                 # The return value will be the tree's current root
lea    r8d, [r8d - 1]                       # Calculate the new length
mov    DWORD PTR [rdi + heap_len], r8d      # And save it
mov    rsi, QWORD PTR [rdx + 8*r8]          # Load the element we're going to swap with the root
mov    QWORD PTR [rdx], rsi                 # Swap the root and the last element
mov    QWORD PTR [rdx + 8*r8], rax
xor    r9, r9                               # The loop index
heap_pop_sift_down:
mov    rcx, r9                              # Save the target index at the start of the loop
lea    r10, [r9 + r9 + 1]                   # The left child index
lea    r11, [r9 + r9 + 2]                   # The right child index
cmp    r10, r8
jge    heap_pop_check_right
mov    rdi, QWORD PTR [rdx + 8*r10]         # Load the left child
mov    rsi, QWORD PTR [rdx + 8*rcx]         # Load the target
mov    esi, DWORD PTR [rsi + tree_count]    # Load the target tree count
cmp    DWORD PTR [rdi + tree_count], esi    # If the left tree count < target tree count
jge    heap_pop_check_right
mov    rcx, r10
heap_pop_check_right:
cmp    r11, r8
jge    heap_pop_compare_indices
mov    rdi, QWORD PTR [rdx + 8*r11]         # Load the right child
mov    rsi, QWORD PTR [rdx + 8*rcx]         # Load the target
mov    esi, DWORD PTR [rsi + tree_count]    # Load the target tree count
cmp    DWORD PTR [rdi + tree_count], esi    # If the right tree count < target tree count
jge    heap_pop_compare_indices
mov    rcx, r11
heap_pop_compare_indices:
cmp    r9, rcx                              # If the target index == current index we're done
je     heap_pop_done
mov    rdi, QWORD PTR [rdx + 8*r9]          # Otherwise we swap the values
mov    rsi, QWORD PTR [rdx + 8*rcx]
mov    QWORD PTR [rdx + 8*r9], rsi
mov    QWORD PTR [rdx + 8*rcx], rdi
mov    r9, rcx
jmp    heap_pop_sift_down
heap_empty:
xor    rax, rax                             # Return a null pointer to indicate the heap was empty
heap_pop_done:
ret

# rdi - codebook start ptr
print_codebook:
push   rbx
push   r12
sub    rsp, 272                             # The bitstring we're going to print
mov    r12, rdi
xor    rbx, rbx                             # Save the loop counter into a register that doesn't get clobbered
print_codebook_loop:
cmp    rbx, 255
jg     print_codebook_done
lea    rax, [rbx + 4*rbx]                   # We get the codebook entry at the specific index
lea    r10, [r12 + 8*rax]
mov    rdx, QWORD PTR [r10 + bitstr_len]    # Load the length of the bitstring
test   rdx, rdx                             # If it's zero then the codepoint didn't exist in the original alphabet, skip
jz     print_codebook_counters
print_codebook_char:
mov    BYTE PTR [rsp], bl                   # First, the character we're printing the code for
mov    WORD PTR [rsp + 1], 0x203a           # Then ": "
mov    BYTE PTR [rsp + rdx + 3], 0x00       # At the end add the null terminator
print_codebook_generate_binary:
dec    rdx
jl     print_codebook_binary
mov    r9, rdx                              # Two copies of the loop counter
mov    rcx, rdx
shr    r9, 6                                # Calculate the bitstring part we're going to load
and    rcx, 63                              # The bit we're interested in
mov    rsi, QWORD PTR [r10 + r9]            # One of the 4, 64 bit parts of the bitstring we're going to print
shr    rsi, cl                              # Get the relevant bit into the 0th position
and    rsi, 1                               # Mask the rest of the bits
add    rsi, '0'                             # Convert it to ASCII
mov    BYTE PTR [rsp + rdx + 3], sil        # And copy it into the string
jmp    print_codebook_generate_binary
print_codebook_binary:
mov    rdi, rsp                             # Print the current bitstring
call   puts
print_codebook_counters:
inc    rbx                                  # And go to the next codebook entry
jmp    print_codebook_loop
print_codebook_done:
pop    r12
pop    rbx
ret

# rdi - message ptr
# This would run out of stack space for long messages but it will do for now
print_message:
push   r12
push   r13
mov    r12, rdi
mov    r13, QWORD PTR [rdi]                 # Get the length of the message
lea    rdi, [r13 + 1]                       # For the length of the string we'll need an additional the null terminator
call   malloc
xor    rdx, rdx
print_message_generate_string:
cmp    rdx, r13
jge    print_message_puts
mov    r8, rdx                              # Get two copies of the current index
mov    rcx, rdx
shr    r8, 3                                # We first get the byte we want to print
mov    r10b, BYTE PTR [r12 + r8 + msg_data]
and    rcx, 7                               # Then the bit in that byte
shr    r10, cl
and    r10, 0x1                             # Mask it so only the bit we're interested in is visible
add    r10, '0'                             # Convert it to ASCII
mov    BYTE PTR [rax + rdx], r10b           # Write it into the printable string
inc    rdx
jmp    print_message_generate_string
print_message_puts:
mov    BYTE PTR [rax + rdx], 0x00           # Write the null terminator
mov    rdi, rax                             # And print the string
call   puts
pop    r13
pop    r12
ret

# rdi - tree ptr
free_tree:
push   rbx
mov    rbx, rdi
test   rbx, rbx                             # When the tree ptr we're trying to free is already null we reached the termination condition
jz     free_tree_done
mov    rdi, [rbx + tree_left]               # Otherwise free the left child first
call   free_tree
mov    rdi, [rbx + tree_right]              # Then the right child
call   free_tree
mov    rdi, rbx                             # And finally, the node itself
call   free
free_tree_done:
pop    rbx
ret

import scala.collection.mutable.{Map, PriorityQueue}

object HuffmanEncoding {

trait Node {
var weight: Int
}

case class Leaf(char: Char, var weight: Int) extends Node

case class Branch(left: Node, right: Node, var weight: Int) extends Node

def createTree(phrase: String): Option[Node] = {

val tree = PriorityQueue[Node]()(Ordering.by(-_.weight))
tree ++= phrase
.groupBy(identity)
.mapValues(_.length)
.map{
case (char, count) => Leaf(char, count)
}

while (tree.size > 1) {
val node1 = tree.dequeue()
val node2 = tree.dequeue()
tree += Branch(node1, node2, node1.weight + node2.weight)
}

}

def createCodeBook(maybeRoot: Option[Node]): Map[Char, String] = {
val codeBook = Map[Char, String]()

def codeBookRecurse(node: Node, code: String): Unit =
node match {
case Leaf(symbol, _) => codeBook.put(symbol, code)
case Branch(left, right, _) => {
codeBookRecurse(left, code + "0")
codeBookRecurse(right, code + "1")
}
}

maybeRoot.foreach(c => codeBookRecurse(c, ""))

codeBook
}

def encode(phrase: String, codeBook: Map[Char, String]): String = {
phrase.flatMap(c => codeBook.getOrElse(c, "?"))
}

def decode(encoded: String, maybeRoot: Option[Node]): String = {
val root = maybeRoot.getOrElse(Leaf('?', 0))
var currentNode = root

def chooseTreeBranch(bit: Char) =
currentNode match {
case Branch(left, right, _) =>
currentNode = if (bit == '0') left else right
case _ =>
}

def maybeGetACharacter =
currentNode match {
case Leaf(c, _) => {
currentNode = root
Some(c)
}
case _ => None
}

encoded
.flatMap(bit => {
chooseTreeBranch(bit)
maybeGetACharacter
})
}

def main(args: Array[String]): Unit = {
val originalText = "bibbity_bobbity"
println("Original Text: " + originalText)

val tree = createTree(originalText)
val codeBook = createCodeBook(tree)
println("CodeBook is: " + codeBook)

val encoded = encode(originalText, codeBook)
println("Encoded text: " + encoded)

val decoded = decode(encoded, tree)
println("Decoded text: " + decoded)

}

}


The code snippet was taken from this scratch project from collections import Counter, deque
from bisect import bisect

class Tree

data Empty() from Tree
data Leaf(char, n is int) from Tree:
def __str__(self):
return f'Leaf({self.char}, {self.n})'

__repr__ = __str__

data Node(left is Tree, right is Tree) from Tree:
def __str__(self):
return f'Node({str(self.left)}, {str(self.right)})'
__repr__ = __str__

def weight(Tree()) = 0
addpattern def weight(Leaf(char, n)) = n
addpattern def weight(Node(left, right)) = weight(left) + weight(right)

def build_huffman_tree(message):

# get sorted list of character and frequency pairs
frequencies = Counter(message)
trees = frequencies.most_common() |> map$(t -> Leaf(*t)) |> reversed |> deque if not trees: return Empty() # while there is more than one tree while len(trees) > 1: # pop off the two trees of least weight from the trees list tree_left = trees.popleft() tree_right = trees.popleft() # combine the nodes and add back to the nodes list new_tree = Node(tree_left, tree_right) # find the first tree that has a weight smaller than new_weight # and returns its index in the list. # If no such tree can be found, use len(trees) instead to append index = bisect(trees |> map$(weight) |> list, weight(new_tree))

# insert the new tree there
trees.insert(index, new_tree)

huffman_tree = trees
return huffman_tree

def build_codebook(Empty(), code='') = []
addpattern def build_codebook(Leaf(char, n), code='') = [(char, code)]
addpattern def build_codebook(Node(left, right), code='') =
build_codebook(left, code+'0') + build_codebook(right, code+'1')

def huffman_encode(codebook, message):

if len(codebook) == 1:
return '0' * len(message)

# build a char -> code dictionary
forward_dict = dict(codebook)

return ''.join(message |> map\$(forward_dict[]))

def huffman_decode(codebook, encoded_message):

decoded_message = []
key = ''

if not codebook:
return ''
elif len(codebook) == 1:
return codebook * len(encoded_message)

# build a code -> char dictionary
inverse_dict = dict((v, k) for k, v in codebook)

# for each bit in the encoding
# if the bit is in the dictionary, replace the bit with the paired
# character else look at the bit and the following bits together
# until a match occurs move to the next bit not yet looked at.
if encoded_message == '':
return inverse_dict['']

for bit in encoded_message:
key += bit
if key in inverse_dict:
decoded_message.append(inverse_dict[key])
key = ''

return ''.join(decoded_message)

if __name__ == '__main__':
# test example
message = 'bibbity_bobbity'
tree = build_huffman_tree(message)
codebook = build_codebook(tree)
encoded_message = huffman_encode(codebook, message)
decoded_message = huffman_decode(codebook, encoded_message)

print('message:', message)
print('huffman tree:', tree)
print('codebook:', codebook)
print('encoded message:', encoded_message)
print('decoded message:', decoded_message)

# prints the following:
#
#  message: bibbity_bobbity
#  huffman_tree: Node(Leaf(b, 6), Node(Node(Leaf(y, 2), Leaf(t, 2)),
#                     Node(Node(Leaf(o, 1), Leaf(_, 1)), Leaf(i, 3))))
#  codebook: [('b', '0'), ('y', '100'), ('t', '101'),
#             ('o', '1100'), ('_', '1101'), ('i', '111')]
#  encoded_message: 01110011110110011010110000111101100
#  decoded_message: bibbity_bobbity


##### Text 