Jump to content

Huffman encoding/decoding + helpers for binary data


Recommended Posts

Introduction

The following UDF can be used to compress a string into a binary via Huffman encoding. To keep the overhead for the huff table integrated in the binary as small as possible, the canonical form of the huffman table was implemented.

The UDF accesses many functionalities of the UDF "BinaryHelpers.au3". This is also part of the repository and offers functionalities for simplified handling of binary data in AutoIt.

How to use?

the following script:

#include "Huffman.au3"

; read out the autoitscript.com page as example string
Global $sString = BinaryToString(InetRead("https://autoitscript.com"))
ConsoleWrite("Original Size: " & BinaryLen(StringToBinary($sString)) & " Bytes" & @CRLF)

; encode the string with a canonical huffman encoding
Global $bEncoded = _huff_encodeString($sString)
ConsoleWrite("Encoded Size:  " & BinaryLen($bEncoded) & " Bytes" & @CRLF & _
    "Compress ratio: " & Round((1 - (BinaryLen($bEncoded) / BinaryLen(StringToBinary($sString)))) * 100.0, 1) & ' %' & @CRLF)

; decode the string out of the binary
Global $sStringDecoded = _huff_decodeBinary($bEncoded)

; check if original and decoded string are the same
ConsoleWrite("$sString == $sStringDecoded: " & ($sString == $sStringDecoded) & @CRLF)

produces:

Original Size: 157081 Bytes
Encoded Size:  102833 Bytes
Compress ratio: 34.5 %
$sString == $sStringDecoded: True


How to further increase the compression potential?

Huffman encoding is just a step to be able to compress data. In combination with further steps you can compress the data even more. For example, you can pre-compress the data at the text level and only then encode the data using Huffman.

The following extends the above example with a (simple) word repetition compression:

Spoiler
#include "Huffman.au3"

; read out the autoitscript.com page as example string
Global $sString = BinaryToString(InetRead("https://autoitscript.com"))
ConsoleWrite("Original Size: " & BinaryLen(StringToBinary($sString)) & " Bytes" & @CRLF)

; precompress at text level by word repetition:
Global $sStringCompressed = _strcmp_compressByWordRepetition($sString)

; encode the string with a canonical huffman encoding
Global $bEncoded = _huff_encodeString($sStringCompressed)
ConsoleWrite("Encoded Size:  " & BinaryLen($bEncoded) & " Bytes" & @CRLF & _
    "Compress ratio: " & Round((1 - (BinaryLen($bEncoded) / BinaryLen(StringToBinary($sString)))) * 100.0, 1) & ' %' & @CRLF)

; decode the string out of the binary
Global $sStringDecoded = _huff_decodeBinary($bEncoded)

; decompress at text level
$sStringDecoded = _strcmp_decompressByWordRepetition($sStringDecoded)

; check if original and decoded string are the same
ConsoleWrite("$sString == $sStringDecoded: " & ($sString == $sStringDecoded) & @CRLF)







; Tokenbasierte Kompression - ersetzt wiederholende Teile durch Nummern im Text
Func _strcmp_compressByWordRepetition($sString)
    Local   $mTmp[3], $aTmp[2], $mPos, $iLen, _
            $mWords[], _
            $iPos = 1, _
            $sWord, $iWord = 0

    ; Bereits vorhandene Präfixnummern escapen:
    $sString = StringRegExpReplace($sString, '(?s)\b(\d+)', Chr(27) & "$1")

    ; Einzelne Wörter aus dem String extrahieren
    Do
        $aMatch = StringRegExp($sString, '(?s)\G.*?(\b[a-zA-ZäöüßÄÖÜ][\wäöüßÄÖÜ]{2,}\b)', 1, $iPos)
        IF @error Then ExitLoop
        $iPos = @extended
        $sWord = $aMatch[0]

        If MapExists($mWords, $sWord) Then
            $aTmp = $mWords[$sWord]
            $mTmp = $aTmp[1]
            MapAppend($mTmp, $iPos - StringLen($sWord))
            $aTmp[1] = $mTmp
        Else
            Local $mTmp[]
            Local $aTmp[2] = [$iWord, $mTmp]
        EndIf
        $mWords[$sWord] = $aTmp

        $iWord += 1
    Until 0

    ; bereite die notwendigen Replaces auf
    Local $aReplaces[$iWord][3], $iIndRepl = 0
    For $sWord In MapKeys($mWords)
        $iLen = StringLen($sWord)
        $aTmp = $mWords[$sWord]
        $iWord = $aTmp[0]
        $mPos = $aTmp[1]

        If Ubound($mPos) < 1 Then ContinueLoop
        If $iLen <= Stringlen($iWord) Then ContinueLoop

        For $i In MapKeys($mPos)
            $iPos = $mPos[$i]

            $aReplaces[$iIndRepl][0] = $iPos
            $aReplaces[$iIndRepl][1] = $iLen
            $aReplaces[$iIndRepl][2] = $iWord

            $iIndRepl += 1

        Next
    Next
    Redim $aReplaces[$iIndRepl][3]
    _ArraySort($aReplaces)

    ; Wörter durch Nummern im Text selbst ersetzen
    For $i = UBound($aReplaces) - 1 To 0 Step -1
        $sString =  StringLeft($sString, $aReplaces[$i][0] - 1) & _
                    $aReplaces[$i][2] & _
                    StringTrimLeft($sString, $aReplaces[$i][0] + $aReplaces[$i][1] -1)
    Next

    Return $sString
EndFunc


; Tokenbasierte Dekompression - stellt einen mit _strcmp_compressByWordRepetition() komprimierten String wieder her
Func _strcmp_decompressByWordRepetition($sString)
    Local   $aTmp[3], _
            $mWords[]
            $iPos = 1

    ; Einzelne Wörter aus dem String extrahieren
    Do
        $aMatch = StringRegExp($sString, '(?s)\G.*?(\b[a-zA-ZäöüßÄÖÜ][\wäöüßÄÖÜ]{2,}\b|\b(?<!\x1B)\d+\b)', 1, $iPos)
        IF @error Then ExitLoop
        $iPos = @extended

        $aTmp[0] = $aMatch[0]
        $aTmp[1] = $iPos - StringLen($aMatch[0])
        $aTmp[2] = StringLen($aMatch[0])
        MapAppend($mWords, $aTmp)
    Until 0

    Local $aWords = _map_IntMap2Array($mWords)
    $aWords = _ArrayAinATo2d($aWords)

    ; Nummern wieder durch Buchstaben ersetzen
    Local $iWord
    For $i = UBound($aWords) - 1  To 0 Step -1
        If Not StringRegExp($aWords[$i][0], "^\d+$", 0) Then ContinueLoop

        $iWord = Int($aWords[$i][0])
        If $iWord >= UBound($aWords) Then ContinueLoop
        ;~ ConsoleWrite($iWord & @CRLF)
        $sString =  StringLeft($sString, $aWords[$i][1] - 1) & _
                    $aWords[$iWord][0] & _
                    StringTrimLeft($sString, $aWords[$i][1] + $aWords[$i][2] -1)
    Next

    ; Escapte Zahlen wieder unescapen:
    $sString = StringRegExpReplace($sString, '(?s)\x1B(\d+)', "$1")

    Return $sString
EndFunc


; #FUNCTION# ======================================================================================
; Name ..........: _map_IntMap2Array()
; Description ...: returns the values of a map only (useful for array-list like maps produced by MapAppend())
; Syntax ........: _map_IntMap2Array(ByRef $aMap)
; Parameters ....: ByRef $aMap - a map variable
; Return values .: 1D-array containing the values
; Author ........: aspirinjunkie
; Modified ......: 2022-07-13
; =================================================================================================
Func _map_IntMap2Array(ByRef $aMap)
    Local $aRet[UBound($aMap)]
    Local $iInd = 0
    For $vEl In $aMap
        $aRet[$iInd] = $vEl
        $iInd += 1
    Next
    Return $aRet
EndFunc   ;==>_map_IntMap2Array


; #FUNCTION# ======================================================================================
; Name ..........: _ArrayAinATo2d()
; Description ...: Convert a Arrays in Array into a 2D array
; Syntax ........: _ArrayAinATo2d(ByRef $A)
; Parameters ....: $A             - the arrays in array which should be converted
; Return values .: Success: a 2D Array build from the input array
;                  Failure: Null
;                     @error = 1: $A is'nt an 1D array
;                            = 2: $A is empty
;                            = 3: first element isn't a array
; Author ........: AspirinJunkie
; =================================================================================================
Func _ArrayAinATo2d(ByRef $A)
    If UBound($A, 0) <> 1 Then Return SetError(1, UBound($A, 0), Null)
    Local $N = UBound($A)
    If $N < 1 Then Return SetError(2, $N, Null)
    Local $u = UBound($A[0])
    If $u < 1 Then Return SetError(3, $u, Null)

    Local $a_Ret[$N][$u]

    For $i = 0 To $N - 1
        Local $t = $A[$i]
        If UBound($t) > $u Then ReDim $a_Ret[$N][UBound($t)]
        For $j = 0 To UBound($t) - 1
            $a_Ret[$i][$j] = $t[$j]
        Next
    Next
    Return SetExtended($N, $a_Ret)
EndFunc   ;==>_ArrayAinATo2d

 

which leads to the following result:

Original Size: 157081 Bytes
Encoded Size:  73333 Bytes
Compress ratio: 53.3 %
$sString == $sStringDecoded: True

 

>>sourcecode and download on github<<

 

Link to comment
Share on other sites

It looks and work good but I don't really like things that depends on other things. At least in your case there are needed just 2 files and after all this is just a personal preference. Well done!

When the words fail... music speaks.

Link to comment
Share on other sites

So I read up about this. Its the type of coding that is also found in zip archives, MP3's , jpg file formats...can you set the compression ratio without having to update the UDF??

Kind Regards
Skeletor

"Coffee: my defense against going postal."

Microsoft Office Splash Screen | Basic Notepad Program (Beginner) | Transparent Splash Screen | Full Screen UI

Link to comment
Share on other sites

Yes, Huffman encoding is part of the compression algorithms of jpg, zip (deflate) and mp3.
Important: it is only ONE component of several! These algorithms consist of further compression steps.
First of all, Huffman encoding is a lossless encoding by itself.
From the fact that mp3 and jpg on the other hand are lossy, you can already see that it is not enough to encode a picture via Huffman and you already have a jpg.
These algorithms are more complex.

The idea behind the Huffman is the following: 
Let's take any text in ANSI encoding.
Each letter of the text is represented by 8 bits.
But if you see that e.g. in a text the "e" occurs 500 times and the "x" only 1x, then you could come to the idea that it would be more space-saving to have an encoding that has a different number of bits per element - depending on its frequency.
If you would need only 3 bits for the "e" and 15 bits for the "x", then you need on average fewer bits for the same information.

There were some approaches to this but the Huffman code stands out here because it determines the optimal distribution.
For a given set, it gives the encoding that saves the most space for its frequency distribution.
So you get different results depending on the input but for this input every time the same.
Therefore it should be clear now, why one cannot set a compression rate with the Huffman coding.

Link to comment
Share on other sites

Cool

You can test it out on "enwik9" if you heard of that, a compression competition that has been running for years with €500.000 in prizes (like, if you beat the previous record by 1% you get a certain cut of the 500k). A 1 gigabyte file containing the first pages of wikipedia that has to be compressed.

Fabrice Bellard, maker of TinyCC and FFmpeg has also participated in it, cant remember his score though.

http://prize.hutter1.net/index.htm

Edited by Werty

Some guy's script + some other guy's script = my script!

Link to comment
Share on other sites

Well, I don't need to compete. Only with a Huffman and some token-based text compression you won't get far.
Apart from the fact that AutoIt would capitulate with this amount of data.

But maybe to understand what this UDF is supposed to do: It is not supposed to be a compression algorithm in AutoIt.
On the one hand it should help to understand the Huffman coding, because an algorithm in a language known to you is often better to understand than a long text about it.
On the other hand it should serve as a template for own developments.
In most compression algorithms the data are coded after a pretreatment finally with the Huffman code. So that one does not have to worry about this and can concentrate instead on the magic before, one can take this UDF.

I have tried to demonstrate this with the 2nd example.
There I implemented a corresponding preprocessing.
This is relatively simple because it only replaces word repetitions with numeric tokens.
I also wrote a variant which produces better compression rates, because it encodes not only whole words but also only prefixes. However, this is much slower and is not suitable as a simple example.
With this example it should be clear that the huffman encoding is only one of several components.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...