AspirinJunkie Posted May 22, 2023 Share Posted May 22, 2023 Introduction The following UDF can be used to compress a string into a binary via Huffman encoding. To keep the overhead for the huff table integrated in the binary as small as possible, the canonical form of the huffman table was implemented. The UDF accesses many functionalities of the UDF "BinaryHelpers.au3". This is also part of the repository and offers functionalities for simplified handling of binary data in AutoIt. How to use? the following script: #include "Huffman.au3" ; read out the autoitscript.com page as example string Global $sString = BinaryToString(InetRead("https://autoitscript.com")) ConsoleWrite("Original Size: " & BinaryLen(StringToBinary($sString)) & " Bytes" & @CRLF) ; encode the string with a canonical huffman encoding Global $bEncoded = _huff_encodeString($sString) ConsoleWrite("Encoded Size: " & BinaryLen($bEncoded) & " Bytes" & @CRLF & _ "Compress ratio: " & Round((1 - (BinaryLen($bEncoded) / BinaryLen(StringToBinary($sString)))) * 100.0, 1) & ' %' & @CRLF) ; decode the string out of the binary Global $sStringDecoded = _huff_decodeBinary($bEncoded) ; check if original and decoded string are the same ConsoleWrite("$sString == $sStringDecoded: " & ($sString == $sStringDecoded) & @CRLF) produces: Original Size: 157081 Bytes Encoded Size: 102833 Bytes Compress ratio: 34.5 % $sString == $sStringDecoded: True How to further increase the compression potential? Huffman encoding is just a step to be able to compress data. In combination with further steps you can compress the data even more. For example, you can pre-compress the data at the text level and only then encode the data using Huffman. The following extends the above example with a (simple) word repetition compression: Spoiler expandcollapse popup#include "Huffman.au3" ; read out the autoitscript.com page as example string Global $sString = BinaryToString(InetRead("https://autoitscript.com")) ConsoleWrite("Original Size: " & BinaryLen(StringToBinary($sString)) & " Bytes" & @CRLF) ; precompress at text level by word repetition: Global $sStringCompressed = _strcmp_compressByWordRepetition($sString) ; encode the string with a canonical huffman encoding Global $bEncoded = _huff_encodeString($sStringCompressed) ConsoleWrite("Encoded Size: " & BinaryLen($bEncoded) & " Bytes" & @CRLF & _ "Compress ratio: " & Round((1 - (BinaryLen($bEncoded) / BinaryLen(StringToBinary($sString)))) * 100.0, 1) & ' %' & @CRLF) ; decode the string out of the binary Global $sStringDecoded = _huff_decodeBinary($bEncoded) ; decompress at text level $sStringDecoded = _strcmp_decompressByWordRepetition($sStringDecoded) ; check if original and decoded string are the same ConsoleWrite("$sString == $sStringDecoded: " & ($sString == $sStringDecoded) & @CRLF) ; Tokenbasierte Kompression - ersetzt wiederholende Teile durch Nummern im Text Func _strcmp_compressByWordRepetition($sString) Local $mTmp[3], $aTmp[2], $mPos, $iLen, _ $mWords[], _ $iPos = 1, _ $sWord, $iWord = 0 ; Bereits vorhandene Präfixnummern escapen: $sString = StringRegExpReplace($sString, '(?s)\b(\d+)', Chr(27) & "$1") ; Einzelne Wörter aus dem String extrahieren Do $aMatch = StringRegExp($sString, '(?s)\G.*?(\b[a-zA-ZäöüßÄÖÜ][\wäöüßÄÖÜ]{2,}\b)', 1, $iPos) IF @error Then ExitLoop $iPos = @extended $sWord = $aMatch[0] If MapExists($mWords, $sWord) Then $aTmp = $mWords[$sWord] $mTmp = $aTmp[1] MapAppend($mTmp, $iPos - StringLen($sWord)) $aTmp[1] = $mTmp Else Local $mTmp[] Local $aTmp[2] = [$iWord, $mTmp] EndIf $mWords[$sWord] = $aTmp $iWord += 1 Until 0 ; bereite die notwendigen Replaces auf Local $aReplaces[$iWord][3], $iIndRepl = 0 For $sWord In MapKeys($mWords) $iLen = StringLen($sWord) $aTmp = $mWords[$sWord] $iWord = $aTmp[0] $mPos = $aTmp[1] If Ubound($mPos) < 1 Then ContinueLoop If $iLen <= Stringlen($iWord) Then ContinueLoop For $i In MapKeys($mPos) $iPos = $mPos[$i] $aReplaces[$iIndRepl][0] = $iPos $aReplaces[$iIndRepl][1] = $iLen $aReplaces[$iIndRepl][2] = $iWord $iIndRepl += 1 Next Next Redim $aReplaces[$iIndRepl][3] _ArraySort($aReplaces) ; Wörter durch Nummern im Text selbst ersetzen For $i = UBound($aReplaces) - 1 To 0 Step -1 $sString = StringLeft($sString, $aReplaces[$i][0] - 1) & _ $aReplaces[$i][2] & _ StringTrimLeft($sString, $aReplaces[$i][0] + $aReplaces[$i][1] -1) Next Return $sString EndFunc ; Tokenbasierte Dekompression - stellt einen mit _strcmp_compressByWordRepetition() komprimierten String wieder her Func _strcmp_decompressByWordRepetition($sString) Local $aTmp[3], _ $mWords[] $iPos = 1 ; Einzelne Wörter aus dem String extrahieren Do $aMatch = StringRegExp($sString, '(?s)\G.*?(\b[a-zA-ZäöüßÄÖÜ][\wäöüßÄÖÜ]{2,}\b|\b(?<!\x1B)\d+\b)', 1, $iPos) IF @error Then ExitLoop $iPos = @extended $aTmp[0] = $aMatch[0] $aTmp[1] = $iPos - StringLen($aMatch[0]) $aTmp[2] = StringLen($aMatch[0]) MapAppend($mWords, $aTmp) Until 0 Local $aWords = _map_IntMap2Array($mWords) $aWords = _ArrayAinATo2d($aWords) ; Nummern wieder durch Buchstaben ersetzen Local $iWord For $i = UBound($aWords) - 1 To 0 Step -1 If Not StringRegExp($aWords[$i][0], "^\d+$", 0) Then ContinueLoop $iWord = Int($aWords[$i][0]) If $iWord >= UBound($aWords) Then ContinueLoop ;~ ConsoleWrite($iWord & @CRLF) $sString = StringLeft($sString, $aWords[$i][1] - 1) & _ $aWords[$iWord][0] & _ StringTrimLeft($sString, $aWords[$i][1] + $aWords[$i][2] -1) Next ; Escapte Zahlen wieder unescapen: $sString = StringRegExpReplace($sString, '(?s)\x1B(\d+)', "$1") Return $sString EndFunc ; #FUNCTION# ====================================================================================== ; Name ..........: _map_IntMap2Array() ; Description ...: returns the values of a map only (useful for array-list like maps produced by MapAppend()) ; Syntax ........: _map_IntMap2Array(ByRef $aMap) ; Parameters ....: ByRef $aMap - a map variable ; Return values .: 1D-array containing the values ; Author ........: aspirinjunkie ; Modified ......: 2022-07-13 ; ================================================================================================= Func _map_IntMap2Array(ByRef $aMap) Local $aRet[UBound($aMap)] Local $iInd = 0 For $vEl In $aMap $aRet[$iInd] = $vEl $iInd += 1 Next Return $aRet EndFunc ;==>_map_IntMap2Array ; #FUNCTION# ====================================================================================== ; Name ..........: _ArrayAinATo2d() ; Description ...: Convert a Arrays in Array into a 2D array ; Syntax ........: _ArrayAinATo2d(ByRef $A) ; Parameters ....: $A - the arrays in array which should be converted ; Return values .: Success: a 2D Array build from the input array ; Failure: Null ; @error = 1: $A is'nt an 1D array ; = 2: $A is empty ; = 3: first element isn't a array ; Author ........: AspirinJunkie ; ================================================================================================= Func _ArrayAinATo2d(ByRef $A) If UBound($A, 0) <> 1 Then Return SetError(1, UBound($A, 0), Null) Local $N = UBound($A) If $N < 1 Then Return SetError(2, $N, Null) Local $u = UBound($A[0]) If $u < 1 Then Return SetError(3, $u, Null) Local $a_Ret[$N][$u] For $i = 0 To $N - 1 Local $t = $A[$i] If UBound($t) > $u Then ReDim $a_Ret[$N][UBound($t)] For $j = 0 To UBound($t) - 1 $a_Ret[$i][$j] = $t[$j] Next Next Return SetExtended($N, $a_Ret) EndFunc ;==>_ArrayAinATo2d which leads to the following result: Original Size: 157081 Bytes Encoded Size: 73333 Bytes Compress ratio: 53.3 % $sString == $sStringDecoded: True >>sourcecode and download on github<< kurtykurtyboy and funkey 2 Link to comment Share on other sites More sharing options...
Andreik Posted May 22, 2023 Share Posted May 22, 2023 It looks and work good but I don't really like things that depends on other things. At least in your case there are needed just 2 files and after all this is just a personal preference. Well done! Link to comment Share on other sites More sharing options...
Skeletor Posted May 22, 2023 Share Posted May 22, 2023 So I read up about this. Its the type of coding that is also found in zip archives, MP3's , jpg file formats...can you set the compression ratio without having to update the UDF?? Kind RegardsSkeletor "Coffee: my defense against going postal." Microsoft Office Splash Screen | Basic Notepad Program (Beginner) | Transparent Splash Screen | Full Screen UI Link to comment Share on other sites More sharing options...
AspirinJunkie Posted May 22, 2023 Author Share Posted May 22, 2023 Yes, Huffman encoding is part of the compression algorithms of jpg, zip (deflate) and mp3. Important: it is only ONE component of several! These algorithms consist of further compression steps. First of all, Huffman encoding is a lossless encoding by itself. From the fact that mp3 and jpg on the other hand are lossy, you can already see that it is not enough to encode a picture via Huffman and you already have a jpg. These algorithms are more complex. The idea behind the Huffman is the following: Let's take any text in ANSI encoding. Each letter of the text is represented by 8 bits. But if you see that e.g. in a text the "e" occurs 500 times and the "x" only 1x, then you could come to the idea that it would be more space-saving to have an encoding that has a different number of bits per element - depending on its frequency. If you would need only 3 bits for the "e" and 15 bits for the "x", then you need on average fewer bits for the same information. There were some approaches to this but the Huffman code stands out here because it determines the optimal distribution. For a given set, it gives the encoding that saves the most space for its frequency distribution. So you get different results depending on the input but for this input every time the same. Therefore it should be clear now, why one cannot set a compression rate with the Huffman coding. Link to comment Share on other sites More sharing options...
Werty Posted May 22, 2023 Share Posted May 22, 2023 (edited) Cool You can test it out on "enwik9" if you heard of that, a compression competition that has been running for years with €500.000 in prizes (like, if you beat the previous record by 1% you get a certain cut of the 500k). A 1 gigabyte file containing the first pages of wikipedia that has to be compressed. Fabrice Bellard, maker of TinyCC and FFmpeg has also participated in it, cant remember his score though. http://prize.hutter1.net/index.htm Edited May 22, 2023 by Werty Some guy's script + some other guy's script = my script! Link to comment Share on other sites More sharing options...
AspirinJunkie Posted May 23, 2023 Author Share Posted May 23, 2023 Well, I don't need to compete. Only with a Huffman and some token-based text compression you won't get far. Apart from the fact that AutoIt would capitulate with this amount of data. But maybe to understand what this UDF is supposed to do: It is not supposed to be a compression algorithm in AutoIt. On the one hand it should help to understand the Huffman coding, because an algorithm in a language known to you is often better to understand than a long text about it. On the other hand it should serve as a template for own developments. In most compression algorithms the data are coded after a pretreatment finally with the Huffman code. So that one does not have to worry about this and can concentrate instead on the magic before, one can take this UDF. I have tried to demonstrate this with the 2nd example. There I implemented a corresponding preprocessing. This is relatively simple because it only replaces word repetitions with numeric tokens. I also wrote a variant which produces better compression rates, because it encodes not only whole words but also only prefixes. However, this is much slower and is not suitable as a simple example. With this example it should be clear that the huffman encoding is only one of several components. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now