Jump to content

Help Encoding Unicode


Go to solution Solved by lowbattery,

Recommended Posts

I've been using code for a long time to escape unicode text:

Execute(StringRegExpReplace($sData, '(.)', '(AscW("$1")>127?"\\u"&StringLower(Hex(AscW("$1"),4)):"$1")&') & "''")

But it seems to return incorrect data on certain characters. For example:

🐒

Should encode to:

\ud83d\udc12

But that code is encoding it to:

\ud83d

As another example:

💧

Should encode to:

\ud83d\udca7

But the autoit code is encoding it to:

\ud83d

Any ideas on what's going on and how I can adjust the autoit code to properly encode the emojis?

 

Link to comment
Share on other sites

  • Solution

Wow. Claude Sonnet 3.5 doesn't fail to impress me!

Func EncodeUnicode($sText)
    Local $aResult = ""
    Local $iLen = StringLen($sText)
    Local $i = 1
    
    While $i <= $iLen
        Local $iCode = AscW(StringMid($sText, $i, 1))
        
        ; Check if this is a high surrogate (first part of surrogate pair)
        If $iCode >= 0xD800 And $iCode <= 0xDBFF And $i < $iLen Then
            ; Get the low surrogate (second part)
            Local $iLowSurrogate = AscW(StringMid($sText, $i + 1, 1))
            
            If $iLowSurrogate >= 0xDC00 And $iLowSurrogate <= 0xDFFF Then
                ; Valid surrogate pair - encode both parts
                $aResult &= "\u" & StringLower(Hex($iCode, 4)) & "\u" & StringLower(Hex($iLowSurrogate, 4))
                $i += 2 ; Skip the next character as weve already processed it
                ContinueLoop
            EndIf
        EndIf
        
        ; Handle regular characters
        If $iCode > 127 Then
            $aResult &= "\u" & StringLower(Hex($iCode, 4))
        Else
            $aResult &= StringMid($sText, $i, 1)
        EndIf
        
        $i += 1
    WEnd
    
    Return $aResult
EndFunc

; Helper function to test the encoding
Func TestEncoding()
    Local $aTestCases[][2] = [ _
        ["🐒", "\ud83d\udc12"], _
        ["💧", "\ud83d\udca7"], _
        ["Hello 👋 World", "Hello \ud83d\udc4b World"], _
        ["🌍", "\ud83c\udf0d"], _
        ["😀", "\ud83d\ude00"] _
    ]
    
    For $i = 0 To UBound($aTestCases) - 1
        Local $sInput = $aTestCases[$i][0]
        Local $sExpected = $aTestCases[$i][1]
        Local $sResult = EncodeUnicode($sInput)
        ConsoleWrite("Input: " & $sInput & @CRLF)
        ConsoleWrite("Expected: " & $sExpected & @CRLF)
        ConsoleWrite("Got: " & $sResult & @CRLF)
        ConsoleWrite("Match: " & ($sResult = $sExpected) & @CRLF & @CRLF)
    Next
EndFunc

 

Edited by lowbattery
Link to comment
Share on other sites

There is an interesting post from @jchd where he explains why we can't use in a RegEx pattern the 2 values of a surrogate pair (for example 0xD83D & 0xDE2D), this is because "the AutoIt RegEx engine has already merged them to the actual codepoint 0x1F62D" (which corresponds to the "Loudly Crying Face" emoji)

So matching the corresponding emoji with a pattern like \x{1F62D} works (tested) but if I'm not wrong, the 2 separate values (high & low surrogate value) aren't really helpful during an AutoIt RegEx ?

I found the explanation of the merging process on this wikipedia UTF-16 page, look at the dinosaur emoji at the upper right corner of the page, the merging process is explained there, plus same explanations lower in the web page.

jchd, thanks for the [\p{Cc}] pattern which matches control characters (not only those < ascii 0x20), it may help at times. I also experimented the [\p{Cf}] which worked too (format characters). Concerning the pattern [\p{Cs}] for the surrogates pair, I was only able to match a character whose code goes from 0xDC00 to 0xDFFF (which in fact is the range of low surrogates values, but it shouldn't be really useful in a RegEx pattern, I guess)

Impossible to use a RexEx pattern with the surrogate high values [\x{D800}-\x{DBFF}] as it generates an immediate error, these characters codes are invalid by themselves in Unicode (?)

@lowbattery Thanks for sharing the code above, it's instructive :)

Edited by pixelsearch
typo
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...