lowbattery Posted November 12 Share Posted November 12 I've been using code for a long time to escape unicode text: Execute(StringRegExpReplace($sData, '(.)', '(AscW("$1")>127?"\\u"&StringLower(Hex(AscW("$1"),4)):"$1")&') & "''") But it seems to return incorrect data on certain characters. For example: 🐒 Should encode to: \ud83d\udc12 But that code is encoding it to: \ud83d As another example: 💧 Should encode to: \ud83d\udca7 But the autoit code is encoding it to: \ud83d Any ideas on what's going on and how I can adjust the autoit code to properly encode the emojis? Link to comment Share on other sites More sharing options...
Solution lowbattery Posted November 12 Author Solution Share Posted November 12 (edited) Wow. Claude Sonnet 3.5 doesn't fail to impress me! expandcollapse popupFunc EncodeUnicode($sText) Local $aResult = "" Local $iLen = StringLen($sText) Local $i = 1 While $i <= $iLen Local $iCode = AscW(StringMid($sText, $i, 1)) ; Check if this is a high surrogate (first part of surrogate pair) If $iCode >= 0xD800 And $iCode <= 0xDBFF And $i < $iLen Then ; Get the low surrogate (second part) Local $iLowSurrogate = AscW(StringMid($sText, $i + 1, 1)) If $iLowSurrogate >= 0xDC00 And $iLowSurrogate <= 0xDFFF Then ; Valid surrogate pair - encode both parts $aResult &= "\u" & StringLower(Hex($iCode, 4)) & "\u" & StringLower(Hex($iLowSurrogate, 4)) $i += 2 ; Skip the next character as weve already processed it ContinueLoop EndIf EndIf ; Handle regular characters If $iCode > 127 Then $aResult &= "\u" & StringLower(Hex($iCode, 4)) Else $aResult &= StringMid($sText, $i, 1) EndIf $i += 1 WEnd Return $aResult EndFunc ; Helper function to test the encoding Func TestEncoding() Local $aTestCases[][2] = [ _ ["🐒", "\ud83d\udc12"], _ ["💧", "\ud83d\udca7"], _ ["Hello 👋 World", "Hello \ud83d\udc4b World"], _ ["🌍", "\ud83c\udf0d"], _ ["😀", "\ud83d\ude00"] _ ] For $i = 0 To UBound($aTestCases) - 1 Local $sInput = $aTestCases[$i][0] Local $sExpected = $aTestCases[$i][1] Local $sResult = EncodeUnicode($sInput) ConsoleWrite("Input: " & $sInput & @CRLF) ConsoleWrite("Expected: " & $sExpected & @CRLF) ConsoleWrite("Got: " & $sResult & @CRLF) ConsoleWrite("Match: " & ($sResult = $sExpected) & @CRLF & @CRLF) Next EndFunc Edited November 12 by lowbattery pixelsearch 1 Link to comment Share on other sites More sharing options...
pixelsearch Posted November 13 Share Posted November 13 (edited) There is an interesting post from @jchd where he explains why we can't use in a RegEx pattern the 2 values of a surrogate pair (for example 0xD83D & 0xDE2D), this is because "the AutoIt RegEx engine has already merged them to the actual codepoint 0x1F62D" (which corresponds to the "Loudly Crying Face" emoji) So matching the corresponding emoji with a pattern like \x{1F62D} works (tested) but if I'm not wrong, the 2 separate values (high & low surrogate value) aren't really helpful during an AutoIt RegEx ? I found the explanation of the merging process on this wikipedia UTF-16 page, look at the dinosaur emoji at the upper right corner of the page, the merging process is explained there, plus same explanations lower in the web page. jchd, thanks for the [\p{Cc}] pattern which matches control characters (not only those < ascii 0x20), it may help at times. I also experimented the [\p{Cf}] which worked too (format characters). Concerning the pattern [\p{Cs}] for the surrogates pair, I was only able to match a character whose code goes from 0xDC00 to 0xDFFF (which in fact is the range of low surrogates values, but it shouldn't be really useful in a RegEx pattern, I guess) Impossible to use a RexEx pattern with the surrogate high values [\x{D800}-\x{DBFF}] as it generates an immediate error, these characters codes are invalid by themselves in Unicode (?) @lowbattery Thanks for sharing the code above, it's instructive Edited November 15 by pixelsearch typo Musashi 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now