Opened 5 years ago
Last modified 8 months ago
#3731 assigned Bug
Binary() performs hidden and wrong conversion on strings
Reported by: | jchd18 | Owned by: | Jon |
---|---|---|---|
Milestone: | Component: | AutoIt | |
Version: | 3.3.14.5 | Severity: | None |
Keywords: | Cc: |
Description
One would expect Binary(<string>) to return the binary image of <string> but it's not (at all) so.
The string below contains the first 5 ASCII letters, a space and the corresponding 5 Greek letters.
ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)
In memory the string looks like this:
0041 0042 0043 0044 0045 0020 0391 0392 0393 0394 0395
and this is what one would expect from invoking Binary(), since AutoIt uses UCS2 (UTF16-LE limited to the Unicode BMP.)
Instead we get something completely unuseable. First the Greek letters Alpha, Beta, Delta and Epsilon appear as question marks (no equivalent in ASCII) but the letter Gamma surprisingly gets converted to ASCII G.
0x4142434445203F3F473F3F
Attachments (0)
Change History (6)
comment:1 Changed 5 years ago by Jpm
comment:2 Changed 5 years ago by jchd18
The doc is indeed incomplete, but there are a number of very unexpected "conversions" elsewhere in the range > 0xFF (maybe even in the range [0x7F,0xFF] depending on local codepage), making Binary(<string>) deceptive.
ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF) Local $c, $b, $u For $i = 0x100 To 0xFFFF $c = ChrW($i) $b = Binary($c) If $b <> "0x3F" Then _U8ConsoleWrite(Hex($i, 4) & @TAB & $c & " --> " & @TAB & $b & @TAB & ChrW($b)) Next ; Unicode-aware ConsoleWrite (set console to UTF8 for decent result) Func _U8ConsoleWrite($s) ConsoleWrite(BinaryToString(StringToBinary($s & @LF, 4), 1)) EndFunc ;==>_U8ConsoleWrite
For instance some codepoints are converted, but not all possible and not always right:
β (lowercase Greek beta) turned into ß (German eszet) ?!?!?
Γ -> G but γ (lowercase Greek gamma) isn't converted
Many codepoints are unexpectedly converted to control characters!
I'm not completely against attempts to convert, say, Ā to A in a distinct function but at least this has to be clearly documented AND it's better to have it right and consistant (that is much, much harder than it looks.) In any case, a function named Binary shouldn't emasculate anything and OTOH an attempt to map UCS2 > 0x7F to local Windows codepage is doomed to failures.
All in all I doubt a simple approach can be really satisfactory. From this point of view, _StringToHex() [which produces hex of the string in UTF8] and StringToASCIIArray() [which returns an array of codepoints] are more robust.
comment:3 Changed 5 years ago by Jpm
- Owner set to Jon
- Status changed from new to assigned
I leave to Jon the final answer to change only doc or the code to follow your recommandation ...
comment:4 Changed 5 years ago by jchd18
Fine. The issue finally boils down to: "what should be the correct semantic of Binary when applied to a native (UCS2) AutoIt string?"
comment:5 Changed 4 years ago by Jpm
- Owner changed from Jon to Jpm
comment:6 Changed 8 months ago by Jpm
- Owner changed from Jpm to Jon
Hi,
I was thinking I can have a solution to proposed to Jon,
but according to Jchd remark I conclude that i have not
So I will return the decision to Jon
Guidelines for posting comments:
- You cannot re-open a ticket but you may still leave a comment if you have additional information to add.
- In-depth discussions should take place on the forum.
For more information see the full version of the ticket guidelines here.
Certainly the doc is incomplete about the conversion to byte not to UCS2