Modify

Opened 6 years ago

Last modified 2 years ago

#3731 assigned Bug

Binary() performs hidden and wrong conversion on strings

Reported by: jchd18 Owned by: Jon
Milestone: Component: AutoIt
Version: 3.3.14.5 Severity: None
Keywords: Cc:

Description

One would expect Binary(<string>) to return the binary image of <string> but it's not (at all) so.
The string below contains the first 5 ASCII letters, a space and the corresponding 5 Greek letters.

ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)

In memory the string looks like this:

0041 0042 0043 0044 0045 0020 0391 0392 0393 0394 0395

and this is what one would expect from invoking Binary(), since AutoIt uses UCS2 (UTF16-LE limited to the Unicode BMP.)

Instead we get something completely unuseable. First the Greek letters Alpha, Beta, Delta and Epsilon appear as question marks (no equivalent in ASCII) but the letter Gamma surprisingly gets converted to ASCII G.

0x4142434445203F3F473F3F

Attachments (0)

Change History (6)

comment:1 by J-Paul Mesnage, 6 years ago

Certainly the doc is incomplete about the conversion to byte not to UCS2

comment:2 by jchd18, 6 years ago

The doc is indeed incomplete, but there are a number of very unexpected "conversions" elsewhere in the range > 0xFF (maybe even in the range [0x7F,0xFF] depending on local codepage), making Binary(<string>) deceptive.

ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)

Local $c, $b, $u
For $i = 0x100 To 0xFFFF
	$c = ChrW($i)
	$b = Binary($c)
	If $b <> "0x3F" Then _U8ConsoleWrite(Hex($i, 4) & @TAB & $c & "    -->     " & @TAB & $b & @TAB & ChrW($b))
Next

; Unicode-aware ConsoleWrite (set console to UTF8 for decent result)
Func _U8ConsoleWrite($s)
	ConsoleWrite(BinaryToString(StringToBinary($s & @LF, 4), 1))
EndFunc   ;==>_U8ConsoleWrite

For instance some codepoints are converted, but not all possible and not always right:

β (lowercase Greek beta) turned into ß (German eszet) ?!?!?
Γ -> G but γ (lowercase Greek gamma) isn't converted
Many codepoints are unexpectedly converted to control characters!

I'm not completely against attempts to convert, say, Ā to A in a distinct function but at least this has to be clearly documented AND it's better to have it right and consistant (that is much, much harder than it looks.) In any case, a function named Binary shouldn't emasculate anything and OTOH an attempt to map UCS2 > 0x7F to local Windows codepage is doomed to failures.

All in all I doubt a simple approach can be really satisfactory. From this point of view, _StringToHex() [which produces hex of the string in UTF8] and StringToASCIIArray() [which returns an array of codepoints] are more robust.

comment:3 by J-Paul Mesnage, 6 years ago

Owner: set to Jon
Status: newassigned

I leave to Jon the final answer to change only doc or the code to follow your recommandation ...

comment:4 by jchd18, 6 years ago

Fine. The issue finally boils down to: "what should be the correct semantic of Binary when applied to a native (UCS2) AutoIt string?"

comment:5 by J-Paul Mesnage, 5 years ago

Owner: changed from Jon to J-Paul Mesnage

comment:6 by J-Paul Mesnage, 2 years ago

Owner: changed from J-Paul Mesnage to Jon

Hi,
I was thinking I can have a solution to proposed to Jon,
but according to Jchd remark I conclude that i have not
So I will return the decision to Jon

Modify Ticket

Action
as assigned The owner will remain Jon.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.