UTF-8 Strings in AutoIt

AXLE · April 28, 2019

I am trying to find information on using UTF-8 Strings in AutoIt. After searching extensively I cannot find anything conclusive on this topic. What I need to do is FileRead() into a String variable(or Array) and keep the UTF-8 Encoding. Some articles, and even Help documents on FileOpen() suggest that AutoIT (Current Versions) can read and store UTF-8 internally but my tests on reading a test web page containing UTF-8 encoded characters into a variable fails.

Does/Can AutoIt use Strings Encoded as UTF-8, and if so how ?

If Not does anyone know of a UDF, or a C/Win-API routine to allow to use a UTF-8 Array in AutoIt ?

What does AutoIt use internally for Strings ? Is it converting the UTF-8 file to UCS-2 String in the Variable ?

The following is an example which fails for me.

;UTF-8 Tests
#include <FileConstants.au3>
#include <MsgBoxConstants.au3>
#include <WinAPIFiles.au3>

;https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
;Also all checked in Notepad++ UTF-8 Encoding (Many Characters are scrambled)
Local $sFile1 = "UTF-8 test file.htm"; 414 Lines | 76,412 characters. "UTF-8 test file.htm" = "/UTF-8-demo.html"
Local $sFile2 = "test2.html"

Local $hfile1 = FileOpen($sFile1, BitOr($FO_READ, $FO_UTF8_NOBOM))
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileOpen1", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

Local $sAm_I_UFT_8 = FileRead($hfile1, -1);Does not appear to read UTF-8 characters correctly from the "UTF-8 test file.htm"
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileRead", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

FileClose($hfile1)

Local $sAm_I_Still_UTF_8 = $sAm_I_UFT_8 ;Are these two strings stored internaly as UTF-8 ?
If @error Then
    MsgBox($MB_SYSTEMMODAL, "String=String", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

Local $iStrLen1 = StringLen($sAm_I_UFT_8)
Local $iStrLen2 = StringLen($sAm_I_Still_UTF_8)
MsgBox($MB_SYSTEMMODAL, "String Lenght of $sAm_I_UFT_8", $iStrLen1); 414 Lines | 70,174 characters
MsgBox($MB_SYSTEMMODAL, "String Lenght of $sAm_I_Still_UTF_8", $iStrLen2); 414 Lines | 70,174 characters

Local $hfile2 = FileOpen($sFile2, BitOR($FO_OVERWRITE, $FO_BINARY))
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileOpen2", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

FileWrite($hfile2, $sAm_I_Still_UTF_8) ;If $sAm_I_Still_UTF_8 is actual UTF-8 it should be an exact copy of the original "UTF-8 test file.htm"
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileOpen2", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf
FileClose($hfile2)

Edited April 28, 2019 by AXLE
Additional information

argumentum · April 28, 2019

The answer to your question is yes, AutoIt can use UTF-8 ( https://www.autoitscript.com/autoit3/docs/intro/unicode.htm )

Look inside AutoIt3Wrapper.au3, and look for $UTFtype.
That may help.

AXLE · April 28, 2019

Thanks argumentum, The above link seams to be referring more towards actual script Encoding rather than Internal "types" although there is much suggestion that UTF types will be automatically detected at FileOpen() and FileRead() etc, I can't confirm any of this at the moment. As with the above example it shows that the file is being loaded into the variable with some other type of encoding that is not a character count equivalent of the original UTF-8 test file. Also I cant make sense of how I can use what appears to be Pre compiler directives ("Look inside AutoIt3Wrapper.au3, and look for $UTFtype") within my script. Is there info or documents on forcing the use of UTF-8 Variable types in my scripts?

Any further assistance will be appreciated.

Axle

jchd · April 28, 2019

Native AutoIt strings use UCS2, i.e. a subset of UTF16-LE restricted to the BMP.

AutoIt File* functions can detect (read) or be forced to write UTF8 files, depending on options. The resulting data read will be UCS2 encoded (except if reading binary of course).

8 hours ago, AXLE said:

;Does not appear to read UTF-8 characters correctly from the "UTF-8 test file.htm"

I suspect it does read UTF8 correctly.

1 hour ago, AXLE said:

As with the above example it shows that the file is being loaded into the variable with some other type of encoding that is not a character count equivalent of the original UTF-8 test file.

A single codepoint can use from 1 to 4 bytes in UTF8 whereas it consists of only one 16-bit word in UCS-2 and 1 to 2 16-bit words in UTF16. Hence there is no surprise that in general [UTF8 file byte count] ≠ [UCS2 codepoints]

During AutoIt internal UTF-8 to UCS2 conversion, codepoints above the BMP are emasculated since they would need an extra 16-bit word to represent in UTF-16. Said otherwise, AutoIt doesn't recognize and handle UTF16 surrogates. This may be a serious problem for people who use the growing number of planes/blocks for the script (= writing script) they use (SMP, SIP, SSP, private planes). Yet the BMP allows to encode a large number of scripts: https://en.wikipedia.org/wiki/Unicode_block

That said, AutoIt offers ways to convert to/from any two of {UCS2, UTF8, ANSII (any codepage), Windows(any codepage), OEM, double-byte and any codepage supported by your Windows version}. It is also possible to build strings in "beyond BMP" UTF16-LE (manually or programmatically), which Windows or other Unicode-aware applications will handle gracefully, provided the appropriate font(s) is used. But keep in mind that most AutoIt string functions won't handle UTF16 surrogates correctly.

This site offers a load of information, examples, data, applets about Unicode: https://r12a.github.io/scripts/tutorial/part3 (also check docs and apps links.)

Don't hesitate to post if you encounter any encoding issue.

NOTE about Unicode planes:
BMP = Basic Multilingual Plane
SMP = Supplementary Multilingual Plane
SIP = Supplementary Ideographic Plane
SSP = Supplementary Special-purpose Plane

Edited April 28, 2019 by jchd

AXLE · April 28, 2019

Thank you very much for your excellent reply jchd You confirmed what I had originally believe of AutoIt internal types. That is all AutoIt types are of the UCS-2/ MS variant UTF-16 and that UFT-8 documents are being read and converted to the internal types of AutoIt. The documentation https://www.autoitscript.com/autoit3/docs/intro/unicode.htm isn't overly clear on this and had me a little confused as thought maybe AutoIt had introduced native support for UTF-8 types.

I can quite likely achieve what I need with the UCS-2 keeping in mind possible conversion "gotchas", or alternatively I'll have a go at creating a native C UTF-8 String dll library/UDF based off something like ICU or anther light UTF-8 header library.

Thanks for the assistance, is very much appreciated.

Axle

jchd · April 28, 2019

ICU is a really huge hog and doesn't easily solve all the issues that arise in practice with Unicode.

By mere curiosity, why do you think you need a set of functions for UTF8?

As far as I can think, this is more or less all you need:

Func _CodepageToString($sCP, $iCodepage = Default)
    If $iCodepage = Default Then $iCodepage = Int(RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\Codepage", "OEMCP"))
    Local $tText = DllStructCreate("byte[" & StringLen($sCP) & "]")
    DllStructSetData($tText, 1, $sCP)
    Local $aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
            "ptr", 0, "int", 0)
    Local $tWstr = DllStructCreate("wchar[" & $aResult[0] & "]")
    $aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
            "struct*", $tWstr, "int", $aResult[0])
    Return DllStructGetData($tWstr, 1)
EndFunc   ;==>_CodepageToString

Func _StringToCodepage($sStr, $iCodepage = Default)
    If $iCodepage = Default Then $iCodepage = Int(RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\Codepage", "OEMCP"))
    Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
            "ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
    Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
            "struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
    Return DllStructGetData($tCP, 1)
EndFunc   ;==>_StringToCodepage

Supply 65001 as codepage to convert to/from UTF8 to native strings. It's trivial to change the default codepage to UTF8 instead of OEM.

Convert your UTF8 or codepage input data to native strings, process and massage them ad nauseam, convert them if necessary to ouput codepage and you're done.

Edited April 28, 2019 by jchd

AXLE · April 28, 2019

Hi jchd, ICU is huge, I was looking at a few single header libraries like https://github.com/sheredom/utf8.h Just for C type width and string manipulation, maybe wrap it up in a dll for convenience. Just thinking outside the box for a moment (future projects etc). My Unicode knowledge is still in a learning phase.

For now I just need to do some inline Base64 data URIs for html pages and some direct image to b64 conversions. Most of this will be as Binary and ANSI 7bit anyways, so the code page shouldn't really matter. Main thing was just confirming that AutoIt still uses UCS-2 as its internal type so I can test and check for type conversion "Gotchas" along the way. Do enough of these codepage conversions and almost certain mojibake will happen sooner or later lol. Would rather it sooner so I can correct it :)

Also thank you for the informative information above. From my research I was of the belief that 65001 codepage is only available in windows 10, and to some console and internal functions prior to W10. It would be nice if everthing was just UTF-8 or byte code.

jchd · April 29, 2019

11 hours ago, AXLE said:

From my research I was of the belief that 65001 codepage is only available in windows 10, and to some console and internal functions prior to W10.

Windows inaugurated Unicode support with an upgrade to Win 9x and was one of the very first large software company to do so.
With Win NT system calls used UCS-2 and with Win 2000 and up the encoding settled on UTF16-LE.

What's new with Win 10 is indeed that you can select the local codepage to be 65001 (UTF-8) system-wide, not only for the DOS console (CHCP) and for use in conversion functions (in code I posted above). That only changes the behavior of system calls explicitely ANSI, ending in *A, which then consider the byte string as UTF8 data. The encoding used in all other primitives is UTF16-LE and will remain such, until UTF32 will be a good incentive to sell more memory and storage (just guessing here).

In short: apps designed to run on 99.5% of today's PCs should use the conversion functions above for converting codepage input and output when required, but everything else (main code) remains UCS2 (BMP of Unicode).

You'll find a number of Base-64 related post when searching here.

If you find a use for that, note that the regexp support functions (PCRE1) accept the (*UCP) switch (see StringRegExp help).

Local $String = "Sample simple english text 한국어    텍스트의 예 טקסט עברית ירושלים русский образец អត្ថបទថៃ"
Local $aLang = StringRegExp($String, "(*UCP)(\p{Hangul}+(?:\s+\p{Hangul}+)*)", 1)
If not @error Then MsgBox(64, "Korean text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Khmer}+(?:\s+\p{Khmer}+)*)", 1)
If not @error Then MsgBox(64, "Thaï text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Latin}+(?:\s+\p{Latin}+)*)", 1)
If not @error Then MsgBox(64, "Latin text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Hebrew}+(?:\s+\p{Hebrew}+)*)", 1)
If not @error Then MsgBox(64, "Hebrew text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Cyrillic}+(?:\s+\p{Cyrillic}+)*)", 1)
If not @error Then MsgBox(64, "Cyrillic text found", $aLang[0])

AXLE · April 29, 2019

Awesome 😃 thanks for the information jchd. I think I have it all nailed atm. I am using a modded version of CryptBinaryToString by trancexx (Added an extra parameter flag for String|Binary mode). The Binary B64 Enc/Dec on images is byte perfect, and the W3C UTF-8 test page is converting byte perfect, so... so far all good

As far as UTF-8 text manipulation Dll/UDF goes, I've added it to my whiteboard along with the many other projects that I will get to as time permits. I have a few large tertiary assessment modules coming up on web programming, so maybe I will slip it in amidst that.

qwert · April 29, 2019

@AXEL: I haven't followed this thread in detail, but I want to mention a couple of things that I'm reminded of whenever I see discussions about UTF/encoding in AU3.

First, make sure you confirm the character encoding of your script, itself. I recall fighting all kinds of problems when there was a mismatch in strings declared in the script and the encoding of the file it was processing. Once I got everything on the same page, things got a lot easier.

Second, get yourself a copy of XVI32 so you can quickly check the encoding of individual files. Again, I recall chasing ghosts when the file wasn't actually what I was declaring it as in the Read/Write statements.

Hope these help.

1313478691_UTFSetting.PNG.1fff1295a1c607f389b65b5c18ca3435.PNG

AXLE · April 30, 2019

Yeah, I try and keep all coding in ANSI or UTF-8. I use both XVI32 and HxD (Prefer HxD Most times) Even the Hex Editor Plugin for Notepad++ is ok for a quick check on the fly

At the moment all my conversions (B64 Enc/Dec) in both Bin, and Text are coming out byte perfect with no mojibake. Text conversion tests are based on the W3C UTF-8 test page (https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html) and all good so far. Thanks for the pointers just the same

Sign In

UTF-8 Strings in AutoIt

Recommended Posts

AXLE

argumentum

AXLE

jchd

AXLE

jchd

AXLE

jchd

AXLE

qwert

AXLE

Create an account or sign in to comment

Create an account

Sign in

Similar Content

File Encoding for Arrays

[SOLVED] Umlaut.. help

ANSI Coding doesn't work

Base64 decoder/encoder, Internet header decoders, email subject decoder, UTF-8, ISO-8859-1, telnet negociations

UTF-8 encoding ? !! \u2019 \u2013 \u2018

Browse

AutoIt Resources

Release

Beta