The problem of internal representation of characters has been plaguing
the computer industry since IT became widespread.
Initially every company used its own conventions and tables to represent
text and symbols, making interoperability a nightmare. The growing demand
for support of more symbols, control characters and non-Latin scripts made
the situation even worse.
Character sets and their possible encodings resembles playing cards: tarot
and poker don't use the same set of cards. Next, once a set is chosen, one
must create a design (a representation or encoding) for each card so that
every player recognizes them instantly.
Today, all character sets fall into 2 families: Unicode and codepages.
The question of the representation of strings in memory or files using a given character set arose when IT started to use non-simple codepages.
Native AutoIt strings use the UCS-2 character set and encoding. It is the
subset of Unicode limited to the BMP (Basic Multilingual Plane), the first
64k Unicode codepoints. This encoding uses 16-bit encoding units (each
character is represented by a unsigned short value) where codepoints in
range U+D800..U+DFFF (surrogates in UTF16) are not special and simply
reserved for private use.
Note that Windows has been handling Unicode for a very long time: Win 3.x,
Win95, NT added a DLL to handle UCS-2, XP and up handled UTF16-LE.
However, some applications need to process strings using other encodings.
Converting to/from some codepage from/to native UCS2 AutoIt strings
You can use these functions to perform the wanted conversion. Codepage
identifier 65001 means UTF8 but you can pass any identifier supported by
Windows.
A list of codepages supported by Windows can be found here:
https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
; To convert a native AutoIt string (UCS-2) to some codepage (by default UTF8):
Func _StringToCodepage($sStr, $iCodepage = Default)
If $iCodepage = Default Then $iCodepage = 65001
Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
"ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
$aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
"struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
Return DllStructGetData($tCP, 1)
EndFunc ;==>_StringToCodepage
; To convert a string from a given codepage (by default UTF8) to a native AutoIt string (UCS-2):
Func _CodepageToString($sCP, $iCodepage = Default)
If $iCodepage = Default Then $iCodepage = 65001
Local $tText = DllStructCreate("byte[" & StringLen($sCP) & "]")
DllStructSetData($tText, 1, $sCP)
Local $aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
"ptr", 0, "int", 0)
Local $tWstr = DllStructCreate("wchar[" & $aResult[0] & "]")
$aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
"struct*", $tWstr, "int", $aResult[0])
Return DllStructGetData($tWstr, 1)
EndFunc ;==>_CodepageToString
If you only need to convert native AutoIt strings to/from UTF8 (a very common use) you can use this
$sMyString = "Hello Χαίρετε こんにちは Привет xin chào हैलो مرحبا 你好 שלום வணக்கம்"
$sUTF8String = BinaryToString(StringToBinary($sMyString & @LF, 4), 1)
; reverse conversion:
$sMyStringBack = BinaryToString(StringToBinary($sUTF8String & @LF, 1), 4)
It is a good idea to use the default UTF8 encoding for your source files:
your strings will display verbatim in both your source code and in Windows
controls.
It is also a good idea to set the SciTe4AutoIt3 console to UTF8 if ever
you need to display characters or symbols not found in your default
Windows codepage.
To send UTF8 strings to the SciTe console, you can use this function:
; Unicode-aware ConsoleWrite for UTF8 SciTe console
Func _ConsoleWrite($s)
ConsoleWrite(BinaryToString(StringToBinary($s & @LF, 4), 1))
EndFunc ;==>_ConsoleWrite
In addition, if your program may use the compiled CUI interface *or* the uncompiled SciTe console (e.g. for debugging), you can use this:
; Indirect Unicode-aware function for UTF8 Scite or CUI consolewrite
Func __ConsoleWrite($s)
Return (@Compiled ? _CUI_ConsoleWrite : _ConsoleWrite)($s)
EndFunc ;==>__ConsoleWrite
; Function for UTF16 CUI consolewrite
Func _CUI_ConsoleWrite(ByRef $s)
Local Static $hCon = _CUI_ConsoleInit()
DllCall("kernel32.dll", "bool", "WriteConsoleW", "handle", $hCon, "wstr", $s & @LF, "dword", StringLen($s) + 1, "dword*", 0, "ptr", 0)
Return
EndFunc ;==>_CUI_ConsoleWrite
; Helper function for CUI consolewrite
Func _CUI_ConsoleInit()
DllCall("kernel32.dll", "bool", "AllocConsole")
Return DllCall("kernel32.dll", "handle", "GetStdHandle", "int", -11)[0]
EndFunc ;==>_CUI_ConsoleInit
For instance, run this code sample using the above functions; Hello should display correctly in different languages identically in the MsgBox and the console (SciTe or CUI):
$sMyString = "Hello Χαίρετε こんにちは Привет xin chào हैलो مرحبا 你好 שלום வணக்கம்"
__ConsoleWrite($sMyString)
MsgBox(0, "", $sMyString)