How do I convert strings between ANSI and UTF-8 / 16

jchd · April 15, 2009

Hello group,

I've been trying almost everything possible to convert strings between ANSI (I personaly use Latin-1 codepage) and UTF-8 or UTF-16, but I've had no real success up to now.

I need this because I have to deal with a pure ANSI database using an ODBC layer, a complex GUI interface and two separate SQLite3 (v3.6.13) databases (one utf-8 and one utf-16). Given that this will be routinely used with a significant dataflow volume, I'd like to know which would be the most practical (and if possible efficient) way.

I may be misunderstanding obvious things, but it seems to me that many standard UDF or functions are still working in ANSI mode. Are simple GUI controls (say InputBoxes) delivering ANSI or UTF-8 strings? What is the SQLite3 interface expecting when it comes to data format?

In the same direction, why are UTF-16 (improperly called Unicode) strings passed to/from in dll calls thru obscure structures instead of the wstr type? Aren't wstr's first class citizens?

Another question haunting me: how can I hex-dump a string without going into any conversion? StringToBinary is not an option since it forces you to declare which format the string is using, which is precisely what I need to know!

I really can't understand how all this is supposed to be used in simple or more complex developments!

I apologize for asking that much, but it's a consequence of me wasting ___way___ too much time trying to solve these issues.

Warm thanks in advance for any help.

jchd · April 16, 2009

I try to push up this post in the hope someone could help.

WideBoyDixon · April 16, 2009

Well, I've been playing around with _WinAPI_MultiByteToWideChar() and _WinAPI_WideCharToMultiByte() but try as I might I couldn't get them to work. I've extracted both functions from the WinAPI.au3 file and tweaked them a little to get them working. Below is an example of converting from ANSI to UTF-8. For UTF-16, you'd be as well taking the result from _WBD_WinAPI_MultiByteToWideChar() and re-using it in your function calls with DllStructGetPtr(). I hope this makes sense.

MsgBox(64, "UTF-8", _ConvertAnsiToUtf8("Café á ©®"), 5)

Exit

Func _ConvertAnsiToUtf8($sText)
    Local $tUnicode = _WBD_WinAPI_MultiByteToWideChar($sText)
    If @error Then Return SetError(@error, 0, "")
    Local $sUtf8 = _WBD_WinAPI_WideCharToMultiByte(DllStructGetPtr($tUnicode), 65001)
    If @error Then Return SetError(@error, 0, "")
    Return SetError(0, 0, $sUtf8)
EndFunc   ;==>_ConvertAnsiToUtf8

Func _WBD_WinAPI_MultiByteToWideChar($sText, $iCodePage = 0, $iFlags = 0)
    Local $iText, $pText, $tText

    $iText = StringLen($sText) + 1
    $tText = DllStructCreate("wchar[" & $iText & "]")
    $pText = DllStructGetPtr($tText)
    DllCall("Kernel32.dll", "int", "MultiByteToWideChar", "int", $iCodePage, "int", $iFlags, "str", $sText, "int", $iText, "ptr", $pText, "int", $iText)
    If @error Then Return SetError(@error, 0, $tText)
    Return $tText
EndFunc   ;==>_WBD_WinAPI_MultiByteToWideChar

Func _WBD_WinAPI_WideCharToMultiByte($pUnicode, $iCodePage = 0)
    Local $aResult, $tText, $pText

    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "int", $iCodePage, "int", 0, "ptr", $pUnicode, "int", -1, "ptr", 0, "int", 0, "int", 0, "int", 0)
    If @error Then Return SetError(@error, 0, "")
    $tText = DllStructCreate("char[" & $aResult[0] + 1 & "]")
    $pText = DllStructGetPtr($tText)
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "int", $iCodePage, "int", 0, "ptr", $pUnicode, "int", -1, "ptr", $pText, "int", $aResult[0], "int", 0, "int", 0)
    If @error Then Return SetError(@error, 0, "")
    Return DllStructGetData($tText, 1)
EndFunc   ;==>_WBD_WinAPI_WideCharToMultiByte

WBD

PsaltyDS · April 17, 2009

Hello group,
I've been trying almost everything possible to convert strings between ANSI (I personaly use Latin-1 codepage) and UTF-8 or UTF-16, but I've had no real success up to now.
I need this because I have to deal with a pure ANSI database using an ODBC layer, a complex GUI interface and two separate SQLite3 (v3.6.13) databases (one utf-8 and one utf-16). Given that this will be routinely used with a significant dataflow volume, I'd like to know which would be the most practical (and if possible efficient) way.
I may be misunderstanding obvious things, but it seems to me that many standard UDF or functions are still working in ANSI mode. Are simple GUI controls (say InputBoxes) delivering ANSI or UTF-8 strings? What is the SQLite3 interface expecting when it comes to data format?
In the same direction, why are UTF-16 (improperly called Unicode) strings passed to/from in dll calls thru obscure structures instead of the wstr type? Aren't wstr's first class citizens?
Another question haunting me: how can I hex-dump a string without going into any conversion? StringToBinary is not an option since it forces you to declare which format the string is using, which is precisely what I need to know!
I really can't understand how all this is supposed to be used in simple or more complex developments!
I apologize for asking that much, but it's a consequence of me wasting ___way___ too much time trying to solve these issues.
Warm thanks in advance for any help.

There is no one-to-one relationship between the two, so "converting" is not involved unless you limit to a small subset of Unicode values that happen to have ANSI analogues. For all the rest, it will be a matter of "translating" (as in Arabic to English), not "converting" (as in Decimal to Hex).

How would you convert ü to ANSI? What if there is a mix of French accents, Hebrew characters, and Greek scientific constants?

^_^

jchd · April 17, 2009

There is no one-to-one relationship between the two, so "converting" is not involved unless you limit to a small subset of Unicode values that happen to have ANSI analogues. For all the rest, it will be a matter of "translating" (as in Arabic to English), not "converting" (as in Decimal to Hex).
How would you convert ü to ANSI? What if there is a mix of French accents, Hebrew characters, and Greek scientific constants?

Hi PsaltyDS, nice to see you on board!

I'm in no way confusing between conversion and translation. Indeed, ü belongs to both Unicode (of course it does!) and Latin-1 ANSI codepage.

We have the right to expect a conversion between some codepage (say, Latin1) and Unicode. In this precise case, Latin1 ü has hex representation FC while Unicode codepoint is 0x00FC, having hex UTF-8 representation C3 BC. Such 1to1 bi-directional conversion is obviously limited to the subset: Unicode ∩ ANSI codepage ≍ ANSI codepage (since Unicode is the full code universe).

When converting the other way round, it's generally admitted to convert to placeholder any Unicode input character that doesn't have an ANSI code in the working codepage.

BTW, I can see no use of a "mixed ANSI string" having distinct elements of two or more codepages. Codepaged sets are a bit like ix86 segmented addressing, where the contents of SP alone doesn't define an unambiguous address: only SS:SP does but once SS (the codepage) is fixed you can't access more than 64kb (O memories!).

What made me crazy is the fact that --as the previous post did point out-- there are untold problems making kernel32.dll' WideCharToMultiByte and sister MultiByteToWideChar work as they should using the WinAPI functions. As I understand it, UTF-16 "strings" returned back from DllCalls are not strings but rather structures containing a hex representation of UTF-16 strings. That I suppose is a side effect of AutoIt typelessness. But then it's impossible to pass such parameters to a third party library function expecting genuine wstrs but unaware of the exotic hex format.

Also I've struggled to understand the following definition of the wstr parameter type in DllCall function: "a UNICODE wide character string (converted to/from an ANSI string during the call if needed)." This hardly makes any sense to me as it is and I believe it should be ( clarified | rephrased | corrected | removed ).

As a sidenote, yes I do have a French accent ... I'm a français de France!

PsaltyDS · April 20, 2009

Hi PsaltyDS, nice to see you on board!
I'm in no way confusing between conversion and translation. Indeed, ü belongs to both Unicode (of course it does!) and Latin-1 ANSI codepage.
We have the right to expect a conversion between some codepage (say, Latin1) and Unicode. In this precise case, Latin1 ü has hex representation FC while Unicode codepoint is 0x00FC, having hex UTF-8 representation C3 BC. Such 1to1 bi-directional conversion is obviously limited to the subset: Unicode ∩ ANSI codepage ≍ ANSI codepage (since Unicode is the full code universe).
When converting the other way round, it's generally admitted to convert to placeholder any Unicode input character that doesn't have an ANSI code in the working codepage.
BTW, I can see no use of a "mixed ANSI string" having distinct elements of two or more codepages. Codepaged sets are a bit like ix86 segmented addressing, where the contents of SP alone doesn't define an unambiguous address: only SS:SP does but once SS (the codepage) is fixed you can't access more than 64kb (O memories!).
What made me crazy is the fact that --as the previous post did point out-- there are untold problems making kernel32.dll' WideCharToMultiByte and sister MultiByteToWideChar work as they should using the WinAPI functions. As I understand it, UTF-16 "strings" returned back from DllCalls are not strings but rather structures containing a hex representation of UTF-16 strings. That I suppose is a side effect of AutoIt typelessness. But then it's impossible to pass such parameters to a third party library function expecting genuine wstrs but unaware of the exotic hex format.
Also I've struggled to understand the following definition of the wstr parameter type in DllCall function: "a UNICODE wide character string (converted to/from an ANSI string during the call if needed)." This hardly makes any sense to me as it is and I believe it should be ( clarified | rephrased | corrected | removed ).
As a sidenote, yes I do have a French accent ... I'm a français de France!

Can you post a short runnable example of one the conversion issues you have? This stuff gets over my head too fast to try and figure out that many cases. Like maybe reading from a short GUI example and a short SQLite example and comparing values in the way you intend?

^_^

Sign In

How do I convert strings between ANSI and UTF-8 / 16

Recommended Posts

jchd

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

WideBoyDixon

Link to comment

Share on other sites

PsaltyDS

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

PsaltyDS

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta