Jump to content

Recommended Posts

Posted (edited)

I am writing a script that reads a text file created by an old MS-DOS application that uses codepage 437 (in North America) or 850 elsewhere. I want to convert this text to the standard Windows codepage, 1252, and then use ClipPut to put it in the Windows clipboard.

I've found this message which shows me how to convert OEM to ANSI:

What I'm not clear about is how to force this to convert either from 850 or 437. I've got it working perfectly well on a US-based (437) system, but I don't know to make it work with 850. Or am I misunderstanding how this operates?

Edited by Edward Mendelson
Posted

Hi,

maybe the following script helps you a bit with the codepages...

;Umwandlung Ansi-String in String Codepage


#include <WinAPI.au3>

;http://www.kostis.net/charsets/cp850.htm

;http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx
$page = "037  IBM037  IBM EBCDIC US-Canada|437  IBM437  OEM United States|500  IBM500  IBM EBCDIC International|708  ASMO-708  Arabic (ASMO 708)|709    Arabic (ASMO-449+, BCON V4)|710    Arabic - Transparent Arabic|720  DOS-720  Arabic (Transparent ASMO); Arabic (DOS)|737  ibm737  OEM Greek (formerly 437G); Greek (DOS)|775  ibm775  OEM Baltic; Baltic (DOS)|850  ibm850  OEM Multilingual Latin 1; Western European (DOS)|852  ibm852  OEM Latin 2; Central European (DOS)|855  IBM855  OEM Cyrillic (primarily Russian)|857  ibm857  OEM Turkish; Turkish (DOS)|858  IBM00858  OEM Multilingual Latin 1 + Euro symbol|860  IBM860  OEM Portuguese; Portuguese (DOS)|861  ibm861  OEM Icelandic; Icelandic (DOS)|862  DOS-862  OEM Hebrew; Hebrew (DOS)|863  IBM863  OEM French Canadian; French Canadian (DOS)|864  IBM864  OEM Arabic; Arabic (864)|865  IBM865  OEM Nordic; Nordic (DOS)|866  cp866  OEM Russian; Cyrillic (DOS)|869  ibm869  OEM Modern Greek; Greek, Modern (DOS)|870  IBM870  IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2|874  windows-874  ANSI/OEM Thai (same as 28605, ISO 8859-15); Thai (Windows)|875  cp875  IBM EBCDIC Greek Modern|932  shift_jis  ANSI/OEM Japanese; Japanese (Shift-JIS)|936  gb2312  ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)|949  ks_c_5601-1987  ANSI/OEM Korean (Unified Hangul Code)|950  big5  ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)|1026  IBM1026  IBM EBCDIC Turkish (Latin 5)|1047  IBM01047  IBM EBCDIC Latin 1/Open System|1140  IBM01140  IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)|1141  IBM01141  IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)|1142  IBM01142  IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)|1143  IBM01143  IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)|1144  IBM01144  IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)|1145  IBM01145  IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)|1146  IBM01146  IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)|1147  IBM01147  IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)|1148  IBM01148  IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)|1149  IBM01149  IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)|1200  utf-16  Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications|" & _
        "1201  unicodeFFFE  Unicode UTF-16, big endian byte order; available only to managed applications|1250  windows-1250  ANSI Central European; Central European (Windows)|1251  windows-1251  ANSI Cyrillic; Cyrillic (Windows)|1252  windows-1252  ANSI Latin 1; Western European (Windows)|1253  windows-1253  ANSI Greek; Greek (Windows)|1254  windows-1254  ANSI Turkish; Turkish (Windows)|1255  windows-1255  ANSI Hebrew; Hebrew (Windows)|1256  windows-1256  ANSI Arabic; Arabic (Windows)|1257  windows-1257  ANSI Baltic; Baltic (Windows)|1258  windows-1258  ANSI/OEM Vietnamese; Vietnamese (Windows)|1361  Johab  Korean (Johab)|10000  macintosh  MAC Roman; Western European (Mac)|10001  x-mac-japanese  Japanese (Mac)|10002  x-mac-chinesetrad  MAC Traditional Chinese (Big5); Chinese Traditional (Mac)|10003  x-mac-korean  Korean (Mac)|10004  x-mac-arabic  Arabic (Mac)|10005  x-mac-hebrew  Hebrew (Mac)|10006  x-mac-greek  Greek (Mac)|10007  x-mac-cyrillic  Cyrillic (Mac)|10008  x-mac-chinesesimp  MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)|10010  x-mac-romanian  Romanian (Mac)|10017  x-mac-ukrainian  Ukrainian (Mac)|10021  x-mac-thai  Thai (Mac)|10029  x-mac-ce  MAC Latin 2; Central European (Mac)|10079  x-mac-icelandic  Icelandic (Mac)|10081  x-mac-turkish  Turkish (Mac)|10082  x-mac-croatian  Croatian (Mac)|12000  utf-32  Unicode UTF-32, little endian byte order; available only to managed applications|12001  utf-32BE  Unicode UTF-32, big endian byte order; available only to managed applications|20000  x-Chinese_CNS  CNS Taiwan; Chinese Traditional (CNS)|20001  x-cp20001  TCA Taiwan|20002  x_Chinese-Eten  Eten Taiwan; Chinese Traditional (Eten)|20003  x-cp20003  IBM5550 Taiwan|20004  x-cp20004  TeleText Taiwan|20005  x-cp20005  Wang Taiwan|20105  x-IA5  IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)|20106  x-IA5-German  IA5 German (7-bit)|20107  x-IA5-Swedish  IA5 Swedish (7-bit)|20108  x-IA5-Norwegian  IA5 Norwegian (7-bit)|20127  us-ascii  US-ASCII (7-bit)|20261  x-cp20261  T.61|20269  x-cp20269  ISO 6937 Non-Spacing Accent|20273  IBM273  IBM EBCDIC Germany|20277  IBM277  IBM EBCDIC Denmark-Norway|20278  IBM278  IBM EBCDIC Finland-Sweden|20280  IBM280  IBM EBCDIC Italy|20284  IBM284  IBM EBCDIC Latin America-Spain|20285  IBM285  IBM EBCDIC United Kingdom|20290  IBM290  IBM EBCDIC Japanese Katakana Extended|20297  IBM297  IBM EBCDIC France|20420  IBM420  IBM EBCDIC Arabic|20423  IBM423  IBM EBCDIC Greek|20424  IBM424  IBM EBCDIC Hebrew|20833  x-EBCDIC-KoreanExtended  IBM EBCDIC Korean Extended|" & _
        "20838  IBM-Thai  IBM EBCDIC Thai|20866  koi8-r  Russian (KOI8-R); Cyrillic (KOI8-R)|20871  IBM871  IBM EBCDIC Icelandic|20880  IBM880  IBM EBCDIC Cyrillic Russian|20905  IBM905  IBM EBCDIC Turkish|20924  IBM00924  IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)|20932  EUC-JP  Japanese (JIS 0208-1990 and 0121-1990)|20936  x-cp20936  Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)|20949  x-cp20949  Korean Wansung|21025  cp1025  IBM EBCDIC Cyrillic Serbian-Bulgarian|21027    (deprecated)|21866  koi8-u  Ukrainian (KOI8-U); Cyrillic (KOI8-U)|28591  iso-8859-1  ISO 8859-1 Latin 1; Western European (ISO)|28592  iso-8859-2  ISO 8859-2 Central European; Central European (ISO)|28593  iso-8859-3  ISO 8859-3 Latin 3|28594  iso-8859-4  ISO 8859-4 Baltic|28595  iso-8859-5  ISO 8859-5 Cyrillic|28596  iso-8859-6  ISO 8859-6 Arabic|28597  iso-8859-7  ISO 8859-7 Greek|28598  iso-8859-8  ISO 8859-8 Hebrew; Hebrew (ISO-Visual)|28599  iso-8859-9  ISO 8859-9 Turkish|28603  iso-8859-13  ISO 8859-13 Estonian|28605  iso-8859-15  ISO 8859-15 Latin 9|29001  x-Europa  Europa 3|38598  iso-8859-8-i  ISO 8859-8 Hebrew; Hebrew (ISO-Logical)|50220  iso-2022-jp  ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)|50221  csISO2022JP  ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)|50222  iso-2022-jp  ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)|50225  iso-2022-kr  ISO 2022 Korean|50227  x-cp50227  ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)|50229    ISO 2022 Traditional Chinese|50930    EBCDIC Japanese (Katakana) Extended|50931    EBCDIC US-Canada and Japanese|50933    EBCDIC Korean Extended and Korean|50935    EBCDIC Simplified Chinese Extended and Simplified Chinese|50936    EBCDIC Simplified Chinese|50937    EBCDIC US-Canada and Traditional Chinese|50939    EBCDIC Japanese (Latin) Extended and Japanese|51932  euc-jp  EUC Japanese|51936  EUC-CN  EUC Simplified Chinese; Chinese Simplified (EUC)|51949  euc-kr  EUC Korean|51950    EUC Traditional Chinese|52936  hz-gb-2312  HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)|54936  GB18030  Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)|57002  x-iscii-de  ISCII Devanagari|57003  x-iscii-be  ISCII Bengali|57004  x-iscii-ta  ISCII Tamil|57005  x-iscii-te  ISCII Telugu|57006  x-iscii-as  ISCII Assamese|57007  x-iscii-or  ISCII Oriya|57008  x-iscii-ka  ISCII Kannada|57009  x-iscii-ma  ISCII Malayalam|57010  x-iscii-gu  ISCII Gujarati|57011  x-iscii-pa  ISCII Punjabi|65000  utf-7  Unicode (UTF-7)|65001  utf-8  Unicode (UTF-8)"

$a = StringSplit($page, "|", 2)
$struct = DllStructCreate("byte[512]") ;platz für UTF16
$ansi = ""
For $i = 1 To 255  ;ansistring füllen mit dem aktuellen Zeichensatz
    $ansi &= Chr($i)
Next

For $b In $a   ;alle codepages
    $codepage = Number($b)   ;codepagenummer
    $string = _WinAPI_MultiByteToWideCharEx($ansi, DllStructGetPtr($struct), $codepage, $MB_USEGLYPHCHARS)   ;Ansi-String in Codepage umwandeln
    ;wenn $string=0 dann ggf Fehler/nicht darstellbar, ansonsten anzahl der Zeichen
    ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $string = ' & $string & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console
    If $string <> 0 Then MsgBox(262144, "Codepage Nr:" & $b, "Orginalstring:" & @CRLF & $ansi & @CRLF & @CRLF & "Codepage " & $codepage & @CRLF & BinaryToString(DllStructGetData($struct, 1), 2)) ;### Debug MSGBOX
Next
Posted

Andy G,

That is an excellent script - thank you for it.

But perhaps I can restate my problem:

My script can read a text file created either in codepage 437 or codepage 850. My script will know which code page the text file was created in. (The filename will either be clip437.txt or clip850.txt.) I want to be able to convert that file into the Windows code page 1252, whether or not the current system codepage is 437 or 850 or something else.

Is there an API call that I can make, or some other AutoIt feature, that will convert a text string from 437 or 850 to 1252?

Thank you for any help.

Posted

The answer to your last question (as it is stated) is NO, this is plain impossible in the general case.

Here's why: a codepage is a _convention_ where your decide to assign a given position (or hex value) to a given character in a table of limited size (typically 256 positions if one doesn't go into the difficulty of multi-byte codepages used in some asian areas). That means that for instance, position 0xD8 means the uppercase U ogonek (Ų) in Windows Baltic codepage, uppercase cyrillic Cha (Ш) in Windows cyrillic codepage, uppercase slashed O in Windows turkish codepage as well as in Windows latin-1 1252 codepage. The problem is that many > 0x7Fcharacters in a given codepage have no equivalent in another codepage.

This is why Unicode was created (about 20 years ago!). Using Unicode, you can be sure that there is no possible mis-interpretation of a given character, as every known past, present or future character used by humanity has been assigned or will be assigned its own codepoint.

In short, my advice is to read the codepage text, convert it into a Unicode string(s) and stick with the Unicode representation to ascertain every character will be interpreted as intended, whatever underlying system/user setting is used.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted (edited)

In short, my advice is to read the codepage text, convert it into a Unicode string(s) and stick with the Unicode representation to ascertain every character will be interpreted as intended, whatever underlying system/user setting is used.

But that is exactly what I am trying to do! I probably didn't make it clear that I was asking how to do exactly that. I'm grateful to you for confirming that what I want to do is exactly what I should be doing.

I should have made this more clear. My script will start by reading a text file that has been output by WordPerfect for DOS in either codepage 437 or codepage 850 - the early versions of WordPerfect can't output ANSI or Unicode text.

If the text from WordPerfect is in codepage 437, and if the Windows system is in North America, then a simple OEM to ANSI conversion will make it easy for me to get Unicode text. That's because Windows checks the OEMCP setting in the registry to see what the local DOS code page should be. (This setting can't be changed by writing to the registry - it also requires a reboot.)

Similarly, if the text from WordPerfect is in codepage 850, and if the Windows system is in Western Europe, then a simple OEM to ANSI conversion will make it easy for me to get Unicode text.

However, for various reasons, the user of this script may not have the technical ability to force his WordPerfect setup into using the correct code page. So it's possible that the user will output codepage 437 text in a Western European system, or he might output codepage 850 text in a North American system. In that case, a simple OEM to ANSI conversion won't work, and I want to be able to handle that situation also.

So my question still is: is there a way to convert the contents of the WordPerfect-created text file from codepage 850 or 437 to ANSI (which in this case is directly convertible into Unicode)? In other words, what I want to do is 100 percent exactly what you are suggesting that I do. I am asking how to do it reliably.

P.S. I know that one answer is to use a third-party utility (the Windows port of the Linux iconv program), but I think there must be a way to accomplish this using the Windows API. I simply don't know what it is.

Edited by Edward Mendelson
Posted

OK I was reading what you wrote in a bit too strict way.

Here's the Windows function you need, along with information from this page. It's available as part of WinApi.au3 UDF:

So you would use something like:

$OutputUnicodeString = _WinAPI_MultiByteToWideChar($sInputText, $iCodePage, 0, True)

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...