emendelson Posted November 6, 2010 Posted November 6, 2010 (edited) I am writing a script that reads a text file created by an old MS-DOS application that uses codepage 437 (in North America) or 850 elsewhere. I want to convert this text to the standard Windows codepage, 1252, and then use ClipPut to put it in the Windows clipboard.I've found this message which shows me how to convert OEM to ANSI:What I'm not clear about is how to force this to convert either from 850 or 437. I've got it working perfectly well on a US-based (437) system, but I don't know to make it work with 850. Or am I misunderstanding how this operates? Edited November 6, 2010 by Edward Mendelson
AndyG Posted November 6, 2010 Posted November 6, 2010 Hi, maybe the following script helps you a bit with the codepages... ;Umwandlung Ansi-String in String Codepage #include <WinAPI.au3> ;http://www.kostis.net/charsets/cp850.htm ;http://msdn.microsoft.com/en-us/library/dd317756(v=VS.85).aspx $page = "037 IBM037 IBM EBCDIC US-Canada|437 IBM437 OEM United States|500 IBM500 IBM EBCDIC International|708 ASMO-708 Arabic (ASMO 708)|709 Arabic (ASMO-449+, BCON V4)|710 Arabic - Transparent Arabic|720 DOS-720 Arabic (Transparent ASMO); Arabic (DOS)|737 ibm737 OEM Greek (formerly 437G); Greek (DOS)|775 ibm775 OEM Baltic; Baltic (DOS)|850 ibm850 OEM Multilingual Latin 1; Western European (DOS)|852 ibm852 OEM Latin 2; Central European (DOS)|855 IBM855 OEM Cyrillic (primarily Russian)|857 ibm857 OEM Turkish; Turkish (DOS)|858 IBM00858 OEM Multilingual Latin 1 + Euro symbol|860 IBM860 OEM Portuguese; Portuguese (DOS)|861 ibm861 OEM Icelandic; Icelandic (DOS)|862 DOS-862 OEM Hebrew; Hebrew (DOS)|863 IBM863 OEM French Canadian; French Canadian (DOS)|864 IBM864 OEM Arabic; Arabic (864)|865 IBM865 OEM Nordic; Nordic (DOS)|866 cp866 OEM Russian; Cyrillic (DOS)|869 ibm869 OEM Modern Greek; Greek, Modern (DOS)|870 IBM870 IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2|874 windows-874 ANSI/OEM Thai (same as 28605, ISO 8859-15); Thai (Windows)|875 cp875 IBM EBCDIC Greek Modern|932 shift_jis ANSI/OEM Japanese; Japanese (Shift-JIS)|936 gb2312 ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)|949 ks_c_5601-1987 ANSI/OEM Korean (Unified Hangul Code)|950 big5 ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)|1026 IBM1026 IBM EBCDIC Turkish (Latin 5)|1047 IBM01047 IBM EBCDIC Latin 1/Open System|1140 IBM01140 IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)|1141 IBM01141 IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)|1142 IBM01142 IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)|1143 IBM01143 IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)|1144 IBM01144 IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)|1145 IBM01145 IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)|1146 IBM01146 IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)|1147 IBM01147 IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)|1148 IBM01148 IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)|1149 IBM01149 IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)|1200 utf-16 Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications|" & _ "1201 unicodeFFFE Unicode UTF-16, big endian byte order; available only to managed applications|1250 windows-1250 ANSI Central European; Central European (Windows)|1251 windows-1251 ANSI Cyrillic; Cyrillic (Windows)|1252 windows-1252 ANSI Latin 1; Western European (Windows)|1253 windows-1253 ANSI Greek; Greek (Windows)|1254 windows-1254 ANSI Turkish; Turkish (Windows)|1255 windows-1255 ANSI Hebrew; Hebrew (Windows)|1256 windows-1256 ANSI Arabic; Arabic (Windows)|1257 windows-1257 ANSI Baltic; Baltic (Windows)|1258 windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows)|1361 Johab Korean (Johab)|10000 macintosh MAC Roman; Western European (Mac)|10001 x-mac-japanese Japanese (Mac)|10002 x-mac-chinesetrad MAC Traditional Chinese (Big5); Chinese Traditional (Mac)|10003 x-mac-korean Korean (Mac)|10004 x-mac-arabic Arabic (Mac)|10005 x-mac-hebrew Hebrew (Mac)|10006 x-mac-greek Greek (Mac)|10007 x-mac-cyrillic Cyrillic (Mac)|10008 x-mac-chinesesimp MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)|10010 x-mac-romanian Romanian (Mac)|10017 x-mac-ukrainian Ukrainian (Mac)|10021 x-mac-thai Thai (Mac)|10029 x-mac-ce MAC Latin 2; Central European (Mac)|10079 x-mac-icelandic Icelandic (Mac)|10081 x-mac-turkish Turkish (Mac)|10082 x-mac-croatian Croatian (Mac)|12000 utf-32 Unicode UTF-32, little endian byte order; available only to managed applications|12001 utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications|20000 x-Chinese_CNS CNS Taiwan; Chinese Traditional (CNS)|20001 x-cp20001 TCA Taiwan|20002 x_Chinese-Eten Eten Taiwan; Chinese Traditional (Eten)|20003 x-cp20003 IBM5550 Taiwan|20004 x-cp20004 TeleText Taiwan|20005 x-cp20005 Wang Taiwan|20105 x-IA5 IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)|20106 x-IA5-German IA5 German (7-bit)|20107 x-IA5-Swedish IA5 Swedish (7-bit)|20108 x-IA5-Norwegian IA5 Norwegian (7-bit)|20127 us-ascii US-ASCII (7-bit)|20261 x-cp20261 T.61|20269 x-cp20269 ISO 6937 Non-Spacing Accent|20273 IBM273 IBM EBCDIC Germany|20277 IBM277 IBM EBCDIC Denmark-Norway|20278 IBM278 IBM EBCDIC Finland-Sweden|20280 IBM280 IBM EBCDIC Italy|20284 IBM284 IBM EBCDIC Latin America-Spain|20285 IBM285 IBM EBCDIC United Kingdom|20290 IBM290 IBM EBCDIC Japanese Katakana Extended|20297 IBM297 IBM EBCDIC France|20420 IBM420 IBM EBCDIC Arabic|20423 IBM423 IBM EBCDIC Greek|20424 IBM424 IBM EBCDIC Hebrew|20833 x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended|" & _ "20838 IBM-Thai IBM EBCDIC Thai|20866 koi8-r Russian (KOI8-R); Cyrillic (KOI8-R)|20871 IBM871 IBM EBCDIC Icelandic|20880 IBM880 IBM EBCDIC Cyrillic Russian|20905 IBM905 IBM EBCDIC Turkish|20924 IBM00924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)|20932 EUC-JP Japanese (JIS 0208-1990 and 0121-1990)|20936 x-cp20936 Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)|20949 x-cp20949 Korean Wansung|21025 cp1025 IBM EBCDIC Cyrillic Serbian-Bulgarian|21027 (deprecated)|21866 koi8-u Ukrainian (KOI8-U); Cyrillic (KOI8-U)|28591 iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO)|28592 iso-8859-2 ISO 8859-2 Central European; Central European (ISO)|28593 iso-8859-3 ISO 8859-3 Latin 3|28594 iso-8859-4 ISO 8859-4 Baltic|28595 iso-8859-5 ISO 8859-5 Cyrillic|28596 iso-8859-6 ISO 8859-6 Arabic|28597 iso-8859-7 ISO 8859-7 Greek|28598 iso-8859-8 ISO 8859-8 Hebrew; Hebrew (ISO-Visual)|28599 iso-8859-9 ISO 8859-9 Turkish|28603 iso-8859-13 ISO 8859-13 Estonian|28605 iso-8859-15 ISO 8859-15 Latin 9|29001 x-Europa Europa 3|38598 iso-8859-8-i ISO 8859-8 Hebrew; Hebrew (ISO-Logical)|50220 iso-2022-jp ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)|50221 csISO2022JP ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)|50222 iso-2022-jp ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)|50225 iso-2022-kr ISO 2022 Korean|50227 x-cp50227 ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)|50229 ISO 2022 Traditional Chinese|50930 EBCDIC Japanese (Katakana) Extended|50931 EBCDIC US-Canada and Japanese|50933 EBCDIC Korean Extended and Korean|50935 EBCDIC Simplified Chinese Extended and Simplified Chinese|50936 EBCDIC Simplified Chinese|50937 EBCDIC US-Canada and Traditional Chinese|50939 EBCDIC Japanese (Latin) Extended and Japanese|51932 euc-jp EUC Japanese|51936 EUC-CN EUC Simplified Chinese; Chinese Simplified (EUC)|51949 euc-kr EUC Korean|51950 EUC Traditional Chinese|52936 hz-gb-2312 HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)|54936 GB18030 Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)|57002 x-iscii-de ISCII Devanagari|57003 x-iscii-be ISCII Bengali|57004 x-iscii-ta ISCII Tamil|57005 x-iscii-te ISCII Telugu|57006 x-iscii-as ISCII Assamese|57007 x-iscii-or ISCII Oriya|57008 x-iscii-ka ISCII Kannada|57009 x-iscii-ma ISCII Malayalam|57010 x-iscii-gu ISCII Gujarati|57011 x-iscii-pa ISCII Punjabi|65000 utf-7 Unicode (UTF-7)|65001 utf-8 Unicode (UTF-8)" $a = StringSplit($page, "|", 2) $struct = DllStructCreate("byte[512]") ;platz für UTF16 $ansi = "" For $i = 1 To 255 ;ansistring füllen mit dem aktuellen Zeichensatz $ansi &= Chr($i) Next For $b In $a ;alle codepages $codepage = Number($b) ;codepagenummer $string = _WinAPI_MultiByteToWideCharEx($ansi, DllStructGetPtr($struct), $codepage, $MB_USEGLYPHCHARS) ;Ansi-String in Codepage umwandeln ;wenn $string=0 dann ggf Fehler/nicht darstellbar, ansonsten anzahl der Zeichen ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $string = ' & $string & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console If $string <> 0 Then MsgBox(262144, "Codepage Nr:" & $b, "Orginalstring:" & @CRLF & $ansi & @CRLF & @CRLF & "Codepage " & $codepage & @CRLF & BinaryToString(DllStructGetData($struct, 1), 2)) ;### Debug MSGBOX Next
CNCMONKEY Posted November 6, 2010 Posted November 6, 2010 excellent work wow how simplistic you made it look thanks
emendelson Posted November 6, 2010 Author Posted November 6, 2010 Andy G, That is an excellent script - thank you for it. But perhaps I can restate my problem: My script can read a text file created either in codepage 437 or codepage 850. My script will know which code page the text file was created in. (The filename will either be clip437.txt or clip850.txt.) I want to be able to convert that file into the Windows code page 1252, whether or not the current system codepage is 437 or 850 or something else. Is there an API call that I can make, or some other AutoIt feature, that will convert a text string from 437 or 850 to 1252? Thank you for any help.
jchd Posted November 6, 2010 Posted November 6, 2010 The answer to your last question (as it is stated) is NO, this is plain impossible in the general case. Here's why: a codepage is a _convention_ where your decide to assign a given position (or hex value) to a given character in a table of limited size (typically 256 positions if one doesn't go into the difficulty of multi-byte codepages used in some asian areas). That means that for instance, position 0xD8 means the uppercase U ogonek (Ų) in Windows Baltic codepage, uppercase cyrillic Cha (Ш) in Windows cyrillic codepage, uppercase slashed O in Windows turkish codepage as well as in Windows latin-1 1252 codepage. The problem is that many > 0x7Fcharacters in a given codepage have no equivalent in another codepage. This is why Unicode was created (about 20 years ago!). Using Unicode, you can be sure that there is no possible mis-interpretation of a given character, as every known past, present or future character used by humanity has been assigned or will be assigned its own codepoint. In short, my advice is to read the codepage text, convert it into a Unicode string(s) and stick with the Unicode representation to ascertain every character will be interpreted as intended, whatever underlying system/user setting is used. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
emendelson Posted November 7, 2010 Author Posted November 7, 2010 (edited) In short, my advice is to read the codepage text, convert it into a Unicode string(s) and stick with the Unicode representation to ascertain every character will be interpreted as intended, whatever underlying system/user setting is used.But that is exactly what I am trying to do! I probably didn't make it clear that I was asking how to do exactly that. I'm grateful to you for confirming that what I want to do is exactly what I should be doing.I should have made this more clear. My script will start by reading a text file that has been output by WordPerfect for DOS in either codepage 437 or codepage 850 - the early versions of WordPerfect can't output ANSI or Unicode text.If the text from WordPerfect is in codepage 437, and if the Windows system is in North America, then a simple OEM to ANSI conversion will make it easy for me to get Unicode text. That's because Windows checks the OEMCP setting in the registry to see what the local DOS code page should be. (This setting can't be changed by writing to the registry - it also requires a reboot.)Similarly, if the text from WordPerfect is in codepage 850, and if the Windows system is in Western Europe, then a simple OEM to ANSI conversion will make it easy for me to get Unicode text. However, for various reasons, the user of this script may not have the technical ability to force his WordPerfect setup into using the correct code page. So it's possible that the user will output codepage 437 text in a Western European system, or he might output codepage 850 text in a North American system. In that case, a simple OEM to ANSI conversion won't work, and I want to be able to handle that situation also.So my question still is: is there a way to convert the contents of the WordPerfect-created text file from codepage 850 or 437 to ANSI (which in this case is directly convertible into Unicode)? In other words, what I want to do is 100 percent exactly what you are suggesting that I do. I am asking how to do it reliably.P.S. I know that one answer is to use a third-party utility (the Windows port of the Linux iconv program), but I think there must be a way to accomplish this using the Windows API. I simply don't know what it is. Edited November 7, 2010 by Edward Mendelson
jchd Posted November 7, 2010 Posted November 7, 2010 OK I was reading what you wrote in a bit too strict way.Here's the Windows function you need, along with information from this page. It's available as part of WinApi.au3 UDF:So you would use something like:$OutputUnicodeString = _WinAPI_MultiByteToWideChar($sInputText, $iCodePage, 0, True) This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
emendelson Posted November 8, 2010 Author Posted November 8, 2010 jchd,That is exactly what I was hoping to find. It works perfectly. Thank you!
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now