czardas Posted March 18, 2013 Share Posted March 18, 2013 Hmm, it seems strange. The regexp pattern seems to be working though. If StringRegExp(ChrW(0x1A),"[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then MsgBox(0, "Error", "_broken segment_") operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 I don't think it is a memory problem either, because if I reduce the TMX file to just the few TUs surrounding the invalid character (i.e. total file size less than 1000 characters), the script doesn't catch the invalid character either. Link to comment Share on other sites More sharing options...
trancexx Posted March 18, 2013 Share Posted March 18, 2013 x{FFFD} character will get you in trouble with regexp. Go one below. And don't tell anyone I told you that. ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
jchd Posted March 19, 2013 Share Posted March 19, 2013 (edited) Codepoint U+00FFFD is perfectly valid by itself (it means "Invalid character got replaced by this special codepoint") provided that: 1) you don't object that the file contains this codepoint by itself 2) you don't object that invalid sequences in input file gets converted into this codepoint 3) PCRE implementation of PCRE support doesn't do anything surprising Anyway, I'm surprised you say that the pattern doesn't work to detect 0x1A: Local $m, $s = "abc" & Chr(0x1A) & "def" If StringRegExp($s, "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then $m = "Failed" Else $m = "Passed" EndIf ConsoleWrite($m & @LF) @trancexx, Can you eloborate further? Edited March 19, 2013 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
trancexx Posted March 19, 2013 Share Posted March 19, 2013 (edited) Codepoint U+00FFFD is perfectly valid by itself (it means "Invalid character got replaced by this special codepoint") provided that: 1) you don't object that the file contains this codepoint by itself 2) you don't object that invalid sequences in input file gets converted into this codepoint 3) PCRE implementation of PCRE support doesn't do anything surprising Anyway, I'm surprised you say that the pattern doesn't work to detect 0x1A: Local $m, $s = "abc" & Chr(0x1A) & "def" If StringRegExp($s, "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]") Then $m = "Failed" Else $m = "Passed" EndIf ConsoleWrite($m & @LF) @trancexx, Can you eloborate further? The problem is for code points U+D800 to U+DFFF. Your pattern (x{FFFD} part) will cause wrong results here. Even though the behavior is explainable (you did it actually), it may be seen as unexpected. You see, x{FFFD} matches both U+FFFD and the whole range from U+D800 to U+DFFF. That's what your pattern explicitly tries no to do. So it's either x20-x{FFFD} or the last hex is x{FFFC}. Edited March 19, 2013 by trancexx Baraoic 1 ♡♡♡ . eMyvnE Link to comment Share on other sites More sharing options...
jchd Posted March 19, 2013 Share Posted March 19, 2013 That's true if the input file is UTF-16 encoded and contains codepoints > U+FFFF (those which use surrogates). Since the OP said he reads UTF-8 text, there should be no surrogate in the input file. Yet a question remains hinted to by my points 1) and 2): should invalid UTF-8 combinations not already converted to U+FFFD inside the input stream be considered charset errors? If no, then merging the ranges as trancexx did is fine, else we need to parse the UTF-8 by ourself byte by byte and check that condition. Anyway "native" U+FFFD in the input should NOT be excluded from the valid XML charset range since it is explicitely allowed as a handy placeholder. Also note that whatever other Unicode-conformant program reading an Unicode text file contining invalid UTF-* sequences will actually replace them with U+FFFD instance(s), so merging the ranges into x20-x{FFFD} is probably the simplest way to behave. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
czardas Posted March 19, 2013 Share Posted March 19, 2013 (edited) I don't follow the logic here at all. Why would so many ill-formed variants (U+D800 - U+DFFF) be needed? Why substitute them at all? What kind of logic is there to this? To me this seems like a waste of resources (perhaps because I don't understand it, or maybe I'm just not ready to understand it). Edited March 19, 2013 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
leuce Posted March 19, 2013 Author Share Posted March 19, 2013 Okay, confession time: I investigated my non-matching x1A problem and it now appears that the program that I used to test the TMX file interpreted x201A as x001A. Well, that is my guess, based on the error message I get from that program (it complains about an unexpected file end at that position). When I remove all instances of x201A from the file, it stops complaining about it. Link to comment Share on other sites More sharing options...
jchd Posted March 19, 2013 Share Posted March 19, 2013 czardas, This range is for surrogates, i.e. 16-bit values that are reserved for encoding codepoints from upper planes ( > U+FFFF) when using either UTF-16 encoding. These values are "non-characters" by themselves, since they must be associated in pairs to encode a codepoint outside plane 0. When a text file uses UTF-8 these values shouldn't occur, as UTF-8 provides its own mean to encode codepoints. Hence a conforming conversion from UTF-16 to UTF-8 will never produce a codepoint in this range. Pathologic programs can however let non-characters appear in their output stream. The U+FFFD codepoint ( the so-called "replacement character") is the default codepoint to indicate an invalid character during a conversion: you can see it in some ill-formed web pages as a white question mark in a black hexagonal background. By itself it is not an error but merely a trace that an earlier error produced something that couldn't be represented. In short, a valid Unicode text may contains occurences of U+FFFD and these are not particularly toxic for subsequent processings. When a conforming program reads Unicode text and discovers invalid sequences, there are two common ways to handle the situation: either halt with an error OR replace every invalid sequence by a replacement character. Thus there are two sources of U+FFFD: replacement already done at an earlier point and actual errors in the encoding of the text stream. Both read as U+FFFD to a conforming program (my points 1) and 2) above). Note that there are more complex conditions which make a particular sequence invalid, for instance overlong sequences. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
jchd Posted March 19, 2013 Share Posted March 19, 2013 leuce, U+201A is a low quote style character. Instead of suppressing it you should perhaps replace it by something similar and meaningful. That or replace the program that fails on this codepoint. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
czardas Posted March 19, 2013 Share Posted March 19, 2013 (edited) Thanks jchd. It's a good explanation of what is happening. I'm still not sure of the need for such a large range. I would have thought one character would be sufficient. I need to read more about it. I read somewhere that some of these surrogates can be used in programming (for whatever purpose the programmer decides). I'm not sure about it, but I'll read up on it. Much appreciated. Update to postI found the answer to my question about surrogates here: http://en.wikipedia.org/wiki/Plane_%28Unicode%29 Edited March 20, 2013 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
leuce Posted March 19, 2013 Author Share Posted March 19, 2013 (edited) U+201A is a low quote style character. Instead of suppressing it you should perhaps replace it by something similar and meaningful. That is exactly what I'm going to try to do. The program also flounders on U+221A but it accepts the entity √. Do you know if one can replace all characters that end on 1A with similarly named entities? What would the regex be for that, I wonder... something like this, perhaps? $tmxfileread = StringRegExpReplace ($tmxfileread, "\x{([a-f0-9][a-f0-9])1a}", "&#x$11a;") Edited March 19, 2013 by leuce Link to comment Share on other sites More sharing options...
jchd Posted March 19, 2013 Share Posted March 19, 2013 I can't succeed in making $n (or n, standing for nth capture replacement) interpolate in replacement pattern.Testing even simpler match pattern like x{00[0-9]9} doesn't match anything in abc999def.So I'm doubting that those expression work in PCRE. About x{hhh...}, the PCRE doc says:If characters other than hexadecimal digits appear between x{ and }, or if there is no terminating }, this form of escape is not recognized. Instead, the initial x will be interpreted as a basic hexadecimal escape, with no following digits, giving a character whose value is zero.You may want to select the smallest subset of these codepoints which are worth expanding as hex entities:expandcollapse popupCodePoint CharacterName GeneralCategory --------- ------------------------------------------------- --------------- 001A <control> Cc 011A LATIN CAPITAL LETTER E WITH CARON Lu 021A LATIN CAPITAL LETTER T WITH COMMA BELOW Lu 031A COMBINING LEFT ANGLE ABOVE Mn 041A CYRILLIC CAPITAL LETTER KA Lu 051A CYRILLIC CAPITAL LETTER QA Lu 061A ARABIC SMALL KASRA Mn 071A SYRIAC LETTER HETH Lo 091A DEVANAGARI LETTER CA Lo 0A1A GURMUKHI LETTER CA Lo 0B1A ORIYA LETTER CA Lo 0C1A TELUGU LETTER CA Lo 0D1A MALAYALAM LETTER CA Lo 0E1A THAI CHARACTER BO BAIMAI Lo 0F1A TIBETAN SIGN RDEL DKAR GCIG So 101A MYANMAR LETTER YA Lo 111A HANGUL CHOSEONG RIEUL-HIEUH Lo 121A ETHIOPIC SYLLABLE MI Lo 131A ETHIOPIC SYLLABLE GGI Lo 141A CANADIAN SYLLABICS WEST-CREE WAA Lo 151A CANADIAN SYLLABICS WEST-CREE SHWI Lo 161A CANADIAN SYLLABICS SAYISI JI Lo 191A LIMBU LETTER SSA Lo 1A1A BUGINESE VOWEL SIGN O Mc 1B1A BALINESE LETTER JA Lo 1C1A LEPCHA LETTER YA Lo 1D1A LATIN LETTER SMALL CAPITAL TURNED R Ll 1E1A LATIN CAPITAL LETTER E WITH TILDE BELOW Lu 1F1A GREEK CAPITAL LETTER EPSILON WITH PSILI AND VARIA Lu 201A SINGLE LOW-9 QUOTATION MARK Ps 211A DOUBLE-STRUCK CAPITAL Q Lu 221A SQUARE ROOT Sm 231A WATCH So 241A SYMBOL FOR SUBSTITUTE So 251A BOX DRAWINGS UP HEAVY AND LEFT LIGHT So 261A BLACK LEFT POINTING INDEX So 271A HEAVY GREEK CROSS So 281A BRAILLE PATTERN DOTS-245 So 291A RIGHTWARDS ARROW-TAIL Sm 2A1A INTEGRAL WITH UNION Sm 2B1A DOTTED SQUARE So 2C1A GLAGOLITIC CAPITAL LETTER PE Lu 2D1A GEORGIAN SMALL LETTER CAN Ll 2E1A HYPHEN WITH DIAERESIS Pd 2F1A KANGXI RADICAL CLIFF So 301A LEFT WHITE SQUARE BRACKET Ps 311A BOPOMOFO LETTER A Lo 321A PARENTHESIZED HANGUL PHIEUPH A So 331A SQUARE KURUZEIRO So A01A YI SYLLABLE BIET Lo A11A YI SYLLABLE TIT Lo A21A YI SYLLABLE GGAT Lo A31A YI SYLLABLE SOP Lo A41A YI SYLLABLE JJI Lo A51A VAI SYLLABLE CEE Lo A61A VAI SYMBOL DANG Lo A71A MODIFIER LETTER LOWER RIGHT CORNER ANGLE Lm A81A SYLOTI NAGRI LETTER PHO Lo A91A KAYAH LI LETTER RA Lo AA1A CHAM LETTER PA Lo F91A CJK COMPATIBILITY IDEOGRAPH-F91A Lo FA1A CJK COMPATIBILITY IDEOGRAPH-FA1A Lo FC1A ARABIC LIGATURE KHAH WITH HAH ISOLATED FORM Lo FD1A ARABIC LIGATURE SHEEN WITH YEH FINAL FORM Lo FF1A FULLWIDTH COLON Po This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
leuce Posted March 20, 2013 Author Share Posted March 20, 2013 I can't succeed in making $n (or n, standing for nth capture replacement) interpolate in replacement pattern. ... You may want to select the smallest subset of these codepoints which are worth expanding as hex entities... Well, the script I'm writing is something that users will use when they know that there is something wrong with their file (the script is a file fixer), so I suppose they won't mind waiting a bit for it to complete. I'm probably going to have to match all of these characters individually, and not just a presumed useful subset of them, because in the file that I tested this week there were many characters that were completely unexpected for the language combination (I suspect the source text was OCR'ed -- for example, I saw the word "iPad" in it in which the "i" looks like an "i" to the human eye but it is really something completely different). Anyway, I tried this: $o = "x" $n = 0 $m = FileRead (FileOpen ("test.tmx", 32)) MsgBox (0, "", @extended, 0) $arr = StringSplit ("\x{011A}|\x{021A}|\x{031A}|\x{041A}|\x{051A}|\x{061A}|\x{071A}|\x{091A}|\x{0A1A}|\x{0B1A}|\x{0C1A}|\x{0D1A}|\x{0E1A}|\x{0F1A}|\x{101A}|\x{111A}|\x{121A}|\x{131A}|\x{141A}|\x{151A}|\x{161A}|\x{1C1A}|\x{1D1A}|\x{1E1A}|\x{1F1A}|\x{201A}|\x{211A}|\x{221A}|\x{231A}|\x{241A}|\x{251A}|\x261A}|\x{271A}|\x{281A}|\x{291A}|\x{2A1A}|\x{2B1A}|\x{2C1A}|\x{2D1A}|\x{2E1A}|\x{2F1A}|\x301A}|\x{311A}|\x{321A}|\x{331A}|\x{A01A}|\x{A11A}|\x{A21A}|\x{A31A}|\x{A41A}|\x{A51A}|\xA61A}|\x{A71A}|\x{A81A}|\x{A91A}|\x{AA1A}|\x{F91A}|\x{FA1A}|\x{FC1A}|\x{FD1A}|\x{FF1A}", "|", 1) For $i = 1 to $arr[0] If StringRegExp ($m, $arr[$i]) Then $a = StringSplit ($arr[$i], "{", 1) $b = StringSplit ($a[2], "}", 1) $c = "&#x" & $b[1] & ";" $m = StringRegExpReplace ($m, $arr[$i], $c) $o = $o & "|" & $c Else $n = $n + 1 EndIf Next MsgBox (0, "", $n & " __ " & $o, 0) FileWrite (FileOpen ("test2.tmx", 34), $m) ...and it works. On my computer it takes the script 9 seconds to read a 350 MB TMX file, and the rest of the script takes less than a minute, making 6 replacements of one character and 10 replacements of another character. Thanks again for all your help, guys! Samuel Link to comment Share on other sites More sharing options...
czardas Posted March 20, 2013 Share Posted March 20, 2013 Good thread, good questions - all very informative. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
jchd Posted March 20, 2013 Share Posted March 20, 2013 Another way: Local $s = ChrW(0x1A) & 'abc' & ChrW(0x1A) & 'def' & ChrW(0x221A) & 'ghi' & ChrW(0x331A) & 'jkl' & ChrW(0x331A) Local $t = Execute("'" & StringRegExpReplace($s, _ "(?x)" & _ "([" & _ "\x{001A}\x{011A}\x{021A}\x{031A}\x{041A}\x{051A}\x{061A}\x{071A}\x{091A}\x{0A1A}\x{0B1A}\x{0C1A}\x{0D1A}\x{0E1A}\x{0F1A}" & _ "\x{101A}\x{111A}\x{121A}\x{131A}\x{141A}\x{151A}\x{161A}\x{191A}\x{1A1A}\x{1B1A}\x{1C1A}\x{1D1A}\x{1E1A}\x{1F1A}" & _ "\x{201A}\x{211A}\x{221A}\x{231A}\x{241A}\x{251A}\x{261A}\x{271A}\x{281A}\x{291A}\x{2A1A}\x{2B1A}\x{2C1A}\x{2D1A}\x{2E1A}\x{2F1A}" & _ "\x{301A}\x{311A}\x{321A}\x{331A}" & _ "\x{A01A}\x{A11A}\x{A21A}\x{A31A}\x{A41A}\x{A51A}\x{A61A}\x{A71A}\x{A81A}\x{A91A}\x{AA1A}" & _ "\x{F91A}\x{FA1A}\x{FC1A}\x{FD1A}\x{FF1A}" & _ "])", _ "&#x' & Hex(AscW('$1'), 4) & '") & "'") ConsoleWrite($t & @LF) This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now