leuce Posted March 18, 2013 Share Posted March 18, 2013 G'day everyone I'm trying to evaluate strings on whether they contain only certain characters, which I specify in hexademical format. However, I have no idea how to write the regular expression, and all my tinkering produces the wrong results. What I'm ultimately trying to accomplish is to test if a string contains only valid XML 1.0 characters. The valid characters that I'm trying to evaluate are: \x0009 \x000A \x000D \x0020-\xD7FF \xE000-\xFFFD \x10000-\x10FFFF I want to specify them all as a single variable ($sValid), which I will then include in the regular expression, as follows: If StringRegExp($aArray[$i], "\A[" & $sValid & "]*\Z") Then ; Then $aArray[$i] contains only valid characters EndIf The file that I read the input from is UTF8 (but does that matter?). How would I have to write the variable $sValid to let the regular expression work? Thanks Samuel Link to comment Share on other sites More sharing options...
czardas Posted March 18, 2013 Share Posted March 18, 2013 (edited) I take it the last three sets include all values within the range (x0020-xD7FF ==> from x0020 to xD7FF). Edited March 18, 2013 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 I take it the last three sets include all values within the range (x0020-xD7FF ==> from x0020 to xD7FF).If I understand correctly, the last two sets are outside the range of the third last set. Or... what don't I understand? :-) Link to comment Share on other sites More sharing options...
czardas Posted March 18, 2013 Share Posted March 18, 2013 (edited) I'm asking if that expression indicates one match or multiple matches within the hexadecimal range. What does the minus symbol indicate? Edit: Perhaps a silly question. Edited March 18, 2013 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
Moderators Melba23 Posted March 18, 2013 Moderators Share Posted March 18, 2013 leuce, Can you please post an example string so that we can see the format. M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 I'm asking if that expression indicates one match or multiple matches within the hexadecimal range. What does the minus symbol indicate? Edit: Perhaps a silly question. The minus is my attempt at indicating a range. Ah, now I understand your original question: yes, the minus means "to". In the same way as one might have [a-zA-Z] in a regular expression, where the minus means "to". I need to find out if the string I'm evaluating contains any character that is not any of those valid characters. Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 (edited) Can you please post an example string so that we can see the format. Sure, but I don't know if it will help. It is a TMX file, which is XML. The original file is actually UTF16LE, and the XML is 1.0. But I resaved it as UTF8 because I thought that that was required for AutoIt's regex. Here is one string that will be evaluated: <tu> <tuv xml:lang="NL-NL"> <seg><cf font="Arial" symboltypeface="Arial" ansitypeface="Arial" color="0xFFFFCC"><cf size="32">Van bachelor naar master</cf><cf size="28"> <br indentation="0" leftmargin="0" alignment="ppAlignLeft" spacewithin="1" spacebefore="0.5" spaceafter="0" mask="0"/></seg> </tuv> <tuv xml:lang="EN-GB"> <seg><cf font="Arial" symboltypeface="Arial" ansitypeface="Arial" color="0xFFFFCC"><cf size="32">From Bachelor's to Master's</cf><cf size="28"> <br indentation="0" leftmargin="0" alignment="ppAlignLeft" spacewithin="1" spacebefore="0.5" spaceafter="0" mask="0"/></seg> </tuv> </tu> The file I'm trying to process is an XML file with invalid characters in it. I want to save as much of the file as possible while removing the invalid characters. Perhaps there is a freeware program somewhere on the internet that can do it too. Added: Just in case the example is confusing, let me say that the hexadecimal colour codes in that text are not the characters that I'm trying to match. I'm trying to match individual characters. The above text contains only valid characters, but some of the strings may contain invalid ones. Edited March 18, 2013 by leuce Link to comment Share on other sites More sharing options...
AZJIO Posted March 18, 2013 Share Posted March 18, 2013 leuce, [a-zA-Z] ??? [a-fA-F] Yes? [a-fA-F0-9]+ {?i}[0-9A-F]+ x10000-x10FFFF ??? x[0-9A-Fa-f]+ Yes? My other projects or all Link to comment Share on other sites More sharing options...
Moderators Melba23 Posted March 18, 2013 Moderators Share Posted March 18, 2013 leuce,This appears to do what you want - but I am not totally confident as I have never used the multiple digit Hex pattern before. Note that I have reversed the logic - if you get a match then there is an unwanted character in the string; no match means that the string is good. That should allow you to use it as a RegExpReplace pattern to strip the unwanted characters if you decide to go that way: (?i)(\x{000[12345678bcef]}|\x{001\d}|\x{d[89abcdef]\d\d}|\x{fff[ef]}) (?i) - Case insensitive ( | | | | | ) - Look for any of these alternatives, which are: \x{000[12345678bcef]} - Any 000# character other than 0009, 000A, 000D \x{001\d} - Any 001# character \x{d[89abcdef]\d\d} - Any D8## to DF## character \x{fff[ef]} - FFFE and FFFFGive it a try and see how you get on. M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 Note that I have reversed the logic - if you get a match then there is an unwanted character in the string; no match means that the string is good. I was actually thinking the same thing -- then I don't have to match the entire string, only one character in the string. The curly brackets is what I was after -- I did not know exactly how to write the hex characters in a regular expression. I had thought that AutoIt would treat a hexademical character as a single unit in regular expressions, so I'm surprised to learn that something like "x{fff[ef]})" is possible. Do you know if these regular expressions work on all files, or only on UTF8 files? Thanks Samuel Link to comment Share on other sites More sharing options...
jchd Posted March 18, 2013 Share Posted March 18, 2013 leuce, FYI whatever text encoding is used, once you load text in AutoIt strings data is converted to a subset of UTF-16LE called UCS-2. It's (roughly) the restriction of Unicode to its plane-0, i.e. codepoints that fit into a single 16-bit representation. Hence the extra range x10000-x10FFFF is irrelevant to AutoIt (and, honestly, I seriously doubt you would encounter such codepoints in real-world data). Also surrogates will endup being remapped to invalid character in the conversion process occuring during file read. The following pattern will match any invalid or excluded codepoint: [^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}] leuce and czardas 2 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 x10000-x10FFFF ??? The reason for that notation is that I had thought that AutoIt would treat the hexadecimal character as a unit within regex... in other words that it would treat "x10000" as a single entity and "x10FFFF" as a single entity, and that something like [x10000-x10FFFF] would be the same as something like [a-z]. Link to comment Share on other sites More sharing options...
Moderators Melba23 Posted March 18, 2013 Moderators Share Posted March 18, 2013 leuce, I'm surprised to learn that something like "x{fff[ef]})" is possibleTo be honest, so was I! The curly brackets are described on the Help file page for StringRegExp. You will need to ask someone like jchd about the encoding - well above my hobbyist level that. M23 Edit: I see he has already responded and given you a much more compact pattern. Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area Link to comment Share on other sites More sharing options...
jchd Posted March 18, 2013 Share Posted March 18, 2013 That's correct but you need to use the correct hex syntax used in PCRE: either x** or x{******} where bold red asterisks are optional. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
jchd Posted March 18, 2013 Share Posted March 18, 2013 (edited) Sorry to add to myself. Note that while "x{fff[ef]})" is indeed possible either by itself or within alternation (like "abc|def|x{fff[ef]})" the syntax won't work inside a character class (inside square brackets). Also, alternation is much slower than a character class: alternation works on complete sub-expressions while character classes work on individual characters. Edited March 18, 2013 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Moderators Melba23 Posted March 18, 2013 Moderators Share Posted March 18, 2013 jchd, Nice pattern - I did not realise that you could do this: \x20-\x{D7FF} Learning point for today. M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area Link to comment Share on other sites More sharing options...
czardas Posted March 18, 2013 Share Posted March 18, 2013 I have to agree, very informative jchd. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 (edited) Thanks, jchd, for the expression. Perhaps one of you can tell me what is wrong with my script, because I know for a fact that there is at least one x1A character in it (possibly more, or other non-valid characters), but the script doesn't catch it. I know that there is an x1A character because I've seen it (it took a while to track it down in my 350 MB XML file, but it is there). My script is this: expandcollapse popup#cs TMX Fixer (per-segment) 1. Read input file (TMX), then split by </tu>. 2. For each array item: 2.1 If it contains an invalid character, write it to an error file (TXT, one file per error). 2.2 If it does not contain an invalid character, write it to the output file (TMX). Note: splitting by </tu> means that the head and the first TU are both in array item 1, but we assume that there are no invalid characters in the head or in the first TU. #ce MsgBox (0, "TMX Fixer (per-segment)", "The per-segment version of TMX Fixer examines every TU (aka translation unit, aka segment) individually and saves only segments that contain no invalid characters to the output TMX file, and saves removed segments with invalid characters to separate error files.", 0) $pathtotmxfile = FileOpenDialog ("Select TMX file", @ScriptDir, "TMX (*.tmx)|All files (*.*)") $tmxfileopen = FileOpen ($pathtotmxfile, 32) $tmxfileread = FileRead ($tmxfileopen) MsgBox (0, "TMX file read", @extended & " characters were read.", 0) $tmxfilearray = StringSplit ($tmxfileread, "</tu>", 1) MsgBox (0, "TMX file split", "TMX was split into approximately " & $tmxfilearray[0] - 1 & " translation units.", 0) $outputtmxfileopen = FileOpen ($pathtotmxfile & "_output.tmx", 34) Global $sInvalid = "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]" ; Global $sInvalid = "[\x00-\x08\x0B\x0C\x0E-\x1F\x{FFFE}\x{FFFF}]" For $i = 1 to $tmxfilearray[0] If StringRegExp ($tmxfilearray[$i], $sInvalid) Then ; $tmxfilearray[$i] = StringRegExpReplace ($tmxfilearray[$i], $sInvalid, "!!!$1!!!") ; when the rest works $roguefileopen = FileOpen ($pathtotmxfile & "_broken segment_" & $i, 34) FileWrite ($roguefileopen, $tmxfilearray[$i] & "</tu>") FileClose ($roguefileopen) Else FileWrite ($outputtmxfileopen, $tmxfilearray[$i] & "</tu>") EndIf If IsInt ($i/10000) Then ToolTip ("Currently at unit " & $i) EndIf Next Edited March 18, 2013 by leuce Link to comment Share on other sites More sharing options...
czardas Posted March 18, 2013 Share Posted March 18, 2013 Are you sure you are reading all of the file? 350MB is big and I'm not sure if StringSplit can handle this. Although I would expect some out of memory message. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
leuce Posted March 18, 2013 Author Share Posted March 18, 2013 (edited) Are you sure you are reading all of the file? 350MB is big and I'm not sure if StringSplit can handle this. Although I would expect some out of memory message.It reads it all (that's why the script tells the user how many characters are read, etc). The TMX file has about 650 000 </tu> tags in it, and the script reports that number to the user. The script reports about 185 million characters for both a UTF16LE file (350 MB) and a UTF8 file (180 MB), which sounds about right.Running the whole script takes about a minute (I'm not sure how much the fact that I have 6 GB RAM on a quad core 64-bit computer has to do with it). Anyway, the first x1A character occurs at string number 115003, so the script should write an error file by then.Could it be that the script is running too fast and therefore "misses" the match? That would be very odd... Edited March 18, 2013 by leuce Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now