RichardL Posted October 23, 2022 Share Posted October 23, 2022 Some help with a regex please. I want to select blocks of text and the pattern I have is doing that almost correctly. The problem is the first item also includes everything before that in the source. #include <Debug.au3> $sStr = "aBa¬aCa¬aCa¬aCa¬aCa¬aCa" $sPatn = "(?U)a.*C.*¬" $sAry = StringRegExp($sStr, $sPatn, 3) _DebugArrayDisplay($sAry) Output: aBa¬aCa¬ aCa¬ aCa¬ aCa¬ I thought (?U) made it not greedy, so it shouldn't do that? (The ¬ are replacing @CRLF from reading a file, could use the original if easier.) Link to comment Share on other sites More sharing options...
OJBakker Posted October 23, 2022 Share Posted October 23, 2022 I cannot reproduce the output you are reporting. I have tested with the script below. Only changes are the @crlf and check for @error #include <Debug.au3> $sStr = "aBa" & @CRLF & "aCa" & @CRLF & "aCa" & @CRLF & "aCa" & @CRLF & "aCa" & @CRLF & "aCa" $sPatn = "(?U)a.*C.*\r\n" $sAry = StringRegExp($sStr, $sPatn, 3) if @error then MsgBox(Default, "ERROR", "@error:" & @error) _DebugArrayDisplay($sAry) Link to comment Share on other sites More sharing options...
RichardL Posted October 23, 2022 Author Share Posted October 23, 2022 Yes, that works. Thanks. Link to comment Share on other sites More sharing options...
jchd Posted October 23, 2022 Share Posted October 23, 2022 The output of your original snippet is perfectly correct and expected: PCRE does exactly what you asked it to do. Let's see (I insert a bar | to denote where we are inside the subject and the pattern): |aBa¬aCa¬aCa¬aCa¬aCa¬aCa |(?U)a.*C.*¬ First the option is parsed and memorized |aBa¬aCa¬aCa¬aCa¬aCa¬aCa (?U)|a.*C.*¬ Then: a|Ba¬aCa¬aCa¬aCa¬aCa¬aCa (?U)a|.*C.*¬ aBa¬a|Ca¬aCa¬aCa¬aCa¬aCa (?U)a.*|C.*¬ aBa¬aC|a¬aCa¬aCa¬aCa¬aCa (?U)a.*C|.*¬ aBa¬aCa|¬aCa¬aCa¬aCa¬aCa (?U)a.*C.*|¬ aBa¬aCa¬|aCa¬aCa¬aCa¬aCa (?U)a.*C.*¬| First match found: aBa¬aCa¬ Remember that . (dot) doesn't match a line break in the example posted as answer. Yet it matches ¬ in your own example. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
pixelsearch Posted October 23, 2022 Share Posted October 23, 2022 Note : before posting what follows, I Just saw @jchd answered while I was preparing my looong post. No offence jchd, I'm posting my answer as I wrote it, then I'll read your post, promised ! And my apologies if I write erroneous comments below Hi everybody, I'm a newbie at RegEx but anyway, let's try some comments and explore deeper the preceding posts : @RichardL If you change your pattern from... $sPatn = "(?U)a.*C.*¬" ...to $sPatn = "(?U)a[^¬]*C.*¬" ...then the output should be correct because it will return matches including an "a", followed by any character (except ¬] followed by "C" etc... all this being ungreedy. So the change from "a.*" to "a[^¬]*" should return a correct output. @OJBakker glad you made it ! It's interesting to experiment on your pattern to force it return exactly... the same issue as OP, by changing this... $sPatn = "(?U)a.*C.*\r\n" ...to that : $sPatn = "(?Us)a.*C.*\r\n" Now it returns exactly the result OP indicated, because (?s) "Single-line or DotAll" was added ! From AutoIt help file : By default, DotAll is off hence . does not match a newline sequence. That's why your pattern worked : in your pattern, when "a.*" met the 1st "\n" in the string, then it didn't match (as no "C" hadn't been found) so the engine started to search for the 1st match after "\n" As (?s) changes this behavior, then . matches a newline sequence and the output will be the same than OP's... who doesn't want this output at all. Let's hope our RegEx guru's will add some nice comments as they're used to By the way, I read this in AutoIt help file, topic StringRegExp : Quantifiers (or repetition specifiers) specify how many of the preceding character, class, reference or group are expected to match. As I didn't understand what "reference" meant in this sentence, then I found this MS page : Quantifiers in Regular Expressions, where we can read : Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found. They don't mention references. So what does this "reference" mean in AutoIt helpfile, when applied to quantifiers ? Thanks... and now let's immediately read jchd's post Link to comment Share on other sites More sharing options...
mikell Posted October 24, 2022 Share Posted October 24, 2022 (edited) 16 hours ago, pixelsearch said: So what does this "reference" mean in AutoIt helpfile, when applied to quantifiers ? Reference means back reference or subroutine call So says the bible Repetition is specified by quantifiers, which can follow any of the following items: (...) a back reference (see next section) a parenthesized subpattern (including assertions) a subroutine call to a subpattern (recursive or otherwise) Edited October 24, 2022 by mikell Link to comment Share on other sites More sharing options...
pixelsearch Posted October 24, 2022 Share Posted October 24, 2022 (edited) Thanks mikell, it's an interesting link By the way, it seems to me that OP had perhaps something else in mind, when he wrote : On 10/23/2022 at 2:25 PM, RichardL said: I thought (?U) made it not greedy, so it shouldn't do that? Look at this simple example : $sStr = "1a2a3c4c5c6" $sPatn = "a.*c" ; greedy (default) returns a2a3c4c5c $sAry = StringRegExp($sStr, $sPatn, 3) _DebugArrayDisplay($sAry, "greedy") $sPatn = "(?U)a.*c" ; ungreedy (aka lazy) returns a2a3c $sAry = StringRegExp($sStr, $sPatn, 3) _DebugArrayDisplay($sAry, "ungreedy") If a user expects to match "a3c" with the ungreedy pattern of this example, then it doesn't work. "(?U)a.*c" doesn't mean "As we are ungreedy, then anchor to the last 'a' found before 'c' and grab everything between this last 'a' and the 1st 'c' following it." With this kind of pattern, no matter the greediness (on or off) the anchor is always done on the first 'a' found in the string, then the lenght of the match depends on the greediness (longer when on, shorter when off) Please be kind to correct this explanation if it's wrong or obscure, thanks. Edit: I got a pattern that returns "a3c" in this last example : $sPatn = "(?U)a[^a]*c" ; ungreedy returns a3c (yes !) "As we are ungreedy, then anchor to the last 'a' found before 'c' and grab everything between this last 'a' and the 1st 'c' following it." Edited October 24, 2022 by pixelsearch Link to comment Share on other sites More sharing options...
mikell Posted October 24, 2022 Share Posted October 24, 2022 52 minutes ago, pixelsearch said: the anchor is always done on the first 'a' found in the string "the regex engine is eager to return a match" (Jan Goyvaerts) pixelsearch and Musashi 2 Link to comment Share on other sites More sharing options...
jchd Posted October 24, 2022 Share Posted October 24, 2022 This isn't exactly an anchor question, but simply a question of satisfying the pattern, backtracking and restarting the pattern after a failure. |1a2a3c4c5c6 |(?U)a[^a]*c Option is parsed once |1a2a3c4c5c6 (?U)|a[^a]*c 1 doesn't match a in pattern 1|a2a3c4c5c6 (?U)|a[^a]*c a matches 1a|2a3c4c5c6 (?U)a|[^a]*c 2 matches [^a]* 1a2|a3c4c5c6 (?U)a[^a]*|c not followed by c in subject : pattern failed backtrack to 2 and restart pattern from there 1a|2a3c4c5c6 (?U)|a[^a]*c 2 doesn't match a 1a2|a3c4c5c6 (?U)|a[^a]*c a3c matches a[^a]*c : success 1a2a3c|4c5c6 (?U)a[^a]*c| pixelsearch and Musashi 2 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
pixelsearch Posted October 24, 2022 Share Posted October 24, 2022 @jchd these last explanations about "satisfying the pattern, backtracking and restarting the [whole ?] pattern after a failure." were very interesting. The word "anchor" I used was surely inappropriate (as it's a "RegEx word") but you certainly understood what I was trying to explain Let's add some more details at the end of your explanations (just to look at the place of the vertical bars and prepare my question to come) : ... 1a2|a3c4c5c6 (?U)|a[^a]*c a matches 1a2a|3c4c5c6 (?U)a|[^a]*c 3 matches 1a2a3|c4c5c6 (?U)a[^a]*|c c matches, so a3c matches a[^a]*c : success 1a2a3c|4c5c6 (?U)a[^a]*c| Now let's try this on another example with a different subject ("33" instead of "3") and a different quantifier {2} instead of * #include <Debug.au3> $sStr = "1a2a33c4c5c6" $sPatn = "(?U)a[^a]{2}c" ; matches a33c (greedy or not) $sAry = StringRegExp($sStr, $sPatn, 3) If @error then MsgBox(0, "StringRegExp", "@error:" & @error) ; error 1 = no matches _DebugArrayDisplay($sAry, "Result") ... 1a2|a33c4c5c6 (?U)|a[^a]{2}c a matches 1a2a|33c4c5c6 (?U)a|[^a]{2}c 3 matches 1a2a3|3c4c5c6 (?U)a[^a]{2}|c What now ? As I moved the vertical bar in pattern & subject for each step (as you did with a * quantifier), how would the engine continue now that the quantifier is {2} ? I mean if we move the vertical bar after each character match when the quantifier is * like in [^a]* , then should it be different when the quantifier is {2} like in [^a]{2} Sorry if the question looks too simple Link to comment Share on other sites More sharing options...
jchd Posted October 24, 2022 Share Posted October 24, 2022 No problem. When the subject and pattern reach this state, everything before being the same as previously: 1a2a|33c4c5c6 (?U)a|[^a]{2}c the pattern expects 2 characters not a and 33 rightly match that expectation. 1a2a33|c4c5c6 (?U)a[^a]{2}|c then the pattern requires c and we have a match as well with the string a33c. Musashi 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
pixelsearch Posted October 24, 2022 Share Posted October 24, 2022 (edited) Thanks jchd So finally, it would be the same behavior as you just explained, when "33" and * are used. When the subject and pattern reach this state, everything before being the same as previously: 1a2a|33c4c5c6 (?U)a|[^a]*c "The pattern expects 0 or more characters (not a) followed by c" and 33 rightly match that expectation : 1a2a33|c4c5c6 (?U)a[^a]*|c No vertical bar between 3's lol... ... or why not, something like that during the possible "multicharacter checking phase", moving one vertical bar to the right (in subject) but not the other vertical bar (in pattern) until the checking phase ends : 1a2a3|3c4c5c6 (?U)a|[^a]*c Glad we have you here Edited October 24, 2022 by pixelsearch modified comment Musashi 1 Link to comment Share on other sites More sharing options...
jchd Posted October 25, 2022 Share Posted October 25, 2022 It's because when the pattern is compiled to low-level PCRE internal primitives, a sequence like [^a]{2}c will immediately (or so) detect partial match or failure, as a quasi block operation. That's why it isn't always possible to follow progression in the subject & pattern with bars like we can do in simple examples. Also PCRE uses by default a number of optimizations which in most use cases cut down the number of pointless backtracking steps. Read the bible for details and the source for more gory details! This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
RichardL Posted October 28, 2022 Author Share Posted October 28, 2022 Well, thank you everyone. That's all fascinating and more than I was expecting. Sadly I won't remember it all but will bookmark it for next time I'm using a regex. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now