hawkair Posted August 3, 2023 Share Posted August 3, 2023 Hi I have a text like this: ($txt=) expandcollapse popup<div class="titlereference-overview-section"> Directors: <ul class="ipl-inline-list"> <li class="ipl-inline-list__item"> <a href="/name/nm8681530">John Smith</a>, <a href="/name/nm8681530">Jim </a>, <a href="/name/nm8681530">Jack</a> </li> <li class="ipl-inline-list__item"> <a href="/title/tt8806524/fullcredits" class=>See more »</a> </li> </ul> </div> <div class="titlereference-overview-section"> Writers: <ul class="ipl-inline-list"> <li class="ipl-inline-list__item"> <a href="/name/nm8681530">Kirsten</a>, <a href="/name/nm8681530">Jessica</a>, <a href="/name/nm8681530">Maya</a> </li> <li class="ipl-inline-list__item"> <a href="/title/tt8806524/fullcredits" class=>See more »</a> </li> </ul> </div> <div class="titlereference-overview-section"> Stars: <ul class="ipl-inline-list"> <li class="ipl-inline-list__item"> <a href="/name/nm0001772">Patrick Stewart</a>, <a href="/name/nm0403335">Michelle Hurd</a>, <a href="/name/nm0005394">Jeri Ryan</a> </li> <li class="ipl-inline-list__item"> <a href="/title/tt8806524/fullcredits" class=>See more »</a> </li> </ul> I want to get the writers. This code $aWriters = StringRegExp($txt, '<a href="/name/nm.*?">([^<]*)</a>', 3) gets all names. This code ;To check quickly copy the text then run the code $txt = Clipget() $txt = StringRegExpReplace($txt, "(?s)^.*Writers", "") $txt = StringRegExpReplace($txt, "(?s)</ul>.*", "") $aWriters = StringRegExp($txt, '<a href="/name/nm.*?">([^<]*)</a>', 3) MsgBox(262144, "Writers", _ArrayToString($aWriters, ",")) deletes text before "Writers" and after Writers section ends and gets all the Writers names. Note that "Stars" section may not always follow "Writers" Can I do this with a single RegExp command? Link to comment Share on other sites More sharing options...
Solution mikell Posted August 3, 2023 Solution Share Posted August 3, 2023 (edited) You may fire all the unwanted parts using a single SRER $txt = Clipget() $s = StringRegExpReplace($txt, '(?s)^.*Writers(.*?/nm\d+">)|\R<a(?1)|</a>|\s+</li>.*$', "") MsgBox(0,"", $s) Edit It's a cute challenge but - IMHO - your multipart solution is somewhere more versatile Edit2 Much nicer, how to get this in a 1D array (and BTW a better answer to the question in the title of the topic) $txt = Clipget() $aWriters = StringRegExp($txt, '(?s)(?:.*?Writers|\G(?!</a>\s+</li>)).*?/nm\d+">([^<]*)', 3) _ArrayDisplay($aWriters) Edited August 3, 2023 by mikell Link to comment Share on other sites More sharing options...
hawkair Posted August 4, 2023 Author Share Posted August 4, 2023 (edited) Mikel thank you It works exactly as I want. I have no words... Now I can merilly go off into my cave and have fun figuring out how it does it Edit: I used Google translate and the Autoit help file as dictionary and got the following: StringRegExp($txt, '(?s)(?:.*?Writers|\G(?!</a>\s+</li>)).*?/nm\d+">([^<]*)', 3) Find all text until Writers - do not save, Or Starting at this position (\G) Match while the subpattern is not '</a>\s+</li>' then follows the actual pattern to match: '.*?/nm\d+">([^<]*)' Edited August 4, 2023 by hawkair Link to comment Share on other sites More sharing options...
mikell Posted August 4, 2023 Share Posted August 4, 2023 (edited) 4 hours ago, hawkair said: Now I can merilly go off into my cave and have fun figuring out how it does it Sorry I didn't comment this \G magic The definition of \G in the helpfile is not very clear, better look at this one , especially "\G matches at the end of the previous match" The conditions in the title question 'start matching after "String1" and stop matching after "String2" ' are defined in both parts of the alternation StringRegExp($txt, '(?sx) (?: ^.*?Writers | \G (?!</a>\s+</li>) ) .*?/nm\d+">([^<]*)', 3) How it works : - using the left part of the alternation and the final pattern, the regex runs up to 'Writers', searches and finds "Kirsten" - then using the right part of the alternation, \G matches right after "Kirsten", the assertion is true so the regex restarts searching and finds "Jessica" - \G matches right after "Jessica", in the same way the regex keeps on searching and finds "Maya" - \G matches right after "Maya", but at this position the condition is not fulfilled any more, the regex fails and returns the result Edited August 4, 2023 by mikell typo(s) pixelsearch 1 Link to comment Share on other sites More sharing options...
pixelsearch Posted August 4, 2023 Share Posted August 4, 2023 @mikell very nice ! Yesterday, I really felt you'd come back with your Edit2 to suggest a solution with \G or similar . jchd wrote once that he should think more of this \G thing As you wrote, the \G explanation isn't really clear in the help file, that's why it took me time (with your help) to achieve the "pseudo help file" in RegExp Quick Tester, especially it had to be a short one-liner explanation : In the previous pic, as a writer can be named... "Writers" (found some guys named "Writers" on Google !) then I added some tests in the left part of the alternation, e.g writers:\s+ instead of writers, no big deal. We note that the order of the tests in the alternation is important, writers: first on the left side of the alternation, \G on the right side). Now if you don't mind, I got 2 questions :1) In case "Writers:" isn't found in the subject, can we add something in the right part of the alternation, so the regex engine returns nothing ? Because actually, if you change in the subject "Writers:" to "Wrs:" for example, then this would be returned (in OP's post with your pattern, or in my previous pic) and it would be better to avoid it : John Smith Jim Jack I don't think we can add a positive look-behind (e.g. search for "Writers:" to be found before each \G match) because look-behind doesn't work with not fixed-length string, maybe \K or something else ? If nothing can be easily done, then question 2 may bring the answer :2) A test shows that a positive lookahead (instead of the negative lookahead found in your pattern or in the pic above) solves this kind of situation... but I don't understand why : (?is)(?:.*?writers:\s+|\G</a>(?=,)).*?/nm\d+">([^<]*) With this positive lookahead, if "Writers:" isn't found in the subject, then nothing is returned (which is a good thing !) . So the 2nd question is, when "Writers:" isn't found in the subject, why does the negative lookahead returns results (which are confusing) when the positive lookahead doesn't return anything (which seems more correct) ? Thanks for reading Link to comment Share on other sites More sharing options...
mikell Posted August 4, 2023 Share Posted August 4, 2023 2 hours ago, pixelsearch said: In case "Writers:" isn't found in the subject Ahhh yesss, I didn't pay attention to this The answer to your questions is written in the definition of this nice \G spot : "matches at the beginning of the subject string OR at the end of the previous match" To solve the problem here we just have to make \G to not match at the beginning of the string StringRegExp($txt, '(?sx) (?: ^.*?Writers | \G (?!\A|</a>\s+</li>) ) .*?/nm\d+">([^<]*)', 3) So the answer to the 2nd question becomes obvious now BTW I still prefer the negative lookahead which allows to define a limit as the OP asked for pixelsearch 1 Link to comment Share on other sites More sharing options...
pixelsearch Posted August 4, 2023 Share Posted August 4, 2023 Well done mikell, that \A| is really cool ! we'll have to remember to always use it to make sure no false result is ever returned when the "string to search" is not found in the subject : (StringtoSearch|\G(?!\A|...)) If not mistaken, what "saves" us in OP's subject is the fact there is no comma after Maya (the last writer) but there is always a comma after each preceding writer (Kirsten & Jessica). If a comma followed Maya, then this would have been wrongly returned : Kirsten Jessica Maya Patrick Stewart Michelle Hurd Jeri Ryan But well, in this case you sure would have found another working pattern Link to comment Share on other sites More sharing options...
mikell Posted August 4, 2023 Share Posted August 4, 2023 1 hour ago, pixelsearch said: what "saves" us in OP's subject is the fact there is no comma after Maya Not really. The purpose here was to find a correct way to define the limit to stop matching, and in this case it is defined by the whole subpattern </a>\s+</li> But well, if you include in the pattern an optional comma, then you can add a comma after Maya (or remove the other commas) in the subject string and it will work StringRegExp($txt, '(?sx) (?: ^.*?Writers | \G (?!\A|</a>,?\s+</li>) ) .*?/nm\d+">([^<]*)', 3) Different requirements, different solutions Link to comment Share on other sites More sharing options...
pixelsearch Posted August 4, 2023 Share Posted August 4, 2023 @mikell Thx for the explanation. While you're still there, a complete explanation to my 2nd question from the post above could be (please correct me if I'm wrong) when Writers: isn't found in this subject : <a href="/name/nm8681530">John Smith</a>, <a href="/name/nm8681530">Jim </a>, <a href="/name/nm8681530">Jack</a> <a href="/name/nm8681530">Kirsten</a>, <a href="/name/nm8681530">Jessica</a>, <a href="/name/nm8681530">Maya</a> <a href="/name/nm0001772">Patrick Stewart</a>, <a href="/name/nm0403335">Michelle Hurd</a>, <a href="/name/nm0005394">Jeri Ryan</a> Pattern with negative lookahead (?is)(?:^.*?Writers:|\G(?!</a>\s+)).*?/nm\d+">([^<]*) Result : John Smith Jim Jack Negative lookahead : the left part of the alternation didn't match (Writers: wasn't found) so position restarts at the beginning of the string before the right part of the alternation is processed. The negative lookahead is then True, because there is no < /a > at the very beginning of the string, so the end of the pattern is processed (outside the alternation) and John Smith is a match. Now back to the right part of the alternation, the \G part, where the negative lookahead is True again (because there is no < /a > followed by whitespaces after John Smith, in fact the presence of the comma makes the negative lookahead True) and Jim is a match, then Jack is a match and basta, because the negative lookahead is now False (Jack is followed by < /a > and whitespace(s), it's the lack of a comma that makes the negative lookahead false) and that's why there were 3 results. Pattern with positive lookahead (?is)(?:^.*?Writers:|\G</a>(?=,)).*?/nm\d+">([^<]*) Result : None Positive lookahead : it seems easier to explain. Same beginning : the left part of the alternation didn't match (Writers: wasn't found) so position restarts at the beginning of the string before the right part of the alternation is processed. The positive lookahead is immediately false as there is no < /a > at the very beginning of the string so the regex fails, the end of pattern isn't processed as both alternations didn't make it at all. BTW, I wonder if the very 1st ^ in pattern is mandatory, it seems to work same with or without it (?) Link to comment Share on other sites More sharing options...
mikell Posted August 5, 2023 Share Posted August 5, 2023 14 hours ago, pixelsearch said: I wonder if the very 1st ^ in pattern is mandatory It is not mandatory but it is recommended Rex says here : "the regex style guide recommends using anchors whenever possible—even when your regex would match without them" Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now