element72 Posted September 23, 2015 Share Posted September 23, 2015 Say I'm looking at an html source that somewhere contains <td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>.I want to capture fox (or cat) but in between <td> and </td>.I just keep getting it wrong. My best effort is $array = StringRegExp($HTML, "(?i)<td>([(cat)(fox)]?)</td>") Link to comment Share on other sites More sharing options...
ViciousXUSMC Posted September 23, 2015 Share Posted September 23, 2015 There are better ways to parse html than regex, this is the first way I have it working. The RegEx pro's are sure to come post some better ways. $sString = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" $sNewString = StringRegExpReplace($sString, "(?i)<td>(.*?)</td>", "$1") MsgBox(0, "", $sNewString) Link to comment Share on other sites More sharing options...
element72 Posted September 23, 2015 Author Share Posted September 23, 2015 Say I have many different symbols/characters in between <td> and </td> but all I want to identify is if fox or cat is in between them, while ignoring everything else. Link to comment Share on other sites More sharing options...
ViciousXUSMC Posted September 23, 2015 Share Posted September 23, 2015 (edited) Yeah I thought you wanted everything between <td> I need to fix it.Almost not quite, seems it will not pick up the repeats present between <td> if its more than once.#Include <Array.au3> $sString = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" $aNewString = StringRegExp($sString, "(?i)<td>.*?(cat|fox).*?</td>", 3) _ArrayDisplay($aNewString) Edited September 23, 2015 by ViciousXUSMC Link to comment Share on other sites More sharing options...
element72 Posted September 23, 2015 Author Share Posted September 23, 2015 (edited) This should be possible with stringregexp() right? Or can this be done with a different function like IE UDF? Edited September 23, 2015 by element72 Link to comment Share on other sites More sharing options...
element72 Posted September 24, 2015 Author Share Posted September 24, 2015 help me... please Still can't figure it out... StringRegExp($HTML, "(?i)<td>([(fox)(cat)(.*)][(fox)(cat)(.*)][(fox)(cat)(.*)]?)</td>",3) Link to comment Share on other sites More sharing options...
jguinch Posted September 24, 2015 Share Posted September 24, 2015 This ?#Include <Array.au3> $HTML = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" $array = StringRegExp($HTML, "(?i)<td>[^<]*(cat|fox)[^<]*</td>", 3) _ArrayDisplay($array) Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
ViciousXUSMC Posted September 24, 2015 Share Posted September 24, 2015 (edited) This ?#Include <Array.au3> $HTML = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" $array = StringRegExp($HTML, "(?i)<td>[^<]*(cat|fox)[^<]*</td>", 3) _ArrayDisplay($array) Its missing the last set where both cat & fox are in a single string between <td></td> I tried to look online how to "repeat" the capture but could not quite figure it out. I assume some kind of back reference or something would be needed. Been eyeing this thread myself since I could not figure it out.Closest I found was this, but none of it worked: http://www.regular-expressions.info/captureall.html Edited September 24, 2015 by ViciousXUSMC Link to comment Share on other sites More sharing options...
jguinch Posted September 24, 2015 Share Posted September 24, 2015 Sorry, I didn't see the last elems contains two results.Try this one :#Include <Array.au3> $HTML = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" $aRes = StringRegExp($HTML, "(?i)(fox|cat)(?=[^<]*<\/td>)", 3) _ArrayDisplay($aRes) Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
ViciousXUSMC Posted September 24, 2015 Share Posted September 24, 2015 Sorry, I didn't see the last elems contains two results.Try this one :#Include <Array.au3> $HTML = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" $aRes = StringRegExp($HTML, "(?i)(fox|cat)(?=[^<]*<\/td>)", 3) _ArrayDisplay($aRes) Works not sure how lol.(?i) Case Insensitive(fox|cat) match & capture fox or cat(?=[^<]*<\/td>) look ahead and match but do not capture Not < 0 or more times and </td> one time.I just do not understand how this one picks up the duplicate cat/fox matches when the other regex was not able to. I'll save this for notes though. Link to comment Share on other sites More sharing options...
mikell Posted September 24, 2015 Share Posted September 24, 2015 (edited) ViciousXUSMC,Hehe "(fox|cat)[^<]*</td>" => matches a string containing fox|cat AND zero or more non-< chars up to </td>When done, the search continues after the </td>"(fox|cat)(?=[^<]*<\/td>)" => matches fox|cat followed by 0 or more non-< chars and </td>When done, the search continues after the matchIt's the magic of the lookahead which is a zero-length assertion BTW [^<]* is used to check that there is no html tag between fox|cat and </td>Edit :Note that you can also do it like this "(?i)(?=(fox|cat)[^<]*</td>)" Edited September 24, 2015 by mikell ViciousXUSMC 1 Link to comment Share on other sites More sharing options...
ViciousXUSMC Posted September 24, 2015 Share Posted September 24, 2015 Thumbs Up! Tested it on regex101 yes how you explained it makes sense, didn't think you could match like that only looking forward and ignoring what is behind, but it's a clever solution. Link to comment Share on other sites More sharing options...
mikell Posted September 24, 2015 Share Posted September 24, 2015 Hmmmyes, but as usual in such topics about regex the OP doesn't provide enough details concerning the source... thus the regex fails if there are html font tags : <td><b>CATdss foxdf sdf</b></td> Link to comment Share on other sites More sharing options...
jguinch Posted September 24, 2015 Share Posted September 24, 2015 Yes, for me this kind of regex should be done in two pass for more flexibility :#Include <Array.au3> $HTML = "<td>aaaa ©FOXaaa ttt</td> <td>aaacat bbbb aaa</td> <td>CATdss foxdf sdf</td>" For $elem In StringRegExp($HTML, "<td>(.+?)</td>", 3) $aRes = StringRegExp($elem, "(?i)(cat|fox)", 3) If IsArray($aRes) Then _ArrayDisplay($aRes) Next Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
element72 Posted October 1, 2015 Author Share Posted October 1, 2015 Hmmmyes, but as usual in such topics about regex the OP doesn't provide enough details concerning the source... thus the regex fails if there are html font tags : <td><b>CATdss foxdf sdf</b></td> good point. How would I solve this kind of problem? or include that into the regexp? Link to comment Share on other sites More sharing options...
mikell Posted October 1, 2015 Share Posted October 1, 2015 #Include <Array.au3> $HTML = "<td>aaaa <b>©FOX</b>aaa ttt</td> <td>aaacat bbbb aaa</td> <td> <i><strong>CAT</i></strong>dssfox df sdf</td>" $aRes = StringRegExp($HTML, "(?i)(fox|cat)(?=.*?</td>)", 3) _ArrayDisplay($aRes)Using 2 passes as said jguinch could be a more robust solution though, should be tested Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now