Gianni Posted February 6, 2015 Share Posted February 6, 2015 how would you remove all between the < lesser and greater > parenthesis, parenthesis included, and leave only what's outside the parenthesis. For example from the following piece of code from an html table, it should remain only the part marked in green. (that is the content inside the cell of the table), while all the rest that is included between < and > pairs, should be removed <td bgcolor="#d3d3d3" align="center" valign="middle" rowspan="2"><font size="2" color="#000000" face="verdana"><b>Cell Two</b></font></td> thanks for any solution Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Developers Solution Jos Posted February 6, 2015 Developers Solution Share Posted February 6, 2015 StringRegExpReplace($YourString,"(?U)\<.*\>","") Jos SciTE4AutoIt3 Full installer Download page - Beta files Read before posting How to post scriptsource Forum etiquette Forum Rules Live for the present, Dream of the future, Learn from the past. Link to comment Share on other sites More sharing options...
MikahS Posted February 6, 2015 Share Posted February 6, 2015 StringRegExpReplace($sString, "(?U)(<.*>)", "") Snips & Scripts My Snips: graphCPUTemp ~ getENVvarsMy Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4 Feel free to use any of my code for your own use. Forum FAQ Link to comment Share on other sites More sharing options...
Gianni Posted February 6, 2015 Author Share Posted February 6, 2015 Wow! seems to work very well! thanks a lot Jos Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Gianni Posted February 6, 2015 Author Share Posted February 6, 2015 (edited) waw your version MikahS works great as well thanks a lot you too p.s. I sign as solved the post of jos, because he was fasterMany thanks to both Edited February 6, 2015 by Chimp Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
MikahS Posted February 6, 2015 Share Posted February 6, 2015 My pleasure. Snips & Scripts My Snips: graphCPUTemp ~ getENVvarsMy Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4 Feel free to use any of my code for your own use. Forum FAQ Link to comment Share on other sites More sharing options...
jdelaney Posted February 6, 2015 Share Posted February 6, 2015 (edited) $oDOMObj.innertext Edited February 6, 2015 by jdelaney IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window. Link to comment Share on other sites More sharing options...
Gianni Posted February 6, 2015 Author Share Posted February 6, 2015 $oDOMObj.innertext thanks jdelaney, but I'm working on a Table extractor from a raw html, not from a browser or DOM objects Thanks for the idea as well. Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Gianni Posted February 8, 2015 Author Share Posted February 8, 2015 (edited) ... I'm again on this, the above regexp fails if the checked line is a multiline (it contains @cr or @crlf) and the opening and closing parenthesis are on different lines for example, the following line is not correctly parsed; <TD> Hello <IMG src="../images/icon.gif" alt= "Hello pic"></TD> so instead of only the Hello word, also the two lines below remains on result Could someone tell me how to modify the above posted regexp to catch and delete also text enclosed between < and > also if the two parenthesis are on 2 different lines? thanks a lot Edited February 8, 2015 by Chimp Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
iamtheky Posted February 8, 2015 Share Posted February 8, 2015 StringRegExpReplace(stringstripws($YourString, 8),"(?U)\<.*\>","") ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
Gianni Posted February 8, 2015 Author Share Posted February 8, 2015 (edited) StringRegExpReplace(stringstripws($YourString, 8),"(?U)\<.*\>","") Thanks boththose, but I do not want to remove the @cr if them are outside the < and > using your way all @cr are removed, also those outside the < and > parenthesis for example <TD> Hello Good morning <IMG src="../images/icon.gif" alt= "Hello pic"></TD> the @cr between hello @cr Good morning should remain ... is there a way? here a simple reproducer to show the problem: Local $sHtml, $sHtml2 $sHtml = '<TD>Hello' $sHtml &= @CRLF & 'Good morning' $sHtml &= @CRLF & '<IMG src="../images/icon.gif" alt=' $sHtml &= @CRLF & '"Hello pic">' $sHtml &= @CRLF & ' </TD>' $sHtml2 = '<TD>Hello' & @CR & 'Good morning<IMG src="../images/icon.gif" alt="Hello pic"> </TD>' MsgBox(0, "string with < and > on different lines", $sHtml) MsgBox(0, "Parsed string", StringRegExpReplace($sHtml, "(?U)\<.*\>", "")) ; < and > on different lines, parse fails MsgBox(0, "string with < and > on same line", $sHtml2) MsgBox(0, "Parsed string", StringRegExpReplace($sHtml2, "(?U)\<.*\>", "")) ; < and > on same line, parse OK Edited February 8, 2015 by Chimp Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Moderators SmOke_N Posted February 8, 2015 Moderators Share Posted February 8, 2015 Is this what you're looking for? StringRegExpReplace($sHtml2, "(?s)(<.*?>)(.*?)(<\s*/.*?>)", "$2") Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer. Link to comment Share on other sites More sharing options...
mikell Posted February 8, 2015 Share Posted February 8, 2015 You must use (?s) to allow the dot to match newline StringRegExpReplace($sHtml2, '(?s)<.*?>', "") Link to comment Share on other sites More sharing options...
Moderators SmOke_N Posted February 8, 2015 Moderators Share Posted February 8, 2015 Ahh, thought he wanted to keep the img one... btw, as demonstrated above... you don't have escape the angle brackets. Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer. Link to comment Share on other sites More sharing options...
mikell Posted February 8, 2015 Share Posted February 8, 2015 (edited) May I add, I'm not a great fan of the (?U) option because it makes lazy ALL the possible + or * quantifiers in the expression BTW the usual (and recommended) workaround is StringRegExpReplace($sHtml2, '<[^>]+>', "") [^>]+ meaning : 1 or more non ">" chars Jan Goyvaerts explains this : In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing http://www.regular-expressions.info/repeat.html Edited February 8, 2015 by mikell Gianni and Xandy 2 Link to comment Share on other sites More sharing options...
Gianni Posted February 8, 2015 Author Share Posted February 8, 2015 May I add, I'm not a great fan of the (?U) option because it makes lazy ALL the possible + or * quantifiers in the expression BTW the usual (and recommended) workaround is StringRegExpReplace($sHtml2, '<[^>]+>', "") [^>]+ meaning : 1 or more non ">" chars Jan Goyvaerts explains this : In this case, there is a better option than making the plus lazy. We can use a greedy plus and a negated character class: <[^>]+>. The reason why this is better is because of the backtracking. When using the lazy plus, the engine has to backtrack for each character in the HTML tag that it is trying to match. When using the negated character class, no backtracking occurs at all when the string contains valid HTML code. Backtracking slows down the regex engine. You will not notice the difference when doing a single search in a text editor. But you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a script that you are writing http://www.regular-expressions.info/repeat.html thanks a lot mikell, it works great! Thanks also for the explanation (.....although I do not understand much about what you're talking about ) thanks again Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now