pollop Posted January 9, 2011 Posted January 9, 2011 (edited) Hi, I'm wondering if it's possible to use RegExp to "parse" html. Here's what I want to do: For example, i have the following text : " <div> Blabla <div> <div> Blabla </div> </div> Blabla </div> " I'm wondering if it's possible to get what's inside the higher div with a regexp. Something that would return : " Blabla <div> <div> Blabla </div> </div> Blabla " I wrote a function that count the number of opening <div> and closing </div> and continue to search until the two numbers are equal. But I think it would be much more efficient with a "simple" regexp. What do you think ? Thanks a lot for your help and sorry for my bad english Pollop. Edited January 9, 2011 by pollop
Mat Posted January 9, 2011 Posted January 9, 2011 (edited) Unfortunately not Some languages like lua have regex that can do this, but not AutoIt. If you know how many divs there are, or if it's always the first and last tags you want then you could do it like this: StringRegExpReplace($sInput, "(?s).*?<div>(?:\r\n)*(.*)(?:\r\n)*</div>.*?", "\1") Alternatives are using the XmlDomWrapper udf that's on the forum somewhere that uses msxml, and then use xpath queries, or use what you already have. Edit: Found the link to Xml dom wrapper: Edited January 9, 2011 by Mat AutoIt Project Listing
pollop Posted January 9, 2011 Author Posted January 9, 2011 Thanks a lot for the reply... I think i'm gonna continue using my solution Here it is (if someone needs something like that) expandcollapse popupFunc HtmlBetween($sText, $sStart, $sEndTag = "</div>") Local $sCountUp Local $sCountDown Switch $sEndTag Case "</div>" $sCountUp = "<div" $sCountDown = "</div>" Case "</span>" $sCountUp = "<span" $sCountDown = "</span>" Case "</ul>" $sCountUp = "<ul" $sCountDown = "</ul>" Case "</li>" $sCountUp = "<li" $sCountDown = "</li>" Case "</a>" $sCountUp = "<a" $sCountDown = "</a>" Case Else LogError("Func HtmlBetween: Wrong type tag") Return False EndSwitch ; We begin by deleting what's before the start. Local $sStartPos = StringInStr($sText, $sStart) If $sStartPos == 0 Then LogError("Func HtmlBetween: Can't find the start") Return False EndIf $sText = StringTrimLeft($sText, $sStartPos + StringLen($sStart) - 1) ; We now search for the content Local $iNumberUp = 1 Local $iNumberDown = 0 Local $iUp Local $iDown While $iNumberDown <> $iNumberUp $iUp = StringInStr($sText, $sCountUp, 0, $iNumberUp) $iDown = StringInStr($sText, $sCountDown, 0, $iNumberDown + 1) If $iUp > 0 And $iUp < $iDown Then $iNumberUp += 1 ElseIf $iDown > 0 Then $iNumberDown += 1 Else LogError("Func HtmlBetween: Can't parse HTML, number of open tags != number of closing tags") Return False EndIf WEnd ; We get everything that's before the last closing tag Return StringLeft($sText, $iDown - 1) EndFunc
Mat Posted January 9, 2011 Posted January 9, 2011 (edited) You want to have a look at this that I wrote. It checks to see if tags are opened and closed in the right order, but could be easily modified to do what you want. It needs a bit more error checking to see if more tags are closed than opened or vice versa, but it works expandcollapse popup#include<Array.au3> $s = '<a href="www.google.com"><span>This is a test</a></span>' MsgBox(0, $s, _HTML_Check($s)) $s = '<a href="www.google.com"><span>This is a test</span></a>' MsgBox(0, $s, _HTML_Check($s)) Func _HTML_Check($sString) Local $aStack[1] = [0] Local $sTemp, $sLast For $i = 1 To StringLen($sString) If StringMid($sString, $i, 1) = "<" Then $sTemp = "" While 1 $i += 1 If $i > StringLen($sString) Or (Not StringIsAlNum(StringMid($sString, $i, 1)) And StringMid($sString, $i, 1) <> "/") Then ExitLoop $sTemp &= StringMid($sString, $i, 1) WEnd ConsoleWrite($sTemp & @LF) If StringLeft($sTemp, 1) = "/" Then $sTemp = StringTrimLeft($sTemp, 1) $sLast = _ArrayPop($aStack) $aStack[0] -= 1 If $sTemp <> $sLast Then Return SetError(1, 0, "Expected closing tag for '" & $sLast & "' tag. Got closing tag for '" & $sTemp & "' instead.") Else If Not _HTML_IsTag($sTemp) Then Return SetError(1, 0, "Unrecognized tag: '" & $sTemp & "'") _ArrayAdd($aStack, $sTemp) $aStack[0] += 1 EndIf EndIf Next Return "Success" EndFunc ;==>_HTML_Check Func _HTML_IsTag($sTag) ; Add a switch or lookup and see if sTag is a proper tag. ; I just assume it is for now. Return True EndFunc ;==>_HTML_IsTag Edit: Just found this: Edited January 9, 2011 by Mat AutoIt Project Listing
jchd Posted January 10, 2011 Posted January 10, 2011 (edited) I beg to differ from Mat assertion that AutoIt PCRE can't do this. Using the pattern (?imsx) <div> ( ( (?>(?<=<div>).*(?=</div>)) | (?R) )+ ) </div> and the input <html><div>ab<div>cd<div>abcd<div>cdef</div><div>cdef1</div> <div> Blabla <div> <div> Blabla </div> </div> Blabla </div> <div>cdef2</div>efgh</div>gh</div>ef</div></html> you get the wanted result. AutoIt PCRE _does_ support recursion. Note that recursing with multi-character boundaries (like html opening/closing tags pairs) is less trivial than with single character boundaries (e.g. parenthesis) but it surely can be done. I don't forcibly mean that the solution above is the best thing since sliced bread but it does work. Edited January 10, 2011 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Mat Posted January 10, 2011 Posted January 10, 2011 (edited) Yes, you're right I was talking about recursion like lua has with '%bxy' where it matches something beginning with x and ending in y with the same number of each, I should have known that it would be possible to do it some other way. I'd still say that XmlDomWrapper.au3 is a better solution, but then I've never really liked using regex a lot. Edit: I also worked a bit on my example: expandcollapse popup#include<Array.au3> #include<String.au3> $s = BinaryToString(InetRead("http://www.isup.me/autoitscript.com")) MsgBox(0, "www.google.com", _HTML_Check($s) & @CRLF & @extended) Func _HTML_Check($sString) Local $aStack[1] = [0] Local $sTemp, $sLast Local $iLine = 1 $sString = StringStripCR($sString) For $i = 1 To StringLen($sString) If StringMid($sString, $i, 1) = @LF Then $iLine += 1 ElseIf StringMid($sString, $i, 1) = "<" Then $sTemp = "" While 1 $i += 1 If StringMid($sString, $i, 1) = @LF Then $iLine += 1 If $i > StringLen($sString) Or (Not StringIsAlNum(StringMid($sString, $i, 1)) And StringMid($sString, $i, 1) <> "/") Then ExitLoop $sTemp &= StringMid($sString, $i, 1) WEnd If StringMid($sString, StringInStr($sString, ">", 1, 1, $i) - 1, 1) = "/" Then ; Self closing If Not _HTML_IsTag($sTemp) Then Return SetError(1, $iLine, "Unrecognized tag: '" & $sTemp & "'") ConsoleWrite(_StringRepeat("|", $aStack[0]) & "-" & $sTemp & @LF) ContinueLoop EndIf If StringLeft($sTemp, 1) = "/" Then $sTemp = StringTrimLeft($sTemp, 1) If $aStack[0] = 0 Then Return SetError(1, $iLine, "Unexpected closing tag: '" & $sTemp & "'") $sLast = _ArrayPop($aStack) $aStack[0] -= 1 If $sTemp <> $sLast Then Return SetError(1, $iLine, "Expected closing tag for '" & $sLast & "' tag. Got closing tag for '" & $sTemp & "' instead.") ElseIf $sTemp = "" Then Else If Not _HTML_IsTag($sTemp) Then Return SetError(1, $iLine, "Unrecognized tag: '" & $sTemp & "'") ConsoleWrite(_StringRepeat("|", $aStack[0]) & "-" & $sTemp & @LF) _ArrayAdd($aStack, $sTemp) $aStack[0] += 1 EndIf EndIf Next Return "Success" EndFunc ;==>_HTML_Check Func _HTML_IsTag($sTag) ; Add a switch or lookup and see if sTag is a proper tag. ; I just assume it is for now. Return True EndFunc ;==>_HTML_IsTag Edited January 10, 2011 by Mat AutoIt Project Listing
jchd Posted January 10, 2011 Posted January 10, 2011 I fully agree that parsing such input (especially html where whitespaces can appear almost everywhere) with regexps is not the best solution. For html, navigating in the IE objects is probably the most robust way, after all a browser engine is particularly well suited to parse html. There are nonetheless countless situations where using non-basic to advanced regexp possibilities is a reasonable, efficient and reliable approach. I only mentionned the recursion possibility here to that effect. To be honest, I didn't use PCRE recursion for some time and had to try a couple of times before coming up with a working pattern, due to the tags being multi-character. Real regexp gurus would find that simple one really trivial... This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now