Azevedo Posted February 22, 2015 Share Posted February 22, 2015 (edited) Hello. Is it possible to parse an HTML string using some DOM library in AU3? Edited February 22, 2015 by Azevedo Link to comment Share on other sites More sharing options...
water Posted February 22, 2015 Share Posted February 22, 2015 First thing that comes to my mind is the IE UDF that comes with AutoIt. My UDFs and Tutorials: Spoiler UDFs: Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki Standard UDFs: Excel - Example Scripts - Wiki Word - Wiki Tutorials: ADO - Wiki WebDriver - Wiki Link to comment Share on other sites More sharing options...
Azevedo Posted February 22, 2015 Author Share Posted February 22, 2015 I'm not using IE's engine. I'm getting the HTTP stream (html code) to a string. Porbably there isn't a DOM for that. Then I'll use RegEx. Link to comment Share on other sites More sharing options...
Moderators SmOke_N Posted February 22, 2015 Moderators Share Posted February 22, 2015 (edited) So essentially you want to build a browser? If not, maybe it's time you use a browser if you want to use browser objects? Edit: I say this, because they have already done all that work for you. Edited February 22, 2015 by SmOke_N mikell 1 Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer. Link to comment Share on other sites More sharing options...
Azevedo Posted February 25, 2015 Author Share Posted February 25, 2015 (edited) The purpose is to automate some online tasks without using IE. By using IE's engine I would be compromising privacy once it keeps history and cache. IE will load web components (flash, javascript, images) that is not what I want. Besides, I don't want to depend on IE's interface/engine. Edited February 25, 2015 by Azevedo Link to comment Share on other sites More sharing options...
Moderators SmOke_N Posted February 25, 2015 Moderators Share Posted February 25, 2015 Unfortunately, there's no "DOM" au3... although it sounds like a fun and extremely lengthy project. I know chimp worked on raw html table parser though. If you got a group of descent coders together for the project, I might be willing to add to the mix. But, now you know why I suggested the IE engine. There's always methods to cleanup as well, but if you're doing this on client machines, your project may be too delicate and the need for a complete dom parser eludes me at the moment. Azevedo 1 Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer. Link to comment Share on other sites More sharing options...
Solution Gianni Posted February 25, 2015 Solution Share Posted February 25, 2015 Hi Azevedo, just 3 days ago, as SmOke_N said in previous post, I posted an >udf to parse tables from a raw html that makes use of an internal function (the core function) that I wrote and used for the tables extraction purpose, but it's been thinked to be also used for a more general purpose, that is to extract portions of code related to specific html tags. Maybe it can be useful also for your project. In short, that function can return a sort of collection of the portions of code in the page related to specific html tags. Of sure it can be enhanced and refined, but it can be a starting point. An example is better of many words: expandcollapse popup#include <array.au3> Local $sHtml = BinaryToString(InetRead("http://www.autoitscript.com")) ; get the raw source Local $aMyTags = _ParseTags($sHtml, "<a", "</a>") ; collection of <a> tags _ArrayDisplay($aMyTags) $aMyTags = _ParseTags($sHtml, "<script", "</script>") _ArrayDisplay($aMyTags) $aMyTags = _ParseTags($sHtml, "<div", "</div>") _ArrayDisplay($aMyTags) $aMyTags = _ParseTags($sHtml, "<style", "</style>") _ArrayDisplay($aMyTags) ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ParseTags ; Description ...: searches and extract all portions of html code within opening and closing tags inclusive. ; Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested) ; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing) ; Parameters ....: $sHtml - A string value containing the html listing ; $sOpening - A string value indicating the opening tag ; $sClosing - A string value indicating the closing tag ; Return values .: success: an 1D 1 based array containing all the portions of html code representing the element ; element [0] af the array (and @extended as well) contains the counter of found elements ; faillure: An empty string and sets @error as following: ; @error: 1 - required tags are not present in the passed HTML ; 2 - error while parsing tags, (opening and closing tags are not balanced) ; 3 - error while parsing tags, (open/close mismatch error) ; =============================================================================================================================== Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>' ; it finds how many of such tags are on the HTML page StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences Local $iNrOfThisTag = @extended ; I assume that opening <tag and closing </tag> tags are balanced (as should be) ; (so NO check is made to see if they are actually balanced) If $iNrOfThisTag Then ; if there is at least one of this tag ; $aThisTagsPositions array will contain the positions of the ; starting <tag and ending </tag> tags within the HTML Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags) ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags For $i = 1 To $iNrOfThisTag $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this $aThisTagsPositions[$i][2] = $i ; nr of this tag $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this Next _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML Local $aStack[UBound($aThisTagsPositions)][2] Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html For $i = 1 To UBound($aThisTagsPositions) - 1 If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag $aStack[0][0] += 1 ; nr of tags in html $aStack[$aStack[0][0]][0] = $sOpening $aStack[$aStack[0][0]][1] = $i ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then Return SetError(3, 0, "") ; Open/Close mismatch error Else ; pair detected (the reciprocal tag) ; now get coordinates of the 2 tags ; 1) extract this tag <tag ..... </tag> from the html to the array $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0]) ; 2) remove that tag <tag ..... </tag> from the html $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1) ; 3) adjust the references to the new positions of remaining tags For $ii = $i To UBound($aThisTagsPositions) - 1 $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]]) Next $aStack[0][0] -= 1 ; nr of tags still in html EndIf EndIf Next If Not $aStack[0][0] Then ; all tags where parsed correctly $aTags[0] = $iNrOfThisTag Return SetError(0, $iNrOfThisTag, $aTags) ; OK Else Return SetError(2, 0, "") ; opening and closing tags are not balanced EndIf Else Return SetError(1, 0, "") ; there are no of such tags on this HTML page EndIf EndFunc ;==>_ParseTags NassauSky 1 Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Azevedo Posted February 26, 2015 Author Share Posted February 26, 2015 Thanks chimp, smoke This chimp's function will help me in some cases! Thanks! Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now