Jump to content

Recommended Posts

Posted (edited)

Here's an UDF to to get informations from any HTML (XML) source, without any browser.

_HTML.au3

Current functions:

- _HTML_ExtractURLVar

- _HTML_Get

- _HTML_GetAllImageSrc

- _HTML_GetAllLinks

- _HTML_GetImageSrc

- _HTML_GetLink

- _HTML_GetSource

- _HTML_GetTable

- _HTML_GetText

- _HTML_GetURLVar

- _HTML_ImageSave

- _HTML_Search

Different search-modes:

$_HTML_SEARCHMODE = 1 ; (0 = Compare / 1 = Substring / RegExp) (2 = Compare / 3 = Substring / String-compare)

Example 1: (getting the url of the favicon of a page)

#include <_html.au3>

$sHTML = _HTML_GetSource("http://autoit.de/index.php?page=Portal")
$sIconURL = _HTML_Get($sHTML, "link", "href", "image/x-icon", "type")
MsgBox(64,"",$sIconURL)

Example:

#Region Includes
#include <Array.au3>
#include <_HTML.au3>
#EndRegion Includes

$_HTML_SEARCHMODE = 1

Main()

Func Main()

    Local $HTML = _HTML_GetSource("http://autoit.de/index.php?page=Portal")

    MsgBox(0, "", _HTML_GetURLVar($HTML, "page", "Mitglieder", "title") & @CRLF)
    MsgBox(0, "", _HTML_GetText($HTML, "div", "cont.*erCont", "class", 5) & @CRLF)
    MsgBox(0, "", _HTML_GetImageSrc($HTML, "controllcenterImage") & @CRLF)
    MsgBox(0, "", _HTML_GetLink($HTML, "loginButton") & @CRLF)

    Local $a = _HTML_GetAllLinks($HTML)
    _ArrayDisplay($a)

    $a = _HTML_GetAllLinks($HTML, '\.com')
    _ArrayDisplay($a)

    $a = _HTML_GetAllImageSrc($HTML, 'wcf/images/')
    _ArrayDisplay($a)
EndFunc   ;==>Main

The UDF:

_HTML.au3

V1.01:

Small bug fix in _HTML_GetTable

Edited by Stilgar
Posted

The UDF looks interesting, and today I thought of a reason why I might want to retrieve search engine results for a program I'm making. Although I have decided against the idea in this instance. It would only have been necessary if the user somehow bypasses the filters I am designing which avoid meaningless (or ambiguous) user imput. Searching for meaningless strings on Google won't be a useful feature. I might find some other uses for this though. :x

Posted

Thanks for sharing!

Brings to mind a concept, which I tentatively call - Webpage Pruner.

Where you parse the contents of a html file to a GUI with mutli tabs. The idea being that different portions of that web page inhabit different tabs, which you can at the click of a button, remove from the html file or export to a new one. In other words a limited visual editor. Unlike a web page editor, it's not really trying to interpret anything ... just mostly dealing with text & images ... though you could provide for links & tables, etc. In other words, you could remove all the crap and end up with a pretty bare bones web page.

Anyone is welcome to have a go at this if they like, as it could just remain an idea in my mind ... until it quietly slips away ....

Make sure brain is in gear before opening mouth!
Remember, what is not said, can be just as important as what is said.

Spoiler

What is the Secret Key? Life is like a Donut

If I put effort into communication, I expect you to read properly & fully, or just not comment.
Ignoring those who try to divert conversation with irrelevancies.
If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it.
I'm only big and bad, to those who have an over-active imagination.

I may have the Artistic Liesense ;) to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage)

userbar.png

  • 1 month later...
  • 1 year later...
  • 2 years later...
Posted (edited)

i have been using this udf...bt never got result with _HTMLGetTable function...can you post an example of it..???

 

Here is  "_HTML.zip" fixed UDF

EDIT: new attachment in second post below.

Edited by mLipok
Attachment removed - quota cleanup

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST APIErrorLog.au3 UDF - A logging Library * Include Dependency Tree (Tool for analyzing script relations) * Show_Macro_Values.au3 *

 

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 UDF * ADO.au3 UDF SMTP Mailer UDF * Dual Monitor resolution detection * * 2GUI on Dual Monitor System * _SciLexer.au3 UDF * SciTE - Lexer for console pane

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Good coding practices in AutoIt * 

OpenOffice/LibreOffice/XLS Related: WriterDemo.au3 * XLS/MDB from scratch with ADOX

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * IE in TaskSchedulerIE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) * PDF Related:How to get reference to PDF object embeded in IE * IE on Windows 11

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

I also encourage you to check awesome @trancexx code:  * Create COM objects from modules without any demand on user to register anything. * Another COM object registering stuffOnHungApp handlerAvoid "AutoIt Error" message box in unknown errors  * HTML editor

winhttp.au3 related : * https://www.autoitscript.com/forum/topic/206771-winhttpau3-download-problem-youre-speaking-plain-http-to-an-ssl-enabled-server-port/

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2023-04-24

Posted

@mLipok you have forgot an _ArrayDisplay($r,'$r') on line 295

The _HTMLGetTable function fails on tables with rowspan and/or colspan.

I post here an example that I've ported and adapted a bit from my post (https://www.autoitscript.com/forum/topic/167679-read-data-from-html-tables-from-raw-html-source/) about data extraction from html tables. You can see that if you select table 2 from the example page and click on the "To array by Stilgar" button the extracted data by the _HTMLGetTable function is not correct. Same issue if you select table 7.
Using my function to extract data by click on the "To array by Chimp" button instead, data is extracted correctly.

here is the example script to test data extraction from html tables:

; #include <_HtmlTable2Array.au3> ; <--- udf by chimp. already included at bottom of this demo
; see here: https://www.autoitscript.com/forum/topic/167679-read-data-from-html-tables-from-raw-html-source/
;
#include <_HTML.au3> ; <--- udf by Stilgar
#include <GUIConstantsEx.au3>
#include <EditConstants.au3>
#include <WindowsConstants.au3>
#include <array.au3>
#include <IE.au3>

Local $oIE1 = _IECreateEmbedded(), $oIE2 = _IECreateEmbedded()
Local $sHtml_File, $iIndex, $aTable, $aMyArray
GUICreate("Html tables to array demo", 1000, 450, (@DesktopWidth - 1000) / 2, (@DesktopHeight - 450) / 2 _
        , $WS_OVERLAPPEDWINDOW + $WS_CLIPSIBLINGS + $WS_CLIPCHILDREN)
GUICtrlCreateObj($oIE1, 010, 10, 480, 360) ; left browser
GUICtrlCreateTab(500, 10, 480, 360)
GUICtrlCreateTabItem("view table")
GUICtrlCreateObj($oIE2, 502, 33, 474, 335) ; right browser
GUICtrlCreateTabItem("view html")
Local $idLabel_HtmlTable = GUICtrlCreateInput("", 502, 33, 474, 335, $ES_MULTILINE + $ES_AUTOVSCROLL)
GUICtrlSetFont(-1, 10, 0, 0, "Courier new")
GUICtrlCreateTabItem("")

Local $idInputUrl = GUICtrlCreateInput("", 10, 380, 440, 20)
Local $idButton_Go = GUICtrlCreateButton("Go", 455, 380, 25, 20)
Local $idButton_Load = GUICtrlCreateButton("Load html from disk", 10, 410, 480, 30)

Local $idButton_Prev = GUICtrlCreateButton("Prev  <-", 510, 380, 100, 30)
Local $idLabel_NunTable = GUICtrlCreateLabel("00 / 00", 615, 380, 40, 30)
GUICtrlSetFont(-1, 9, 700)
Local $idButton_Next = GUICtrlCreateButton("Next  ->", 660, 380, 100, 30)
Local $idButton_Array2 = GUICtrlCreateButton("To array by Stilgar", 770, 380, 100, 30)
Local $idButton_Array = GUICtrlCreateButton("To array by Chimp", 880, 380, 100, 30)

GUISetState(@SW_SHOW) ;Show GUI
_IEDocWriteHTML($oIE2, "<HTML></HTML>")
GUICtrlSetData($idInputUrl, "http://www.mojotoad.com/sisk/projects/HTML-TableExtract/tables.html") ; example page
; GUICtrlSetData($idInputUrl, "http://www.devbevy.com/tutorials/html/table-rowspan-colspan.htm") ; <thead> & <tfoot> problem #table 6
ControlClick("", "", $idButton_Go)
; _IEAction($oIE1, "stop")

Do; Waiting for user to close the window
    $iMsg = GUIGetMsg()
    Select
        Case $iMsg = $idButton_Go
            _IENavigate($oIE1, GUICtrlRead($idInputUrl))
            ; _IEAction($oIE1, "stop")
            $aTables = _HtmlTableGetList(_IEBodyReadHTML($oIE1))
            If Not @error Then
                ; _ArrayDisplay($aTables, "Tables contained in this html")
                $iIndex = 1
                _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>")
                ControlClick("", "", $idButton_Prev)
                _IEAction($oIE2, "stop")
            Else
                MsgBox(0, 0, "@error " & @error)
            EndIf
        Case $iMsg = $idButton_Load
            ConsoleWrite("$idButton_Load" & @CRLF)
            $sHtml_File = FileOpenDialog("Choose an html file", @ScriptDir & "\", "html page (*.htm;*.html)")
            If Not @error Then
                GUICtrlSetData($idInputUrl, $sHtml_File)
                ControlClick("", "", $idButton_Go)
            EndIf
        Case $iMsg = $idButton_Next
            If IsArray($aTables) Then
                $iIndex += $iIndex < $aTables[0]
                GUICtrlSetData($idLabel_NunTable, "Table" & @CRLF & $iIndex & " / " & $aTables[0])
                GUICtrlSetData($idLabel_HtmlTable, $aTables[$iIndex])
                _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>")
                _IEAction($oIE2, "stop")
            EndIf
        Case $iMsg = $idButton_Prev
            If IsArray($aTables) Then
                $iIndex -= $iIndex > 1
                GUICtrlSetData($idLabel_NunTable, "Table" & @CRLF & $iIndex & " / " & $aTables[0])
                GUICtrlSetData($idLabel_HtmlTable, $aTables[$iIndex])
                _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>")
                _IEAction($oIE2, "stop")
            EndIf
        Case $iMsg = $idButton_Array
            If IsArray($aTables) Then
                $aMyArray = _HtmlTableWriteToArray($aTables[$iIndex], 0, 1)
            If Not @error Then _ArrayDisplay($aMyArray)
            EndIf
        Case $iMsg = $idButton_Array2
            If IsArray($aTables) Then
                $aMyArray = _HTML_GetTable($aTables[$iIndex]) ; <---- function from _HTML.au3 (can give wrong results)
            If Not @error Then _ArrayDisplay($aMyArray)
            EndIf

    EndSelect
Until $iMsg = $GUI_EVENT_CLOSE
GUIDelete()
Exit
;
; function included here for a quicker "load & excute" of this demo.
; It should be loaded with #include <_HtmlTable2Array.au3> at the beginning instead
; see here: https://www.autoitscript.com/forum/topic/167679-read-data-from-html-tables-from-raw-html-source/
;
; #include-once
; #include <array.au3>
;
; #FUNCTION# ====================================================================================================================
; Name ..........: _HtmlTableGetList
; Description ...: Finds and enumerates all the html tables contained in an html listing (even if nested).
;                  if the optional parameter $i_index is passed, then only that table is returned
; Syntax ........: _HtmlTableGetList($sHtml[, $i_index = -1])
; Parameters ....: $sHtml               - A string value containing an html page listing
;                  $i_index             - [optional] An integer value indicating the number of the table to be returned (1 based)
;                                         with the default value of -1 an array with all found tables is returned
; Return values .: Success;               Returns an 1D 1 based array containing all or single html table found in the html.
;                                         element [0] (and @extended as well) contains the number of tables found (or 0 if no tables are returned)
;                                         if an error occurs then an ampty string is returned and the following @error code is setted
;                                         @error:   1 - no tables are present in the passed HTML
;                                                   2 - error while parsing tables, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tables, (open/close mismatch error)
;                                                   4 - invalid table index request (requested table nr. is out of boundaries)
; ===============================================================================================================================
Func _HtmlTableGetList($sHtml, $i_index = -1)
    Local $aTables = _ParseTags($sHtml, "<table", "</table>")
    If @error Then
        Return SetError(@error, 0, "")
    ElseIf $i_index = -1 Then
        Return SetError(0, $aTables[0], $aTables)
    Else
        If $i_index > 0 And $i_index <= $aTables[0] Then
            Local $aTemp[2] = [1, $aTables[$i_index]]
            Return SetError(0, 1, $aTemp)
        Else
            Return SetError(4, 0, "") ; bad index
        EndIf
    EndIf
EndFunc   ;==>_HtmlTableGetList

; #FUNCTION# ====================================================================================================================
; Name ..........: _HtmlTableWriteToArray
; Description ...: It writes values from an html table to a 2D array. It tries to take care of the rowspan and colspan formats
; Syntax ........: _HtmlTableWriteToArray($sHtmlTable[, $bFillSpan = False[, $bExtractTH = True]])
; Parameters ....: $sHtmlTable          - A string value containing the html code of the table to be parsed
;                  $bFillSpan           - [optional] Default is False. If span areas have to be filled by repeating the data
;                                         contained in the first cell of the span area
;                  $bExtractTH          - [optional] Default is True. This is an experimental parameter
;                                                    This parameter indicates if Table Headers have to be extracted to the array.
;                                                    This option can generate misalignments in the array if is set to false due
;                                                    to the "holes" left empty by the ignored <th> and </th> tags.
;                                                    (Best DON'T set this option to False !)
; Return values .: Success:               2D array containing data as from the html table
;                  Faillure:              An empty strimg and sets @error as following:
;                                         @error:   1 - no table content is present in the passed HTML
;                                                   2 - error while parsing rows and/or columns, (opening and closing tags are not balanced)
;                                                   3 - error while parsing rows and/or columns, (open/close mismatch error)
; ===============================================================================================================================
Func _HtmlTableWriteToArray($sHtmlTable, $bFillSpan = False, $bExtractTH = True)
    If $bExtractTH Then ;extract also TableHeaders as normal data?
        $sHtmlTable = StringReplace(StringReplace($sHtmlTable, "<th", "<td"), "</th>", "</td>") ; th becomes td
    EndIf
    ; rows of the wanted table
    Local $iError, $aTempEmptyRow[2] = [1, ""]
    Local $aRows = _ParseTags($sHtmlTable, "<tr", "</tr>") ; $aRows[0] = nr. of rows
    If @error Then Return SetError(@error, 0, "")
    Local $aCols[$aRows[0] + 1], $aTemp
    For $i = 1 To $aRows[0]
        $aTemp = _ParseTags($aRows[$i], "<td", "</td>")
        $iError = @error
        If $iError = 1 Then ; check if it's an empty row
            $aTemp = $aTempEmptyRow ; Empty Row
        Else
            If $iError Then Return SetError($iError, 0, "")
        EndIf
        If $aCols[0] < $aTemp[0] Then $aCols[0] = $aTemp[0] ; $aTemp[0] = max nr. of columns in table
        $aCols[$i] = $aTemp
    Next
    Local $aResult[$aRows[0]][$aCols[0]], $iStart, $iEnd, $aRowspan, $aColspan, $iSpanY, $iSpanX, $iSpanRow, $iSpanCol, $iMarkerCode, $sCellContent
    Local $aMirror = $aResult
    For $i = 1 To $aRows[0] ;      scan all rows in this table
        $aTemp = $aCols[$i] ; <td ..> xx </td> .....
        For $ii = 1 To $aTemp[0] ; scan all cells in this row
            $iSpanY = 0
            $iSpanX = 0
            $iY = $i - 1 ; zero base index for vertical ref
            $iX = $ii - 1 ; zero based indexes for horizontal ref
            ; following RegExp kindly provided by SadBunny in this post:
            ; http://www.autoitscript.com/forum/topic/167174-how-to-get-a-number-located-after-a-name-from-within-a-string/?p=1222781
            $aRowspan = StringRegExp($aTemp[$ii], "(?i)rowspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of rowspan
            If IsArray($aRowspan) Then
                $iSpanY = $aRowspan[0] - 1
                If $iSpanY + $iY > $aRows[0] Then
                    $iSpanY -= $iSpanY + $iY - $aRows[0] + 1
                EndIf
            EndIf
            ;
            $aColspan = StringRegExp($aTemp[$ii], "(?i)colspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of colspan
            If IsArray($aColspan) Then $iSpanX = $aColspan[0] - 1
            ;
            $iMarkerCode += 1 ; code to mark this span area or single cell
            If $iSpanY Or $iSpanX Then
                $iX1 = $iX
                For $iSpY = 0 To $iSpanY
                    For $iSpX = 0 To $iSpanX
                        $iSpanRow = $iY + $iSpY
                        If $iSpanRow > UBound($aMirror, 1) - 1 Then
                            $iSpanRow = UBound($aMirror, 1) - 1
                        EndIf
                        $iSpanCol = $iX1 + $iSpX
                        If $iSpanCol > UBound($aMirror, 2) - 1 Then
                            ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1]
                            ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1]
                        EndIf
                        ;
                        While $aMirror[$iSpanRow][$iX1 + $iSpX] ; search first free column
                            $iX1 += 1 ; $iSpanCol += 1
                            If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then
                                ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1]
                                ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1]
                            EndIf
                        WEnd
                    Next
                Next
            EndIf
            ;
            $iX1 = $iX
            ; following RegExp kindly provided by mikell in this post:
            ; http://www.autoitscript.com/forum/topic/167309-how-to-remove-from-a-string-all-between-and-pairs/?p=1224207
            $sCellContent = StringRegExpReplace($aTemp[$ii], '<[^>]+>', "")
            For $iSpX = 0 To $iSpanX
                For $iSpY = 0 To $iSpanY
                    $iSpanRow = $iY + $iSpY
                    If $iSpanRow > UBound($aMirror, 1) - 1 Then
                        $iSpanRow = UBound($aMirror, 1) - 1
                    EndIf
                    While $aMirror[$iSpanRow][$iX1 + $iSpX]
                        $iX1 += 1
                        If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then
                            ReDim $aResult[$aRows[0]][$iX1 + $iSpX + 1]
                            ReDim $aMirror[$aRows[0]][$iX1 + $iSpX + 1]
                        EndIf
                    WEnd
                    $aMirror[$iSpanRow][$iX1 + $iSpX] = $iMarkerCode ; 1
                    If $bFillSpan Then $aResult[$iSpanRow][$iX1 + $iSpX] = $sCellContent
                Next
                $aResult[$iY][$iX1] = $sCellContent
            Next
        Next
    Next
    ; _ArrayDisplay($aMirror, "Debug")
    Return SetError(0, $aResult[0][0], $aResult)
EndFunc   ;==>_HtmlTableWriteToArray

;
; #FUNCTION# ====================================================================================================================
; Name ..........: _HtmlTableGetWriteToArray
; Description ...: extract the html code of the required table from the html listing and copy the data of the table to a 2D array
; Syntax ........: _HtmlTableGetWriteToArray($sHtml[, $iWantedTable = 1[, $bFillSpan = False[, $bExtractTH = True]]])
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $iWantedTable        - [optional] An integer value. The nr. of the table to be parsed (default is first table)
;                  $bFillSpan           - [optional] Default is False. If all span areas have to be filled by repeating the data
;                                         contained in the first cell of the span area
;                  $bExtractTH          - [optional] Default is True. This is an experimental parameter
;                                                    This parameter indicates if Table Headers have to be extracted to the array.
;                                                    This option can generate misalignments in the array if is set to false due
;                                                    to the "holes" left empty by the ignored <th> and </th> tags.
;                                                    (Best DON'T set this option to False !)
; Return values .: success:               2D array containing data from the wanted html table.
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - no tables are present in the passed HTML
;                                                   2 - error while parsing tables, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tables, (open/close mismatch error)
;                                                   4 - invalid table index request (requested table nr. is out of boundaries)
; ===============================================================================================================================
Func _HtmlTableGetWriteToArray($sHtml, $iWantedTable = 1, $bFillSpan = False, $bExtractTH = True)
    Local $aSingleTable = _HtmlTableGetList($sHtml, $iWantedTable)
    If @error Then Return SetError(@error, 0, "")
    Local $aTableData = _HtmlTableWriteToArray($aSingleTable[1])
    If @error Then Return SetError(@error, 0, "")
    Return SetError(0, $aTableData[0][0], $aTableData)
EndFunc   ;==>_HtmlTableGetWriteToArray

; #FUNCTION# ====================================================================================================================
; Name ..........: _ParseTags
; Description ...: searches and extract all portions of html code within opening and closing tags inclusive.
;                  Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested)
; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing)
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $sOpening            - A string value indicating the opening tag
;                  $sClosing            - A string value indicating the closing tag
; Return values .: success:               an 1D 1 based array containing all the portions of html code representing the element
;                                         element [0] af the array (and @extended as well) contains the counter of found elements
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - no tables are present in the passed HTML
;                                                   2 - error while parsing tables, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tables, (open/close mismatch error)
;                                                   4 - invalid table index request (requested table nr. is out of boundaries)
; ===============================================================================================================================
Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>'
    ; it finds how many of such tags are on the HTML page
    StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences
    Local $iNrOfThisTag = @extended
    ; I assume that opening <tag and closing </tag> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfThisTag Then ; if there is at least one of this tag
        ; $aThisTagsPositions array will contain the positions of the
        ; starting <tag and ending </tag> tags within the HTML
        Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag
            $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag
            $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
            $aThisTagsPositions[$i][2] = $i ; nr of this tag
            $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag
            $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this
        Next
        _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aStack[UBound($aThisTagsPositions)][2]
        Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html
        For $i = 1 To UBound($aThisTagsPositions) - 1
            If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag
                $aStack[0][0] += 1 ; nr of tags in html
                $aStack[$aStack[0][0]][0] = $sOpening
                $aStack[$aStack[0][0]][1] = $i
            ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found
                If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then
                    Return SetError(3, 0, "") ; Open/Close mismatch error
                Else ; pair detected (the reciprocal tag)
                    ; now get coordinates of the 2 tags
                    ; 1) extract this tag <tag ..... </tag> from the html to the array
                    $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0])
                    ; 2) remove that tag <tag ..... </tag> from the html
                    $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1)
                    ; 3) adjust the references to the new positions of remaining tags
                    For $ii = $i To UBound($aThisTagsPositions) - 1
                        $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                    Next
                    $aStack[0][0] -= 1 ; nr of tags still in html
                EndIf
            EndIf
        Next
        If Not $aStack[0][0] Then ; all tags where parsed correctly
            $aTags[0] = $iNrOfThisTag
            Return SetError(0, $iNrOfThisTag, $aTags) ; OK
        Else
            Return SetError(2, 0, "") ; opening and closing tags are not balanced
        EndIf
    Else
        Return SetError(1, 0, "") ; there are no of such tags on this HTML page
    EndIf
EndFunc   ;==>_ParseTags

 

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Posted

@mLipok you have forgot an _ArrayDisplay($r,'$r') on line 295

Thanks, I fix it and changed example.

 __HTML.ZIP

 

 

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST APIErrorLog.au3 UDF - A logging Library * Include Dependency Tree (Tool for analyzing script relations) * Show_Macro_Values.au3 *

 

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 UDF * ADO.au3 UDF SMTP Mailer UDF * Dual Monitor resolution detection * * 2GUI on Dual Monitor System * _SciLexer.au3 UDF * SciTE - Lexer for console pane

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Good coding practices in AutoIt * 

OpenOffice/LibreOffice/XLS Related: WriterDemo.au3 * XLS/MDB from scratch with ADOX

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * IE in TaskSchedulerIE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) * PDF Related:How to get reference to PDF object embeded in IE * IE on Windows 11

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

I also encourage you to check awesome @trancexx code:  * Create COM objects from modules without any demand on user to register anything. * Another COM object registering stuffOnHungApp handlerAvoid "AutoIt Error" message box in unknown errors  * HTML editor

winhttp.au3 related : * https://www.autoitscript.com/forum/topic/206771-winhttpau3-download-problem-youre-speaking-plain-http-to-an-ssl-enabled-server-port/

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2023-04-24

  • 4 years later...
Posted
On 6/21/2015 at 3:58 AM, Chimp said:

@mLipok you have forgot an _ArrayDisplay($r,'$r') on line 295

The _HTMLGetTable function fails on tables with rowspan and/or colspan.

I post here an example that I've ported and adapted a bit from my post (https://www.autoitscript.com/forum/topic/167679-read-data-from-html-tables-from-raw-html-source/) about data extraction from html tables. You can see that if you select table 2 from the example page and click on the "To array by Stilgar" button the extracted data by the _HTMLGetTable function is not correct. Same issue if you select table 7.
Using my function to extract data by click on the "To array by Chimp" button instead, data is extracted correctly.

here is the example script to test data extraction from html tables:

; #include <_HtmlTable2Array.au3> ; <--- udf by chimp. already included at bottom of this demo
; see here: https://www.autoitscript.com/forum/topic/167679-read-data-from-html-tables-from-raw-html-source/
;
#include <_HTML.au3> ; <--- udf by Stilgar
#include <GUIConstantsEx.au3>
#include <EditConstants.au3>
#include <WindowsConstants.au3>
#include <array.au3>
#include <IE.au3>

Local $oIE1 = _IECreateEmbedded(), $oIE2 = _IECreateEmbedded()
Local $sHtml_File, $iIndex, $aTable, $aMyArray
GUICreate("Html tables to array demo", 1000, 450, (@DesktopWidth - 1000) / 2, (@DesktopHeight - 450) / 2 _
        , $WS_OVERLAPPEDWINDOW + $WS_CLIPSIBLINGS + $WS_CLIPCHILDREN)
GUICtrlCreateObj($oIE1, 010, 10, 480, 360) ; left browser
GUICtrlCreateTab(500, 10, 480, 360)
GUICtrlCreateTabItem("view table")
GUICtrlCreateObj($oIE2, 502, 33, 474, 335) ; right browser
GUICtrlCreateTabItem("view html")
Local $idLabel_HtmlTable = GUICtrlCreateInput("", 502, 33, 474, 335, $ES_MULTILINE + $ES_AUTOVSCROLL)
GUICtrlSetFont(-1, 10, 0, 0, "Courier new")
GUICtrlCreateTabItem("")

Local $idInputUrl = GUICtrlCreateInput("", 10, 380, 440, 20)
Local $idButton_Go = GUICtrlCreateButton("Go", 455, 380, 25, 20)
Local $idButton_Load = GUICtrlCreateButton("Load html from disk", 10, 410, 480, 30)

Local $idButton_Prev = GUICtrlCreateButton("Prev  <-", 510, 380, 100, 30)
Local $idLabel_NunTable = GUICtrlCreateLabel("00 / 00", 615, 380, 40, 30)
GUICtrlSetFont(-1, 9, 700)
Local $idButton_Next = GUICtrlCreateButton("Next  ->", 660, 380, 100, 30)
Local $idButton_Array2 = GUICtrlCreateButton("To array by Stilgar", 770, 380, 100, 30)
Local $idButton_Array = GUICtrlCreateButton("To array by Chimp", 880, 380, 100, 30)

GUISetState(@SW_SHOW) ;Show GUI
_IEDocWriteHTML($oIE2, "<HTML></HTML>")
GUICtrlSetData($idInputUrl, "http://www.mojotoad.com/sisk/projects/HTML-TableExtract/tables.html") ; example page
; GUICtrlSetData($idInputUrl, "http://www.devbevy.com/tutorials/html/table-rowspan-colspan.htm") ; <thead> & <tfoot> problem #table 6
ControlClick("", "", $idButton_Go)
; _IEAction($oIE1, "stop")

Do; Waiting for user to close the window
    $iMsg = GUIGetMsg()
    Select
        Case $iMsg = $idButton_Go
            _IENavigate($oIE1, GUICtrlRead($idInputUrl))
            ; _IEAction($oIE1, "stop")
            $aTables = _HtmlTableGetList(_IEBodyReadHTML($oIE1))
            If Not @error Then
                ; _ArrayDisplay($aTables, "Tables contained in this html")
                $iIndex = 1
                _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>")
                ControlClick("", "", $idButton_Prev)
                _IEAction($oIE2, "stop")
            Else
                MsgBox(0, 0, "@error " & @error)
            EndIf
        Case $iMsg = $idButton_Load
            ConsoleWrite("$idButton_Load" & @CRLF)
            $sHtml_File = FileOpenDialog("Choose an html file", @ScriptDir & "\", "html page (*.htm;*.html)")
            If Not @error Then
                GUICtrlSetData($idInputUrl, $sHtml_File)
                ControlClick("", "", $idButton_Go)
            EndIf
        Case $iMsg = $idButton_Next
            If IsArray($aTables) Then
                $iIndex += $iIndex < $aTables[0]
                GUICtrlSetData($idLabel_NunTable, "Table" & @CRLF & $iIndex & " / " & $aTables[0])
                GUICtrlSetData($idLabel_HtmlTable, $aTables[$iIndex])
                _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>")
                _IEAction($oIE2, "stop")
            EndIf
        Case $iMsg = $idButton_Prev
            If IsArray($aTables) Then
                $iIndex -= $iIndex > 1
                GUICtrlSetData($idLabel_NunTable, "Table" & @CRLF & $iIndex & " / " & $aTables[0])
                GUICtrlSetData($idLabel_HtmlTable, $aTables[$iIndex])
                _IEBodyWriteHTML($oIE2, "<html>" & $aTables[$iIndex] & "</html>")
                _IEAction($oIE2, "stop")
            EndIf
        Case $iMsg = $idButton_Array
            If IsArray($aTables) Then
                $aMyArray = _HtmlTableWriteToArray($aTables[$iIndex], 0, 1)
            If Not @error Then _ArrayDisplay($aMyArray)
            EndIf
        Case $iMsg = $idButton_Array2
            If IsArray($aTables) Then
                $aMyArray = _HTML_GetTable($aTables[$iIndex]) ; <---- function from _HTML.au3 (can give wrong results)
            If Not @error Then _ArrayDisplay($aMyArray)
            EndIf

    EndSelect
Until $iMsg = $GUI_EVENT_CLOSE
GUIDelete()
Exit
;
; function included here for a quicker "load & excute" of this demo.
; It should be loaded with #include <_HtmlTable2Array.au3> at the beginning instead
; see here: https://www.autoitscript.com/forum/topic/167679-read-data-from-html-tables-from-raw-html-source/
;
; #include-once
; #include <array.au3>
;
; #FUNCTION# ====================================================================================================================
; Name ..........: _HtmlTableGetList
; Description ...: Finds and enumerates all the html tables contained in an html listing (even if nested).
;                  if the optional parameter $i_index is passed, then only that table is returned
; Syntax ........: _HtmlTableGetList($sHtml[, $i_index = -1])
; Parameters ....: $sHtml               - A string value containing an html page listing
;                  $i_index             - [optional] An integer value indicating the number of the table to be returned (1 based)
;                                         with the default value of -1 an array with all found tables is returned
; Return values .: Success;               Returns an 1D 1 based array containing all or single html table found in the html.
;                                         element [0] (and @extended as well) contains the number of tables found (or 0 if no tables are returned)
;                                         if an error occurs then an ampty string is returned and the following @error code is setted
;                                         @error:   1 - no tables are present in the passed HTML
;                                                   2 - error while parsing tables, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tables, (open/close mismatch error)
;                                                   4 - invalid table index request (requested table nr. is out of boundaries)
; ===============================================================================================================================
Func _HtmlTableGetList($sHtml, $i_index = -1)
    Local $aTables = _ParseTags($sHtml, "<table", "</table>")
    If @error Then
        Return SetError(@error, 0, "")
    ElseIf $i_index = -1 Then
        Return SetError(0, $aTables[0], $aTables)
    Else
        If $i_index > 0 And $i_index <= $aTables[0] Then
            Local $aTemp[2] = [1, $aTables[$i_index]]
            Return SetError(0, 1, $aTemp)
        Else
            Return SetError(4, 0, "") ; bad index
        EndIf
    EndIf
EndFunc   ;==>_HtmlTableGetList

; #FUNCTION# ====================================================================================================================
; Name ..........: _HtmlTableWriteToArray
; Description ...: It writes values from an html table to a 2D array. It tries to take care of the rowspan and colspan formats
; Syntax ........: _HtmlTableWriteToArray($sHtmlTable[, $bFillSpan = False[, $bExtractTH = True]])
; Parameters ....: $sHtmlTable          - A string value containing the html code of the table to be parsed
;                  $bFillSpan           - [optional] Default is False. If span areas have to be filled by repeating the data
;                                         contained in the first cell of the span area
;                  $bExtractTH          - [optional] Default is True. This is an experimental parameter
;                                                    This parameter indicates if Table Headers have to be extracted to the array.
;                                                    This option can generate misalignments in the array if is set to false due
;                                                    to the "holes" left empty by the ignored <th> and </th> tags.
;                                                    (Best DON'T set this option to False !)
; Return values .: Success:               2D array containing data as from the html table
;                  Faillure:              An empty strimg and sets @error as following:
;                                         @error:   1 - no table content is present in the passed HTML
;                                                   2 - error while parsing rows and/or columns, (opening and closing tags are not balanced)
;                                                   3 - error while parsing rows and/or columns, (open/close mismatch error)
; ===============================================================================================================================
Func _HtmlTableWriteToArray($sHtmlTable, $bFillSpan = False, $bExtractTH = True)
    If $bExtractTH Then ;extract also TableHeaders as normal data?
        $sHtmlTable = StringReplace(StringReplace($sHtmlTable, "<th", "<td"), "</th>", "</td>") ; th becomes td
    EndIf
    ; rows of the wanted table
    Local $iError, $aTempEmptyRow[2] = [1, ""]
    Local $aRows = _ParseTags($sHtmlTable, "<tr", "</tr>") ; $aRows[0] = nr. of rows
    If @error Then Return SetError(@error, 0, "")
    Local $aCols[$aRows[0] + 1], $aTemp
    For $i = 1 To $aRows[0]
        $aTemp = _ParseTags($aRows[$i], "<td", "</td>")
        $iError = @error
        If $iError = 1 Then ; check if it's an empty row
            $aTemp = $aTempEmptyRow ; Empty Row
        Else
            If $iError Then Return SetError($iError, 0, "")
        EndIf
        If $aCols[0] < $aTemp[0] Then $aCols[0] = $aTemp[0] ; $aTemp[0] = max nr. of columns in table
        $aCols[$i] = $aTemp
    Next
    Local $aResult[$aRows[0]][$aCols[0]], $iStart, $iEnd, $aRowspan, $aColspan, $iSpanY, $iSpanX, $iSpanRow, $iSpanCol, $iMarkerCode, $sCellContent
    Local $aMirror = $aResult
    For $i = 1 To $aRows[0] ;      scan all rows in this table
        $aTemp = $aCols[$i] ; <td ..> xx </td> .....
        For $ii = 1 To $aTemp[0] ; scan all cells in this row
            $iSpanY = 0
            $iSpanX = 0
            $iY = $i - 1 ; zero base index for vertical ref
            $iX = $ii - 1 ; zero based indexes for horizontal ref
            ; following RegExp kindly provided by SadBunny in this post:
            ; http://www.autoitscript.com/forum/topic/167174-how-to-get-a-number-located-after-a-name-from-within-a-string/?p=1222781
            $aRowspan = StringRegExp($aTemp[$ii], "(?i)rowspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of rowspan
            If IsArray($aRowspan) Then
                $iSpanY = $aRowspan[0] - 1
                If $iSpanY + $iY > $aRows[0] Then
                    $iSpanY -= $iSpanY + $iY - $aRows[0] + 1
                EndIf
            EndIf
            ;
            $aColspan = StringRegExp($aTemp[$ii], "(?i)colspan\s*=\s*[""']?\s*(\d+)", 1) ; check presence of colspan
            If IsArray($aColspan) Then $iSpanX = $aColspan[0] - 1
            ;
            $iMarkerCode += 1 ; code to mark this span area or single cell
            If $iSpanY Or $iSpanX Then
                $iX1 = $iX
                For $iSpY = 0 To $iSpanY
                    For $iSpX = 0 To $iSpanX
                        $iSpanRow = $iY + $iSpY
                        If $iSpanRow > UBound($aMirror, 1) - 1 Then
                            $iSpanRow = UBound($aMirror, 1) - 1
                        EndIf
                        $iSpanCol = $iX1 + $iSpX
                        If $iSpanCol > UBound($aMirror, 2) - 1 Then
                            ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1]
                            ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1]
                        EndIf
                        ;
                        While $aMirror[$iSpanRow][$iX1 + $iSpX] ; search first free column
                            $iX1 += 1 ; $iSpanCol += 1
                            If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then
                                ReDim $aResult[$aRows[0]][UBound($aResult, 2) + 1]
                                ReDim $aMirror[$aRows[0]][UBound($aMirror, 2) + 1]
                            EndIf
                        WEnd
                    Next
                Next
            EndIf
            ;
            $iX1 = $iX
            ; following RegExp kindly provided by mikell in this post:
            ; http://www.autoitscript.com/forum/topic/167309-how-to-remove-from-a-string-all-between-and-pairs/?p=1224207
            $sCellContent = StringRegExpReplace($aTemp[$ii], '<[^>]+>', "")
            For $iSpX = 0 To $iSpanX
                For $iSpY = 0 To $iSpanY
                    $iSpanRow = $iY + $iSpY
                    If $iSpanRow > UBound($aMirror, 1) - 1 Then
                        $iSpanRow = UBound($aMirror, 1) - 1
                    EndIf
                    While $aMirror[$iSpanRow][$iX1 + $iSpX]
                        $iX1 += 1
                        If $iX1 + $iSpX > UBound($aMirror, 2) - 1 Then
                            ReDim $aResult[$aRows[0]][$iX1 + $iSpX + 1]
                            ReDim $aMirror[$aRows[0]][$iX1 + $iSpX + 1]
                        EndIf
                    WEnd
                    $aMirror[$iSpanRow][$iX1 + $iSpX] = $iMarkerCode ; 1
                    If $bFillSpan Then $aResult[$iSpanRow][$iX1 + $iSpX] = $sCellContent
                Next
                $aResult[$iY][$iX1] = $sCellContent
            Next
        Next
    Next
    ; _ArrayDisplay($aMirror, "Debug")
    Return SetError(0, $aResult[0][0], $aResult)
EndFunc   ;==>_HtmlTableWriteToArray

;
; #FUNCTION# ====================================================================================================================
; Name ..........: _HtmlTableGetWriteToArray
; Description ...: extract the html code of the required table from the html listing and copy the data of the table to a 2D array
; Syntax ........: _HtmlTableGetWriteToArray($sHtml[, $iWantedTable = 1[, $bFillSpan = False[, $bExtractTH = True]]])
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $iWantedTable        - [optional] An integer value. The nr. of the table to be parsed (default is first table)
;                  $bFillSpan           - [optional] Default is False. If all span areas have to be filled by repeating the data
;                                         contained in the first cell of the span area
;                  $bExtractTH          - [optional] Default is True. This is an experimental parameter
;                                                    This parameter indicates if Table Headers have to be extracted to the array.
;                                                    This option can generate misalignments in the array if is set to false due
;                                                    to the "holes" left empty by the ignored <th> and </th> tags.
;                                                    (Best DON'T set this option to False !)
; Return values .: success:               2D array containing data from the wanted html table.
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - no tables are present in the passed HTML
;                                                   2 - error while parsing tables, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tables, (open/close mismatch error)
;                                                   4 - invalid table index request (requested table nr. is out of boundaries)
; ===============================================================================================================================
Func _HtmlTableGetWriteToArray($sHtml, $iWantedTable = 1, $bFillSpan = False, $bExtractTH = True)
    Local $aSingleTable = _HtmlTableGetList($sHtml, $iWantedTable)
    If @error Then Return SetError(@error, 0, "")
    Local $aTableData = _HtmlTableWriteToArray($aSingleTable[1])
    If @error Then Return SetError(@error, 0, "")
    Return SetError(0, $aTableData[0][0], $aTableData)
EndFunc   ;==>_HtmlTableGetWriteToArray

; #FUNCTION# ====================================================================================================================
; Name ..........: _ParseTags
; Description ...: searches and extract all portions of html code within opening and closing tags inclusive.
;                  Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested)
; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing)
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $sOpening            - A string value indicating the opening tag
;                  $sClosing            - A string value indicating the closing tag
; Return values .: success:               an 1D 1 based array containing all the portions of html code representing the element
;                                         element [0] af the array (and @extended as well) contains the counter of found elements
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - no tables are present in the passed HTML
;                                                   2 - error while parsing tables, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tables, (open/close mismatch error)
;                                                   4 - invalid table index request (requested table nr. is out of boundaries)
; ===============================================================================================================================
Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>'
    ; it finds how many of such tags are on the HTML page
    StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences
    Local $iNrOfThisTag = @extended
    ; I assume that opening <tag and closing </tag> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfThisTag Then ; if there is at least one of this tag
        ; $aThisTagsPositions array will contain the positions of the
        ; starting <tag and ending </tag> tags within the HTML
        Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag
            $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag
            $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
            $aThisTagsPositions[$i][2] = $i ; nr of this tag
            $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag
            $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this
        Next
        _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aStack[UBound($aThisTagsPositions)][2]
        Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html
        For $i = 1 To UBound($aThisTagsPositions) - 1
            If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag
                $aStack[0][0] += 1 ; nr of tags in html
                $aStack[$aStack[0][0]][0] = $sOpening
                $aStack[$aStack[0][0]][1] = $i
            ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found
                If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then
                    Return SetError(3, 0, "") ; Open/Close mismatch error
                Else ; pair detected (the reciprocal tag)
                    ; now get coordinates of the 2 tags
                    ; 1) extract this tag <tag ..... </tag> from the html to the array
                    $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0])
                    ; 2) remove that tag <tag ..... </tag> from the html
                    $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1)
                    ; 3) adjust the references to the new positions of remaining tags
                    For $ii = $i To UBound($aThisTagsPositions) - 1
                        $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                    Next
                    $aStack[0][0] -= 1 ; nr of tags still in html
                EndIf
            EndIf
        Next
        If Not $aStack[0][0] Then ; all tags where parsed correctly
            $aTags[0] = $iNrOfThisTag
            Return SetError(0, $iNrOfThisTag, $aTags) ; OK
        Else
            Return SetError(2, 0, "") ; opening and closing tags are not balanced
        EndIf
    Else
        Return SetError(1, 0, "") ; there are no of such tags on this HTML page
    EndIf
EndFunc   ;==>_ParseTags

 

I don't know why it went wrong:

"C:\Program Files\AutoIt3\Include\IE.au3" (1654) : ==> ??????????.:
$oObject.document.Write($sHTML)
$oObject.document^ ERROR

Posted (edited)

Hi @Letraindusoir

in the above listing comment out the line nr. 36

; _IEDocWriteHTML($oIE2, "<HTML></HTML>")

and, since web links often disappear, you can try testing with this other link on line 37: https://html.com/tables/rowspan-colspan/ (... till also this will disappears from the web ...)

GUICtrlSetData($idInputUrl, "https://html.com/tables/rowspan-colspan/") ; example page

you can see extraction differences selecting tables 2 and 3

 

Edited by Chimp

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

  • 3 months later...
Posted (edited)

 

I tried the correct scrip on line 36 but I get this message:

Quote

--> IE.au3 T3.0-2 Error from function _IELoadWait, $_IESTATUS_InvalidDataType

and even if the site exists in the tab I always see the message "page not reachable" or "exploration canceled".

what should be corrected?

 

Edited by BLNJ000

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...