Jump to content

Recommended Posts

Posted

Normally I load html files into the browser and grab the html using the $oIE model..then I can run the DOM stuff on each file loaded into the browser... but the problem is that I need to load several hundred html files with bytes in the 100k range. To load each into the browser to process takes a long time...

So I load the html files into a string for example $str and try to run the dom stuff on that string and of course it is not an object so it won't cooperate.

Is there a way to fake it out or convert the string $str to an object so the DOM stuff can do its thing..

I am loading the $str from the old basic open file and read line by line.. Then I am using the following..

$str = $str & $line so that in short order the files are loaded... maybe 1 or 2 seconds per file.. but to load the same file in the browser may take 7 or 8 seconds...

Can anyone help...

Posted

tommytx,

This will ghet you started.

This is some code that I use to test scraping routines. The source is a simple text file downloaded with inietget().

#include <StaticConstants.au3>
#include <WindowsConstants.au3>
#include <IE.au3>
#include <array.au3>
#include <string.au3>

#AutoIt3Wrapper_Add_Constants=n

local $fln = 'k:\sd\sd0100\nba\boxes\400440940'         ; this is a file downloaded with inetget

filedelete(@tempdir & '\tmp.txt')
filewrite(@tempdir & '\tmp.txt',_do_tbls( fileread($fln) ))
shellexecute(@tempdir & '\tmp.txt')

func _do_tbls($html)

    $html = stringreplace($html,@crlf,'')
    $html = stringreplace($html,@cr,'')

    Local $o_htmlfile = ObjCreate('HTMLFILE'), $str

    If Not IsObj($o_htmlfile) Then Return SetError(-1)

    $o_htmlfile.open()
    $o_htmlfile.write($html)
    $o_htmlfile.close()

    Local $otbls = _IETagnameGetCollection($o_htmlfile, 'TABLE')
    if not isobj($otbls) then return seterror(-2)

    Local $otitles = _IETagnameGetCollection($o_htmlfile, 'TITLE')
    if not isobj($otitles) then return seterror(-3)

    for $otitle in $otitles
        ConsoleWrite($otitle.innertext & @LF)
    next

    Local $odivs = _IETagnameGetCollection($o_htmlfile, 'DIV')
    if not isobj($odivs) then return seterror(-4)

    for $odiv in $odivs
        ConsoleWrite('!----  ' & 'id = ' & $odiv.id  & ' classname= ' & $odiv.classname  &  ' title = ' & $odiv.title & @LF)
        $str &= $odiv.innertext & @LF
    next

    for $otbl in $otbls

        ConsoleWrite(stringformat('ID = %-30sTITLE = %-30sSUMMARY = %-30s',$otbl.id, $otbl.title, $otbl.summary) & @lf)

        $a10 = _IETableWriteToArray($otbl,true)

        if not isarray($a10) then continueloop

        _arraydisplay($a10)

        for $1 = 0 to ubound($a10,1) - 1
            for $2 = 0 to ubound($a10,2) - 1
                $str &= $a10[$1][$2] & '`'
            Next
            $str &= @LF
        next
    next

    return $str

endfunc

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...