Jump to content

Recommended Posts

Posted (edited)

This is my first UDF so go easy on me =) This has the sub-requirement of Larry's awesome RealFileReading functions, which I hope become part of the standard distro. If you don't have those yet just paste them at the bottom of a script running this function. (update: Scrape.au3 attached with all needed code. Just drop in your include dir!)

The syntax looks like this:

_ScreenScrape( 'URL', 'String before', 'String after', [..., 'before', 'after', 'before', 'after',...] )

Lets jump straight to the examples so you can see how easy it is: (if you're not sure what screen scraping is, try reading this article)

Examples

;;;scrape google for the number of web pages they index

$google = _ScreenScrape('http://www.google.com','Searching ',' web pages')
MsgBox(1,'',$google)

;;;scrape microsoft for the last time they updated their homepage

$microsoft = _ScreenScrape('http://www.microsoft.com','Last Updated: ',' Pacific Time')
MsgBox(1,'',$microsoft)

;;;scrape wikipedia for the total number of articles

$wikipedia = _ScreenScrape('http://en.wikipedia.org/','Statistics">','</a> articles.')
MsgBox(1,'',$wikipedia)

;;;advanced: scrape the wikipedia statistics page for six things at once!!
#include <array.au3>

Global $wikipediaStatistics = _ScreenScrape('http://en.wikipedia.org/wiki/Special:Statistics', _ 
                                           'Wikipedia currently has <b>','</b> <a href="/wiki/Wikipedia:What_is_an_article"', _
                                           'Including these, we have <b>','</b> pages.</p>', _ 
                                           '<p>Users have made <b>', '</b> edits since July 2002', _ 
                                           'an average of <b>','</b> edits per page.</p>', _ 
                                           '<p>We have <b>','</b> registered users', _ 
                                           'of which <b>','</b> are <a hr')
_ArrayDisplay($wikipediaStatistics,'')

_ScreenScrape:

;===============================================================================
;
; Function Name:    _ScreenScrape
; Description:    Easily screen scrape any web page for the text you want
; Parameter(s):  $ss_URL  - The website to scrape
;                  $ss_1  - The string occurring before the text you want
;                   $ss_2  - The string occuring after the text you want
;                  ...
;                   $ss_19 - The string occurring before the text you want
;                  $v_20 - The string occuring after the text you want.
; Requirement(s):   _UnFormat, _RealFileClose, _RealFileRead, _RealFileOpen
; Return Value(s):  If only one result will return a string. If more than one
;                  result, will return an array
; Author(s):        Alterego http://www.br1an.net
; Note(s):        Woot!
;
;===============================================================================

Func _ScreenScrape($ssURL, $ss_1, $ss_2, $ss_3 = 0, $ss_4 = 0, $ss_5 = 0, $ss_6 = 0, $ss_7 = 0, $ss_8 = 0, $ss_9 = 0, $ss_10 = 0, $ss_11 = 0, $ss_12 = 0, $ss_13 = 0, $ss_14 = 0, $ss_15 = 0, $ss_16 = 0, $ss_17 = 0, $ss_18 = 0, $ss_19 = 0, $ss_20 = 0)
    Local $ss_NumParam = @NumParams
    Local $ss_CountOdd = 1
    Local $ss_CountEven = 2
    Local $ss_Half = $ss_NumParam / 2
    Local $ss_Data[$ss_NumParam + 1]
    Local $ss_Return[$ss_Half]
    For $ss_Primer = 0 To $ss_NumParam - 1
        $ss_Data[$ss_Primer] = _UnFormat (Eval('ss_' & String($ss_Primer)))
    Next
    Global $file = @TempDir & "\" & Random(500000, 1000000, 1) & ".scrape"
    InetGet($ssURL, $file, 1, 0)
    Local $ss_Handle = _RealFileOpen ($file)
    Local $ss_ReadOnce = _RealFileRead ($ss_Handle, FileGetSize($file))
    Local $ss_PermanentStore = _UnFormat ($ss_ReadOnce[0])
    For $ss_Scrape = 0 to ($ss_NumParam - 2) / 2
        $ss_TemporaryStore = $ss_PermanentStore
        $ss_TemporaryStore = StringTrimLeft($ss_TemporaryStore, StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountOdd], 1, 1) + StringLen($ss_Data[$ss_CountOdd]) - 1)
        $ss_TemporaryStore = StringTrimRight($ss_TemporaryStore, StringLen($ss_TemporaryStore) - StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountEven]) + 1)
        $ss_CountOdd = $ss_CountOdd + 2
        $ss_CountEven = $ss_CountEven + 2
        $ss_Return[$ss_Scrape] = $ss_TemporaryStore
    Next
    _RealFileClose ($ss_Handle)
    FileDelete($file)   
    If UBound($ss_Return) = 1 Then 
        Return $ss_Return[0]
    Else
        Return $ss_Return
    EndIf
EndFunc

scrape.au3

Edited by Alterego
Posted

Multiple lines here, about 15.

"I thoroughly disapprove of duels. If a man should challenge me, I would take him kindly and forgivingly by the hand and lead him to a quiet place and kill him." - Mark TwainPatient: "It hurts when I do $var_"Doctor: "Don't do $var_" - Lar.
Posted (edited)

With this update (see original post) you can scrape the same page for several things at once, and still on only one line of code!. The fastest way to test it is to download scrape.au3 to your include dir and use that.

PS: even with this complete rewrite all old syntax still works. backwards compatability baby :)

Changelog

22 March 05: Complete rewrite allowing one to scrape the same page for multiple strings
19 March 05: Minor fixes
Edited by Alterego
Posted

Hmm... Maybe I'm missing the obvious here, or I'm jumping ahead because I'm excited that this could be a big time-saver for me, so I'm overlooking the details, but I am missing the _ArrayDisplay function...

Am I doing something stupid here, or is there something else that should be included? :)

Writing damaged code since 1996.

Posted

my apologies. i added that to the example.  my test script environment has all the includes in by default so i overlooked it

<{POST_SNAPBACK}>

No problem, I just commented that stuff out and tried the rest as it was. Thanks for the reply SteveR. :D

I played with this a little bit, but most of what I work with are web pages and it's not really in my best interest to have code interjected in my scrape results, such as line breaks and text formatting. It would be really cool to have a script remove all of the code from a document before/after scraping. Maybe something that finds the first < then the next > and counts the spaces in between then trims the middle out to remove all of the obvious/standard bits of HTML.

I will try to play with this a little, but if someone beats me to it I won't be upset. :)

Excellent work so far! I am glad someone else is working on this!

Writing damaged code since 1996.

Posted (edited)

Func _html2txt($html)
    $Html2TxT = StringRegExpReplace($html, "<.[^>]*>" , "")
    Return $Html2TxT
EndFunc;==>html2txt

written by supergg02. i use it quite often and it works well. you must be using the latest beta for StringRegExpReplace

i also scrape all @CR, @LF, and all @CRLF both from your input and from the document to make matching easier

Edited by Alterego
Posted

Func _html2txt($html)
    Return StringRegExpReplace($html, "<.[^>]*>" , "")
EndFunc;==>html2txt

<{POST_SNAPBACK}>

Thanks Alterego!

I would also like to thank Larry for acting as the "Mr Clean"-inspired image would suggest and cleaning the code up. You're like the wise code janitor picking up after all of us. We appreciate it! :)

Writing damaged code since 1996.

Posted (edited)

could this be used to parse the *entire* contents of a page?

for example, I've used a VB and now a vb.net application which inputs the entire web page and then proceeds to parse out one(1) character at a time and sends the pure text as an SMS (only to GSM enabled) phone message

could I process with this, for example http://www.cnn.com/ ?

(BTW, I downloaded the latest beta and the html2txt throws an error, unknown function name ??)

Edited by AutoIt
Posted

not sure why RegEx's don't work for you...this is the best you're gonna' get without doing some serious keyword processing to filter java script:

$file = @HomeDrive & '\cnn.txt' 
InetGet('http://www.cnn.com', $file, 1)
$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)
ClipPut($text)

The best solution is to use the browser lynx, which returns not-bad-at-all output

$file = @HomeDrive & 'cnn.txt'
RunWait(@ComSpec & ' /c lynx -dump --accept_all_cookies -nolist http://www.cnn.com > ' & $file)
$text = StringStripWS(StringStripWS(StringStripWS(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)),1),2),4)
ClipPut($text)

you'll notice the final example does not contain a regex, so you can just use that as soon as you download lynx and put it in \Windows\

Posted

----Off Topic----

there is a lynx for windows :)

----Off Topic----

[font="Times"] If anyone remembers me, I am back. Maybe to stay, maybe not.----------------------------------------------------------------------------------------------------------[/font][font="Times"]Things I am proud of: Pong! in AutoIt | SearchbarMy website: F.R.I.E.S.A little website that is trying to get started: http://thepiratelounge.net/ (not mine)[/font][font="Times"] ----------------------------------------------------------------------------------------------------------[/font][font="Arial"]The newbies need to stop stealing avatars!!! It is confusing!![/font]

Posted

thanks for the sample, tried that and it throws an "unknown function error"

if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

Posted

----Off Topic----

there is a lynx for windows  :)

----Off Topic----

<{POST_SNAPBACK}>

sure. you can either run cygwin or download one compiled for win

thanks for the sample, tried that and it throws an "unknown function error"

if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function starts

I downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/

uninstalled previous version, installed the 3.1.1 and still no joy

<{POST_SNAPBACK}>

right, i don't know why that is, but you should be able to get along fine using the last code example i provided, as it only uses functions in the last stable distribution.
Posted

I'm also having some trouble, permit me to post a couple questions and comments

1. I installed version 3.1.1 from the public beta download

2. The original sample from Alterego works (ie. get number from google.com)

3. html2txt and the www.cnn.com sample does not work, it shows "unknown function error"

4. I use an include to have all of larry's excellent file handling routines

this piece of code does not work due to the "StringRegExpReplace" being an unknown function

#include <E:\Program Files\AutoIt3\Examples\English\FileInclude.au3>

$file = @HomeDrive & '\cnn.txt'

InetGet('http://www.cnn.com', $file, 1)

$text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringRepl

ace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4)

ClipPut($text)

Posted

they removed StringRegExpReplace from all releases I believe, so if you don't have it now you aren't going to get it. I'm not sure why this choice was made.

alternative is to use the lynx example i provided. it returns better output anyway. lynx requires no installation. just download it from somewhere and drop it in \windows\

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...