Alterego Posted March 19, 2005 Posted March 19, 2005 (edited) This is my first UDF so go easy on me =) This has the sub-requirement of Larry's awesome RealFileReading functions, which I hope become part of the standard distro. If you don't have those yet just paste them at the bottom of a script running this function. (update: Scrape.au3 attached with all needed code. Just drop in your include dir!)The syntax looks like this:_ScreenScrape( 'URL', 'String before', 'String after', [..., 'before', 'after', 'before', 'after',...] )Lets jump straight to the examples so you can see how easy it is: (if you're not sure what screen scraping is, try reading this article)Examples;;;scrape google for the number of web pages they index $google = _ScreenScrape('http://www.google.com','Searching ',' web pages') MsgBox(1,'',$google) ;;;scrape microsoft for the last time they updated their homepage $microsoft = _ScreenScrape('http://www.microsoft.com','Last Updated: ',' Pacific Time') MsgBox(1,'',$microsoft) ;;;scrape wikipedia for the total number of articles $wikipedia = _ScreenScrape('http://en.wikipedia.org/','Statistics">','</a> articles.') MsgBox(1,'',$wikipedia) ;;;advanced: scrape the wikipedia statistics page for six things at once!! #include <array.au3> Global $wikipediaStatistics = _ScreenScrape('http://en.wikipedia.org/wiki/Special:Statistics', _ 'Wikipedia currently has <b>','</b> <a href="/wiki/Wikipedia:What_is_an_article"', _ 'Including these, we have <b>','</b> pages.</p>', _ '<p>Users have made <b>', '</b> edits since July 2002', _ 'an average of <b>','</b> edits per page.</p>', _ '<p>We have <b>','</b> registered users', _ 'of which <b>','</b> are <a hr') _ArrayDisplay($wikipediaStatistics,'')_ScreenScrape:expandcollapse popup;=============================================================================== ; ; Function Name: _ScreenScrape ; Description: Easily screen scrape any web page for the text you want ; Parameter(s): $ss_URL - The website to scrape ; $ss_1 - The string occurring before the text you want ; $ss_2 - The string occuring after the text you want ; ... ; $ss_19 - The string occurring before the text you want ; $v_20 - The string occuring after the text you want. ; Requirement(s): _UnFormat, _RealFileClose, _RealFileRead, _RealFileOpen ; Return Value(s): If only one result will return a string. If more than one ; result, will return an array ; Author(s): Alterego http://www.br1an.net ; Note(s): Woot! ; ;=============================================================================== Func _ScreenScrape($ssURL, $ss_1, $ss_2, $ss_3 = 0, $ss_4 = 0, $ss_5 = 0, $ss_6 = 0, $ss_7 = 0, $ss_8 = 0, $ss_9 = 0, $ss_10 = 0, $ss_11 = 0, $ss_12 = 0, $ss_13 = 0, $ss_14 = 0, $ss_15 = 0, $ss_16 = 0, $ss_17 = 0, $ss_18 = 0, $ss_19 = 0, $ss_20 = 0) Local $ss_NumParam = @NumParams Local $ss_CountOdd = 1 Local $ss_CountEven = 2 Local $ss_Half = $ss_NumParam / 2 Local $ss_Data[$ss_NumParam + 1] Local $ss_Return[$ss_Half] For $ss_Primer = 0 To $ss_NumParam - 1 $ss_Data[$ss_Primer] = _UnFormat (Eval('ss_' & String($ss_Primer))) Next Global $file = @TempDir & "\" & Random(500000, 1000000, 1) & ".scrape" InetGet($ssURL, $file, 1, 0) Local $ss_Handle = _RealFileOpen ($file) Local $ss_ReadOnce = _RealFileRead ($ss_Handle, FileGetSize($file)) Local $ss_PermanentStore = _UnFormat ($ss_ReadOnce[0]) For $ss_Scrape = 0 to ($ss_NumParam - 2) / 2 $ss_TemporaryStore = $ss_PermanentStore $ss_TemporaryStore = StringTrimLeft($ss_TemporaryStore, StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountOdd], 1, 1) + StringLen($ss_Data[$ss_CountOdd]) - 1) $ss_TemporaryStore = StringTrimRight($ss_TemporaryStore, StringLen($ss_TemporaryStore) - StringInStr($ss_TemporaryStore, $ss_Data[$ss_CountEven]) + 1) $ss_CountOdd = $ss_CountOdd + 2 $ss_CountEven = $ss_CountEven + 2 $ss_Return[$ss_Scrape] = $ss_TemporaryStore Next _RealFileClose ($ss_Handle) FileDelete($file) If UBound($ss_Return) = 1 Then Return $ss_Return[0] Else Return $ss_Return EndIf EndFuncscrape.au3 Edited March 30, 2005 by Alterego This dynamic web page is powered by AutoIt 3.
MHz Posted March 19, 2005 Posted March 19, 2005 Interesting. But using code tags, would enable people to copy the code correctly.
steveR Posted March 19, 2005 Posted March 19, 2005 I don't know why the codebox makes it all one line, that is so dumb. AutoIt3 online docs Use it... Know it... Live it...MSDN libraryglobal Help and SupportWindows: Just another pane in the glass.
Insolence Posted March 19, 2005 Posted March 19, 2005 Multiple lines here, about 15. "I thoroughly disapprove of duels. If a man should challenge me, I would take him kindly and forgivingly by the hand and lead him to a quiet place and kill him." - Mark TwainPatient: "It hurts when I do $var_"Doctor: "Don't do $var_" - Lar.
Alterego Posted March 22, 2005 Author Posted March 22, 2005 (edited) With this update (see original post) you can scrape the same page for several things at once, and still on only one line of code!. The fastest way to test it is to download scrape.au3 to your include dir and use that. PS: even with this complete rewrite all old syntax still works. backwards compatability baby Changelog22 March 05: Complete rewrite allowing one to scrape the same page for multiple strings 19 March 05: Minor fixes Edited March 22, 2005 by Alterego This dynamic web page is powered by AutoIt 3.
Alterego Posted March 25, 2005 Author Posted March 25, 2005 I've received several PMs asking for examples from this and also generating RSS feeds. AutoIt is powering this page, so that should help. This dynamic web page is powered by AutoIt 3.
cybie Posted March 29, 2005 Posted March 29, 2005 Hmm... Maybe I'm missing the obvious here, or I'm jumping ahead because I'm excited that this could be a big time-saver for me, so I'm overlooking the details, but I am missing the _ArrayDisplay function... Am I doing something stupid here, or is there something else that should be included? Writing damaged code since 1996.
steveR Posted March 29, 2005 Posted March 29, 2005 the _ArrayDislay() udf is part of the array.au3 file. To use the udf functions, you have to #Include it in your script. Example: #include <array.au3> $array = StringSplit("foo,bar", ",") _ArrayDisplay($array, "test") AutoIt3 online docs Use it... Know it... Live it...MSDN libraryglobal Help and SupportWindows: Just another pane in the glass.
Alterego Posted March 30, 2005 Author Posted March 30, 2005 my apologies. i added that to the example. my test script environment has all the includes in by default so i overlooked it This dynamic web page is powered by AutoIt 3.
cybie Posted March 30, 2005 Posted March 30, 2005 my apologies. i added that to the example. my test script environment has all the includes in by default so i overlooked it<{POST_SNAPBACK}>No problem, I just commented that stuff out and tried the rest as it was. Thanks for the reply SteveR. I played with this a little bit, but most of what I work with are web pages and it's not really in my best interest to have code interjected in my scrape results, such as line breaks and text formatting. It would be really cool to have a script remove all of the code from a document before/after scraping. Maybe something that finds the first < then the next > and counts the spaces in between then trims the middle out to remove all of the obvious/standard bits of HTML.I will try to play with this a little, but if someone beats me to it I won't be upset. Excellent work so far! I am glad someone else is working on this! Writing damaged code since 1996.
Alterego Posted March 30, 2005 Author Posted March 30, 2005 (edited) Func _html2txt($html) $Html2TxT = StringRegExpReplace($html, "<.[^>]*>" , "") Return $Html2TxT EndFunc;==>html2txtwritten by supergg02. i use it quite often and it works well. you must be using the latest beta for StringRegExpReplacei also scrape all @CR, @LF, and all @CRLF both from your input and from the document to make matching easier Edited March 30, 2005 by Alterego This dynamic web page is powered by AutoIt 3.
cybie Posted March 30, 2005 Posted March 30, 2005 Func _html2txt($html) Return StringRegExpReplace($html, "<.[^>]*>" , "") EndFunc;==>html2txt<{POST_SNAPBACK}>Thanks Alterego! I would also like to thank Larry for acting as the "Mr Clean"-inspired image would suggest and cleaning the code up. You're like the wise code janitor picking up after all of us. We appreciate it! Writing damaged code since 1996.
AutoIt Posted March 31, 2005 Posted March 31, 2005 (edited) could this be used to parse the *entire* contents of a page?for example, I've used a VB and now a vb.net application which inputs the entire web page and then proceeds to parse out one(1) character at a time and sends the pure text as an SMS (only to GSM enabled) phone messagecould I process with this, for example http://www.cnn.com/ ?(BTW, I downloaded the latest beta and the html2txt throws an error, unknown function name ??) Edited March 31, 2005 by AutoIt
Alterego Posted March 31, 2005 Author Posted March 31, 2005 not sure why RegEx's don't work for you...this is the best you're gonna' get without doing some serious keyword processing to filter java script: $file = @HomeDrive & '\cnn.txt' InetGet('http://www.cnn.com', $file, 1) $text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4) ClipPut($text) The best solution is to use the browser lynx, which returns not-bad-at-all output $file = @HomeDrive & 'cnn.txt' RunWait(@ComSpec & ' /c lynx -dump --accept_all_cookies -nolist http://www.cnn.com > ' & $file) $text = StringStripWS(StringStripWS(StringStripWS(StringStripCR(StringReplace(StringReplace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)),1),2),4) ClipPut($text) you'll notice the final example does not contain a regex, so you can just use that as soon as you download lynx and put it in \Windows\ This dynamic web page is powered by AutoIt 3.
zcoacoaz Posted March 31, 2005 Posted March 31, 2005 ----Off Topic---- there is a lynx for windows ----Off Topic---- [font="Times"] If anyone remembers me, I am back. Maybe to stay, maybe not.----------------------------------------------------------------------------------------------------------[/font][font="Times"]Things I am proud of: Pong! in AutoIt | SearchbarMy website: F.R.I.E.S.A little website that is trying to get started: http://thepiratelounge.net/ (not mine)[/font][font="Times"] ----------------------------------------------------------------------------------------------------------[/font][font="Arial"]The newbies need to stop stealing avatars!!! It is confusing!![/font]
AutoIt Posted March 31, 2005 Posted March 31, 2005 thanks for the sample, tried that and it throws an "unknown function error"if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function startsI downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/uninstalled previous version, installed the 3.1.1 and still no joy
Alterego Posted March 31, 2005 Author Posted March 31, 2005 ----Off Topic----there is a lynx for windows ----Off Topic----<{POST_SNAPBACK}>sure. you can either run cygwin or download one compiled for winthanks for the sample, tried that and it throws an "unknown function error"if I comment out your last sample, the original search works fine but the moment I try either html2txt or the sample for cnn.com you posted, bam! "unknown function" and it points to the line where the cnn.com parsing starts or in the case of html2txt the line where that function startsI downloaded the beta 3.1.1 from http://www.autoitscript.com/autoit3/files/beta/autoit/uninstalled previous version, installed the 3.1.1 and still no joy<{POST_SNAPBACK}>right, i don't know why that is, but you should be able to get along fine using the last code example i provided, as it only uses functions in the last stable distribution. This dynamic web page is powered by AutoIt 3.
AutoIt Posted March 31, 2005 Posted March 31, 2005 (edited) the error points to line 3StringRegExpReplaceapparently that is an "unknown function"hmm... I've now tried this with version 3.1.0 and 3.1.1 (latest beta), then I read this pagehttp://www.autoitscript.com/forum/index.ph...&st=0&p=68496 Edited March 31, 2005 by AutoIt
Guest Nina Posted March 31, 2005 Posted March 31, 2005 I'm also having some trouble, permit me to post a couple questions and comments 1. I installed version 3.1.1 from the public beta download 2. The original sample from Alterego works (ie. get number from google.com) 3. html2txt and the www.cnn.com sample does not work, it shows "unknown function error" 4. I use an include to have all of larry's excellent file handling routines this piece of code does not work due to the "StringRegExpReplace" being an unknown function #include <E:\Program Files\AutoIt3\Examples\English\FileInclude.au3> $file = @HomeDrive & '\cnn.txt' InetGet('http://www.cnn.com', $file, 1) $text = StringStripWS(StringStripWS(StringStripWS(StringRegExpReplace(StringStripCR(StringReplace(StringRepl ace(FileRead($file, FileGetSize($file)), @CRLF, '', 0, 0), @LF, '', 0, 0)), "<.[^>]*>", ""),1),2),4) ClipPut($text)
Alterego Posted March 31, 2005 Author Posted March 31, 2005 they removed StringRegExpReplace from all releases I believe, so if you don't have it now you aren't going to get it. I'm not sure why this choice was made. alternative is to use the lynx example i provided. it returns better output anyway. lynx requires no installation. just download it from somewhere and drop it in \windows\ This dynamic web page is powered by AutoIt 3.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now