AcidicChip Posted November 19, 2005 Posted November 19, 2005 (edited) This script is designed to spider along the web and gather media URLs. It's still in the works, just thought I'd post it to see if I could get some help in making it better. Input and comments are more than welcome. There's 2 things it currently lacks that I know of. 1) Making the script faster. Tends to slow down as it goes. 2) An accurate way to determine if the URL is audio, video, or an image. I tried several ways to get the URL headers to retreive the server's content-type output, but it was always either too slow [using ObjCreate("winhttp.winhttprequest.5.1")], or caused the script to freeze after so many checks (TCPConnect on port 80, getting the first 1024 bytes and parsing it for the "Content-Type") Release Notes ================================================================= Version: 0.21 - Date: 2005-11-20 --------------------------------------- CHANGE: Made collected URLs store into a .txt file, and readline from the .txt file (Works faster than an array) ADDED: Start URL text box. ADDED: History buffer that saves the last 1024 URLs collected, to check against to prevent hitting the same URLs (Capped at 1024 to prevent slow downs) CHANGE: When Audio or Video files are found, it adds the file's root folder to the list to be spidered. expandcollapse popup; ---------------------------------------------------------------------------- ; ; AutoIt Version: 3.1.1.87 ; Author: AcidicChip <acidicchip@acidicchip.com> ; ; Script Name: Web Media Spider ; Script Version: 0.21 ; ; Script Function: ; Spider the web and gather media file URLs ; ; ---------------------------------------------------------------------------- Opt("GUIOnEventMode", 1) Opt("TrayIconDebug", 1) #include <Array.au3> #include <GUIConstants.au3> Dim $collected[1] Dim $urls[1] Dim $urlon = 0 Dim $urlnum = 0 Dim $imagenum = 0 Dim $audionum = 0 Dim $videonum = 0 #region "GUI" GUICreate("Media Spider", 600, 100) $lblAction = GUICtrlCreateLabel("Action:", 0, 3, 35, 20) $txtAction = GUICtrlCreateInput("", 40, 0, 560, 20) GUICtrlSetState($txtAction, $GUI_DISABLE) $lblURL = GUICtrlCreateLabel("URL:", 0, 23, 35, 20) $txtURL = GUICtrlCreateInput("", 40, 20, 560, 20) GUICtrlSetState($txtURL, $GUI_DISABLE) $prgPercent = GUICtrlCreateProgress(0, 40, 560, 20) $txtPercent = GUICtrlCreateInput("0%", 560, 40, 40, 20) GUICtrlSetState($txtPercent, $GUI_DISABLE) $lblURLs = GUICtrlCreateLabel("URLs:", 0, 63, 35, 20) $txtURLs = GUICtrlCreateInput("0", 40, 60, 75, 20) GUICtrlSetState($txtURLs, $GUI_DISABLE) $lblAudio = GUICtrlCreateLabel("Audio:", 125, 63, 35, 20) $txtAudio = GUICtrlCreateInput("0", 160, 60, 75, 20) GUICtrlSetState($txtAudio, $GUI_DISABLE) $lblImages = GUICtrlCreateLabel("Images:", 245, 63, 36, 20) $txtImages = GUICtrlCreateInput("0", 285, 60, 75, 20) GUICtrlSetState($txtImages, $GUI_DISABLE) $lblVideos = GUICtrlCreateLabel("Videos:", 370, 63, 35, 20) $txtVideos = GUICtrlCreateInput("0", 410, 60, 75, 20) GUICtrlSetState($txtVideos, $GUI_DISABLE) $lblHistory = GUICtrlCreateLabel("History:", 490, 63, 35, 20) $txtHistory = GUICtrlCreateInput("0", 530, 60, 75, 20) GUICtrlSetState($txtHistory, $GUI_DISABLE) $lblStartURL = GUICtrlCreateLabel("Start URL:", 0, 83, 50, 20) $txtStartURL = GUICtrlCreateInput("http://www.myspace.com/acidicchip", 55, 80, 490, 20) $btnStartStop = GUICtrlCreateButton("Start", 550, 80, 50, 20) GUISetState(@SW_SHOW) GUISetOnEvent($GUI_EVENT_CLOSE, "GUIClose") GUICtrlSetOnEvent($btnStartStop, "GUIStartStop") #endregion "GUI" Func GUIClose() Exit EndFunc ;==>GUIClose Func GUIStartStop() If GUICtrlRead($btnStartStop) == "Start" Then GUICtrlSetData($btnStartStop, "Stop") GUICtrlSetState($txtStartURL, $GUI_DISABLE) FileDelete("spider.urls.txt") GetURLs(GUICtrlRead($txtStartURL)) Do ;$url = $urls[1] $urlon = $urlon + 1 $url = FileReadLine("spider.urls.txt", $urlon) ;_ArrayDelete($urls, 1) $urlnum = $urlnum - 1 GetURLs($url) Until $urlnum <= 0 Or GUICtrlRead($btnStartStop) == "Start" ;Until UBound($urls) <= 1 Or GUICtrlRead($btnStartStop) == "Start" Else GUICtrlSetData($btnStartStop, "Start") GUICtrlSetState($txtStartURL, $GUI_ENABLE) EndIf EndFunc ;==>GUIStartStop While 1 Sleep(250) Wend Func Status($action, $url, $percent) GUICtrlSetData($txtAction, $action) If $url <> "" Then GUICtrlSetData($txtURL, $url) GUICtrlSetData($prgPercent, $percent) GUICtrlSetData($txtPercent, $percent & "%") GUICtrlSetData($txtURLs, $urlnum) ;GUICtrlSetData($txtURLs, UBound($urls)) GUICtrlSetData($txtAudio, $audionum) GUICtrlSetData($txtImages, $imagenum) GUICtrlSetData($txtVideos, $videonum) GUICtrlSetData($txtHistory, UBound($collected)) EndFunc ;==>Status Func _ArrayParse($str, $before, $after) Return StringRegExp($str, "(?i)" & $before & "(.*?)" & $after, 3) EndFunc ;==>_ArrayParse Func AddURL($url) If Not WasCollected($url) Then _ArrayAdd($collected, $url) ;_ArrayAdd($urls, $url) FileWriteLine("spider.urls.txt", $url) $urlnum = $urlnum + 1 EndIf EndFunc ;==>AddURL Func WasCollected($url) $return = False For $i = 1 To Ubound($collected) - 1 Step 1 If $collected[$i] == $url Then $return = True ExitLoop EndIf Next If Not $return And UBound($collected) >= 1024 Then _ArrayDelete($collected, 1) Return $return EndFunc ;==>WasCollected Func GetURI($url) $uri = StringMid($url, 1, StringInStr($url, "://")) & "//" $turl = StringMid($url, StringLen($uri) + 1) If StringInStr($turl, "?") Then $temp = StringSplit($turl, "?") $turl = $temp[1] $temp = StringSplit($turl, "/") $uri = $uri & $temp[1] & "/" For $i = 2 To UBound($temp) - 1 Step 1 If StringInStr($temp[$i], ".") Or Not StringLen($temp[$i]) Then ExitLoop $uri = $uri & $temp[$i] & "/" Next If Not InetGetSize(StringLeft($uri, StringLen($uri) - 1)) Then $uri = StringMid($url, 1, StringInStr($url, "://")) & "//" $temp = StringSplit($turl, "?") $turl = $temp[1] $temp = StringSplit($turl, "/") $uri = $uri & $temp[1] & "/" For $i = 2 To UBound($temp) - 2 Step 1 If StringInStr($temp[$i], ".") Or Not StringLen($temp[$i]) Then ExitLoop $uri = $uri & $temp[$i] & "/" Next EndIf Else $temp = StringSplit($turl, "/") $uri = $uri & $temp[1] & "/" For $i = 2 To UBound($temp) - 1 Step 1 If StringInStr($temp[$i], ".") Or Not StringLen($temp[$i]) Then ExitLoop $uri = $uri & $temp[$i] & "/" Next EndIf Return $uri EndFunc ;==>GetURI Func GetURLs($url) $uri = GetURI($url) $file = "spider.html.txt" Status("Downloading", $url, 0) $filesize = InetGetSize($url) $lastsize = 0 $strikes = 0 InetGet($url, $file, 1, 1) While @InetGetActive If $lastsize == @InetGetBytesRead Then $strikes = $strikes + 1 If $strikes >= 30 Then ExitLoop $lastsize = @InetGetBytesRead Status("Downloading", $url, Round(($lastsize / $filesize) * 100)) Sleep(250) Wend $html = FileRead($file, FileGetSize($file)) FileDelete($file) Status("Parsing URLs", $url, 0) $tags = _ArrayParse($html, "<a", ">") For $i = 0 To UBound($tags) - 1 Step 1 Status("Checking <A> Tags for URLs", $url, Round(($i / (UBound($tags) - 1)) * 100)) CheckURL($uri, $tags[$i], $url) Next $tags = _ArrayParse($html, "<img", ">") For $i = 0 To UBound($tags) - 1 Step 1 Status("Checking <IMG> Tags for URLs", $url, Round(($i / (UBound($tags) - 1)) * 100)) CheckURL($uri, $tags[$i], $url) Next $tags = _ArrayParse($html, "<embed", ">") For $i = 0 To UBound($tags) - 1 Step 1 Status("Checking <EMBED> Tags for URLs", $url, Round(($i / (UBound($tags) - 1)) * 100)) CheckURL($uri, $tags[$i], $url) Next EndFunc ;==>GetURLs Func CheckURL($uri, $str, $ref) If StringInStr($str, "href=") Then $turl = GetAttr($str, "href=") If Not StringInStr(StringLeft($turl, 10), "://") Then If StringLeft($turl, 1) == "/" Then $turl = $uri & StringMid($turl, 2) Else $turl = $uri & $turl EndIf EndIf CheckType($turl, $ref) EndIf If StringInStr($str, "src=") Then $turl = GetAttr($str, "src=") If Not StringInStr(StringLeft($turl, 10), "://") Then If StringLeft($turl, 1) == "/" Then $turl = $uri & StringMid($turl, 2) Else $turl = $uri & $turl EndIf EndIf CheckType($turl, $ref) EndIf EndFunc ;==>CheckURL Func GetAttr($str, $attr) If StringInStr($str, $attr & '"') Then $temp = _ArrayParse($str, $attr & '"', '"') If UBound($temp) == 1 Then Return $temp[0] ElseIf StringInStr($str, $attr & "'") Then $temp = _ArrayParse($str, $attr & "'", "'") If UBound($temp) == 1 Then Return $temp[0] ElseIf StringInStr($str, $attr) Then $temp = StringMid($str, StringInStr($str, $attr) + StringLen($attr)) If StringInStr($temp, " ") Then $temp = StringMid($temp, 1, StringInStr($temp, " ") - 1) EndIf Return $temp EndIf EndFunc ;==>GetAttr Func CheckType($url, $ref) If StringRight($url, 4) == ".jpg" Or _ StringRight($url, 4) == ".gif" Or _ StringRight($url, 4) == ".png" Or _ StringRight($url, 4) == "bmp" Then FileWriteLine("spider.images.log", $url & @TAB & $ref) $imagenum = $imagenum + 1 ElseIf StringRight($url, 4) == ".mp3" Or _ StringRight($url, 4) == ".rbs" Then FileWriteLine("spider.audio.log", $url & @TAB & $ref) $audionum = $audionum + 1 AddURL(GetURI($url)) ElseIf StringRight($url, 4) == ".avi" Or _ StringRight($url, 4) == ".wmv" Or _ StringRight($url, 4) == ".mpg" Or _ StringRight($url, 5) == ".mpeg" Then FileWriteLine("spider.video.log", $url & @TAB & $ref) $videonum = $videonum + 1 AddURL(GetURI($url)) ElseIf StringRight($url, 4) == ".exe" Or _ StringRight($url, 4) == ".zip" Or _ StringRight($url, 4) == ".rar" Or _ StringRight($url, 4) == ".tar" Then ;Do Nothing Else AddURL($url) EndIf EndFunc ;==>CheckType Keep in mind that this is my first script, and I am a complete newbie to AutoIt, so my code syntax may be a little dirty. Edited November 22, 2005 by AcidicChip
layer Posted November 19, 2005 Posted November 19, 2005 Wow ! A wonderful first script. I haven't tried it yet, but it looks really nice. I've always wanted to make a webspider... Nice ! FootbaG
AcidicChip Posted November 20, 2005 Author Posted November 20, 2005 New version released. It's a lot faster, still trying to make it even faster. Look to first post for the latest code.
gamerman2360 Posted November 20, 2005 Posted November 20, 2005 WOW, that's awsome. Alwas have wondered what a spider would look like. Would FileOpen() make anything faster?
AcidicChip Posted November 20, 2005 Author Posted November 20, 2005 WOW, that's awsome. Alwas have wondered what a spider would look like. Would FileOpen() make anything faster?I'm using "FileWriteLine" and "FileReadLine" for the URL queue, instead of an array that causes the script to slow down once the array get's pretty big. I don't see using that same technique to be beneficial anywhere else.One of the biggest keys to this bot gathering a good amount of media links, would be it's starting point, lol. I don't know what was a bigger challenge; Writing the bot, or finding a good starting URL for the bot to gather the links from.
killaz219 Posted November 20, 2005 Posted November 20, 2005 If that is your first script then I can't wait until you're not a "newbie" anymore.
AcidicChip Posted November 21, 2005 Author Posted November 21, 2005 If that is your first script then I can't wait until you're not a "newbie" anymore.What do you mean?
themax90 Posted November 21, 2005 Posted November 21, 2005 Very nice, I like it. One thing to work on however is coding either really accurately and cleanly so it's easier to read or run Tidy on your script. Tidy can be found in the AutoIt TextEditor SciTe. Search the forums for it. I really like this, it's great!
AcidicChip Posted November 21, 2005 Author Posted November 21, 2005 Very nice, I like it. One thing to work on however is coding either really accurately and cleanly so it's easier to read or run Tidy on your script. Tidy can be found in the AutoIt TextEditor SciTe. Search the forums for it. I really like this, it's great!I ran Tidy just now with the "Indent + Proper Case" option, and the only difference I see is, at each endfunc there was a ";==> FUNCNAME" Everything else looked exactly the same, including Indents.Did I use it incorrectly?
layer Posted November 21, 2005 Posted November 21, 2005 What do you mean?It was a compliment. He meant that you've written such a great first script that he can't wait until you're an advanced scriptor... I still think I'm a newbie however FootbaG
AcidicChip Posted November 21, 2005 Author Posted November 21, 2005 It was a compliment. He meant that you've written such a great first script that he can't wait until you're an advanced scriptor... I still think I'm a newbie however Ah, well thanks killaz219. I'm a PHP and VB6/.NET developer, so I'm not new to the development scene, just new to the AutoIt aspect of developing.
gamerman2360 Posted November 30, 2005 Posted November 30, 2005 I'm using "FileWriteLine" and "FileReadLine" for the URL queue, instead of an array that causes the script to slow down once the array get's pretty big. I don't see using that same technique to be beneficial anywhere else.One of the biggest keys to this bot gathering a good amount of media links, would be it's starting point, lol. I don't know what was a bigger challenge; Writing the bot, or finding a good starting URL for the bot to gather the links from.I ment "FileWriteLine" and "FileReadLine" both will open and close the file during the duration of the command, which couses it to open and close a lot. FileOpen() will open it once and leave a handle to write to the file. Thing is if you did that I don't think you could read the file with anything other than the script, like if you wanted to check on it using notepad or something.I wonder if it's possible to also have robot exclusion on this. Does this robot have a name?
AcidicChip Posted December 4, 2005 Author Posted December 4, 2005 I ment "FileWriteLine" and "FileReadLine" both will open and close the file during the duration of the command, which couses it to open and close a lot. FileOpen() will open it once and leave a handle to write to the file. Thing is if you did that I don't think you could read the file with anything other than the script, like if you wanted to check on it using notepad or something.I wonder if it's possible to also have robot exclusion on this. Does this robot have a name?Naw, no name for it...Doing a FileOpen and using it to read/write might be a faster solution. I'll give it a shot.
gamerman2360 Posted December 4, 2005 Posted December 4, 2005 How about AutoItBot? If there was a name there could be a way to exclude if from websites as soon as that kind of thing was added.
ivan Posted April 20, 2009 Posted April 20, 2009 You deserve a medal for this mate. Cheers. IVAN Think out of the boxGrabber: Yet another WinInfo tool_CSVLib (still alpha)Dynamic html in au3
Marlo Posted April 20, 2009 Posted April 20, 2009 3 and a half years overdue eh... Click here for the best AutoIt help possible.Currently Working on: Autoit RAT
JackDinn Posted April 28, 2009 Posted April 28, 2009 (edited) what a great script ! , but an unrelated question. it works fine but i cant stop it? i have just started looking at and using OnEventMode 1 and was looking at this code to see how EventMode 1 works (pretty simple really) but when i run this script i can click start and it gets to the "GUIStartStop" func but whilst its running i hit stop and it dont get into the "GUIStartStop" func ?? also whilst running it wont register the GUISetOnEvent($GUI_EVENT_CLOSE, "GUIClose") when i try to close the GUI it again dont get to the "GUIClose" func ? have searched the script for anything that might be turning off the GUICtrlSetOnEvent or changing it but cant find anything. sorry if i missed something really simple. thx all. Edited April 28, 2009 by JackDinn Thx all,Jack Dinn. JD's Auto Internet Speed Tester JD's Clip Catch (With Screen Shot Helper) Projects :- AutoIt - My projects My software never has bugs. It just develops random features. :-D
corgano Posted April 28, 2009 Posted April 28, 2009 Very nice! How would I make it download images instead? This would be awesome... 0x616e2069646561206973206c696b652061206d616e20776974686f7574206120626f64792c20746f206669676874206f6e6520697320746f206e657665722077696e2e2e2e2e
JackDinn Posted April 28, 2009 Posted April 28, 2009 (edited) hmm think i found what the problem is :-When an event calls a function in OnEvent mode, no other event will be executed until that function returns.so its because you are calling the other functions from within GUIStartStop() that it can never get back to GUIStartStop() by eventCall until it has returned from all other func's and returned back to where it was first called from. thats why i can start it (the first onEvent call is fine) but after that you can not call GUIStartStop() again or GUIClose() until finishing the first call which in this case it does not do for the duration (not quite sure how long) until it gets back to the little while wend loop again where it was initially called from.http://www.autoitscript.com/forum/index.ph...&hl=OnEvent Edited April 28, 2009 by JackDinn Thx all,Jack Dinn. JD's Auto Internet Speed Tester JD's Clip Catch (With Screen Shot Helper) Projects :- AutoIt - My projects My software never has bugs. It just develops random features. :-D
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now