Creator Posted December 2, 2007 Posted December 2, 2007 (edited) Here are a few examples for creating your own web crawler/spider by using the free activeX component from ChilKat.I found it very fast and easy to use.Free download of the activeX : http://www.chilkatsoft.com/download/SpiderActiveX.msiReference for the activeX (if you dont wanna wait on me posting more examples ) : http://www.chilkatsoft.com/refdoc/xSpiderRef.htmlExamples include: Getting Started Spidering a Site.au3Extract HTML Title, Description, Keywords.au3Fetch robots.txt for a Site.au3Avoid URLs Matching Any of a Set of Patterns.au3Setting a Maximum Response Size.au3Setting a Maximum URL Length.au3Using the Disk Cache.au3Crawling the Web.au3Get Referenced Domains.au3A Simple Web Crawler.au3Did i mention its fully robot.txt compliant !!Have fun!More Examples to come:Examples Added as new zip file:Get Base DomainsGetBaseDomainCanonicalizeUrlAvoiding Outbound Links Matching PatternsMust-Match PatternsThese examples are a port from the vb-scripts examples on the ChilKat site.Updated zip with A simple webcrawler.au3 (crawl a google directory ...how ironic )Spider_Examples.zipSpider_Examples_2.zip Edited December 3, 2007 by Creator
James Posted December 2, 2007 Posted December 2, 2007 I was actually trying to do the same thing through Google. I will check them out now. Blog - Seriously epic web hosting - Twitter - GitHub - Cachet HQ
Creator Posted December 2, 2007 Author Posted December 2, 2007 (edited) Added A simple webcrawler.au3 which crawls a googledirectory and is pretty much complete.If you want to do a full html index, you can find the complete html in the LastHtml property of a crawled url.Only imagine doing an offline search in the autoit forums with all keywords allowed Edited December 2, 2007 by Creator
jvanegmond Posted December 3, 2007 Posted December 3, 2007 Before I check it out and download it, have you included options to ignore robots.txt? github.com/jvanegmond
Creator Posted December 3, 2007 Author Posted December 3, 2007 (edited) Before I check it out and download it, have you included options to ignore robots.txt?The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc). -edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/ Edited December 3, 2007 by Creator
jvanegmond Posted December 3, 2007 Posted December 3, 2007 The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc). -edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/Makes sense. I might wanted to try it for personal use, just to gather some data from websites that I normally would not have found.. but it is good to keep a compliance to robots.txt github.com/jvanegmond
ptrex Posted December 3, 2007 Posted December 3, 2007 @Creator Great !! I always fond of ActiveX Compontents Good job. regards ptrex Contributions :Firewall Log Analyzer for XP - Creating COM objects without a need of DLL's - UPnP support in AU3Crystal Reports Viewer - PDFCreator in AutoIT - Duplicate File FinderSQLite3 Database functionality - USB Monitoring - Reading Excel using SQLRun Au3 as a Windows Service - File Monitor - Embedded Flash PlayerDynamic Functions - Control Panel Applets - Digital Signing Code - Excel Grid In AutoIT - Constants for Special Folders in WindowsRead data from Any Windows Edit Control - SOAP and Web Services in AutoIT - Barcode Printing Using PS - AU3 on LightTD WebserverMS LogParser SQL Engine in AutoIT - ImageMagick Image Processing - Converter @ Dec - Hex - Bin -Email Address Encoder - MSI Editor - SNMP - MIB ProtocolFinancial Functions UDF - Set ACL Permissions - Syntax HighLighter for AU3ADOR.RecordSet approach - Real OCR - HTTP Disk - PDF Reader Personal Worldclock - MS Indexing Engine - Printing ControlsGuiListView - Navigation (break the 4000 Limit barrier) - Registration Free COM DLL Distribution - Update - WinRM SMART Analysis - COM Object Browser - Excel PivotTable Object - VLC Media Player - Windows LogOnOff Gui -Extract Data from Outlook to Word & Excel - Analyze Event ID 4226 - DotNet Compiler Wrapper - Powershell_COM - New
Creator Posted December 3, 2007 Author Posted December 3, 2007 (edited) Updated first post with a new zip file. It contains the following examples: Get Base Domains GetBaseDomain CanonicalizeUrl Avoiding Outbound Links Matching Patterns Must-Match Patterns Thats it!! Now you should be more than on your way to building a nice crawling little thingy Edited December 12, 2007 by Creator
DicatoroftheUSA Posted December 6, 2007 Posted December 6, 2007 I have an au3 script to share and I am trying to get to ten posts in the examples forums. So I am saying thanks to everyone who makes AU3 scripts that will be usefull to me. Thanks! Statism is violence, Taxation is theft. Autoit Wiki
Fabry Posted January 1, 2008 Posted January 1, 2008 (edited) Does it work throught proxy? My output.txt file is empty. Edited January 1, 2008 by Fabry A lan chat (Multilanguage)LanMuleFile transferTank gameTank 2 an online game[center]L'esperienza è il nome che tutti danno ai propri errori.Experience is the name everyone gives to their mistakes.Oscar Wilde[/center]
coffeeturtle Posted April 18, 2013 Posted April 18, 2013 Hello! I know this is an old thread, but is there a way to flag links that exist, but are dead links? Or is the report in output.txt only of good links? If so, is there a way I can filter or search for dead links specifically? Thanks!
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now