Jump to content

Recommended Posts

Posted (edited)

Here are a few examples for creating your own web crawler/spider by using the free activeX component from ChilKat.

I found it very fast and easy to use.

Free download of the activeX : http://www.chilkatsoft.com/download/SpiderActiveX.msi

Reference for the activeX (if you dont wanna wait on me posting more examples :) ) : http://www.chilkatsoft.com/refdoc/xSpiderRef.html

Examples include:

  • Getting Started Spidering a Site.au3
  • Extract HTML Title, Description, Keywords.au3
  • Fetch robots.txt for a Site.au3
  • Avoid URLs Matching Any of a Set of Patterns.au3
  • Setting a Maximum Response Size.au3
  • Setting a Maximum URL Length.au3
  • Using the Disk Cache.au3
  • Crawling the Web.au3
  • Get Referenced Domains.au3
  • A Simple Web Crawler.au3

Did i mention its fully robot.txt compliant !!

Have fun!

More Examples to come:

Examples Added as new zip file:

Get Base Domains

GetBaseDomain

CanonicalizeUrl

Avoiding Outbound Links Matching Patterns

Must-Match Patterns

These examples are a port from the vb-scripts examples on the ChilKat site.

Updated zip with A simple webcrawler.au3 (crawl a google directory ...how ironic ^_^ )

Spider_Examples.zip

Spider_Examples_2.zip

Edited by Creator
Posted (edited)

Added A simple webcrawler.au3 which crawls a googledirectory and is pretty much complete.

If you want to do a full html index, you can find the complete html in the LastHtml property of a crawled url.

Only imagine doing an offline search in the autoit forums with all keywords allowed :)

Edited by Creator
Posted (edited)

Before I check it out and download it, have you included options to ignore robots.txt?

The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.

On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc).

:)

-edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/

Edited by Creator
Posted

The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.

On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc).

:)

-edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/

Makes sense. I might wanted to try it for personal use, just to gather some data from websites that I normally would not have found.. but it is good to keep a compliance to robots.txt
Posted (edited)

Updated first post with a new zip file. It contains the following examples:

Get Base Domains

GetBaseDomain

CanonicalizeUrl

Avoiding Outbound Links Matching Patterns

Must-Match Patterns

Thats it!! Now you should be more than on your way to building a nice crawling little thingy :)

Edited by Creator
  • 4 weeks later...
  • 5 years later...
Posted

Hello! I know this is an old thread, but is there a way to flag links that exist, but are dead links?

Or is the report in output.txt only of good links? If so, is there a way I can filter or search for dead links specifically?

Thanks!

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...