Jump to content

Recommended Posts

Posted

Hello,

I am pulling information from yellow pages and seem to be having a issue

I want to pull any website's that are not internal links or yellowpages.com

here is my current code

#include <IE.au3>
#include <array.au3>
#Include <File.au3>
#include <string.au3>
#include <INet.au3>
#include <Excel.au3>
 
 
$YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL.
$i = 1;This will keep track of how many pofiles we have pulled from linkedin
 
 
 
$YellowPages = _INetGetSource($YellowPagesUrl);Pulls the data from the address
InetClose ($YellowPages);Closes the connection to linkedin
 
$YellowPagesWebsite = _StringBetween($YellowPages, '<a href="', '"');List out all yellow pages links
 
_ArrayDisplay($YellowPagesWebsite);
Posted

hmm second time this has happened it didnt include what I put in my message after the code.

I wan this to exclude anything that is a /ofiheif.html type of link or anything that is a yellowpages.com/ type of link

how would I do this

Thanks

Posted (edited)

Is it good with this ?

#include <array.au3>

$YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL.
 
$YellowPages = BinaryToString( InetRead ($YellowPagesUrl) );Pulls the data from the address
 $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="(http://(?!www\.yellowpages\.com)[^"]+)', 3) ; 
 _ArrayDisplay($YellowPagesWebsite);

Match only links starting by "http://" and exclude yellowpages.com

 

Or this

$YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#](?!.*yellowpages)[^"]+)', 3) ;

for links not in "http://" format

Edited by jguinch
Posted (edited)

If you want a more manageable solution you can also do it like this

$YellowPages = StringReplace($YellowPages, 'href="http://www.yellowpages', "")
$YellowPages = StringReplace($YellowPages, 'href="http://ads', "")
; etc
$YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#][^"]+)', 3)
Edited by mikell

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...