Medic873 Posted March 27, 2014 Posted March 27, 2014 Hello, I am pulling information from yellow pages and seem to be having a issue I want to pull any website's that are not internal links or yellowpages.com here is my current code #include <IE.au3> #include <array.au3> #Include <File.au3> #include <string.au3> #include <INet.au3> #include <Excel.au3> $YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL. $i = 1;This will keep track of how many pofiles we have pulled from linkedin $YellowPages = _INetGetSource($YellowPagesUrl);Pulls the data from the address InetClose ($YellowPages);Closes the connection to linkedin $YellowPagesWebsite = _StringBetween($YellowPages, '<a href="', '"');List out all yellow pages links _ArrayDisplay($YellowPagesWebsite);
Medic873 Posted March 27, 2014 Author Posted March 27, 2014 hmm second time this has happened it didnt include what I put in my message after the code. I wan this to exclude anything that is a /ofiheif.html type of link or anything that is a yellowpages.com/ type of link how would I do this Thanks
jguinch Posted March 27, 2014 Posted March 27, 2014 (edited) Is it good with this ? #include <array.au3> $YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL. $YellowPages = BinaryToString( InetRead ($YellowPagesUrl) );Pulls the data from the address $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="(http://(?!www\.yellowpages\.com)[^"]+)', 3) ; _ArrayDisplay($YellowPagesWebsite); Match only links starting by "http://" and exclude yellowpages.com Or this $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#](?!.*yellowpages)[^"]+)', 3) ; for links not in "http://" format Edited March 27, 2014 by jguinch Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
mikell Posted March 27, 2014 Posted March 27, 2014 (edited) If you want a more manageable solution you can also do it like this $YellowPages = StringReplace($YellowPages, 'href="http://www.yellowpages', "") $YellowPages = StringReplace($YellowPages, 'href="http://ads', "") ; etc $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#][^"]+)', 3) Edited March 27, 2014 by mikell
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now