Chance Posted October 20, 2012 Share Posted October 20, 2012 I have a huge list of proxies with ports in IP:PORT format.I need to extract all the proxies that use the port ranges 80-8081 including common ports like 3128 and 8080 and ignore all the rest.The regular expression I'm using is "(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,4})", how can I do this using a single captchering regexp statment? Link to comment Share on other sites More sharing options...
Chance Posted October 20, 2012 Author Share Posted October 20, 2012 (edited) Ok, so I think I know what needs to be done. ((?:d{1,3}.){3}d{1,3}:(?:8080|3128|80|8081)) Just read through the help file, apparantly it's just that simple... Edited October 20, 2012 by FlutterShy Link to comment Share on other sites More sharing options...
Chance Posted October 25, 2012 Author Share Posted October 25, 2012 Well, I read up some more, an this is what I've got.(?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests]))^taken from netApparently, this regexp validates the IP addres to make sure that it's not in those funny ranges, I had to do this because I'm dealing with those IPs that you just don't know what you're going to get thrown at you.Still, I do have a problem, I have a huge list of IP:PORT addres's and I wanted to make a regexp that would only pick up valid addres, but I'm not that smart to do it..... ( ._.)1.237.43.340:80 421.535.123.123:8080 53.12.2.2:8080 14.55.01.255:443 164.77.82.21:80 202.149.78.234:8080 60.2.227.123:3128From the above examples, the regexp will only pick 5 examples I think, one of which should not be picked up, which is "14.55.01.255:443", specifically because of the 01 bit, and if I remember correctly, IPs shouldn't begin with a 0. I'me not too smart enough to be able to develop a regexp to filer out fake IP addrs's like those ;_;If anyone could be so kind as to help me out? Link to comment Share on other sites More sharing options...
Beege Posted October 25, 2012 Share Posted October 25, 2012 (edited) Thats kinda tricky. zero can be valid. so can one. thats why the fuction passes it. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) just verifys its a value 0 to 255. If you have a bunch like that I would add a second check for that. so something like: (?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests])) and not stringregexp($ip, '.0d?') Edited October 25, 2012 by Beege Chance 1 Assembly Code: fasmg . fasm . BmpSearch . Au3 Syntax Highlighter . Bounce Multithreading Example . IDispatchASMUDFs: Explorer Frame . ITaskBarList . Scrolling Line Graph . Tray Icon Bar Graph . Explorer Listview . Wiimote . WinSnap . Flicker Free Labels . iTunesPrograms: Ftp Explorer . Snipster . Network Meter . Resistance Calculator Link to comment Share on other sites More sharing options...
Chance Posted October 25, 2012 Author Share Posted October 25, 2012 (edited) If you have a bunch like that I would add a second check for that. so something like: (?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests])) and not stringregexp($ip, '.0d?') I'm sorry, I think that's exactly what I just posted............... ( ._.) Uless I'm missing something. It's still picking up exactly what the other RegExp was picking up. I'm sorry, I just don't know too much regexp..... Edited October 25, 2012 by FlutterShy Link to comment Share on other sites More sharing options...
Robjong Posted October 25, 2012 Share Posted October 25, 2012 (edited) Hi, give this a try, it does allow .0 or .1 but not .01, the port can be between 80 and 8081. 'b(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[0-9][0-9]?|0):(?:[8-9]d|[1-9]dd|[1-7]d{3}|80(?:[0-7]d|8[01]))b' Edited October 29, 2012 by Robjong Chance 1 Link to comment Share on other sites More sharing options...
BrewManNH Posted October 25, 2012 Share Posted October 25, 2012 If you're looking for a way to validate an IP address, try this snippet that I came up with that will validate an IPv4 address as valid or not. Validating the port number after that is simple. If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag GudeHow to ask questions the smart way! I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from. Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator Link to comment Share on other sites More sharing options...
Chance Posted October 25, 2012 Author Share Posted October 25, 2012 If you're looking for a way to validate an IP address, try this snippet that I came up with that will validate an IPv4 address as valid or not. Validating the port number after that is simple. Thanks, I'll find some use for this. Thats kinda tricky. zero can be valid. so can one. thats why the fuction passes it. (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) just verifys its a value 0 to 255. If you have a bunch like that I would add a second check for that. so something like: (?i)((?:(?:25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(?:[some port|statmests])) and not stringregexp($ip, '.0d?') Sorry, I didn't exactly see what it was you were doing the first time I saw, as it turns out this does work, it's just that I need to keep it withing the regexpression because I can't throw a single IP at it at one time, this is meant to go through thousands at a time. Hi, give this a try, it does allow .0 or .1 but not .01, the port can be between 80 and 8081. 'b(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0):(?:[8-9]d|[1-9]dd|[1-7]d{3}|80(?:[0-7]d|8[01]))b' YES! This got me on track, it keeps those funny addresses out flawlessly as I can tell by now, I've modified it a bit and threw a huge list of IPs I know are valid and it picked them all up, then I mixed in some tricky ones with zeros and odd ranges and it filtered those out perfectly. (?i)((?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0):(?:8[0-9]{3}|8[0-9]|3128|28134|54321|45612|443)) The reason I add "(?i)" is because I'm assuming that it helps the RegExp operation go a little faster since it's not checking for case sensitivity. Link to comment Share on other sites More sharing options...
Robjong Posted October 25, 2012 Share Posted October 25, 2012 (edited) The reason I add "(?i)" is because I'm assuming that it helps the RegExp operation go a little faster since it's not checking for case sensitivity.That assumption is wrong, there are no characters that have a case difference.If you go with your pattern you should add a word boundary "b" to the end of it, otherwise it might still match some ports you do not want.For example, your pattern matches any 8xxx and 8x port but would now also match the first two digits of 8xx numbers, resulting in non existing/working ip:port combinations.Edit: removed unintentional smiley. Edited October 25, 2012 by Robjong Link to comment Share on other sites More sharing options...
Bowmore Posted October 25, 2012 Share Posted October 25, 2012 (edited) The reason I add "(?i)" is because I'm assuming that it helps the RegExp operation go a little faster since it's not checking for case sensitivity.I've not done any tests on this but I would expect a case insensitive RegEx to be slightly slower if there is any difference. Some where in Regex engines code it will have to do a comparison something like this:If char = A or char = a Then rather than just If char = A Then Edited October 25, 2012 by Bowmore "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook Link to comment Share on other sites More sharing options...
Beege Posted October 26, 2012 Share Posted October 26, 2012 Hi, give this a try, it does allow .0 or .1 but not .01, the port can be between 80 and 8081. 'b(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0).){3}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|0):(?:[8-9]d|[1-9]dd|[1-7]d{3}|80(?:[0-7]d|8[01]))b' Ha! I knew you'd be able to nail this one! Assembly Code: fasmg . fasm . BmpSearch . Au3 Syntax Highlighter . Bounce Multithreading Example . IDispatchASMUDFs: Explorer Frame . ITaskBarList . Scrolling Line Graph . Tray Icon Bar Graph . Explorer Listview . Wiimote . WinSnap . Flicker Free Labels . iTunesPrograms: Ftp Explorer . Snipster . Network Meter . Resistance Calculator Link to comment Share on other sites More sharing options...
Chance Posted October 26, 2012 Author Share Posted October 26, 2012 If you go with your pattern you should add a word boundary "b" to the end of it, otherwise it might still match some ports you do not want.For example, your pattern matches any 8xxx and 8x port but would now also match the first two digits of 8xx numbers, resulting in non existing/working ip:port combinations.I've read over the help file and I've still yet to understand what this "word boundry" thing does, so I preferomed various tests and found that it reduces the amount of legit IP:PORT addresses it picks up, while not using it allows the RegExp pattern to work flawlessly. The IPs were seperated by only line break characters in my tests aka CR+LF, about two hundred of them.I've not done any tests on this but I would expect a case insensitive RegEx to be slightly slower if there is any difference. Some where in Regex engines code it will have to do a comparison something like this:If char = A or char = a Then rather than just If char = A ThenThat assumption is wrong, there are no characters that have a case difference.Ok, so performed 4 tests using a file about 1,500,000 lines long, each containing an IP:PORT in them at each line and I the results weren't what I was expecting...;Case Insensitive = 64181.3263976287, 68533.5281439401;Case Sensetive = 63767.234662506, 63963.4948017136 Link to comment Share on other sites More sharing options...
Robjong Posted October 26, 2012 Share Posted October 26, 2012 (edited) Ha! I knew you'd be able to nail this one! Haha, I was waiting for you to do it (Hi btw) I've read over the help file and I've still yet to understand what this "word boundry" thing does, so I preferomed various tests and found that it reduces the amount of legit IP:PORT addresses it picks up, while not using it allows the RegExp pattern to work flawlessly. The IPs were seperated by only line break characters in my tests aka CR+LF, about two hundred of them. It is really quite easy, as it's name suggest, the word boundary is related to the word sequence ( w ). It matches a boundary of a word but not an actual character (zero width assertion), in other words it matches bewtween a word character ( A-Z a-z 0-9 _ ) and a non-word character. For example, if you were to match groups of 3 digits you might write a pattern like this. #include <Array.au3> $aMatches = StringRegExp("123 456 7890", "d{3}", 3) ; matches 0:123, 1:456, 2:789 _ArrayDisplay($aMatches) Which matches "123" "456" and "789", now you can see the problem, the "789" was originally Not a group of 3 numbers, now let's try it with boundaries. #include <Array.au3> $aMatches = StringRegExp("123 456 7890", "bd{3}b", 3) ; matches 0:123, 1:456 _ArrayDisplay($aMatches) I hope this clears it up a bit. Ok, so performed 4 tests using a file about 1,500,000 lines long, each containing an IP:PORT in them at each line and I the results weren't what I was expecting... I'm betting you did not start the test with an SRE call you did not include in the timings, to start up the engine? ( First SRE call is significantly slower ) Edit: tidy. Edited October 26, 2012 by Robjong Chance 1 Link to comment Share on other sites More sharing options...
Chance Posted October 29, 2012 Author Share Posted October 29, 2012 (edited) It was still catching fake addresses due to the included 0 in the repeating 3 statements, obviously it has to be even bigger and more monstrous to work correctly. ((?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]).(?:(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0).){2}(?:25[0-5]|2[0-4][0-9]|1?[1-9][0-9]?|1[0-9][0-9]|0):(?:8[0-9]{3}|3128|28134|54321|45612|443|8[0-9])) Edit: Ok, so after reading the comments below, it's become obvious that addresses with a leading 0 are not fake, but the problem is that I'd rather ignore these because some people like to keep lists with the IPs 3 characters wide at each octet, I've tested over 200,000 and find that these rarely ever work, so I find it better to just skip them exclusively in order to prevent wasting any time. Edited October 29, 2012 by FlutterShy Link to comment Share on other sites More sharing options...
BrewManNH Posted October 29, 2012 Share Posted October 29, 2012 I've never found a 100% reliable regex that will validate every possible IP address without errors. There was a thread on (I think) codeproject that someone tried to solicit the best regex to do it, and after about 50 tries they never came up with a bulletproof way to do it in one line. There's just far too many variables and exceptions and allowances from what I saw. Which is why I created my IP address validater function. It may not be lightning fast, but at least it works. If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag GudeHow to ask questions the smart way! I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from. Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator Link to comment Share on other sites More sharing options...
Robjong Posted October 29, 2012 Share Posted October 29, 2012 (edited) I've never found a 100% reliable regex that will validate every possible IP address without errors. There was a thread on (I think) codeproject that someone tried to solicit the best regex to do it, and after about 50 tries they never came up with a bulletproof way to do it in one line. There's just far too many variables and exceptions and allowances from what I saw. Which is why I created my IP address validater function. It may not be lightning fast, but at least it works. I wrote this SRE version based on the same rules your snippet enforces, as far as I can tell it works. ;=============================================================================== ; Description.......: Check if a given IP address is a valid IPv4 address. ; Parameter(s)......: $sIP - The IPv4 address to validate. ; Requirement.......: ; Return Value(s)...: Success - 1 ; Failure - 0, and sets @error to 1 ; Author(s).........: Robjong (SRE version of _ValidIP by BrewManNH : http://www.autoitscript.com/wiki/Snippets_%28_Internet_%29#ValidIP.28.29_.7E_Author_-_BrewManNH) ; Remarks ..........: This will accept an IP address that is 4 octets long, and contains only numbers and falls within ; valid IP address values. Class A networks can't start with 0 or 127. 169.xx.xx.xx is reserved and is ; invalid and any address that starts above 239, ex. 240.xx.xx.xx is reserved. The address range ; 224-239 is reserved as well for Multicast groups but can be a valid IP address range if you're using ; it as such. Any IP address ending in 0 or 255 is also invalid for an IP. ;=============================================================================== Func _IsValidIPv4($sIP) Local $fRes = StringRegExp($sIP, "\A(?!(127|169|0{1,3})\.)(2[0-3]\d|[01]?\d\d?)(\.(25[0-5]|2[0-4]\d|[01]?\d\d?)){2}\.(25[0-4]|2[0-4]\d|1\d\d|0?[1-9]\d?|0{0,2}[1-9])\z") Return SetError(Not $fRes, 0, $fRes) EndFunc ;==>_IsValidIPv4 I also noticed your version allows addresses like 01.02.03.04, that should not be allowed should it? (it should, see next post) Another thing I was curious about was this line: $dString &= StringRight(Hex($aArray[$I]), 2) ... is there a reason you are not using it like this..? Hex($aArray[$I], 2) To get back on topic, to use this to parse the proxy list this should help: Func _ParseProxyList($sString) Return StringRegExp($sIP, "\b(?!(?:127|169|0{1,3})\.)(?:2[0-3]\d|1\d\d|0?\d\d?)(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d\d?)){2}\.(?:25[0-4]|2[0-4]\d|[01]?[1-9]\d?|0{0,2}[1-9]):(?:[8-9]\d|[1-9]\d\d|[1-7]\d{3}|80(?:[0-7]\d|8[01]))\b", 3) EndFunc ;==>_ParseProxyList Edit 1: credits + cleaning Edit 2: made groups non-capturing Edit 3: updated source to allow for leading zero (see next posts) Edit 4: cleaned patterns up a bit Edited October 30, 2012 by Robjong Beege 1 Link to comment Share on other sites More sharing options...
BrewManNH Posted October 29, 2012 Share Posted October 29, 2012 A class A IP address range goes from 1 to 127, so 01.02.03.04 is a valid IP address. As to the Hex statement, either is equally valid, your's is probably a better choice, less commands to parse. As to your RegEx, it fails if any of the octets start with a zero, yet that's a perfectly valid IP address, as the leading zero is ignored when setting an IP address, or if you try to ping one. If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag GudeHow to ask questions the smart way! I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from. Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator Link to comment Share on other sites More sharing options...
Robjong Posted October 29, 2012 Share Posted October 29, 2012 OK, thanks for the response. The pattern fails for leading zeros because, as the first question might have given away, I was under the impression that it was invalid, I should have known better. I will update the script in my previous post after I have some dinner. Link to comment Share on other sites More sharing options...
Robjong Posted October 29, 2012 Share Posted October 29, 2012 I have updated the functions in my previous post, they now allow leading zeros. Link to comment Share on other sites More sharing options...
Chance Posted November 25, 2012 Author Share Posted November 25, 2012 (edited) Ok. the point in my thread is that the most valid working proxies are accepted, although a small magirity of those proxies might work, the thing is that most don't. I've tested a lot, but I mean A LOT! And this is the best I could come up with. ((?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd).(?:(?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd|0).){2}(?:25[0-5]|2[0-4]d|1?[1-9]d?|1dd|0):(?:312[8-9]|28134|54321|45612|443|1d{2,3}|9d{3}|8d{1,3})) the trick is the port, I duno why but these ports seem to work more often than others. Now, I'm not saying that anyone above me is wrong, and to be honest, I'm not too keen with this stuff, but this has yealded the best results so far, possibly because of the port filtering part. Edited November 25, 2012 by FlutterShy Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now