youtuber Posted September 22, 2018 Share Posted September 22, 2018 (edited) I want to delete the left and right of Url addresses, but I've tried a few patterns but I haven't succeeded $aURLs[6] = ["-_ http://autoitscript.com---", _ " https://www.autoitscript.com => ", _ "1-http://www.autoitscript.com", _ "- www.autoitscript.com -", _ "- www.autoitscript.org - _", _ "-#$%& www.autoitscript.net -"] For $i = 0 To 5 $RegExp = StringRegExpReplace($aURLs[$i], 'How should the pattern be?','') $RegExp = StringRegExpReplace($aURLs[$i], '','How should the pattern be?') ConsoleWrite($RegExp & @CRLF) Next I did a sample test, but I failed $pattern = '(.com|\.net|\.org)(.*)' $pattern2 = '(*.)(http://|https://|www.)' $aURLs = "-_ http://autoitscript.com---" & @CRLF & _ " https://www.autoitscript.com => " & @CRLF & _ "1-http://www.autoitscript.com" & @CRLF & _ "- www.autoitscript.com -" & @CRLF & _ "- www.autoitscript.org - _" & @CRLF & _ "-#$%& www.autoitscript.net -" $RegExp = StringRegExpReplace($aURLs, $pattern,'$1') $RegExp = StringRegExpReplace($aURLs, $pattern2,'$1') ConsoleWrite($RegExp & @CRLF) Edited September 22, 2018 by youtuber Link to comment Share on other sites More sharing options...
TheXman Posted September 22, 2018 Share Posted September 22, 2018 (edited) Ordinarily I would ask to see some of your attempts to help you understand where you were having an issue. That's because I prefer to help you learn rather than to just give you solutions. But I'm bored at the moment. Here are just a couple of ways to do it. expandcollapse popupexample1() example2() Func example1() Local $aURLs[6] = [ _ "-_ http://autoitscript.com---", _ " https://www.autoitscript.com => ", _ "1-http://www.autoitscript.com", _ "- www.autoitscript.com -", _ "- www.autoitscript.org - _", _ "-#$%& www.autoitscript.net -" _ ] ConsoleWrite("Example1" & @CRLF) For $i = 0 To 5 $RegExp = StringRegExpReplace($aURLs[$i], ".*?((?:https?://|www).*?[.](?:com|net|org)).*","\1") ConsoleWrite($RegExp & @CRLF) Next EndFunc Func example2() Local $aResult[0] Local $aURLs[6] = [ _ "-_ http://autoitscript.com---", _ " https://www.autoitscript.com => ", _ "1-http://www.autoitscript.com", _ "- www.autoitscript.com -", _ "- www.autoitscript.org - _", _ "-#$%& www.autoitscript.net -" _ ] ConsoleWrite("Example2" & @CRLF) For $i = 0 To 5 $aResult = StringRegExp($aURLs[$i], "(?:https?://|www).*?[.](?:com|net|org)", 1) If IsArray($aResult) Then ConsoleWrite($aResult[0] & @CRLF) Next EndFunc Edited September 22, 2018 by TheXman youtuber 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
youtuber Posted September 22, 2018 Author Share Posted September 22, 2018 well what will happen if a url is more specific Local $aURLs[6] = [ _ "-_ http://www.international.in---", _ "- https://www.communications.com => ", _ "1-http://www.networksupport.net", _ "--- www.organizasion.org -", _ "- www.information.info - _", _ "-#$%& www.autoitscript.com -" _ ] Link to comment Share on other sites More sharing options...
TheXman Posted September 22, 2018 Share Posted September 22, 2018 I'm not going to play that game with you. I pointed you in the right direction. Now it is time for you to put in a little effort. If you encounter an obstacle, then come back with your attempt(s), clearly state what your issue is/are, provide your code or a workable example, and someone will probably help you. CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
iamtheky Posted September 22, 2018 Share Posted September 22, 2018 (edited) more fun with pipes, but I can make up roughly 400 ways to make it fail. I fear the problem is not yet clearly defined, also you should totally be showing us what youve tried. Local $aURLs[6] = [ _ "-_ http://www.international.in---", _ "- https://www.communications.com => ", _ "1-http://www.networksupport.net", _ "--- www.organizasion.org -", _ "- www.information.info - _", _ "-#$%& www.autoitscript.com -" _ ] for $i = 0 to ubound($aURLs) - 1 msgbox(0, '' , stringregexp($aURLs[$i] ,"(h.*?www\..*?\.\w\w+|www\..*?\.\w\w+)" , 3)[0]) next Edited September 22, 2018 by iamtheky youtuber 1 ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
youtuber Posted September 22, 2018 Author Share Posted September 22, 2018 (edited) I'm not trying anything but I'm looking for the best way to extract url addresses in a complex html or txt file. Because I know I won't meet my needs when I go deeper Do you think this is the best way? (h.*?www\..*?\.\w\w+|www\..*?\.\w\w+) I wonder if @mikell has an idea for us @iamtheky it really is not my fault that url addresses can be similar to this Local $aURLs[8] = [ _ "-_ http://www.international.in.us---", _ "- https://www.communications.com.fr => ", _ "1-http://www.networksupport.net.us", _ "--- www.organizasion.org -", _ "- www.information.info - _", _ "-#$%& www.autoitscript.com -", _ "- https://www.autoit-script.com.fr/ -" _ ] Edited September 22, 2018 by youtuber Link to comment Share on other sites More sharing options...
mikell Posted September 22, 2018 Share Posted September 22, 2018 Are not the previous answers correct ? I -personally- think they are You can always make a regex fail, reason why I totally agree with what iamtheky said BTW this one $RegExp = StringRegExpReplace($aURLs[$i], '.*?((?:https?://|www)[.\w+]+).*', "$1") is nothing but a mix of the previous ones. It works ... and obviously may fail, depending on the addresses, the context, etc Link to comment Share on other sites More sharing options...
Deye Posted September 23, 2018 Share Posted September 23, 2018 (edited) This looks like a one clean way of doing it For $i = 0 To UBound($aURLs) - 1 $aURLs[$i] = StringTrimLeft($aURLs[$i], _by($aURLs[$i])) $aURLs[$i] = StringTrimRight($aURLs[$i], _by(StringReverse($aURLs[$i]))) Next _ArrayDisplay($aURLs) Func _by($sValue) Local $aRet = StringRegExp($sValue, '(^.?[\W_]+)\w()', 3) If Not @error Then Return StringLen($aRet[0]) EndFunc Edit: a small fix on an extra space introduced when running TheXman's (next post) array example Edited September 23, 2018 by Deye youtuber 1 Link to comment Share on other sites More sharing options...
Jury Posted September 23, 2018 Share Posted September 23, 2018 (edited) mistake Edited September 23, 2018 by Jury mistake youtuber 1 Link to comment Share on other sites More sharing options...
TheXman Posted September 23, 2018 Share Posted September 23, 2018 (edited) 22 hours ago, youtuber said: I'm not trying anything but I'm looking for the best way to extract url addresses in a complex html or txt file. Here is one more example that uses a RFC 3986 compliant character set. This regular expression will not handle URLs that do not start with either http://, https://, or "www.". For example, it will not find "autoitscript.com" but it will find "https://autoitscript.com". Like others have said, you have not adequately defined what, EXACTLY, you are looking for. Without a specific, all-encompassing, definition, all of our suggestions may miss certain cases. All of the suggestions are based on the data that you provided. Maybe you can help us help you by providing one of your "complex" html or text files so that we can see what you are working with. This back and forth, what-if, way of getting to your solution is a waste of time and effort . #include <Constants.au3> #include <Array.au3> example() Func example() Local $aResult[0] Local $aURLs = [ _ "-_ http://autoitscript.com---" , " https://www.autoitscript.com => ", _ "1-http://www.autoitscript.com" , "- https://autoitscript.com -", _ "- www.autoitscript.com -" , "- www.autoitscript.org - _", _ "- www.autoitscript.org... - _" , "-#$%& www.autoitscript.net -", _ "-_ http://www.international.in---" , "- https://www.communications.com => ", _ "1-http://www.networksupport.net" , "--- www.organizasion.org -", _ "- www.information.info - _" , "-#$%& www.autoitscript.com -", _ "-_ http://www.international.in.us---", "- https://www.communications.com.fr => ", _ "1-http://www.networksupport.net.us" , "--- www.organizasion.org -", _ "- www.information.info/test.html - _", _ "-#$%& www.autoitscript.com/this&20is%20a%20test.html -", _ "-#$%& https://www.autoitscript.com/this&20is%20a%20test.html -", _ "- https://www.autoit-script.com.fr/ -" _ ] ;Parse URLs using RFC 3986 Compliant Character Set $aResult = StringRegExp( _ _ArrayToString($aURLs, @CRLF), _ "(?i)\b(?:https?://|www\.)[-A-Z0-9+&@#/%=~_$?!:,.]*[A-Z0-9+&@#/%=~_$]", _ $STR_REGEXPARRAYGLOBALMATCH) If IsArray($aResult) Then _ArrayDisplay($aResult) EndFunc Edited September 23, 2018 by TheXman Removed "|" from RFC 3986 character set in regular expression youtuber 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
youtuber Posted September 23, 2018 Author Share Posted September 23, 2018 @TheXman It's really great, thank you, this is a very good pattern. But what is the difference between which should I use? (?i)\b(?:https?://|www\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$] (?i)\b(?:https?://|www\.)[-A-Z0-9+&@#/%=~_$?!:,.]*[A-Z0-9+&@#/%=~_$] Link to comment Share on other sites More sharing options...
TheXman Posted September 23, 2018 Share Posted September 23, 2018 (edited) 29 minutes ago, youtuber said: It's really great, thank you, this is a very good pattern. But what is the difference between which should I use? You're welcome. You should use the most current version. As it stated in my comment, as to why I changed it, I removed the "|" symbol from the set of characters that are valid in a URL. It was added in error. I just corrected it to make it match the spec, or at least to make it match the character set as closely as possible. If you look at my previous post, you will also see that I created a hyperlink to the RFC. From my previous post: Edited September 23, 2018 by TheXman youtuber 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
youtuber Posted September 23, 2018 Author Share Posted September 23, 2018 understood thanks. I prepared a similar pattern for him, but I guess I failed (?i)\b(?:https?:\/\/|www\.)((?:[a-zA-Z\x{00a1}-\x{ffff}0-9.\-])+(?:\.[a-zA-Z]{2,63})) Link to comment Share on other sites More sharing options...
Deye Posted September 23, 2018 Share Posted September 23, 2018 (edited) As suggested already there will always be reasons for anything fail Add an extra "&" to the end and it flunks I believe the stream cannot be handled so sterilely when its going all in a one direction So the idea of treating both other ends separately might still be a better way .. Yet, Still needed extra proofing to my example : #include <File.au3> Local $aURLs = [ _ "&##$%&http://www.networksupport.net.us&##$%& - _", _ " =https://www.autoitscript.com/forum/topic/195819-clean-up-both-right-and-left-of-url/?tab=comments#comment-1403743&##$%&", " https://www.autoitscript.com => ", _ "- www.information.info/test.html - _", _ "-#$%& www.autoitscript.com/this&20is%20a%20test.html -", _ "- https://www.autoit-script.com.fr&##$%&##$%&##$##$%&" _ ] For $i = 0 To UBound($aURLs) - 1 $aURLs[$i] = StringTrimLeft($aURLs[$i], _by($aURLs[$i])) $aURLs[$i] = StringTrimRight($aURLs[$i], _by(StringReverse($aURLs[$i]))) Next _ArrayDisplay($aURLs) Func _by($sValue) Local $aRet = StringRegExp($sValue, '(^.?\d?[\W_]+)\w()', 3) If Not @error Then Return StringLen($aRet[0]) EndFunc Edited September 23, 2018 by Deye youtuber 1 Link to comment Share on other sites More sharing options...
TheXman Posted September 23, 2018 Share Posted September 23, 2018 41 minutes ago, Deye said: Add an extra "&" to the end and it flunks Yes, as previously stated, there are many edge cases that would break the regular expression that I provided. On 9/22/2018 at 3:57 PM, youtuber said: I'm looking for the best way to extract url addresses in a complex html or txt file. The original poster said that he was trying to parse URLs from "complex" html or text files. Not sure what a "complex" html or text file is, but your solution appears to rely on the input being an array of pre-parsed data. That means that your solution is not viable at all, if run against a file, without additional parsing. My last example assumes that it would be run against a file, not an array. It also successfully parsed out all of the examples that had been supplied up to the point in which I suggested it. youtuber 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
Deye Posted September 24, 2018 Share Posted September 24, 2018 TheXman, Your examples were reading off an array, but I dig your full intention, seeing it also in the OP I guess some of us can get easily distracted at times .. these are just examples but to help emphasize the original intention in code it could have been put like so: #include <File.au3> Local $sData = ' - www.autoitscript.org... - _-#$%& _www.autoitscript.net -,' & @CRLF & _ ' -_ http://www.international.in---- https://www.communications.com => , _' & @CRLF & _ ' 1-http://www.networksupport.net&##$%--- www.organizasion.org -, _' & @CRLF & _ ' - www.information.info - _-#$%& www.autoitscript.com -, _' & @CRLF & _ ' -_ http://www.international.in.us&##-##$%&--- https://www.communications.com.fr&##$%$##$%& => , _' & @CRLF & _ ' 1-http.networksupport.net.us- w-- _' & @CRLF & _ ' - www.information.info/test.html, - _ _' & @CRLF & _ ' -#$%& www.autoitscript.com/this&20is%20a%20test.html _-, ' & @CRLF & _ ' -#$%& https://www.autoitscript.com/this&20is%20a%20test.html -&##$%&####$%&https://www.autoitscript.com/forum/topic/?tab=comments#comment-1403807&%$#3546737$aResult, _' & @CRLF & _ ' $aResult- https://www.autoit-script.com.fr/ _ -$aResult' & @CRLF & _ ' ]' & @CRLF Local $aResult = StringRegExp($sData, "(?i)\b(?:https?://|www\.)[-A-Z0-9+&@#/%=~_$?!:,.]*[A-Z0-9+&@#/%=~_$]", _ $STR_REGEXPARRAYGLOBALMATCH) _ArrayDisplay($aResult) ; Yet Another Example Local $aResult = StringRegExp($sData, '(?i)(?:https?://|www\.)+[\w.?+=&%@#!:\-/]+\w', 3) $aResult[1] &= " <= Previously missed " _ArrayDisplay($aResult) Deye youtuber 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now