leuce Posted December 19, 2016 Posted December 19, 2016 (edited) Hello I'm trying to get a script to search a string for a search query on a whole-words-only basis. This means that I would use something like StringRegExp ($string, "\b" & $query & "\b"). However, I have no control over what the $query will be -- it may contain characters that "mean" something in regular expressions. For example, it may contain a backslash or a fullstop, but I don't want the backslash or fullstop to mean what they usually mean in regular expressions. Is there a way to use regular expression search while specifying that a certain portion of it should always be read literally? Or is my only solution to make a list of potential special characters and then escape them? Here's a sample code, in case my explanation above is insufficiently clear: $string1 = "asdf bcd asdf" $string2 = "asdf .c. asdf" $query = ".c." $foo1 = StringRegExp ($string1, "\b" & $query & "\b", 1) MsgBox (0, "", $foo1[0], 0) ; I want it to fail, but it returns "bcd" $foo2 = StringRegExp ($string2, "\b" & $query & "\b", 1) MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails If my only solution is to escape characters, do you happen to know of a ready list of characters that must be escaped? For the moment it appears to me to be "[](){}?.\^$*+|". Thanks Samuel Edited December 19, 2016 by leuce
czardas Posted December 19, 2016 Posted December 19, 2016 (edited) Look at \Q .... \E in the help file (turn off special characters between these instructions). If these exact sequences appear in the pattern to be tested, then you might need to make some replacements in the test string first, and change the pattern accordingly. You will have to experiment and see what suits your requirements. BTW '.c.' is not adjacent to any word character, so using '\b' won't work in this example. Local $string2 = "asdf .c. asdf" Local $query = ".c." Local $foo2 = StringRegExp($string2, "(\Q" & $query & "\E)", 1) MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails Edited December 19, 2016 by czardas operator64 ArrayWorkshop
iamtheky Posted December 19, 2016 Posted December 19, 2016 why is stringinstr insufficient for this task? ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__)
czardas Posted December 19, 2016 Posted December 19, 2016 From this small example, we don't exactly know what the full task is. However it is always good to question one's approach, and if using StringInStr() is plausible, then it may turn out to be better. operator64 ArrayWorkshop
leuce Posted December 19, 2016 Author Posted December 19, 2016 Thanks for all your replies. The script that I write will perform searches in paragraphs from files. The user specifies the search query, and he specifies e.g. whether it is case-sensitive or not, and e.g. whether whole words should be search or not, etc. If the user was to specify "match whole words only", then StringInStr can't be used. For example, in the string "the overt rover is over", a StringInStr query for "over" will always match "overt", "rover" and "over" . But if the user wants to match only "over" (i.e. whole words only) (and not "overt" and "rover", which contains "over" but which aren't "over" by themselves), then StringInStr can't be used (AFAIK).
iamtheky Posted December 19, 2016 Posted December 19, 2016 Case ;Whole Words $Frmt_UserInput = " " & $userstring & " " stringinstr($string , $Frmt_UserInput) but regex works too. ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__)
czardas Posted December 19, 2016 Posted December 19, 2016 (edited) The part about word boundaries needs clear definition. If the search string begins, or ends, with a symbol; then you need to define boundary. You might want to define spaces, or the start and end of the source string as boundaries. Edited December 19, 2016 by czardas operator64 ArrayWorkshop
iamtheky Posted December 19, 2016 Posted December 19, 2016 especially with literals. Without a decent knowledge of the target we can edge case it for days, especially if its not all English. ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__)
leuce Posted December 19, 2016 Author Posted December 19, 2016 Well, I thought that since AutoIt does contain the concept of "word boundaries" in StringRegExp, I might as well make use of it. But if it turns out to be unreliable, then my other option would be (as iamtheky and czardas appear to point to) to specify word boundaries myself. These would be spaces, tabs, line breaks, all punctuation marks, and probably hyphens too. And obviously things would get extra complicated once you get to non-Latin scripts, but (and perhaps I should have said so, sorry) my intended user uses a language that uses a Latin script. ATM I'm the only user :-p I'm calling it a night, but here (attached) is what I have at this stage (not yet taking into account any of your comments... that's for tomorrow). WFTM delete segs.zip
czardas Posted December 19, 2016 Posted December 19, 2016 (edited) All boundaries can be placed in a set ==> [\A\b\s] means any of these occurrences. If you have problems post what you have tried. Edited December 19, 2016 by czardas operator64 ArrayWorkshop
jchd Posted December 19, 2016 Posted December 19, 2016 (edited) Our help file indicates which are characters denoting \b. Also using (*UCP) you can significantly extend what \b means. Also wanted to add the \b is in no way "unreliable". It just may not be the criterion you need, but as czardas points out, there are ways to overcome this. Edited December 19, 2016 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
iamtheky Posted December 19, 2016 Posted December 19, 2016 Also if you say the safe word regexp will stop whatever it is doing to you kylomas and czardas 2 ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__)
leuce Posted January 6, 2017 Author Posted January 6, 2017 On 12/19/2016 at 11:45 PM, iamtheky said: Case ;Whole Words $Frmt_UserInput = " " & $userstring & " " stringinstr($string , $Frmt_UserInput) but regex works too. Thanks for the reply, but your example assumes that the only word boundary character is a space. Word can also be bounded by punctuation :-)
leuce Posted January 6, 2017 Author Posted January 6, 2017 On 12/19/2016 at 11:03 PM, czardas said: Look at \Q .... \E in the help file (turn off special characters between these instructions). Local $string2 = "asdf .c. asdf" Local $query = ".c." Local $foo2 = StringRegExp($string2, "(\Q" & $query & "\E)", 1) MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails Thanks, I did not realise that a variable's content will be treated as ordinary text if used in a regular expression. I had thought that I could literalise characters by placing them in a variable. Your tip to use \Q and \E to literalise characters in the regular expression is helpful.
czardas Posted January 6, 2017 Posted January 6, 2017 (edited) If the search criteria contains the sequence '\E', then you will need to first make replacements using StringReplace(). A good replacement character would be one from the private Unicode range. In ArrayWorkshop, I used ChrW(57344) [= U+E000] as a replacement character. Of course you might have to undo the replacements afterwards. This depends on how your code is written. Edited January 6, 2017 by czardas operator64 ArrayWorkshop
jguinch Posted January 6, 2017 Posted January 6, 2017 46 minutes ago, czardas said: If the search criteria contains the sequence '\E', then you will need to first make replacements using StringReplace(). In the case where the search criteria contains \E, why not just replacing \E by \E\\E\Q in the pattern ? So it looks like \Q\E\\E\Q\E ? (i'm not sure to understand what you say) Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
czardas Posted January 6, 2017 Posted January 6, 2017 (edited) Maybe, because there's a very small chance that U+E000 is also within the source, or the search string. Although I can't think of many legitimate reasons for private range Unicode characters to be part of a search pattern. There's still a chance that the sequence '\E\\E\Q' already exists, and things can get kind of messy. That's why I prefer my imperfect solution. Edit: After thinking about it, I believe your solution should work regardless. Edited January 6, 2017 by czardas operator64 ArrayWorkshop
jguinch Posted January 6, 2017 Posted January 6, 2017 22 minutes ago, czardas said: Edit: After thinking about it, I believe your solution should work. I think, too.. czardas 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now