Specify characters to be interpreted literally in StringRegExp

leuce · December 19, 2016

Hello

I'm trying to get a script to search a string for a search query on a whole-words-only basis.

This means that I would use something like StringRegExp ($string, "\b" & $query & "\b"). However, I have no control over what the $query will be -- it may contain characters that "mean" something in regular expressions. For example, it may contain a backslash or a fullstop, but I don't want the backslash or fullstop to mean what they usually mean in regular expressions.

Is there a way to use regular expression search while specifying that a certain portion of it should always be read literally? Or is my only solution to make a list of potential special characters and then escape them?

Here's a sample code, in case my explanation above is insufficiently clear:

$string1 = "asdf bcd asdf"
$string2 = "asdf .c. asdf"
$query = ".c."
$foo1 = StringRegExp ($string1, "\b" & $query & "\b", 1)
MsgBox (0, "", $foo1[0], 0) ; I want it to fail, but it returns "bcd"
$foo2 = StringRegExp ($string2, "\b" & $query & "\b", 1)
MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails

If my only solution is to escape characters, do you happen to know of a ready list of characters that must be escaped? For the moment it appears to me to be "[](){}?.\^$*+|".

Thanks

Samuel

Edited December 19, 2016 by leuce

czardas · December 19, 2016

Look at \Q .... \E in the help file (turn off special characters between these instructions). If these exact sequences appear in the pattern to be tested, then you might need to make some replacements in the test string first, and change the pattern accordingly. You will have to experiment and see what suits your requirements.

BTW '.c.' is not adjacent to any word character, so using '\b' won't work in this example.

Local $string2 = "asdf .c. asdf"
Local $query = ".c."
Local $foo2 = StringRegExp($string2, "(\Q" & $query & "\E)", 1)
MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails

Edited December 19, 2016 by czardas

iamtheky · December 19, 2016

why is stringinstr insufficient for this task?

czardas · December 19, 2016

From this small example, we don't exactly know what the full task is. However it is always good to question one's approach, and if using StringInStr() is plausible, then it may turn out to be better.

leuce · December 19, 2016

Thanks for all your replies.

The script that I write will perform searches in paragraphs from files. The user specifies the search query, and he specifies e.g. whether it is case-sensitive or not, and e.g. whether whole words should be search or not, etc. If the user was to specify "match whole words only", then StringInStr can't be used.

For example, in the string "the overt rover is over", a StringInStr query for "over" will always match "overt", "rover" and "over" . But if the user wants to match only "over" (i.e. whole words only) (and not "overt" and "rover", which contains "over" but which aren't "over" by themselves), then StringInStr can't be used (AFAIK).

iamtheky · December 19, 2016

Case ;Whole Words

$Frmt_UserInput = " " & $userstring & " "

stringinstr($string , $Frmt_UserInput)

but regex works too.

czardas · December 19, 2016

The part about word boundaries needs clear definition. If the search string begins, or ends, with a symbol; then you need to define boundary. You might want to define spaces, or the start and end of the source string as boundaries.

Edited December 19, 2016 by czardas

iamtheky · December 19, 2016

especially with literals. Without a decent knowledge of the target we can edge case it for days, especially if its not all English.

leuce · December 19, 2016

Well, I thought that since AutoIt does contain the concept of "word boundaries" in StringRegExp, I might as well make use of it. But if it turns out to be unreliable, then my other option would be (as iamtheky and czardas appear to point to) to specify word boundaries myself. These would be spaces, tabs, line breaks, all punctuation marks, and probably hyphens too.

And obviously things would get extra complicated once you get to non-Latin scripts, but (and perhaps I should have said so, sorry) my intended user uses a language that uses a Latin script. ATM I'm the only user :-p

I'm calling it a night, but here (attached) is what I have at this stage (not yet taking into account any of your comments... that's for tomorrow).

WFTM delete segs.zip

czardas · December 19, 2016

All boundaries can be placed in a set ==> [\A\b\s] means any of these occurrences. If you have problems post what you have tried.

Edited December 19, 2016 by czardas

jchd · December 19, 2016

Our help file indicates which are characters denoting \b. Also using (*UCP) you can significantly extend what \b means.

Also wanted to add the \b is in no way "unreliable". It just may not be the criterion you need, but as czardas points out, there are ways to overcome this.

Edited December 19, 2016 by jchd

iamtheky · December 19, 2016

Also if you say the safe word regexp will stop whatever it is doing to you

leuce · January 6, 2017

On 12/19/2016 at 11:45 PM, iamtheky said:

Case ;Whole Words

$Frmt_UserInput = " " & $userstring & " "

stringinstr($string , $Frmt_UserInput)

but regex works too.

Thanks for the reply, but your example assumes that the only word boundary character is a space. Word can also be bounded by punctuation :-)

leuce · January 6, 2017

On 12/19/2016 at 11:03 PM, czardas said:
Look at \Q .... \E in the help file (turn off special characters between these instructions).
Local $string2 = "asdf .c. asdf"
Local $query = ".c."
Local $foo2 = StringRegExp($string2, "(\Q" & $query & "\E)", 1)
MsgBox (0, "", $foo2[0], 0) ; I want it to return ".c.", but it fails

Thanks, I did not realise that a variable's content will be treated as ordinary text if used in a regular expression. I had thought that I could literalise characters by placing them in a variable. Your tip to use \Q and \E to literalise characters in the regular expression is helpful.

czardas · January 6, 2017

If the search criteria contains the sequence '\E', then you will need to first make replacements using StringReplace(). A good replacement character would be one from the private Unicode range. In ArrayWorkshop, I used ChrW(57344) [= U+E000] as a replacement character. Of course you might have to undo the replacements afterwards. This depends on how your code is written.

Edited January 6, 2017 by czardas

jguinch · January 6, 2017

46 minutes ago, czardas said:

If the search criteria contains the sequence '\E', then you will need to first make replacements using StringReplace().

In the case where the search criteria contains \E, why not just replacing \E by \E\\E\Q in the pattern ? So it looks like \Q\E\\E\Q\E ? (i'm not sure to understand what you say)

czardas · January 6, 2017

Maybe, because there's a very small chance that U+E000 is also within the source, or the search string. Although I can't think of many legitimate reasons for private range Unicode characters to be part of a search pattern. There's still a chance that the sequence '\E\\E\Q' already exists, and things can get kind of messy. That's why I prefer my imperfect solution.

Edit: After thinking about it, I believe your solution should work regardless.

Edited January 6, 2017 by czardas

jguinch · January 6, 2017

22 minutes ago, czardas said:

Edit: After thinking about it, I believe your solution should work.

I think, too..

Specify characters to be interpreted literally in StringRegExp

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members