Check if a string fits a given regular expression pattern.
StringRegExp ( "test", "pattern" [, flag = 0 [, offset = 1]] )
test | The subject string to check |
pattern | The regular expression to match. |
flag | [optional] A number to indicate how the function behaves. See below for details. The default is 0. |
offset | [optional] The string position to start the match (starts at 1). The default is 1. |
Flag | Values |
$STR_REGEXPMATCH (0) | Returns 1 (match) or 0 (no match). (Default). |
$STR_REGEXPARRAYMATCH (1) | Return array of matches. |
$STR_REGEXPARRAYFULLMATCH (2) | Return array of matches including the full match (Perl / PHP style). |
$STR_REGEXPARRAYGLOBALMATCH (3) | Return array of global matches. |
$STR_REGEXPARRAYGLOBALFULLMATCH (4) | Return an array of arrays containing global matches including the full match (Perl / PHP style). |
@error: | Meaning |
2: | Bad pattern. @extended = offset of error in pattern. |
@error: | Meaning |
0: | Array is valid. Check @extended for next offset |
1: | Array is invalid. No matches. |
2: | Bad pattern, array is invalid. @extended = offset of error in pattern. |
@error: | Meaning |
0: | Array is valid. |
1: | Array is invalid. No matches. |
2: | Bad pattern, array is invalid. @extended = offset of error in pattern. |
for testing various StringRegExp() patterns - Thanks steve8tch. Credit: w0uter
The flag parameter can have one of 5 values ($STR_REGEXPMATCH (0) through $STR_REGEXPARRAYGLOBALFULLMATCH (4)).
$STR_REGEXPMATCH (0) | returns 1 (true) or 0 (false) if the pattern was found or not. |
$STR_REGEXPARRAYMATCH (1) $STR_REGEXPARRAYFULLMATCH (2) |
find the first match and return captured groups in an array; when the pattern has no capturing groups, the first match is returned in the array. |
$STR_REGEXPARRAYGLOBALMATCH (3) $STR_REGEXPARRAYGLOBALFULLMATCH (4) |
fill the array with all matching instances. |
(*CR) | Carriage return (@CR). |
(*LF) | Line feed (@LF). |
(*CRLF) | Carriage return immediately followed by linefeed (@CRLF). |
(*ANYCRLF) | Any of @CRLF, @CR or @LF. This is the default newline convention. |
(*ANY) | Any Unicode newline sequence: @CRLF, @LF, VT, FF, @CR or \x85. |
(*BSR_ANYCRLF) | By default \R matches @CRLF, @CR or @LF only. |
(*BSR_UNICODE) | Changes \R to match any Unicode newline sequence: @CRLF, @LF, VT, FF, @CR or \x85. |
(?i) | Caseless: matching becomes case-insensitive from that point on. By default, matching is case-sensitive. When UCP is enabled casing applies to the entire Unicode plane 0, else applies by default to ASCII letters A-Z and a-z only. |
(?m) | Multiline: ^ and $ match at newline sequences within data. By default, multiline is off. |
(?s) | Single-line or DotAll: . matches anything including a newline sequence. By default, DotAll is off hence . does not match a newline sequence. |
(?U) | Ungreedy: quantifiers become lazy (non-greedy) from that point on. By default, matching is greedy - see below for further explanation. |
(?x) | eXtended: whitespaces outside character classes are ignored and # starts a comment up to the next solid newline in pattern. Meaningless whitespaces between components make regular expressions much more readable. By default, whitespaces match themselves and # is a literal character. |
\a | Represents "alarm", the BEL character (Chr(7)). |
\cX | Represents "control-X", where X is any 7-bit ASCII character. For example, "\cM" represents ctrl-M, same as \x0D or \r (Chr(13)). |
\e | Represents the "escape" control character (Chr(27)). Not to be confused with the escaping of a character! |
\f | Represents "formfeed" (Chr(12)). |
\n | Represents "linefeed" (@LF, Chr(10)). |
\r | Represents "carriage return" (@CR, Chr(13)). |
\t | Represents "tab" (@TAB, Chr(9)). |
\ddd | Represents character with octal code ddd, OR backreference to capturing group number ddd in decimal. For example, ([a-z])\1 would match a doubled letter. Best avoided as it can be ambiguous! Favor the hex representations below. |
\xhh | Represents Unicode character with hex codepoint hh: "\x7E" represents a tilde, "~". |
\x{hhhh} | Represents Unicode character with hex codepoint hhhh: "\x{20AC}" represents the Euro symbol, "€" (ChrW(0x20AC)). |
\x | where x is non-alphanumeric, stands for a literal x. Used to represent metacharacters literally: "\.\[" represents a dot followed by a left square bracket, ".[". |
\Q ... \E | Verbatim sequence: metacharacters loose their special meaning between \Q and \E: "\Q(.)\E" matches "(.)" and is equivalent to, but more readable than, "\(\.\)". |
. | Matches any single character except, by default, a newline sequence. Matches newlines as well when option (?s) is active. |
\d | Matches any decimal digit (any Unicode decimal digit in any language when UCP is enabled). |
\D | Matches any non-digit. |
\h | Matches any horizontal whitespace character (see table below). |
\H | Matches any character that is not a horizontal whitespace character. |
\N | Matches any character except a newline sequence regardless of option (?s). |
\p{ppp} | Only when UCP is enabled: matches any Unicode character having the property ppp. E.g. "\b\p{Cyrillic}+" matches any cyrillic word; "\p{Sc}" matches any currency symbol. See reference documentation for details. |
\P{ppp} | Only when UCP is enabled: matches any Unicode character not having the property ppp. |
\R | Matches any Unicode newline sequence by default, or the currently active (*BSR_...) setting. By default \R matches "(?>\r\n|\n|\r)" where "(?>...)" is an atomic group, making the sequence "\r\n" (@CRLF) unbreakable. |
\s | Matches any whitespace character (see table below). |
\S | Matches any non-whitespace character. |
\v | Matches any vertical whitespace character (see table below). |
\V | Matches any character that is not a vertical whitespace character. |
\w | Matches any "word" character: any digit, any letter or underscore "_" (any Unicode digit, any Unicode letter in any language or underscore "_" when UCP is enabled). |
\W | Matches any non-word character. |
\X | Only when UCP is enabled: matches any Unicode extended grapheme cluster - an unbreakable sequence of codepoints which represent a single character for the user. As a consequence \X may match more than one character in the subject string, contrary to all other sequences in this table. |
[ ... ] | Matches any character in the explicit set: "[aeiou]" matches any lowercase vowel. A contiguous (in Unicode codepoint increasing order) set can be defined by putting an hyphen between the starting and ending characters: "[a-z]" matches any lowercase ASCII letter. To include a hyphen (-) in a set, put it as the first or last character of the set or escape it (\-). Notice that the pattern "[A-z]" is not the same as "[A-Za-z]": the former is equivalent to "[A-Z\[\\\]^_`a-z]". To include a closing bracket in a set, use it as the first character of the set or escape it: "[][]" and "[\[\]]" will both match either "[" or "]". Note that in a character class, only \d, \D, \h, \H, \p{}, \P{}, \s, \Q...\E, \S, \v, \V, \w, \W, and \x sequences retain their special meaning, while \b means the backspace character (Chr(8)). |
[^ ... ] | Matches any character not in the set: "[^0-9]" matches any non-digit. To include a caret (^) in a set, put it after the beginning of the set or escape it (\^). |
[:alnum:] | ASCII letters and digits (same as [^\W_] or [A-Za-z0-9]). When UCP is enabled: Unicode letters and digits (same as [^\W_] or \p{Xan}). |
[:alpha:] | ASCII letters (same as [^\W\d_] or [A-Za-z]). When UCP is enabled: Unicode letters (same as [^\W\d_] or \p{L}). |
[:ascii:] | ASCII characters (same as [\x00-\x7F]). |
[:blank:] | Space or Tab (@TAB) (same as \h or [\x09\x20]). When UCP is enabled: Unicode horizontal whitespaces (same as \h). |
[:cntrl:] | ASCII control characters (same as Chr(0) ... Chr(31) and Chr(127)). |
[:digit:] | ASCII decimal digits (same as \d or [0-9]). When UCP is enabled: Unicode decimal digits (same as \d or \p{Nd}). |
[:graph:] | ASCII printing characters, excluding space (same as Chr(33) ... Chr(126)). |
[:lower:] | ASCII lowercase letters (same as [a-z]). When UCP is enabled: Unicode lowercase letters (same as \p{Ll}). |
[:print:] | ASCII printing characters, including space (same as Chr(32) ... Chr(126)). |
[:punct:] | ASCII punctuation characters, [:print:] excluding [:alnum:] and space, (33-47, 58-64, 91-96, 123-126). |
[:space:] | ASCII white space (same as [\h\x0A-\x0D]). [:space:] is not quite the same as \s: it includes VT, Chr(11)). |
[:upper:] | ASCII uppercase letters (same as [A-Z]). When UCP is enabled: Unicode uppercase letters (same as \p{Lu}). |
[:word:] | ASCII "Word" characters (same as \w or [[:alnum:]_]). When UCP is enabled: Unicode "word" characters (same as \w or [[:alnum:]_] or \p{Xwd}). |
[:xdigit:] | Hexadecimal digits (same as [0-9A-Fa-f]). |
( ... ) | Capturing group. The elements in the group are treated in order and can be repeated as a block. E.g. "(ab)+c" will match "abc" or "ababc", but not "abac". Capturing groups remember the text they matched for use in backreferences and they populate the optionally returned array. They are numbered starting from 1 in the order of appearance of their opening parenthesis. Capturing groups may also be treated as subroutines elsewhere in the pattern, possibly recursively. |
(?<name> ... ) | Named capturing group. Can be later referenced by name as well as by number. Avoid using the name "DEFINE" (see "conditional patterns"). |
(?: ... ) | Non-capturing group. Does not record the matching characters in the array and cannot be re-used as backreference. |
(?| ... ) | Non-capturing group with reset. Resets capturing group numbers in each top-level alternative it contains: "(?|(Mon)|(Tue)s|(Wed)nes|(Thu)rs|(Fri)|(Sat)ur|(Sun))day" matches a weekday name and captures its abbreviation in group number 1. |
(?> ... ) | Atomic non-capturing group: locks and never backtracks into (gives back from) what has been matched (see also Quantifiers and greediness below). Atomic groups, like possessive quantifiers, are always greedy. |
(?# ... ) | Comment group: always ignored (but may not contain a closing parenthesis, hence comment groups are not nestable). |
? | 0 or 1, greedy. |
?+ | 0 or 1, possessive. |
?? | 0 or 1, lazy. |
* | 0 or more, greedy. |
*+ | 0 or more, possessive. |
*? | 0 or more, lazy. |
+ | 1 or more, greedy. |
++ | 1 or more, possessive. |
+? | 1 or more, lazy. |
{x} | exactly x. |
{x,y} | at least x and no more than y, greedy. |
{x,y}+ | at least x and no more than y, possessive. |
{x,y}? | at least x and no more than y, lazy. |
{x,} | x or more, greedy. |
{x,}+ | x or more, possessive. |
{x,}? | x or more, lazy. |
X|Y | Matches either subpattern X or Y: "ac|dc|ground" matches "ac" or "dc" or "ground". |
\n | References a previous capturing group by its absolute number. WARNING: if no group number n exists, it evaluates as the character with value n provided n is a valid octal value, else errors out. Due to this ambiguity, this form is not recommended. Favor the next forms for a safe semantic. |
\gn | References a previous capturing group by its absolute number. |
\g{n} | References a previous capturing group by its absolute number. Similar to above but clearly delimits where n ends: useful when the following character(s) is(are) digits. |
\g-n | References a previous capturing group by its relative number. |
\k<name> | References a previous capturing group by its name. |
(?R) or (?0) | Recurses into the entire regular expression. |
(?n) | Calls subpattern by absolute number. |
(?+n) | Calls subpattern by relative number. |
(?-n) | Calls subpattern by relative number. |
(?&name) | Calls subpattern by name. |
^ | Outside a character class, the caret matches at the start of the subject text, and also just after a non-final newline sequence if option (?m) is active. By default the newline sequence is @CRLF. Inside a character class, a leading ^ complements the class (excludes the characters listed there). |
$ | Outside a character class, the dollar matches at the end of the subject text, and also just before a newline sequence if option (?m) is active. Inside a character class, $ means itself, a dollar sign. |
\A | Matches only at the absolute beginning of subject string, irrespective of the multiline option (?m). Will never match if offset is not 1. |
\G | Matches when the current position is the first matching position in subject. |
\z | Matches only at end of subject string, irrespective of the multiline option (?m). |
\Z | Matches only at end of subject string, or before a newline sequence at the end, irrespective of the multiline option (?m). |
\b | Matches at a "word" boundary, i.e. between characters not both \w or \W. See \w for the meaning of "word". Inside a character class, \b means "backspace" (Chr(8)). |
\B | Matches when not at a word boundary. |
(?=X) | Positive look-ahead: matches when the subpattern X matches starting at the current position. |
(?!X) | Negative look-ahead: matches when the subpattern X does not match starting at the current position. |
(?<=X) | Positive look-behind: matches when the subpattern X matches characters preceding the current position. Pattern X must match a fixed-length string, i.e. may not use any undefinite quantifier * + or ?. |
(?<!X) | Negative look-behind: matches when the subpattern X does not match characters preceding the current position. Pattern X must match a fixed-length string, i.e. may not use any undefinite quantifier * + or ?. |
\K | Resets start of match at the current point in subject string. Note that groups already captured are left alone and still populate the returned array; it is therefore always possible to backreference to them later on. Action of \K is similar but not identical to a look-behind, in that \K can work on alternations of varying lengths. |
(?(condition)yes-pattern) | Allows conditional execution of pattern. |
(?(condition)yes-pattern|no-pattern) | Chooses between distinct patterns depending on the result of (condition). |
(n) | Tests whether the capturing group with absolute number n matched. |
(+n) | Tests whether the capturing group with relative number +n matched. |
(-n) | Tests whether the capturing group with relative number -n matched. |
(<name>) | Tests whether the capturing group with name name matched. |
(R) | Tests whether any kind of recursion occured. |
(Rn) | Tests whether the most recent recursion was for capturing group with absolute number n. |
(R&name) | Tests whether the most recent recursion was for capturing group with name name. |
(DEFINE) | Used without no-pattern, permits definition of a subroutine useable from elsewhere. "(?x) (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )" defines a subroutine named "byte" which matches any component of an IPv4 address. Then an actual address can be matched by "\b (?&byte) (\.(?&byte)){3} \b". |
(assertion) | Here assertion is one of positive or negative, look-ahead or look-behind assertion. |
(?J) | Enables duplicate group or subroutine names (not discussed further here). |
(?X) | Causes some out-of-context sequences to raise an error, instead of being benign. |
(*J) | Enables Javascript compatibility (not discussed further here). |
(*LIMIT_MATCH=n) | Limits number of matches to n. |
(*LIMIT_RECURSION=n) | Limits recursion to n levels. |
(*NO_START_OPT) | Disables several optimizations (not discussed further here). |
(*ACCEPT) | Forces an immediate match success in the current subroutine or top-level pattern. |
(*FAIL) or (*F) | Forces an immediate match failure. |
(*MARK:name) or (*:name) | (See reference documentation.) |
(*COMMIT) | (See reference documentation.) |
(*PRUNE) | (See reference documentation.) |
(*PRUNE:name) | (See reference documentation.) |
(*SKIP) | (See reference documentation.) |
(*SKIP:name) | (See reference documentation.) |
(*THEN) | (See reference documentation.) |
(*THEN:name) | (See reference documentation.) |
StringInStr, StringRegExpReplace
#include <MsgBoxConstants.au3> #include <StringConstants.au3> Local $aArray = 0, _ $iOffset = 1, $iOffsetStart While 1 $iOffsetStart = $iOffset $aArray = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '(?i)<test>(.*?)</test>', $STR_REGEXPARRAYMATCH, $iOffset) If @error Then ExitLoop $iOffset = @extended For $i = 0 To UBound($aArray) - 1 MsgBox($MB_SYSTEMMODAL, "Opt 1 at " & $iOffsetStart, $aArray[$i]) Next WEnd
#include <MsgBoxConstants.au3> #include <StringConstants.au3> Local $aArray = 0, _ $iOffset = 1, $iOffsetStart While 1 $iOffsetStart = $iOffset $aArray = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '(?i)<test>(.*?)</test>', $STR_REGEXPARRAYFULLMATCH, $iOffset) If @error Then ExitLoop $iOffset = @extended For $i = 0 To UBound($aArray) - 1 Step 2 MsgBox($MB_SYSTEMMODAL, "Option 2 at " & $iOffsetStart, $aArray[$i] & @TAB & "captured = " & $aArray[$i + 1]) Next WEnd
#include <Array.au3> #include <StringConstants.au3> Local $aArray = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '(?i)<test>(.*?)</test>', $STR_REGEXPARRAYGLOBALMATCH) #cs 1st Capturing Group (.*?) . matches any character (except for line terminators) *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy) #ce _ArrayDisplay($aArray, "Option 3 Results")
#include <Array.au3> #include <MsgBoxConstants.au3> #include <StringConstants.au3> Local $aArray = StringRegExp('F1oF2oF3o', '(F.o)*?', $STR_REGEXPARRAYGLOBALFULLMATCH) #cs 1st Capturing Group (F.o)*? *? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy) #ce _ArrayDisplay($aArray,"Opt - 4 Results") Local $aMatch = 0 For $i = 0 To UBound($aArray) - 1 $aMatch = $aArray[$i] If UBound($aMatch) > 1 Then _ArrayDisplay($aMatch, "Array #" & $i) Else MsgBox($MB_SYSTEMMODAL, "Array #" & $i, "'" & $aMatch[0] & "' StringLen = " & StringLen(StringLen)) EndIf Next
#include <Array.au3> _Example() Func _Example() Local $sHTML = _ '<select id="OptionToChoose">' & @CRLF & _ ' <option value="" selected="selected">Choose option</option>' & @CRLF & _ ' <option value="1">Sun</option>' & @CRLF & _ ' <option value="2">Earth</option>' & @CRLF & _ ' <option value="3">Moon</option>' & @CRLF & _ '</select>' & @CRLF & _ '' Local $aOuter = StringRegExp($sHTML, '(?is)(<option value="(.*?)"( selected="selected"|.*?)>(.*?)</option>)', $STR_REGEXPARRAYGLOBALFULLMATCH) _ArrayDisplay($aOuter, '$aOuter') Local $aInner For $IDX_Out = 0 To UBound($aOuter) - 1 $aInner = $aOuter[$IDX_Out] _ArrayDisplay($aInner, '$aInner = $aOuter[$IDX_Out] ... $IDX_Out = ' & $IDX_Out) Next EndFunc ;==>_Example