RegExp - has anyone seen this library before?

martijn · September 27, 2006

I really don't recommend that until we actually decide we want to go this route. Testing is okay but don't write mission-critical applications with any test executables because nothing is final yet.

No problemo

sohfeyr · September 28, 2006

It would be very difficult but it basically requires dropping some supported operating systems or writing a ton of code that Windows already implements for us if we do want to support them... Then there is porting the existing code to use WCHAR instead of CHAR. That is probably about as much effort as writing all the wrappers.

As I said, not something I expect to see any time soon Nice to have the description of the process involved, though.

trids · September 28, 2006

Just stumbled on this thread .. and wanted to add my vote of support

Also, following some links that thomasl included with his PCRE wrapper (in another thread), I came across the following pages which offer an excellent introduction to regexps. For those who need to a quick introduction:

They also include some examples that might prove useful for testing the AU3 implementation, as they spell out the results and subtleties for various expressions and features.

HTH

Jon · October 1, 2006

Test AutoIt Exe: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

///////////////////////////////////////////////////////////////////////////////
//
// $val = StringRegExp("string", "pattern", [flag, [offset]])
//
// Perform regular expression matching on the given string.
//
// flags:
//      0(default) - returns 1 (matched) or 0 (no match)
//      1 - return array of matches
//
// When flag = 1:
//      Returns an array.
//      @Error = 0.  Array is valid.  Check @Extended for next offset
//      @Error = 1.  Array is invalid.  No matches.
//      @Error = 2.  Bad pattern, array is invalid.  @Extended = offset of error in pattern.
//
///////////////////////////////////////////////////////////////////////////////

Based on the php: preg_match function (seems to return entire match followed by matching subsubstring). Haven't done a global version yet because I don't know if this is working correctly yet (half the patterns I try don't work, but I don't know if they should work or if it is broken...) and I also have no idea how to return a global selection of data that would be meaningful. It's very hard to implement regexp code when you barely understand them, so help testing would be great.

Here is code using the offset parameter to perform a manual global match.

$nOffset = 1
While 1
    $array = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '<(?i)test>(.*?)</(?i)test>', 1, $nOffset)
    If @error = 0 Then
        $nOffset = @extended
    Else
        ExitLoop
    EndIf
    for $i = 0 to UBound($array) - 1
        msgbox(0, $i, $array[$i])
    Next
WEnd

steve8tch · October 1, 2006

Checked out a few of my regexs (including some that I used to have issues with) - most of them are quite simple - but it seems to be behaving fine.

Valik · October 1, 2006

I just tested one of my patterns and didn't even have to change it (That was unexpected). It worked mostly but the returned array contained data I didn't expect.

Take this simple script:

Main()

Func Main()
    Local $s = "abcdef"
    Local $p = "(ab)(cd)"
    Local $a = StringRegExp($s, $p, 1)
    ConsoleWrite('@@ (48) :(' & @min & ':' & @sec & ') UBound($a) = ' & UBound($a) & @CR);### Debug Console
    For $i = 0 To UBound($a) - 1
        ConsoleWrite($a[$i] & @CRLF)
    Next
EndFunc; Main()

The output is:

@@ (48) :(59:13) UBound($a) = 3
abcd
ab
cd

I expected:

@@ (48) :(59:13) UBound($a) = 2
ab
cd

Edit: Fixed the post up a bit.

Edited October 1, 2006 by Valik

Valik · October 1, 2006

Just tried another expression. It looks like you have to escape $ when it's not being used as an anchor. For example, I had the pattern "$$(.*?)$" which would match things like "Foo" in the string "$(Foo)". In order to make that pattern compatible with PCRE, I had to make it "\$$(.*?)$".

So far I'm optimistic that our patterns won't be too broken by using PCRE. Just need to get the damn "too-much-data" problem fixed. StringRegExp() would return this using the pattern and string mentioned above:

$(Foo)
Foo

Again, the first line should not be there. The group only specified that "Foo" should be captured.

Jon · October 1, 2006

The first array entry seems to be something to do with a full match, the one in php does the same (and also, the implementation that tylo did a while ago has this too). So I thought I'd keep it the same.

Edit: The comment from php's preg_match:

$matches[0]will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

Whether it's useful or not I have no idea whatsoever.

I had a go at the replace stuff and decided I'd had enough for one day.

Valik · October 1, 2006

The problem is, the existing implementation did not use it and that will break scripts. So far I'm surprised at how compatible the expressions are. I guess David used Perl as a guide so a lot of patterns are going to work with PCRE out of the box. However, if the returned data is different than the "native" implementation, things are just as broken. I'll have to go through and adjust all my loops to start indexing at 1 instead of 0 even if the pattern itself works perfectly. That seems a shame to me since the patterns are what I thought would make the implementation incompatible.

Valik · October 1, 2006

Jon, here is my proposal. It's a combination of maintaining backwards compatibility and supporting what PCRE does by default. Here are the flags I propose:

0 - Current behavior, returns True or False if the pattern matches.

1 - Old behavior. Only return data that matches a group and only return the first matches. Example:

Main()

Func Main()
    Local $s = "abcdefabcdef"
    Local $p = "(ab)(cd)"
    Local $a = StringRegExp($s, $p, 1)
    ConsoleWrite("Matches: " & UBound($a) & @CRLF)
    For $i = 0 To UBound($a) - 1
        ConsoleWrite($a[$i] & @CRLF)
    Next
EndFunc
 oÝ÷ Øë¦ë¡×j×!zÎ|Ù¦Üw÷(uïåX¶5ì  z¯Ó+"³Z´ý¸r§¦èºÑej
°jÉ÷öÛ¬yØ§¶¨ÛÞ®È¨Ê"µÆ§mæj^vÚ)z·è®kazw°

Output:

Matches: 6
abcd
ab
cd
abcd
ab
cd

This will work because the flags in the old StringRegExp() were not bit-flags. This provides maximum compatibility so that any breakages will require very minor tweaks to the pattern. It also adds in the new functionality which I admit could be useful.

Edit: For flag 4, I'm assuming that PCRE behaves the same with a global match that it does with a single match. If PCRE behaves exactly like flag 3, then flag 4 can be skipped. If the behavior of PCRE does not match flag 3 but does seem useful, then it can be put onto flag 4.

Edited October 1, 2006 by Valik

Jon · October 1, 2006

It has no concept of a global match, which is what I'm struggling with atm. You basically have to manually re-call it (like AutoIt example above) but we would implement it internally. If the interface doesn't output something mentioned above then I don't understand enough about it to make it do so :ph34r:

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

PS. I've got the simple version of StringRegExp replace working (no dollar substitutions etc) so I'll post that in a while.

Edit: At least your post gives me some examples to play with. I was really struggling to find some. :lmao:

spyrorocks · October 1, 2006

If there was some way to make this exacly like the php function, i could really use it.

Valik · October 1, 2006

It has no concept of a global match, which is what I'm struggling with atm. You basically have to manually re-call it (like AutoIt example above) but we would implement it internally. If the interface doesn't output something mentioned above then I don't understand enough about it to make it do so
If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.
PS. I've got the simple version of StringRegExp replace working (no dollar substitutions etc) so I'll post that in a while.
Edit: At least your post gives me some examples to play with. I was really struggling to find some.

I think it may be useful but I'm trying to keep as much backwards compatibility as possible. Like I said before, the patterns are pretty close and a lot of them are going to work out of the box with PCRE so it's a shame the output is not the same, otherwise this transition would be very smooth requiring only minor changes to patterns.

From what you posted earlier (in private maybe), it sounded like the function with all in the name did a global search. I don't know what it's output would be, though I suspect it should be similar to flag 3 of David's implementation.

Jon · October 1, 2006

From what you posted earlier (in private maybe), it sounded like the function with all in the name did a global search. I don't know what it's output would be, though I suspect it should be similar to flag 3 of David's implementation.

Yeah, preg_match_all is the php function. But the underlying pcre api doesn't have a global option so it seems we have to do the global cleverness manually. There's no way to predict how many matches will be done so it seems like we'll have to keep calling the single match function and adding the matches to some sort of linked list and then when there are no more matches decide how to turn that into something useful for AutoIt.

I'm leaving global until last, I think doing StringRegExpReplace looks easier.

Valik · October 1, 2006

It'd be nice to use std::vector for that. Wonder how much STL would increase size by? I wonder if we've gotten to the point we can use STL without too much size bloat? We could port a lot of stuff to STL...

sohfeyr · October 1, 2006

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

I think the value in position 0 is very useful when parsing long documents. You can examine both your capturing groups and their context and relation to eachother. (.Net's implementation is similar: RegEx.Matches(n).Groups(0) returns the text that matched the whole expression.)

If reverse compatibility is really an issue though, people like me could always just enclose the whole expression as a group. As long as nested groups are supported, that shouldn't be too big a problem. Personally, I like the flags idea. It would be easier for people to add a flag to their regexp calls than to go through and be sure of every 0-based loop that needs to become 1-based.

Edited October 1, 2006 by sohfeyr

Jon · October 2, 2006

I need a regexp that will match the $n or ${n} parts of of a string.

I currently have "\\$(0-9]+)" which matches $1 $2 ok but I need also to cope with situtations that have {} like ${1}

It's for the replacement parameter code in StringRegExpReplace - I was going to use a regexp to parse itself Oo

This almost works: "\\${*(0-9]+)}*" but it allows for ${{{1}}} which is wrong, is there some way to say a match for 0 or 1 lots of { but no more?

SmOke_N · October 2, 2006

I need a regexp that will match the $n or ${n} parts of of a string.
I currently have "\\$(0-9]+)" which matches $1 $2 ok but I need also to cope with situtations that have {} like ${1}
It's for the replacement parameter code in StringRegExpReplace - I was going to use a regexp to parse itself Oo
This almost works: "\\${*(0-9]+)}*" but it allows for ${{{1}}} which is wrong, is there some way to say a match for 0 or 1 lots of { but no more?

I'm going to assume you're speaking of the current project you're working on and now the current releases version?

Jon · October 2, 2006

I'm going to assume you're speaking of the current project you're working on and now the current releases version?

Yes.

thomasl · October 2, 2006

Test AutoIt Exe: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

This looks pretty good, Jon. I have thrown some simple and quite a few of my more convoluted patterns at it and they work out okay. I did compare the output of AU3 to what the same pattern produces in Perl and with the expection of element[0] (whole match) they agree. Good job.

FWIW, I agree about keeping backwards compatibility if at all possible. If someone really wants the whole match, another pair of parentheses does the trick, as sohfeyr pointed out.

As to ${...}: try this: \$\{{0,1}\d+\}{0,1}

EDIT:sorry, forgot the () around \d+: \$\{{0,1}(\d+)\}{0,1}

Edited October 2, 2006 by thomasl

Sign In

RegExp - has anyone seen this library before?

Recommended Posts

martijn

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

Jon

sohfeyr

trids

Jon

steve8tch

Valik

Valik

Jon

Valik

Valik

Jon

spyrorocks

Valik

Jon

Valik

sohfeyr

Jon

SmOke_N

Jon

thomasl

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta