Jump to content

StringSplit multiple whole words autoit


Go to solution Solved by jguinch,

Recommended Posts

Posted

Hello!

I didn't find this on the forums so I would appreciate any help.

I was wondering if there was a way to StringSplit using multiple whole words as delimiters.

For example,

$rtext = "We advised you to clear your cache and cookies."

$asentence = StringSplit($rtext,"We ", 1) ; <- How should this be written to split $rtext on "We " and "you " so that $asentence will be:

;asentence[0] = 3
;asentence[1] = ""
;asentence[2] = "advised "
;asentence[3] = "to clear your cache and cookies."

Let me know if this is possible or if I have to use other means of achieving this. Thanks in advance!

Posted

#Include <Array.au3>
$rtext = "We advised you to clear your cache and cookies."

$asentence = StringRegExp($rtext, 'We\s*(\w+)\s*you\s*(.+)', 3)

_ArrayDisplay($asentence)

?

 

Well, that satisfies one example. But I want the split to happen for any sentence pattern.

For example $rtext could be:

 

$rtext = You explained that we need to call you back at a later time. ; <- So asentence should be:

;asentence[0] = 3
;asentence[1] = ""
;asentence[2] = "explained that "
;asentence[3] = "need to call you back at a later time."
  • Solution
Posted (edited)

in you the 2nd exemple, it sould be (without case sensitive, of course)...

$rtext = You explained that we need to call you back at a later time. ; <- So asentence should be:

;asentence[0] = 3
;asentence[1] = ""
;asentence[2] = "explained that "
;asentence[3] = "need to call"
;asentence[4] = "back at a later time."

No ?

 

It can be something like this :

#Include <Array.au3>

Local $rtext = "We advised you to clear your cache and cookies."
Local $aDelimiters[] = [ "you", "we" ]


$asentense = _StringSplitMultiple($rtext, $aDelimiters)
_ArrayDisplay($asentense)

; $iFlag = 0 : case-sensitive
; $iFlag = 1 : case-insensitive
Func _StringSplitMultiple($sString, $aDelims, $iFlag = 1)

    Local $sPattern = "(.*?)(?:"
    If $iFlag Then $sPattern = "(?i)" & $sPattern
    
    For $i = 0 To UBound($aDelims) - 1
        $sPattern &= $aDelims[$i] & "\b|"
    Next
    $sPattern &= "$)"
    
    Local $aResult = StringRegExp($sString, $sPattern , 3)

    If IsArray($aResult) Then
        For $i = UBOund($aResult) - 1 To 1 Step -1
            $aResult[$i] = $aResult[$i - 1]
        Next
    Else
        Return SetError(1, 0, -1)
    EndIf
    
    $aResult[0] = UBound($aResult)  - 1
    Return $aResult
EndFunc
Edited by jguinch
Posted (edited)

As often in general and with regular expressions in particular, the devil hides in the detail.

If a "sentence" is assumed to be what I would call "well formed", i.e. without parasitic whitespaces or made-up pitfalls, then the job is already non-trivial for a single regexp.

But if the beef is supposed to cook for any input string, then a more precise definition of "word" is needed.

Sample inputs (whitespaces matter):

"we you we"

" no magic word here "

"we love you"

" we love you, do we "

"loving you shall we"

"you-tube"

a.s.o.

Also is the solution allowed to produce inelegant empty strings?

EDIT in bold!

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted

Watch the effect of this (overly simple) pattern on some "sentences":

#Include <Array.au3>
Local $rtext = [ _
    "We Remember you we advised you to clear your cache and cookies, now we are not guilty: you are.", _
    "we you we", _
    " no magic word here ", _
    "we love you", _
    " we love you, do we ", _
    "loving you shall we", _
    "you-tube" _
]
For $s In $rtext
    $asentence = StringRegExp($s, '(?i)\b((?:(?!\bwe\b|\byou\b).)+)', 3)
    _ArrayDisplay($asentence)
Next

Still parasitic empty captures.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted

I apologize for following up to myself this way but my findings deserve a separate post so that readers are aware of progress.

I've hit a bug (or is that a mis-feature?) of our PCRE implementation, which caused me some headache.

Here is the one-liner able to split any sentence on a list of taboo words properly, while removing whitespaces around the taboo words. Run it and see for yourself whether it fits the bill. Also watch the second loop and observe we get a parasitic empty capture corresponding to the pseudo-group named "taboo". The PCRE interface makes a difference for named definitions like the DEFINE for "taboo" and numbered groups but StringRegExp considers the group as an effective part of the result.

#Include <Array.au3>
Local $rtext = [ _
    "We Remember you we advised you to clear your cache and cookies, now we are not guilty: you are.", _
    "we you wE", _
    "you owe us $100", _
    " no magic word here ", _
    "we love you", _
    " we love you, do we ", _
    "loving you shall we", _
    "you_tube", _
    "you-tube" _
]
For $s In $rtext
    $asentence = StringRegExp($s, '(?ix)  (?: \h* (?<!\pL) (?:we|you) (?!\pL) \h* )* ( (?: (?! \h* (?<!\pL) (?:we|you) (?!\pL) \h* ) \N)* ) (?: \h* (?<!\pL) (?:we|you) (?!\pL) \h* )* \K', 3)
    _ArrayDisplay($asentence)
Next

; results should be identical using a DEFINE special condition, but our PCRE implementation returns an empty ghost capture for the DEFINE, which IMHO it shouldn't do.
For $s In $rtext
    $asentence = StringRegExp($s, '(?ix) (?(DEFINE) (?<taboo> \h* (?<!\pL) (?:we|you) (?!\pL) \h*) )  (?: (?&taboo) )* ( (?: (?! (?&taboo) ) \N)* ) (?: (?&taboo) )* \K', 3)
    _ArrayDisplay($asentence)
Next

pL is a character property which is true for letters, same as [[:alpha:]] in non-Unicode mode. w is not applicable since it regards the underscore as a word character.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted (edited)

Exactly. To be fair, regex101.com offers the PHP flavor of PCRE which might behave somehow differently than the genuine PCRE library. Perl also has a number of differing behaviors, all of them pointed out in the PCRE reference documents.

RegexBuddy v4 correctly displays more details than regex101.com: in full detail mode it lists group "taboo" and then the actual capture but points out that Group "taboo" did not participate in the match. In normal mode, it only shows the actual capturing group for each match.

I'll post a detailed bug ticket as soon as I have more information about the reason for this (minor but annoying) issue.

Simple code to demonstrate the issue, without even needing invokation of the DEFINEd subroutine:

_ArrayDisplay(StringRegExp("bbb", "(?x) (a)? (b+)", 3))
_ArrayDisplay(StringRegExp("bbb", "(?x) (?(DEFINE) (?<head> a)) (b+)", 3))

Edit: this is now Trac ticket #2696.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted

 

in you the 2nd exemple, it sould be (without case sensitive, of course)...

$rtext = You explained that we need to call you back at a later time. ; <- So asentence should be:

;asentence[0] = 3
;asentence[1] = ""
;asentence[2] = "explained that "
;asentence[3] = "need to call"
;asentence[4] = "back at a later time."

No ?

 

It can be something like this :

#Include <Array.au3>

Local $rtext = "We advised you to clear your cache and cookies."
Local $aDelimiters[] = [ "you", "we" ]


$asentense = _StringSplitMultiple($rtext, $aDelimiters)
_ArrayDisplay($asentense)

; $iFlag = 0 : case-sensitive
; $iFlag = 1 : case-insensitive
Func _StringSplitMultiple($sString, $aDelims, $iFlag = 1)

    Local $sPattern = "(.*?)(?:"
    If $iFlag Then $sPattern = "(?i)" & $sPattern
    
    For $i = 0 To UBound($aDelims) - 1
        $sPattern &= $aDelims[$i] & "\b|"
    Next
    $sPattern &= "$)"
    
    Local $aResult = StringRegExp($sString, $sPattern , 3)

    If IsArray($aResult) Then
        For $i = UBOund($aResult) - 1 To 1 Step -1
            $aResult[$i] = $aResult[$i - 1]
        Next
    Else
        Return SetError(1, 0, -1)
    EndIf
    
    $aResult[0] = UBound($aResult)  - 1
    Return $aResult
EndFunc

 

Jguinch, this seems to work well enough for my situation. Thank you!

(You mispelled "sentence" on your example though, but don't worry about it haha =P)

 

As often in general and with regular expressions in particular, the devil hides in the detail.

If a "sentence" is assumed to be what I would call "well formed", i.e. without parasitic whitespaces or made-up pitfalls, then the job is already non-trivial for a single regexp.

But if the beef is supposed to cook for any input string, then a more precise definition of "word" is needed.

Sample inputs (whitespaces matter):

"we you we"

" no magic word here "

"we love you"

" we love you, do we "

"loving you shall we"

"you-tube"

a.s.o.

Also is the solution allowed to produce inelegant empty strings?

EDIT in bold!

I don't really mind the empty strings in my case since I can always just filter that out, but other people may preffer it. You have done a great job with that in your solution Jchd. Very detailed with your regex, it's amazing.

Posted

PCRE is amazing, I'm only a janitor.

AFAICT you only get an empty string as result when the "sentence" contain only taboo words or is itself empty. Of course it's trivial to use a longer list of splitting words.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...