Jump to content

Another RegEx question (challenge?)


Go to solution Solved by mikell,

Recommended Posts

Lets say I have 2 strings ($s1 and $s2) of indeterminate length. Neither string has any @CR, @LF, or @CRLF characters within it.

For $s1, I want to insert \n periodically throughout the string, breaking the string up into chunks of no more than 80 chars between \n codes and inserting \n only immediately after a comma. I don't want to replace the comma, just insert the \n right after it.

For $s2, I want to do the exact same thing except in this case I want to insert the \n codes immediately after a space instead of a comma; again, not replacing the space but just inserting the \n right after it.

Sample strings:

$s1: Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker

$s2: There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length.

The output desired is this:

$s1: Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley,\n Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter,\n Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker

$s2: There are usually about 200 words in a paragraph, but this can vary widely. \nMost paragraphs focus on a single idea that's expressed with an introductory \nsentence, then followed by two or more supporting sentences about the idea. A \nshort paragraph may not reach even 50 words while long paragraphs can be over \n400 words long, but generally speaking they tend to be approximately 200 words \nin length.

I've accomplished this using the following code:

Local $s1 = "Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker"
Local $s2 = "There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length."

_WrapText($s1, ",")
_WrapText($s2, " ")

ConsoleWrite($s1 & @CRLF)
ConsoleWrite($s2 & @CRLF)

Exit

Func _WrapText(ByRef $sTxt, $sChar)
    Local $iMaxLen = 80
    Local $iLen = StringLen($sTxt)
    Local $sWrapped = ""
    Local $iStartPos = 1
    Local $iRemaining = $iLen
    While $iRemaining > $iMaxLen
        Local $iWrapPos = StringInStr(StringMid($sTxt, $iStartPos, $iMaxLen), $sChar, 0, -1)
        $sWrapped &= StringMid($sTxt, $iStartPos, $iWrapPos) & "\n"
        $iStartPos += ($iWrapPos)
        $iRemaining = $iLen - $iStartPos + 1
        If $iRemaining <= $iMaxLen Then $sWrapped &= StringRight($sTxt, $iRemaining)
    WEnd
    $sTxt = $sWrapped
EndFunc

But having seen the remarkable things regex is capable of, I wonder if there is a slick regex method of adding the \n codes to the strings in the manner described?

Link to comment
Share on other sites

This is a bit more difficult and it's something that I've briefly looked into for some things in my job. In the end I didn't get an actual character count break working, just an occurrence thing. It was set to match a pattern 25 times, then replace it with the match and \n. I'm not quite sure that RegEx can reliably do what you're looking for. 

That being said, I like trying things and I know more now than I did then, so for commas: https://regex101.com/r/ywFNPy/1

([^\v]{0,80})(?:,|$)

And then for spaces/whitespace it's just a simple modification: https://regex101.com/r/HP6gZy/1

([^\v]{0,80})(?:\s|$)

Putting it into AutoIt to test:

Local $s1 = "Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker"
Local $s2 = "There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length."
Local $s3 = $s1 & ' ' & $s2

ConsoleWrite(_WrapText($s1, ",") & @CRLF & @CRLF)
ConsoleWrite(_WrapText($s2, " ") & @CRLF & @CRLF)
ConsoleWrite('---------------------------------' & @CRLF & @CRLF)
ConsoleWrite('RegEx output,       Auto:' & @CRLF & _WrapText_RegEx($s3, 0, 80) & @CRLF & @CRLF)
ConsoleWrite('RegEx output,      comma:' & @CRLF & _WrapText_RegEx($s1, 1, 80) & @CRLF & @CRLF)
ConsoleWrite('RegEx output, whitespace:' & @CRLF & _WrapText_RegEx($s2, 2, 80) & @CRLF & @CRLF)

Exit

Func _WrapText($sTxt, $sChar)
    Local $iMaxLen = 80
    Local $iLen = StringLen($sTxt)
    Local $sWrapped = ""
    Local $iStartPos = 1
    Local $iRemaining = $iLen
    While $iRemaining > $iMaxLen
        Local $iWrapPos = StringInStr(StringMid($sTxt, $iStartPos, $iMaxLen), $sChar, 0, -1)
;~         $sWrapped &= StringMid($sTxt, $iStartPos, $iWrapPos) & "\n"
        $sWrapped &= StringMid($sTxt, $iStartPos, $iWrapPos) & @CRLF
        $iStartPos += ($iWrapPos)
        $iRemaining = $iLen - $iStartPos + 1
        If $iRemaining <= $iMaxLen Then $sWrapped &= StringRight($sTxt, $iRemaining)
    WEnd
    $sTxt = $sWrapped

    Return $sTxt
EndFunc   ;==>_WrapText


Func _WrapText_RegEx($sTxt, $iMode = 0, $iLineMaxLength = 80)
    ; iMode, 0 = Auto, 1 = comma, 2 = whitespace
    If $iMode = Default Then $iMode = 0
    If $iMode > 2 Or $iMode < 0 Then $iMode = 0
    If $iLineMaxLength = Default Then $iLineMaxLength = 80

    Local $sPatternComma = '([^\v]{0,' & $iLineMaxLength & '})(,|$)'
    Local $sPatternWhitespace = '([^\v]{0,' & $iLineMaxLength & '})(\s|$)'
    Local $sPatternAuto = '([^\v]{0,' & $iLineMaxLength & '})(,|\s|$)'

    Local $sOutput1, $sOutput2, $sReturn
    Local $aLines1, $aLines2
    Local $iLines1, $iLines2

    Switch $iMode
        Case 0 ; Auto, choose whichever produces the least amount of lines, though it may cause lines over 80 characters (when there's not a comma to break on
            #cs
            $sOutput1 = StringRegExpReplace($sTxt, $sPatternComma, '$1' & @CRLF)
            ; Get a count of how many lines there are. Alternatively and likely better is StringReplace for both @LF and @CR and add the @extended
            $aLines1 = StringSplit($sOutput1, @CRLF, 2)
            $iLines1 = UBound($aLines1)

            $sOutput2 = StringRegExpReplace($sTxt, $sPatternWhitespace, '$1' & @CRLF)
            $aLines2 = StringSplit($sOutput2, @CRLF, 2)
            $iLines2 = UBound($aLines2)

            If $iLines1 <= $iLines2 Then
                $sReturn = $sOutput1
            Else
                $sReturn = $sOutput2
            EndIf
            #ce
            $sReturn = StringRegExpReplace($sTxt, $sPatternAuto, '$1$2' & @CRLF)
        Case 1 ; Comma
            $sReturn = StringRegExpReplace($sTxt, $sPatternComma, '$1$2' & @CRLF)
        Case 2 ; Whitespace
            $sReturn = StringRegExpReplace($sTxt, $sPatternWhitespace, '$1$2' & @CRLF)
    EndSwitch

    Return StringStripWS($sReturn, 1 + 2)  ; $STR_STRIPLEADING + $STR_STRIPTRAILING
EndFunc   ;==>_WrapText_RegEx

Seems to work to me, the only thing is that I'm not keeping the trailing comma or whitespace. It's probably easier to simple add that into the replace with $1,\n, but if you do that make sure that you're adjusting your line/character length -1 for the comma (whitespace one can probably just be dropped).

 

Edit: I just realized that for my 'Auto', I could just combine looking for either a whitespace or a comma, duh. Updated code. Also updated to keep commas, though I don't have a check to make sure that with the comma it doesn't go over 80 characters to 81. Simple way for $iMode = 1 is to set $iLineMaxLength - 1

 

Edit 2: I also compared the speed of both, and the RegEx is faster:

Runs: 100000
1) "_WrapText" ('_WrapText($s2, " ")') time elapsed: 5609.60 ms
2) "_WrapText_RegEx" ('_WrapText_RegEx($s2)') time elapsed: 3541.02 ms
#Fastest Function: "_WrapText_RegEx"

 

Edited by mistersquirrle

We ought not to misbehave, but we should look as though we could.

Link to comment
Share on other sites

  • Solution

Funny regex challenge :)
My 2 cents

Local $s1 = "Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker"

$res1 = StringTrimRight(StringRegExpReplace($s1, ".{1,79}(,|$)\K", "\\n"), 2)
Msgbox(0,"", $res1)


Local $s2 = "There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length."

$res2 = StringTrimRight(StringRegExpReplace($s2, ".{1,79}(\h|$)\K", "\\n"), 2)
Msgbox(0,"", $res2)

Edit
...and the func

Msgbox(0,"", _WrapText($s1, ",", 80) )
Msgbox(0,"", _WrapText($s2, "\h", 80) )

Func _WrapText($txt, $char, $n)
    Return StringTrimRight(StringRegExpReplace($txt, '.{1,' & $n-1 & '}(' & $char & '|$)\K', "\\n"), 2)
EndFunc

Please note that you can write either _WrapText($s2, "\h", 80) or _WrapText($s2, " ", 80) as both work

Edit 2
Ooops I didn't see the regex from mistersquirrle... nearly the same :rolleyes:

 

Edited by mikell
Link to comment
Share on other sites

Well done guys
Before someone asks : but how to retrieve the separate strings in an Array, using RegEx, in a subject that includes several literal '\n' . This seems to do the job

1206981188_TimRudeschallenge1.png.217ddf358e842bb5afd941c586cf5d0d.png

Edit1: there is no hidden @CRLF or any space at the very end of the subject... @mikell ...and the last group is empty again, no matter I changed '$' to '\z' or '\Z' in the pattern, grr...

Edit2: I wish we could use PCRE_NOTEMPTY in AutoIt's PCRE, to get rid of empty groups when needed, but I don't think it's possible (?)

Edit3:

13 hours ago, TimRude said:

For $s1, I want to insert \n periodically throughout the string, breaking the string up into chunks of no more than 80 chars between \n codes and inserting \n only immediately after a comma. I don't want to replace the comma, just insert the \n right after it.

Just re-read OP's post, the pic above corresponds to a subject where literal '\n' are found and the subject has not been broken into chunks.
Then a comma should be inserted in the pattern of the pic above to make it safer, even without adding the space, in case this kind of subject can be found 'Salena Haley,\nCade Batson' . Not feeling into modifying the pic but you got the idea :)

(.*?)(?:,\\n|$)

Edit4: As written in Edit1, we see in the pic a 4th "zero-width" match returned with this kind of pattern :

(.*?)(?:\\n |$)
(.*?)(?:,\\n|$)

A solution to avoid the 4th "zero-width" group in this example is to use the + quantifier (1 or more) instead of the * quantifier (0 or more)

(.+?)(?:,\\n|$)

It's not the 1st time (and certainly not the last) that the choice between * and + eliminates empty groups in the array returned. One should carefully check before if + instead of * won't behave badly when applied to the subject.

If not mistaken, there is also the "Non-capturing group with reset (?| ... ) that allows to avoid blank groups to be returned, when used with alternation (e.g | ) and capturing groups placed inside the (?| ... ) but enough for today :)
 

Edited by pixelsearch
Edit'sss
Link to comment
Share on other sites

@mistersquirrle Your method inserts @CRLF's into the string. However, I need to insert '\n' characters into the string. The '\n' characters will eventually translate into @CRLF's at some later point, but the strings have to stay single-line at this point because they're part of a file where each line is separate item (like in an ini file). So points deducted for not following the specs. :lol:

FWIW, I tried replacing the @CRLF in your StringRegExpReplace function with '\\n' (i.e. $sReturn = StringRegExpReplace($sTxt, $sPatternComma, '$1$2' & '\\n') but that ended up with a couple of trailing '\n' sets at the end of each processed string.

@mikell Your method, while similar to mistersquirrle's, produced the correct output that exactly matched my specifications and the output of my cruder method. I even tested with some different strings that were carefully crafted so that the space or comma was exactly the 80th, 160th, 240th, etc. character and it worked perfectly. You win the solution. :thumbsup:

@pixelsearch Bonus points for going the extra mile and providing some additional education. Thanks! :geek:

---

As mistersquirrle did, I benchmarked the difference between my crude method and the regex method as presented by mikell. I ran the replacements 100000 times as well and found that on my machine the regex method was consistently about 3 times faster than my original method. Very impressive!

Thanks to all 3 of you for the input! 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...