TimRude Posted March 20, 2023 Share Posted March 20, 2023 Lets say I have 2 strings ($s1 and $s2) of indeterminate length. Neither string has any @CR, @LF, or @CRLF characters within it. For $s1, I want to insert \n periodically throughout the string, breaking the string up into chunks of no more than 80 chars between \n codes and inserting \n only immediately after a comma. I don't want to replace the comma, just insert the \n right after it. For $s2, I want to do the exact same thing except in this case I want to insert the \n codes immediately after a space instead of a comma; again, not replacing the space but just inserting the \n right after it. Sample strings: $s1: Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker $s2: There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length. The output desired is this: $s1: Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley,\n Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter,\n Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker $s2: There are usually about 200 words in a paragraph, but this can vary widely. \nMost paragraphs focus on a single idea that's expressed with an introductory \nsentence, then followed by two or more supporting sentences about the idea. A \nshort paragraph may not reach even 50 words while long paragraphs can be over \n400 words long, but generally speaking they tend to be approximately 200 words \nin length. I've accomplished this using the following code: Local $s1 = "Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker" Local $s2 = "There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length." _WrapText($s1, ",") _WrapText($s2, " ") ConsoleWrite($s1 & @CRLF) ConsoleWrite($s2 & @CRLF) Exit Func _WrapText(ByRef $sTxt, $sChar) Local $iMaxLen = 80 Local $iLen = StringLen($sTxt) Local $sWrapped = "" Local $iStartPos = 1 Local $iRemaining = $iLen While $iRemaining > $iMaxLen Local $iWrapPos = StringInStr(StringMid($sTxt, $iStartPos, $iMaxLen), $sChar, 0, -1) $sWrapped &= StringMid($sTxt, $iStartPos, $iWrapPos) & "\n" $iStartPos += ($iWrapPos) $iRemaining = $iLen - $iStartPos + 1 If $iRemaining <= $iMaxLen Then $sWrapped &= StringRight($sTxt, $iRemaining) WEnd $sTxt = $sWrapped EndFunc But having seen the remarkable things regex is capable of, I wonder if there is a slick regex method of adding the \n codes to the strings in the manner described? Link to comment Share on other sites More sharing options...
mistersquirrle Posted March 20, 2023 Share Posted March 20, 2023 (edited) This is a bit more difficult and it's something that I've briefly looked into for some things in my job. In the end I didn't get an actual character count break working, just an occurrence thing. It was set to match a pattern 25 times, then replace it with the match and \n. I'm not quite sure that RegEx can reliably do what you're looking for. That being said, I like trying things and I know more now than I did then, so for commas: https://regex101.com/r/ywFNPy/1 ([^\v]{0,80})(?:,|$) And then for spaces/whitespace it's just a simple modification: https://regex101.com/r/HP6gZy/1 ([^\v]{0,80})(?:\s|$) Putting it into AutoIt to test: expandcollapse popupLocal $s1 = "Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker" Local $s2 = "There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length." Local $s3 = $s1 & ' ' & $s2 ConsoleWrite(_WrapText($s1, ",") & @CRLF & @CRLF) ConsoleWrite(_WrapText($s2, " ") & @CRLF & @CRLF) ConsoleWrite('---------------------------------' & @CRLF & @CRLF) ConsoleWrite('RegEx output, Auto:' & @CRLF & _WrapText_RegEx($s3, 0, 80) & @CRLF & @CRLF) ConsoleWrite('RegEx output, comma:' & @CRLF & _WrapText_RegEx($s1, 1, 80) & @CRLF & @CRLF) ConsoleWrite('RegEx output, whitespace:' & @CRLF & _WrapText_RegEx($s2, 2, 80) & @CRLF & @CRLF) Exit Func _WrapText($sTxt, $sChar) Local $iMaxLen = 80 Local $iLen = StringLen($sTxt) Local $sWrapped = "" Local $iStartPos = 1 Local $iRemaining = $iLen While $iRemaining > $iMaxLen Local $iWrapPos = StringInStr(StringMid($sTxt, $iStartPos, $iMaxLen), $sChar, 0, -1) ;~ $sWrapped &= StringMid($sTxt, $iStartPos, $iWrapPos) & "\n" $sWrapped &= StringMid($sTxt, $iStartPos, $iWrapPos) & @CRLF $iStartPos += ($iWrapPos) $iRemaining = $iLen - $iStartPos + 1 If $iRemaining <= $iMaxLen Then $sWrapped &= StringRight($sTxt, $iRemaining) WEnd $sTxt = $sWrapped Return $sTxt EndFunc ;==>_WrapText Func _WrapText_RegEx($sTxt, $iMode = 0, $iLineMaxLength = 80) ; iMode, 0 = Auto, 1 = comma, 2 = whitespace If $iMode = Default Then $iMode = 0 If $iMode > 2 Or $iMode < 0 Then $iMode = 0 If $iLineMaxLength = Default Then $iLineMaxLength = 80 Local $sPatternComma = '([^\v]{0,' & $iLineMaxLength & '})(,|$)' Local $sPatternWhitespace = '([^\v]{0,' & $iLineMaxLength & '})(\s|$)' Local $sPatternAuto = '([^\v]{0,' & $iLineMaxLength & '})(,|\s|$)' Local $sOutput1, $sOutput2, $sReturn Local $aLines1, $aLines2 Local $iLines1, $iLines2 Switch $iMode Case 0 ; Auto, choose whichever produces the least amount of lines, though it may cause lines over 80 characters (when there's not a comma to break on #cs $sOutput1 = StringRegExpReplace($sTxt, $sPatternComma, '$1' & @CRLF) ; Get a count of how many lines there are. Alternatively and likely better is StringReplace for both @LF and @CR and add the @extended $aLines1 = StringSplit($sOutput1, @CRLF, 2) $iLines1 = UBound($aLines1) $sOutput2 = StringRegExpReplace($sTxt, $sPatternWhitespace, '$1' & @CRLF) $aLines2 = StringSplit($sOutput2, @CRLF, 2) $iLines2 = UBound($aLines2) If $iLines1 <= $iLines2 Then $sReturn = $sOutput1 Else $sReturn = $sOutput2 EndIf #ce $sReturn = StringRegExpReplace($sTxt, $sPatternAuto, '$1$2' & @CRLF) Case 1 ; Comma $sReturn = StringRegExpReplace($sTxt, $sPatternComma, '$1$2' & @CRLF) Case 2 ; Whitespace $sReturn = StringRegExpReplace($sTxt, $sPatternWhitespace, '$1$2' & @CRLF) EndSwitch Return StringStripWS($sReturn, 1 + 2) ; $STR_STRIPLEADING + $STR_STRIPTRAILING EndFunc ;==>_WrapText_RegEx Seems to work to me, the only thing is that I'm not keeping the trailing comma or whitespace. It's probably easier to simple add that into the replace with $1,\n, but if you do that make sure that you're adjusting your line/character length -1 for the comma (whitespace one can probably just be dropped). Edit: I just realized that for my 'Auto', I could just combine looking for either a whitespace or a comma, duh. Updated code. Also updated to keep commas, though I don't have a check to make sure that with the comma it doesn't go over 80 characters to 81. Simple way for $iMode = 1 is to set $iLineMaxLength - 1 Edit 2: I also compared the speed of both, and the RegEx is faster: Runs: 100000 1) "_WrapText" ('_WrapText($s2, " ")') time elapsed: 5609.60 ms 2) "_WrapText_RegEx" ('_WrapText_RegEx($s2)') time elapsed: 3541.02 ms #Fastest Function: "_WrapText_RegEx" Edited March 20, 2023 by mistersquirrle TimRude, dmob and pixelsearch 1 2 We ought not to misbehave, but we should look as though we could. Link to comment Share on other sites More sharing options...
Solution mikell Posted March 20, 2023 Solution Share Posted March 20, 2023 (edited) Funny regex challenge My 2 cents Local $s1 = "Marco Scarborough, Chaim Stephenson, Clark Casey, Phoebe Moser, Salena Haley, Cade Batson, Carl Lindsey, Roy Mckenzie, Lillie Peek, Priya Harter, Finn Stratton, Sharon Saxton, Todd Poole, Ariella Findley, Edith Walker" $res1 = StringTrimRight(StringRegExpReplace($s1, ".{1,79}(,|$)\K", "\\n"), 2) Msgbox(0,"", $res1) Local $s2 = "There are usually about 200 words in a paragraph, but this can vary widely. Most paragraphs focus on a single idea that's expressed with an introductory sentence, then followed by two or more supporting sentences about the idea. A short paragraph may not reach even 50 words while long paragraphs can be over 400 words long, but generally speaking they tend to be approximately 200 words in length." $res2 = StringTrimRight(StringRegExpReplace($s2, ".{1,79}(\h|$)\K", "\\n"), 2) Msgbox(0,"", $res2) Edit ...and the func Msgbox(0,"", _WrapText($s1, ",", 80) ) Msgbox(0,"", _WrapText($s2, "\h", 80) ) Func _WrapText($txt, $char, $n) Return StringTrimRight(StringRegExpReplace($txt, '.{1,' & $n-1 & '}(' & $char & '|$)\K', "\\n"), 2) EndFunc Please note that you can write either _WrapText($s2, "\h", 80) or _WrapText($s2, " ", 80) as both work Edit 2 Ooops I didn't see the regex from mistersquirrle... nearly the same Edited March 20, 2023 by mikell TimRude and pixelsearch 1 1 Link to comment Share on other sites More sharing options...
pixelsearch Posted March 20, 2023 Share Posted March 20, 2023 (edited) Well done guys Before someone asks : but how to retrieve the separate strings in an Array, using RegEx, in a subject that includes several literal '\n' . This seems to do the job Edit1: there is no hidden @CRLF or any space at the very end of the subject... @mikell ...and the last group is empty again, no matter I changed '$' to '\z' or '\Z' in the pattern, grr... Edit2: I wish we could use PCRE_NOTEMPTY in AutoIt's PCRE, to get rid of empty groups when needed, but I don't think it's possible (?) Edit3: 13 hours ago, TimRude said: For $s1, I want to insert \n periodically throughout the string, breaking the string up into chunks of no more than 80 chars between \n codes and inserting \n only immediately after a comma. I don't want to replace the comma, just insert the \n right after it. Just re-read OP's post, the pic above corresponds to a subject where literal '\n' are found and the subject has not been broken into chunks. Then a comma should be inserted in the pattern of the pic above to make it safer, even without adding the space, in case this kind of subject can be found 'Salena Haley,\nCade Batson' . Not feeling into modifying the pic but you got the idea (.*?)(?:,\\n|$) Edit4: As written in Edit1, we see in the pic a 4th "zero-width" match returned with this kind of pattern : (.*?)(?:\\n |$) (.*?)(?:,\\n|$) A solution to avoid the 4th "zero-width" group in this example is to use the + quantifier (1 or more) instead of the * quantifier (0 or more) (.+?)(?:,\\n|$) It's not the 1st time (and certainly not the last) that the choice between * and + eliminates empty groups in the array returned. One should carefully check before if + instead of * won't behave badly when applied to the subject. If not mistaken, there is also the "Non-capturing group with reset (?| ... ) that allows to avoid blank groups to be returned, when used with alternation (e.g | ) and capturing groups placed inside the (?| ... ) but enough for today Edited March 20, 2023 by pixelsearch Edit'sss Musashi and TimRude 1 1 Link to comment Share on other sites More sharing options...
TimRude Posted March 20, 2023 Author Share Posted March 20, 2023 Too bad I have to wait until I get off work today to examine these! pixelsearch and SOLVE-SMART 2 Link to comment Share on other sites More sharing options...
TimRude Posted March 21, 2023 Author Share Posted March 21, 2023 @mistersquirrle Your method inserts @CRLF's into the string. However, I need to insert '\n' characters into the string. The '\n' characters will eventually translate into @CRLF's at some later point, but the strings have to stay single-line at this point because they're part of a file where each line is separate item (like in an ini file). So points deducted for not following the specs. FWIW, I tried replacing the @CRLF in your StringRegExpReplace function with '\\n' (i.e. $sReturn = StringRegExpReplace($sTxt, $sPatternComma, '$1$2' & '\\n') but that ended up with a couple of trailing '\n' sets at the end of each processed string. @mikell Your method, while similar to mistersquirrle's, produced the correct output that exactly matched my specifications and the output of my cruder method. I even tested with some different strings that were carefully crafted so that the space or comma was exactly the 80th, 160th, 240th, etc. character and it worked perfectly. You win the solution. @pixelsearch Bonus points for going the extra mile and providing some additional education. Thanks! --- As mistersquirrle did, I benchmarked the difference between my crude method and the regex method as presented by mikell. I ran the replacements 100000 times as well and found that on my machine the regex method was consistently about 3 times faster than my original method. Very impressive! Thanks to all 3 of you for the input! Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now