Guy_ Posted August 27, 2015 Posted August 27, 2015 (edited) 1) Was testing some code that would add a full stop after a sentence when it thinks it is proper. In my simplified example are just a small range of characters I want to exclude. However, although my code in the tester is what I want, my AutoIt version doesn't respect my exclusions and is adding full stops to the other characters too. #include <Array.au3> $text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!' & @CRLF & @CRLF & 'Last sentence.' MsgBox(0,"original text", $text) $_aFull_Stop_missing = StringRegExp( $text, "(.[^.…!\""]\s{2,}.{3})", 3 ) If Not @error Then _ArrayDisplay($_aFull_Stop_missing) For $i = 0 To UBound($_aFull_Stop_missing)-1 $found = $_aFull_Stop_missing[$i] $fix = StringLeft( $found, 1) & "." & StringRight( $found, StringLen($found) -1 ) $text = StringReplace( $text, $found, $fix, 1) Next EndIf MsgBox(0,"processed text", $text) 2) I'm pretty sure I've seen (maybe older) info on how a StringRegExReplace can uppercase/lowercase a result. But I can't get to work any of what I saw... Neither of these are working here:$text = "Http://site.com, Www.domain.org" $result = StringRegExpReplace($text, "(?i)(https?|www)", StringLower("$1") ) ToolTip($result) Sleep(2500) $result = StringRegExpReplace($text, "(?i)(https?|www)", "\L$1" ) ToolTip($result) Sleep(2500)3) I've had the impression the https://regex101.com tester can give different results depending on the browser? If so, is there a preferred browser?4) General question:I see both "If Not @error" and "If @error = 0" being used."If Not @error" reads best to me. Is there a reason to not always use that variation?5) If I am mostly processing text from the web or pdf, do I have to use a UTF setting in my regex everywhere/anywhere? So far I have not had that impression at all, but the first example made me start thinking if character encoding could be involved... (although adding UTF instructions in the regex didn't help)TIA!! Edited August 27, 2015 by Guy_
jguinch Posted August 27, 2015 Posted August 27, 2015 1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?2)$text = "Http://site.com, Www.domain.org" $result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' ) ConsoleWrite($result) Guy_ 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
iamtheky Posted August 27, 2015 Posted August 27, 2015 Im not good at regexp, but i like pretending.#include <Array.au3> $text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!' & @CRLF & @CRLF & 'Last sentence.' MsgBox(0,"original text", $text) $_aFull_Stop_missing = StringRegExp($text, '([^\!\\"\s]\w\s{2,})', 3 ) If Not @error Then _ArrayDisplay($_aFull_Stop_missing) For $i = 0 To UBound($_aFull_Stop_missing)-1 $found = $_aFull_Stop_missing[$i] $fix = stringstripws($found , 8) & "." & @CRLF & @CRLF $text = StringReplace( $text, $found, $fix, 1) Next EndIf MsgBox(0,"processed text", $text) and for urls just stringlower the whole thing.. Guy_ 1 ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__)
Guy_ Posted August 27, 2015 Author Posted August 27, 2015 (edited) Im not good at regexp, but i like pretending.Wow, thanks! You fooled me good! I will study that till I "get it."But it seems basically a cool workaround, so I'm still wondering why the tester does what I want and my AutoIt regex selects more than that... Edited August 27, 2015 by Guy_
jchd Posted August 27, 2015 Posted August 27, 2015 (edited) Is that what you want?$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!' & @CRLF & @CRLF & 'Last sentence.' MsgBox(0,"original text", $text) $fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1') MsgBox(0,"processed text", $fixed)2/ AutoIt != PerlTo change case of results, you need Execute:$text = "Http://site.com, Www.domain.org" $result = Execute('"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"') ConsoleWrite($result & @LF)3/ I'm not aware of such dependancy.4/ If Not @error is always fine.5/ PCRE as compiled into AutoIt is UTF-aware (hopefully since AutoIt strings are UTF16!). What you may need (*UCP) in case you can expect to have to benefit of the wider range of \w, \d, \b, ... Edited August 27, 2015 by jchd Guy_ 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 27, 2015 Author Posted August 27, 2015 (edited) 1) I don't understant what is the expected result of the regex. What do you want to do ? Extract sentences ? Can you give use an example of result ?The code by boththose demonstrates it well.My own thinking was to select a hopefully unique part of the text including a last character of a sentence *if* that sentence ending excludes certain characters that may be indicating it does *not* need a full stop (and there have to be at least 2 returns).I need to understand why AutoIt behaves differently from the tester here... $text = "Http://site.com, Www.domain.org" $result = Execute('"' & StringRegExpReplace(StringReplace($text, '"', '""'), "(?i)(https?|www)", '" & StringLower(''$1'') & "' ) & '"' ) ConsoleWrite($result)Thanks, that does work but is very confusing to me.Is this demonstrating a needed workaround and could we do this more easily in earlier times? (at least, I Googled simpler examples like I provided that I guess used to work once...)Is this more demonstrating "the ultimate failsafe pro way" and can it be done simpler? (ok, just saw jchd's answer too)I will take note of the principle of course, but if this is what I actually needed, the code seems over the top and I'd better use(?) ... (I realize it's not exactly the same)$text = StringReplace($text, "Http", "http") $text = StringReplace($text, "Www.", "www.") Edited August 27, 2015 by Guy_
jchd Posted August 27, 2015 Posted August 27, 2015 Guy, perhaps the first ConsoleWrite below will help you see what's needed to achieve the result.$text = "Http://site.com, Www.domain.org" $result = '"' & StringRegExpReplace($text, '(?i)(https?|www)', '" & StringLower("$1") & "') & '"' ConsoleWrite($result & @LF) $result = Execute($result) ConsoleWrite($result & @LF) Guy_ 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
jguinch Posted August 27, 2015 Posted August 27, 2015 For the first question, using jchd's way, here is a way to store each end of sentence character in an array :; Possible end of sentences Local $aEndChars[] = ['.', '!', '?', '...', '…', '"'] $text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!' & @CRLF & @CRLF & 'Last sentence.' MsgBox(0,"original text", $text) $sExpr = "(?<!" For $i = 0 To UBound($aEndChars) - 1 $sExpr &= "\Q" & $aEndChars[$i] & "\E|" Next $sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")" $fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1') MsgBox(0,"processed text", $fixed) Guy_ 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
Guy_ Posted August 27, 2015 Author Posted August 27, 2015 Thanks a lot everyone! I'm busy studying all of this further now
jchd Posted August 27, 2015 Posted August 27, 2015 Fine, come back if you still have questions. Guy_ 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 27, 2015 Author Posted August 27, 2015 Is that what you want?$text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!' & @CRLF & @CRLF & 'Last sentence.' MsgBox(0,"original text", $text) $fixed = StringRegExpReplace($text, '(?<![.…!"])(\r\n\r\n)', '.$1') MsgBox(0,"processed text", $fixed)That was a breakthrough in helping me understand lookbehind. Beautiful However, neither in this one or the variation of jguinch can I get it to work for a comma in the lookbehind... The usual \escaping doesn't seem to work and I can't find anything about it... (maybe it doesn't work with some other characters either; haven't checked all of them out yet. But the comma stood out.)
jguinch Posted August 27, 2015 Posted August 27, 2015 Well, with my code you would have to add ',' in the $aEndChars array :Local $aEndChars[] = ['.', '!', '?', '...', '…', '"', ','] $text = 'Possible end of a "sentence."' & @CRLF & @CRLF & 'Possible end of a sentence…' & @CRLF & @CRLF & 'Possible end of a sentence' & @CRLF & @CRLF & 'Possible end of a sentence!' & @CRLF & @CRLF & 'With comma sentence,' & @CRLF & @CRLF & "Last sentence." MsgBox(0,"original text", $text) $sExpr = "(?<!" For $i = 0 To UBound($aEndChars) - 1 $sExpr &= "\Q" & $aEndChars[$i] & "\E|" Next $sExpr = StringRegExpReplace($sExpr, "\|$", "") & ")" $fixed = StringRegExpReplace($text, $sExpr & '(\R{2})', '.$1') MsgBox(0,"processed text", $fixed)It's not good ?With jchd,'s code, [.…!",] should work Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
jchd Posted August 27, 2015 Posted August 27, 2015 I don't see [.…!",] failing, nor why on Earth it would fail either. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
mikell Posted August 27, 2015 Posted August 27, 2015 Except if there is a remaining space after the comma
jchd Posted August 27, 2015 Posted August 27, 2015 Of course but that is indedendant of the "stop-char" being a dot, a comma, ellipsis, whatever. It's trivial to get rid of extra whitespaces between the stop-char (or absence of) and the two line terminations. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 28, 2015 Author Posted August 28, 2015 (edited) Indeed works in the example of jchd but for some reason I'm still not sure of, I managed to have it fail in a very simple example of my own. For a moment it seemed '\r\n\r\n' was asking for 4 returns if my source was the clipboard and I got it working by making that '\r\r'. Then I got it working normally after all, but only after hours including fidding to discover which characters I had to double up or escape.Space after the comma was never the issue and I made sure of that.Lastly, I now remain stumped by how to make it work when there are more than 2 returns... Both of these can lead to bad results:$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2,})', '.$1')$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*\"\“\”\„''\‘\’\>])(\R{2}\R*)', '.$1') Edited August 28, 2015 by Guy_
jguinch Posted August 28, 2015 Posted August 28, 2015 [.?…!;,:+=&*"“”„''‘’>] should be sufficient Guy_ 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
jchd Posted August 28, 2015 Posted August 28, 2015 Guy,PCRE \R by default translates into this atomic group:(?>\r\n|\n|\x0b|\f|\r|\x85|\x{2028}|\x{2029})Hence, if \R finds \r\n first, it will find a match. But it was decided to compile AutoIt PCRE with the option PCRE_BSR_ANYCRLF, which changes \R into the equivalent of:(?>\r\n|\n|\r)The default behavior (matching 0x0B, \f and 0x85 and the two other codepoints) can be restored in a pattern by placing (*BSR_UNICODE) at its head.But anyway, \R{2,} will definitely match all combinations of two or more line terminations using CR and/or LF. Note that "abc" & @CR & @LF counts for only one line termination (this is @CRLF).$text = "a" & @CRLF & @CRLF & @CRLF & "b" & @CR & @CR & @CR & "c" & @LF & @LF & @LF & "d" & @CRLF & @LF & @CRLF & "e" & @LF & @CRLF & @CRLF & "f" & @LF & @CR & "g" & @CR & @CRLF $result = StringRegExp($text, '(.*)(\R{2,})', 3) For $i = 0 To UBound($result) - 1 Step 2 ConsoleWrite($result[$i] & ' -> ' & Binary($result[$i + 1]) & @LF) Next Guy_ 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Guy_ Posted August 28, 2015 Author Posted August 28, 2015 [.?…!;,:+=&*"“”„''‘’>] should be sufficientIndeed, it now is... *sigh* Thank you.I guess I was trying double quotes around the whole thing at the time...-Wanted to apply my newly learned knowledge to have a full stop in similar circumstances but only when the line immediately after does not start with a capital... This again does not work after like 30 variations and is adding full stops when there is more than one return...$fixed = StringRegExpReplace($text, '(?<![.?…!;,:+=&*"“”„''‘’>])(\R)(?=[A-Z])', '.$1')I'm ready for a good cry... Giving up? Maybe tomorrow
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now