Opened 2 years ago
Closed 21 months ago
#3945 closed Bug (Fixed)
StringRegExp help about \s misses VT
Reported by: | jchd18 | Owned by: | Jpm |
---|---|---|---|
Milestone: | 3.3.17.0 | Component: | Documentation |
Version: | 3.3.14.0 | Severity: | None |
Keywords: | Cc: |
Description
PCRE v8.44 changed the meaning of character class \s to include VT as well, hence following same change in Perl.
So \s is now equivalent to [[:space:]]
Help should be updated to reflect the actual behavior.
Attachments (0)
Change History (21)
comment:1 Changed 23 months ago by mLipok
comment:2 Changed 23 months ago by pixelsearch
Some infos concerning the evolution of Chr(11) e.g. Vertical Tab VT
Excerpts from AutoIt history and PCRE changelog, displayed by date, descending :
https://www.pcre.org/original/changelog.txt
A) AutoIt 3.3.16.1 (19th September, 2022) (Release)
B) AutoIt 3.3.16.0 (6th March, 2022) (Release)
Changed: PCRE regular expression engine updated to 8.44
C) PCRE Version 8.36 26-September-2014
- When a pattern starting with \s was studied, VT was not included in the list of possible starting characters; this should have been part of the 8.34/18 patch.
D) PCRE Version 8.34 15-December-2013
- The character VT has been added to the default ("C" locale) set of characters that match \s and are generally treated as white space, following this same change in Perl 5.18. There is now no difference between "Perl space" and "POSIX space". Whether VT is treated as white space in other locales depends on the locale.
E) AutoIt 3.3.8.1 (29th January, 2012) (Release)
F) PCRE Version 8.11 10-Dec-2010
- If \s appeared in a character class, it removed the VT character from the class, even if it had been included by some previous item, for example in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part of \s, but is part of the POSIX "space" class.)
G) AutoIt 3.3.6.1 (16th April, 2010) (Release)
Test 1 : the bug in F)
======
Local $sSubject = Chr(11) ; Vertical Tab VT
Local $sPattern = '[\x00-\xff\s]'
No match using AutoIt 2010 (because of bug PCRE)
Match Chr(11) using AutoIt 2012+ (PCRE fixed the bug)
Test 2 : \s
======
Local $sSubject = a string of 256 characters, from Chr(0) to Chr(255)
Local $sPattern = '\s' ; now same as '[[:space:]]'
AutoIt 2022 => match Chr(9) Chr(10) Chr(11) Chr(12) Chr(13) Chr(32) <= 6 whitespace
Test 3 : \S (mLipok's question)
======
Local $sSubject = a string of 256 characters, from Chr(0) to Chr(255)
Local $sPattern = '\S' ; now same as '[[:^space:]]'
AutoIt 2022 => match 250 characters (e.g. 256 - 6 whitespace above)
ChangeLog for PCRE Note that the PCRE 8.xx series (PCRE1) is now at end of life. All development is happening in the PCRE2 10.xx series. Version 8.45 15-June-2021 ...
So PCRE 8.45 could be the last version to integrate the next AutoIt release, because PCRE2 10.xx series may require a lot of rework for integration ?
Good luck Jon
comment:3 Changed 23 months ago by pixelsearch
If the helpfile is gonna be reworked (topic StringRegExp) then I got a little issue with \xA0 e.g. chr(160) non-breaking space.
Here is a script where the subject is a string of 255 characters from chr(1) to chr(255)
You can comment out / uncomment any pattern line to display in Scite Console the ascii codes that matched.
The goal is to have a help-file even more accurate concerning \s or [[:blank:]] etc...
This rework is a real nightmare, I wonder how many days/weeks jchd spent to prepare this help file topic (especially he also indicated all code characters > 255), hats off !
#include <Array.au3> Local $sSubject = "" For $i = 1 To 255 $sSubject &= Chr($i) Next Local $sPattern = '\h' ; e.g. '[\x09\x20\xA0]' e.g. chr(9) chr(32) chr(160) : ok ; Local $sPattern = '(*UCP)\h' ; same result (tested) ; Actual help-file : the following line found in help file isn't accurate : ; \s is equivalent to "[\h\x0A\x0C\x0D]" (excluding \xA0 from \h when UCP is enabled) ; Should be : ; Local $sPattern = '\s' ; e.g. '[\x09-\x0D\x20]' e.g. chr(9) chr(10) chr(11) chr(12) chr(13) chr(32) ; Local $sPattern = '(*UCP)\s' ; (including \xA0 when UCP is enabled) ; Local $sPattern = '[[:space:]]' ; same as '\s' ; Local $sPattern = '(*UCP)[[:space:]]' ; same as '(*UCP)\s' ; Actual help-file : 2 following lines found in help file could be more accurate : ; [:blank:] Space or Tab (@TAB) (same as \h or [\x09\x20]) : no it is not same as \h ; When UCP is enabled: Unicode horizontal whitespaces (same as \h). ; Should be : ; Local $sPattern = '[[:blank:]]' ; e.g. '[\x09\x20]' e.g. chr(9) chr(32) ; Local $sPattern = '(*UCP)[[:blank:]]' ; (including \xA0 when UCP is enabled) Local $aArray = StringRegExp($sSubject, $sPattern, 3) If Not @error Then _CW($aArray) _ArrayDisplay($aArray) Else MsgBox(0, 'StringRegExp', 'error = ' & @error & (@error = 1 ? ' (no matches)' : ' (bad pattern)')) EndIf ;====================================== Func _CW(ByRef $aArray, $sTestABC = "") ; _CW($aArray) can be placed just before _ArrayDisplay($aArray) if useful Local $iAscW For $i = 0 To Ubound($aArray) - 1 ConsoleWrite(($sTestABC ? "Test " & $sTestABC & " - " : "") & "Row " & $i & " : ") For $j = 1 To StringLen($aArray[$i]) $iAscW = AscW(StringMid($aArray[$i], $j, 1)) ; ConsoleWrite("Chr(" & Asc(StringMid($aArray[$i], $j, 1)) & ")") ConsoleWrite(($iAscW < 256 ? "Chr" : "ChrW") & "(" & $iAscW & ") ") Next ConsoleWrite(@crlf) Next ConsoleWrite(@crlf) EndFunc
comment:4 Changed 22 months ago by Jpm
- Milestone set to 3.3.17.0
- Owner set to Jpm
- Resolution set to Fixed
- Status changed from new to closed
Fixed by revision [12974] in version: 3.3.17.0
comment:5 Changed 22 months ago by jchd18
Good catch @pixelsearch!
There is as well an issue with Unicode codepoint 0x85 (Unicode "Next line", acronym NEL) wich is matched by the following patterns:
(*UCP)\s (*UCP)[[:space:]]
Code to demonstrate which is which in both directions:
Local $sSubject, $aPattern = ['\h', '(*UCP)\h', '\s', '(*UCP)\s', '[[:space:]]', '(*UCP)[[:space:]]'] ; Character to pattern For $i = 0 To 65535 $sSubject = ChrW($i) For $sPattern In $aPattern If StringRegExp($sSubject, $sPattern) Then CW("ChrW(0x" & Hex(AscW($sSubject), 4) & ") matched by pattern " & $sPattern) EndIf Next Next CW() ; Pattern to character For $sPattern In $aPattern For $i = 0 To 65535 $sSubject = ChrW($i) If StringRegExp($sSubject, $sPattern) Then CW("Pattern " & $sPattern & " matches ChrW(0x" & Hex(AscW($sSubject), 4) & ")") EndIf Next Next
Here, CW() is just a ConsoleWrite with a @LF appended.
Note that there are many other Unicode codepoint matching the various "spacing" patterns in UCP mode!
comment:6 Changed 22 months ago by jchd18
- Resolution Fixed deleted
- Status changed from closed to reopened
comment:7 Changed 22 months ago by TicketCleanup
- Milestone 3.3.17.0 deleted
Automatic ticket cleanup.
comment:8 Changed 22 months ago by Jpm
Hi @jchd, @pixelsearch
not sure what should be fixed
Can you post the change needed;
Thanks
comment:9 Changed 22 months ago by jchd18
I'm fairly busy these days but I'll try to come up with a correct description of what subset of Unicode the various patterns cover.
For now here's a table showing which is which:
Codepoint CharacterName \h (*UCP)\h \s (*UCP)\s [[:space:]] (*UCP)[[:space:]] 0x0009 HT ✔ ✔ ✔ ✔ ✔ ✔ 0x000A LF ✔ ✔ ✔ ✔ 0x000B VT ✔ ✔ ✔ ✔ 0x000C FF ✔ ✔ ✔ ✔ 0x000D CR ✔ ✔ ✔ ✔ 0x0020 SPACE ✔ ✔ ✔ ✔ ✔ ✔ 0x0085 NEL ✔ ✔ 0x00A0 NO-BREAK SPACE ✔ ✔ ✔ ✔ 0x1680 OGHAM SPACE MARK ✔ ✔ ✔ ✔ 0x180E MONGOLIAN VOWEL SEPARATOR ✔ ✔ ✔ ✔ 0x2000 EN QUAD ✔ ✔ ✔ ✔ 0x2001 EM QUAD ✔ ✔ ✔ ✔ 0x2002 EN SPACE ✔ ✔ ✔ ✔ 0x2003 EM SPACE ✔ ✔ ✔ ✔ 0x2004 THREE-PER-EM SPACE ✔ ✔ ✔ ✔ 0x2005 FOUR-PER-EM SPACE ✔ ✔ ✔ ✔ 0x2006 SIX-PER-EM SPACE ✔ ✔ ✔ ✔ 0x2007 FIGURE SPACE ✔ ✔ ✔ ✔ 0x2008 PUNCTUATION SPACE ✔ ✔ ✔ ✔ 0x2009 THIN SPACE ✔ ✔ ✔ ✔ 0x200A HAIR SPACE ✔ ✔ ✔ ✔ 0x2028 LINE SEPARATOR ✔ ✔ 0x2029 PARAGRAPH SEPARATOR ✔ ✔ 0x202F NARROW NO-BREAK SPACE ✔ ✔ ✔ ✔ 0x205F MEDIUM MATHEMATICAL SPACE ✔ ✔ ✔ ✔ 0x3000 IDEOGRAPHIC SPACE ✔ ✔ ✔ ✔
comment:10 Changed 22 months ago by Jpm
so you want this table integrated to the help ?
comment:11 Changed 22 months ago by jchd18
I didn't come up with a succint textual description and finally I think it's clearer to insert a simplified version of the table. Since there are in fact 3 pairs of different patterns producing the same result, grouping them by pair leads to this shorter table:
Codepoint CharacterName \h \s (*UCP)\s ⎫ equivalent (*UCP)\h [[:space:]] (*UCP)[[:space:]] ⎭ patterns 0x0009 HT * * * 0x000A LF * * 0x000B VT * * 0x000C FF * * 0x000D CR * * 0x0020 SPACE * * * 0x0085 NEL * 0x00A0 NO-BREAK SPACE * * 0x1680 OGHAM SPACE MARK * * 0x180E MONGOLIAN VOWEL SEPARATOR * * 0x2000 EN QUAD * * 0x2001 EM QUAD * * 0x2002 EN SPACE * * 0x2003 EM SPACE * * 0x2004 THREE-PER-EM SPACE * * 0x2005 FOUR-PER-EM SPACE * * 0x2006 SIX-PER-EM SPACE * * 0x2007 FIGURE SPACE * * 0x2008 PUNCTUATION SPACE * * 0x2009 THIN SPACE * * 0x200A HAIR SPACE * * 0x2028 LINE SEPARATOR * 0x2029 PARAGRAPH SEPARATOR * 0x202F NARROW NO-BREAK SPACE * * 0x205F MEDIUM MATHEMATICAL SPACE * * 0x3000 IDEOGRAPHIC SPACE * *
If someone finds a better way to describe the same thing, feel free to proceed.
comment:12 Changed 22 months ago by Jpm
Unless somebody disagree I will integrate the last proposal
comment:13 Changed 22 months ago by Jpm
I recheck with the following script and I find out that
\s equal :space:? and are *UCP sensitive
\h \v independant of *UCP
:blank:? *UCP sensitive as opposed to \h
do you agree?
x only without *UCP
X only with *UCP
xX with or without *UCP
#include <StringConstants.au3> #include <Debug.au3> #include <AutoItConstants.au3> Local $aPatterns[] = ["\h", "\v", "\s", "[[:space:]]", "[[:blank:]]"] Local $sUCP = "(*UCP)" Local $aChrW[][7] = [ _ ["Unicode", "CharacterName", "\h", "\v", "\s", "[[:space:]]", "[[:blank:]]"], _ [0x0009, "HT"], _ [0x000A, "LF"], _ [0x000B, "VT"], _ [0x000C, "FF"], _ [0x000D, "CR"], _ [0x0020, "SPACE"], _ [0x0085, "NEL"], _ [0x00A0, "NO-BREAK SPACE"], _ [0x1680, "OGHAM SPACE MARK"], _ [0x180E, "MONGOLIAN VOWEL SEPARATOR"], _ [0x2000, "EN QUAD"], _ [0x2001, "EM QUAD"], _ [0x2002, "EN SPACE"], _ [0x2003, "EM SPACE"], _ [0x2004, "THREE-PER-EM SPACE"], _ [0x2005, "FOUR-PER-EM SPACE"], _ [0x2006, "SIX-PER-EM SPACE"], _ [0x2007, "FIGURE SPACE"], _ [0x2008, "PUNCTUATION SPACE"], _ [0x2009, "THIN SPACE"], _ [0x200A, "HAIR SPACE"], _ [0x2028, "LINE SEPARATOR"], _ [0x2029, "PARAGRAPH SEPARATOR"], _ [0x202F, "NARROW NO-BREAK SPACE"], _ [0x205F, "MEDIUM MATHEMATICAL SPACE"], _ [0x3000, "IDEOGRAPHIC SPACE"] _ ] Local $sStr, $bResult ;= StringRegExp("- -", $sUCP & $aPatterns[$j], $STR_REGEXPMATCH) For $k = 0 To 1 ; test *UCP on second loop For $i = 1 To UBound($aChrW) - 1 For $j = 0 To UBound($aPatterns) - 1 $sStr = "-" & ChrW($aChrW[$i][0]) & "-" If $k Then $bResult = StringRegExp($sStr, $sUCP & $aPatterns[$j], $STR_REGEXPMATCH) If $bResult Then If ($aChrW[$i][$j + 2] <> "x") Then $aChrW[$i][$j + 2] = "X" Else $aChrW[$i][$j + 2] = "xX" EndIf Elseif $k And ($aChrW[$i][$j + 2] = "x") Then $aChrW[$i][$j + 2] = "?X" EndIf Else $bResult = StringRegExp($sStr, $aPatterns[$j], $STR_REGEXPMATCH) If $bResult Then $aChrW[$i][$j + 2] = "x" EndIf Next Next Next For $i = 1 To UBound($aChrW) - 1 $aChrW[$i][0] = "0x" & Hex($aChrW[$i][0], 4) Next Local $sHeader = "" For $i = 0 To UBound($aChrW, $UBOUND_COLUMNS) - 1 $sHeader &= $aChrW[0][$i] & "|" Next $sHeader = StringReplace($sHeader, "CharacterName","CharacterName ") _DebugArrayDisplay($aChrW, @ScriptName, "1:", 0, Default, $sHeader)
comment:14 Changed 22 months ago by pixelsearch
Hi jpm & jchd
Not sure I'll be a big help on the updates as I checked quickly only chars from 0 to 255 . For example, with jchd script (which uses his personal "CW.au3" and "dump.au3" found in the Forum), it's easy to show what follows :
#include "CW.au3" ; Local $sSubject, $aPattern = ['\h', '(*UCP)\h', '\s', '(*UCP)\s', '[[:space:]]', '(*UCP)[[:space:]]'] Local $sSubject, $aPattern = ['\s', '(*UCP)\s', '[[:space:]]', '(*UCP)[[:space:]]'] Local $bFound ; Character to pattern ; For $i = 0 To 65535 For $i = 0 To 255 $bFound = False $sSubject = ChrW($i) For $sPattern In $aPattern If StringRegExp($sSubject, $sPattern) Then $bFound = True CW("ChrW(0x" & Hex(AscW($sSubject), 4) & ") matched by pattern " & $sPattern) EndIf Next If $bFound Then ConsoleWrite(@crlf) Next
Console display
ChrW(0x0009) matched by pattern \s ChrW(0x0009) matched by pattern (*UCP)\s ChrW(0x0009) matched by pattern [[:space:]] ChrW(0x0009) matched by pattern (*UCP)[[:space:]] ChrW(0x000A) matched by pattern \s ChrW(0x000A) matched by pattern (*UCP)\s ChrW(0x000A) matched by pattern [[:space:]] ChrW(0x000A) matched by pattern (*UCP)[[:space:]] ChrW(0x000B) matched by pattern \s ChrW(0x000B) matched by pattern (*UCP)\s ChrW(0x000B) matched by pattern [[:space:]] ChrW(0x000B) matched by pattern (*UCP)[[:space:]] ChrW(0x000C) matched by pattern \s ChrW(0x000C) matched by pattern (*UCP)\s ChrW(0x000C) matched by pattern [[:space:]] ChrW(0x000C) matched by pattern (*UCP)[[:space:]] ChrW(0x000D) matched by pattern \s ChrW(0x000D) matched by pattern (*UCP)\s ChrW(0x000D) matched by pattern [[:space:]] ChrW(0x000D) matched by pattern (*UCP)[[:space:]] ChrW(0x0020) matched by pattern \s ChrW(0x0020) matched by pattern (*UCP)\s ChrW(0x0020) matched by pattern [[:space:]] ChrW(0x0020) matched by pattern (*UCP)[[:space:]] ChrW(0x0085) matched by pattern (*UCP)\s ChrW(0x0085) matched by pattern (*UCP)[[:space:]] ChrW(0x00A0) matched by pattern (*UCP)\s ChrW(0x00A0) matched by pattern (*UCP)[[:space:]]
So yes, we see that ChrW(0x0085)and ChrW(0x00A0) are treated differently. With jchd's code, it's easy to check which pattern matches what.
Good luck to both of you
comment:15 Changed 22 months ago by Jpm
@pixelsearch did you use the script I post above?
does your post conflict with what I say
\s equal :space:? and are *UCP sensitive \h \v independant of *UCP :blank:? *UCP sensitive as opposed to \h
Thanks for the help
comment:16 Changed 22 months ago by pixelsearch
@jpm After testing each and every pattern, I confirm everything you wrote in your very last post :
Compare jpm's script results with RegExpQuickTester results, based on 65535 chars : \h 19 results (*UCP)\h 19 results same as \h => independant of *UCP \v 7 results (*UCP)\v 7 results same as \v => independant of *UCP \s 6 results (*UCP)\s 26 results => *UCP sensitive [[:space:]] 6 results (*UCP)[[:space:]] 26 results => *UCP sensitive [[:blank:]] 2 results (*UCP)[[:blank:]] 19 results => *UCP sensitive Notes : \s and [[:space:]] return the very same 6 results (*UCP)\s and (*UCP)[[:space:]] return the very same 26 results (*UCP)[[:blank:]] returns the very same 19 results as \h or (*UCP)\h
The output of your script is great (with the xX's in ArrayDisplay). I wish I could upload it here as an image for everyone to see it, but I don't know how.
Bravo to both of you.
comment:17 Changed 22 months ago by Jpm
Hi Pixelsearch,
Thanks for the feedback.
I dont know how Jchd post the output of his test?
comment:18 Changed 22 months ago by pixelsearch
@jpm, as your script uses _DebugArrayDisplay, then I clicked the button "copy data & header/row", saved clipboard content as jpm.csv, imported in Excel (pipe delimited), centered columns in Excel, exported from Excel as "Formatted text (space delimited *.prn)" as jpm.prn, opened the .prn file with NotePad, copy-paste its content below, without any further modification (except the horizontal line I added below the headers)
It was quick & easy to do it, much longer to explain ! The display below looks great. Let's try to remember the different steps, in case we need one day other _DebugArrayDisplay results formatted as below.
So the steps are found in Trac Ticket 39-45 ?
Then a mnemonic WW2 will do.
Row Unicode CharacterName \h \v \s [[:space:]] [[:blank:]] -------------------------------------------------------------------------------- # 1 0x0009 HT xX xX xX xX # 2 0x000A LF xX xX xX # 3 0x000B VT xX xX xX # 4 0x000C FF xX xX xX # 5 0x000D CR xX xX xX # 6 0x0020 SPACE xX xX xX xX # 7 0x0085 NEL xX X X # 8 0x00A0 NO-BREAK SPACE xX X X X # 9 0x1680 OGHAM SPACE MARK xX X X X # 10 0x180E MONGOLIAN VOWEL SEPARATOR xX X X X # 11 0x2000 EN QUAD xX X X X # 12 0x2001 EM QUAD xX X X X # 13 0x2002 EN SPACE xX X X X # 14 0x2003 EM SPACE xX X X X # 15 0x2004 THREE-PER-EM SPACE xX X X X # 16 0x2005 FOUR-PER-EM SPACE xX X X X # 17 0x2006 SIX-PER-EM SPACE xX X X X # 18 0x2007 FIGURE SPACE xX X X X # 19 0x2008 PUNCTUATION SPACE xX X X X # 20 0x2009 THIN SPACE xX X X X # 21 0x200A HAIR SPACE xX X X X # 22 0x2028 LINE SEPARATOR xX X X # 23 0x2029 PARAGRAPH SEPARATOR xX X X # 24 0x202F NARROW NO-BREAK SPACE xX X X X # 25 0x205F MEDIUM MATHEMATICAL SPACE xX X X X # 26 0x3000 IDEOGRAPHIC SPACE xX X X X
comment:19 Changed 22 months ago by Jpm
Whoah, You need to write a small script to automise this process
You could have added at the End the signification of x X xY
Thanks a lot
comment:20 Changed 22 months ago by jchd18
Just did (hand editing) the same thing with (obviously) the exact same result.
Great.
comment:21 Changed 21 months ago by Jpm
- Milestone set to 3.3.17.0
- Resolution set to Fixed
- Status changed from reopened to closed
Fixed by revision [12981] in version: 3.3.17.0
Guidelines for posting comments:
- You cannot re-open a ticket but you may still leave a comment if you have additional information to add.
- In-depth discussions should take place on the forum.
For more information see the full version of the ticket guidelines here.
Is there also adequate change for \S?
As so far it was Matches any non-whitespace character.