Modify

Opened 2 years ago

Closed 21 months ago

#3945 closed Bug (Fixed)

StringRegExp help about \s misses VT

Reported by: jchd18 Owned by: Jpm
Milestone: 3.3.17.0 Component: Documentation
Version: 3.3.14.0 Severity: None
Keywords: Cc:

Description

PCRE v8.44 changed the meaning of character class \s to include VT as well, hence following same change in Perl.

So \s is now equivalent to [[:space:]]

Help should be updated to reflect the actual behavior.

Attachments (0)

Change History (21)

comment:1 Changed 23 months ago by mLipok

Is there also adequate change for \S?

As so far it was Matches any non-whitespace character.

Last edited 23 months ago by mLipok (previous) (diff)

comment:2 Changed 23 months ago by pixelsearch

Some infos concerning the evolution of Chr(11) e.g. Vertical Tab VT

Excerpts from AutoIt history and PCRE changelog, displayed by date, descending :
https://www.pcre.org/original/changelog.txt

A) AutoIt 3.3.16.1 (19th September, 2022) (Release)

B) AutoIt 3.3.16.0 (6th March, 2022) (Release)
Changed: PCRE regular expression engine updated to 8.44

C) PCRE Version 8.36 26-September-2014

  1. When a pattern starting with \s was studied, VT was not included in the list of possible starting characters; this should have been part of the 8.34/18 patch.

D) PCRE Version 8.34 15-December-2013

  1. The character VT has been added to the default ("C" locale) set of characters that match \s and are generally treated as white space, following this same change in Perl 5.18. There is now no difference between "Perl space" and "POSIX space". Whether VT is treated as white space in other locales depends on the locale.

E) AutoIt 3.3.8.1 (29th January, 2012) (Release)

F) PCRE Version 8.11 10-Dec-2010

  1. If \s appeared in a character class, it removed the VT character from the class, even if it had been included by some previous item, for example in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part of \s, but is part of the POSIX "space" class.)

G) AutoIt 3.3.6.1 (16th April, 2010) (Release)

Test 1 : the bug in F)
======
Local $sSubject = Chr(11) ; Vertical Tab VT
Local $sPattern = '[\x00-\xff\s]'
No match using AutoIt 2010 (because of bug PCRE)
Match Chr(11) using AutoIt 2012+ (PCRE fixed the bug)

Test 2 : \s
======
Local $sSubject = a string of 256 characters, from Chr(0) to Chr(255)
Local $sPattern = '\s' ; now same as '[[:space:]]'
AutoIt 2022 => match Chr(9) Chr(10) Chr(11) Chr(12) Chr(13) Chr(32) <= 6 whitespace

Test 3 : \S (mLipok's question)
======
Local $sSubject = a string of 256 characters, from Chr(0) to Chr(255)
Local $sPattern = '\S' ; now same as '[[:^space:]]'
AutoIt 2022 => match 250 characters (e.g. 256 - 6 whitespace above)

ChangeLog for PCRE
Note that the PCRE 8.xx series (PCRE1) is now at end of life. All development
is happening in the PCRE2 10.xx series.

Version 8.45 15-June-2021
...

So PCRE 8.45 could be the last version to integrate the next AutoIt release, because PCRE2 10.xx series may require a lot of rework for integration ?
Good luck Jon

comment:3 Changed 23 months ago by pixelsearch

If the helpfile is gonna be reworked (topic StringRegExp) then I got a little issue with \xA0 e.g. chr(160) non-breaking space.

Here is a script where the subject is a string of 255 characters from chr(1) to chr(255)
You can comment out / uncomment any pattern line to display in Scite Console the ascii codes that matched.

The goal is to have a help-file even more accurate concerning \s or [[:blank:]] etc...
This rework is a real nightmare, I wonder how many days/weeks jchd spent to prepare this help file topic (especially he also indicated all code characters > 255), hats off !

#include <Array.au3>

Local $sSubject = ""
For $i = 1 To 255
    $sSubject &= Chr($i)
Next

Local $sPattern = '\h' ; e.g. '[\x09\x20\xA0]' e.g. chr(9) chr(32) chr(160) : ok
; Local $sPattern = '(*UCP)\h' ; same result (tested)

; Actual help-file : the following line found in help file isn't accurate :
; \s is equivalent to "[\h\x0A\x0C\x0D]" (excluding \xA0 from \h when UCP is enabled)

; Should be :
; Local $sPattern = '\s' ; e.g. '[\x09-\x0D\x20]' e.g. chr(9) chr(10) chr(11) chr(12) chr(13) chr(32)
; Local $sPattern = '(*UCP)\s' ; (including \xA0 when UCP is enabled)

; Local $sPattern = '[[:space:]]' ; same as '\s'
; Local $sPattern = '(*UCP)[[:space:]]' ; same as '(*UCP)\s'

; Actual help-file : 2 following lines found in help file could be more accurate :
; [:blank:] Space or Tab (@TAB) (same as \h or [\x09\x20]) : no it is not same as \h
; When UCP is enabled: Unicode horizontal whitespaces (same as \h).

; Should be :
; Local $sPattern = '[[:blank:]]' ; e.g. '[\x09\x20]' e.g. chr(9) chr(32)
; Local $sPattern = '(*UCP)[[:blank:]]' ; (including \xA0 when UCP is enabled)

Local $aArray = StringRegExp($sSubject, $sPattern, 3)
If Not @error Then
    _CW($aArray)
    _ArrayDisplay($aArray)
Else
    MsgBox(0, 'StringRegExp', 'error = ' & @error & (@error = 1 ? ' (no matches)' : ' (bad pattern)'))
EndIf

;======================================
Func _CW(ByRef $aArray, $sTestABC = "") ; _CW($aArray) can be placed just before _ArrayDisplay($aArray) if useful
    Local $iAscW
    For $i = 0 To Ubound($aArray) - 1
        ConsoleWrite(($sTestABC ? "Test " & $sTestABC  & "  -  " : "") & "Row " & $i & " : ")
        For $j = 1 To StringLen($aArray[$i])
            $iAscW = AscW(StringMid($aArray[$i], $j, 1))
            ; ConsoleWrite("Chr(" & Asc(StringMid($aArray[$i], $j, 1)) & ")")
            ConsoleWrite(($iAscW < 256 ? "Chr" : "ChrW") & "(" & $iAscW & ") ")
        Next
        ConsoleWrite(@crlf)
    Next
    ConsoleWrite(@crlf)
EndFunc

comment:4 Changed 22 months ago by Jpm

  • Milestone set to 3.3.17.0
  • Owner set to Jpm
  • Resolution set to Fixed
  • Status changed from new to closed

Fixed by revision [12974] in version: 3.3.17.0

comment:5 Changed 22 months ago by jchd18

Good catch @pixelsearch!

There is as well an issue with Unicode codepoint 0x85 (Unicode "Next line", acronym NEL) wich is matched by the following patterns:

    (*UCP)\s
    (*UCP)[[:space:]]

Code to demonstrate which is which in both directions:

Local $sSubject, $aPattern = ['\h', '(*UCP)\h', '\s', '(*UCP)\s', '[[:space:]]', '(*UCP)[[:space:]]']
; Character to pattern
For $i = 0 To 65535
    $sSubject = ChrW($i)
	For $sPattern In $aPattern
		If StringRegExp($sSubject, $sPattern) Then
			CW("ChrW(0x" & Hex(AscW($sSubject), 4) & ") matched by pattern " & $sPattern)
		EndIf
	Next
Next
CW()
; Pattern to character
For $sPattern In $aPattern
	For $i = 0 To 65535
		$sSubject = ChrW($i)
		If StringRegExp($sSubject, $sPattern) Then
			CW("Pattern " & $sPattern & " matches ChrW(0x" & Hex(AscW($sSubject), 4) & ")")
		EndIf
	Next
Next

Here, CW() is just a ConsoleWrite with a @LF appended.

Note that there are many other Unicode codepoint matching the various "spacing" patterns in UCP mode!

Last edited 22 months ago by jchd18 (previous) (diff)

comment:6 Changed 22 months ago by jchd18

  • Resolution Fixed deleted
  • Status changed from closed to reopened

comment:7 Changed 22 months ago by TicketCleanup

  • Milestone 3.3.17.0 deleted

Automatic ticket cleanup.

comment:8 Changed 22 months ago by Jpm

Hi @jchd, @pixelsearch
not sure what should be fixed
Can you post the change needed;
Thanks

comment:9 Changed 22 months ago by jchd18

I'm fairly busy these days but I'll try to come up with a correct description of what subset of Unicode the various patterns cover.

For now here's a table showing which is which:

Codepoint CharacterName             \h  (*UCP)\h  \s  (*UCP)\s  [[:space:]]  (*UCP)[[:space:]]
0x0009    HT                         ✔     ✔     ✔      ✔          ✔              ✔
0x000A    LF                                      ✔      ✔          ✔              ✔
0x000B    VT                                      ✔      ✔          ✔              ✔
0x000C    FF                                      ✔      ✔          ✔              ✔
0x000D    CR                                      ✔      ✔          ✔              ✔
0x0020    SPACE                      ✔     ✔     ✔      ✔          ✔              ✔
0x0085    NEL                                            ✔                          ✔
0x00A0    NO-BREAK SPACE             ✔     ✔            ✔                          ✔
0x1680    OGHAM SPACE MARK           ✔     ✔            ✔                          ✔
0x180E    MONGOLIAN VOWEL SEPARATOR  ✔     ✔            ✔                          ✔
0x2000    EN QUAD                    ✔     ✔            ✔                          ✔
0x2001    EM QUAD                    ✔     ✔            ✔                          ✔
0x2002    EN SPACE                   ✔     ✔            ✔                          ✔
0x2003    EM SPACE                   ✔     ✔            ✔                          ✔
0x2004    THREE-PER-EM SPACE         ✔     ✔            ✔                          ✔
0x2005    FOUR-PER-EM SPACE          ✔     ✔            ✔                          ✔
0x2006    SIX-PER-EM SPACE           ✔     ✔            ✔                          ✔
0x2007    FIGURE SPACE               ✔     ✔            ✔                          ✔
0x2008    PUNCTUATION SPACE          ✔     ✔            ✔                          ✔
0x2009    THIN SPACE                 ✔     ✔            ✔                          ✔
0x200A    HAIR SPACE                 ✔     ✔            ✔                          ✔
0x2028    LINE SEPARATOR                                 ✔                          ✔
0x2029    PARAGRAPH SEPARATOR                            ✔                          ✔
0x202F    NARROW NO-BREAK SPACE      ✔     ✔            ✔                          ✔
0x205F    MEDIUM MATHEMATICAL SPACE  ✔     ✔            ✔                          ✔
0x3000    IDEOGRAPHIC SPACE          ✔     ✔            ✔                          ✔

comment:10 Changed 22 months ago by Jpm

so you want this table integrated to the help ?

comment:11 Changed 22 months ago by jchd18

I didn't come up with a succint textual description and finally I think it's clearer to insert a simplified version of the table. Since there are in fact 3 pairs of different patterns producing the same result, grouping them by pair leads to this shorter table:

Codepoint CharacterName             \h         \s           (*UCP)\s       ⎫ equivalent
                                 (*UCP)\h  [[:space:]]  (*UCP)[[:space:]]  ⎭ patterns
                                                                 
0x0009    HT                         *          *               *
0x000A    LF                                    *               *
0x000B    VT                                    *               *
0x000C    FF                                    *               *
0x000D    CR                                    *               *
0x0020    SPACE                      *          *               *
0x0085    NEL                                                   *
0x00A0    NO-BREAK SPACE             *                          *
0x1680    OGHAM SPACE MARK           *                          *
0x180E    MONGOLIAN VOWEL SEPARATOR  *                          *
0x2000    EN QUAD                    *                          *
0x2001    EM QUAD                    *                          *
0x2002    EN SPACE                   *                          *
0x2003    EM SPACE                   *                          *
0x2004    THREE-PER-EM SPACE         *                          *
0x2005    FOUR-PER-EM SPACE          *                          *
0x2006    SIX-PER-EM SPACE           *                          *
0x2007    FIGURE SPACE               *                          *
0x2008    PUNCTUATION SPACE          *                          *
0x2009    THIN SPACE                 *                          *
0x200A    HAIR SPACE                 *                          *
0x2028    LINE SEPARATOR                                        *
0x2029    PARAGRAPH SEPARATOR                                   *
0x202F    NARROW NO-BREAK SPACE      *                          *
0x205F    MEDIUM MATHEMATICAL SPACE  *                          *
0x3000    IDEOGRAPHIC SPACE          *                          *

If someone finds a better way to describe the same thing, feel free to proceed.

comment:12 Changed 22 months ago by Jpm

Unless somebody disagree I will integrate the last proposal

comment:13 Changed 22 months ago by Jpm

I recheck with the following script and I find out that

\s equal :space:? and are *UCP sensitive
\h \v independant of *UCP
:blank:? *UCP sensitive as opposed to \h

do you agree?
x only without *UCP
X only with *UCP
xX with or without *UCP

#include <StringConstants.au3>
#include <Debug.au3>
#include <AutoItConstants.au3>

Local $aPatterns[] = ["\h", "\v", "\s", "[[:space:]]", "[[:blank:]]"]
Local $sUCP = "(*UCP)"

Local $aChrW[][7] = [ _
		["Unicode", "CharacterName", "\h", "\v", "\s", "[[:space:]]", "[[:blank:]]"], _
		[0x0009, "HT"], _
		[0x000A, "LF"], _
		[0x000B, "VT"], _
		[0x000C, "FF"], _
		[0x000D, "CR"], _
		[0x0020, "SPACE"], _
		[0x0085, "NEL"], _
		[0x00A0, "NO-BREAK SPACE"], _
		[0x1680, "OGHAM SPACE MARK"], _
		[0x180E, "MONGOLIAN VOWEL SEPARATOR"], _
		[0x2000, "EN QUAD"], _
		[0x2001, "EM QUAD"], _
		[0x2002, "EN SPACE"], _
		[0x2003, "EM SPACE"], _
		[0x2004, "THREE-PER-EM SPACE"], _
		[0x2005, "FOUR-PER-EM SPACE"], _
		[0x2006, "SIX-PER-EM SPACE"], _
		[0x2007, "FIGURE SPACE"], _
		[0x2008, "PUNCTUATION SPACE"], _
		[0x2009, "THIN SPACE"], _
		[0x200A, "HAIR SPACE"], _
		[0x2028, "LINE SEPARATOR"], _
		[0x2029, "PARAGRAPH SEPARATOR"], _
		[0x202F, "NARROW NO-BREAK SPACE"], _
		[0x205F, "MEDIUM MATHEMATICAL SPACE"], _
		[0x3000, "IDEOGRAPHIC SPACE"] _
		]

Local $sStr, $bResult ;= StringRegExp("- -", $sUCP & $aPatterns[$j], $STR_REGEXPMATCH)
For $k = 0 To 1 ; test *UCP on second loop
	For $i = 1 To UBound($aChrW) - 1
		For $j = 0 To UBound($aPatterns) - 1
			$sStr = "-" & ChrW($aChrW[$i][0]) & "-"
			If $k Then
				$bResult = StringRegExp($sStr, $sUCP & $aPatterns[$j], $STR_REGEXPMATCH)
				If $bResult Then
					If ($aChrW[$i][$j + 2] <> "x") Then
						$aChrW[$i][$j + 2] = "X"
					Else
						$aChrW[$i][$j + 2] = "xX"
					EndIf
				Elseif $k And ($aChrW[$i][$j + 2] = "x") Then
						$aChrW[$i][$j + 2] = "?X"
				EndIf
			Else
				$bResult = StringRegExp($sStr, $aPatterns[$j], $STR_REGEXPMATCH)

				If $bResult Then $aChrW[$i][$j + 2] = "x"
			EndIf
		Next
	Next
Next

For $i = 1 To  UBound($aChrW) - 1
	$aChrW[$i][0] = "0x" & Hex($aChrW[$i][0], 4)
Next

Local $sHeader =  ""
For $i = 0 To UBound($aChrW, $UBOUND_COLUMNS) - 1
	$sHeader &= $aChrW[0][$i] & "|"
Next
$sHeader = StringReplace($sHeader, "CharacterName","CharacterName                                  ")
_DebugArrayDisplay($aChrW, @ScriptName, "1:", 0, Default, $sHeader)

Last edited 22 months ago by Jpm (previous) (diff)

comment:14 Changed 22 months ago by pixelsearch

Hi jpm & jchd
Not sure I'll be a big help on the updates as I checked quickly only chars from 0 to 255 . For example, with jchd script (which uses his personal "CW.au3" and "dump.au3" found in the Forum), it's easy to show what follows :

#include "CW.au3"

; Local $sSubject, $aPattern = ['\h', '(*UCP)\h', '\s', '(*UCP)\s', '[[:space:]]', '(*UCP)[[:space:]]']
Local $sSubject, $aPattern = ['\s', '(*UCP)\s', '[[:space:]]', '(*UCP)[[:space:]]']

Local $bFound
; Character to pattern
; For $i = 0 To 65535
For $i = 0 To 255
    $bFound = False
    $sSubject = ChrW($i)
    For $sPattern In $aPattern
        If StringRegExp($sSubject, $sPattern) Then
            $bFound = True
            CW("ChrW(0x" & Hex(AscW($sSubject), 4) & ") matched by pattern " & $sPattern)
        EndIf
    Next
    If $bFound Then ConsoleWrite(@crlf)
Next

Console display

ChrW(0x0009) matched by pattern \s
ChrW(0x0009) matched by pattern (*UCP)\s
ChrW(0x0009) matched by pattern [[:space:]]
ChrW(0x0009) matched by pattern (*UCP)[[:space:]]

ChrW(0x000A) matched by pattern \s
ChrW(0x000A) matched by pattern (*UCP)\s
ChrW(0x000A) matched by pattern [[:space:]]
ChrW(0x000A) matched by pattern (*UCP)[[:space:]]

ChrW(0x000B) matched by pattern \s
ChrW(0x000B) matched by pattern (*UCP)\s
ChrW(0x000B) matched by pattern [[:space:]]
ChrW(0x000B) matched by pattern (*UCP)[[:space:]]

ChrW(0x000C) matched by pattern \s
ChrW(0x000C) matched by pattern (*UCP)\s
ChrW(0x000C) matched by pattern [[:space:]]
ChrW(0x000C) matched by pattern (*UCP)[[:space:]]

ChrW(0x000D) matched by pattern \s
ChrW(0x000D) matched by pattern (*UCP)\s
ChrW(0x000D) matched by pattern [[:space:]]
ChrW(0x000D) matched by pattern (*UCP)[[:space:]]

ChrW(0x0020) matched by pattern \s
ChrW(0x0020) matched by pattern (*UCP)\s
ChrW(0x0020) matched by pattern [[:space:]]
ChrW(0x0020) matched by pattern (*UCP)[[:space:]]

ChrW(0x0085) matched by pattern (*UCP)\s
ChrW(0x0085) matched by pattern (*UCP)[[:space:]]

ChrW(0x00A0) matched by pattern (*UCP)\s
ChrW(0x00A0) matched by pattern (*UCP)[[:space:]]

So yes, we see that ChrW(0x0085)and ChrW(0x00A0) are treated differently. With jchd's code, it's easy to check which pattern matches what.

Good luck to both of you

comment:15 Changed 22 months ago by Jpm

@pixelsearch did you use the script I post above?
does your post conflict with what I say

\s equal :space:? and are *UCP sensitive
\h \v independant of *UCP
:blank:? *UCP sensitive as opposed to \h

Thanks for the help

comment:16 Changed 22 months ago by pixelsearch

@jpm After testing each and every pattern, I confirm everything you wrote in your very last post :

Compare jpm's script results with RegExpQuickTester results, based on 65535 chars :

\h                 19 results
(*UCP)\h           19 results same as \h => independant of *UCP

\v                 7 results
(*UCP)\v           7 results same as \v => independant of *UCP

\s                  6 results
(*UCP)\s           26 results => *UCP sensitive

[[:space:]]         6 results
(*UCP)[[:space:]]  26 results => *UCP sensitive 

[[:blank:]]         2 results
(*UCP)[[:blank:]]  19 results => *UCP sensitive 

Notes :
\s and [[:space:]] return the very same 6 results
(*UCP)\s and (*UCP)[[:space:]] return the very same 26 results
(*UCP)[[:blank:]] returns the very same 19 results as \h or (*UCP)\h

The output of your script is great (with the xX's in ArrayDisplay). I wish I could upload it here as an image for everyone to see it, but I don't know how.
Bravo to both of you.

comment:17 Changed 22 months ago by Jpm

Hi Pixelsearch,
Thanks for the feedback.
I dont know how Jchd post the output of his test?

comment:18 Changed 22 months ago by pixelsearch

@jpm, as your script uses _DebugArrayDisplay, then I clicked the button "copy data & header/row", saved clipboard content as jpm.csv, imported in Excel (pipe delimited), centered columns in Excel, exported from Excel as "Formatted text (space delimited *.prn)" as jpm.prn, opened the .prn file with NotePad, copy-paste its content below, without any further modification (except the horizontal line I added below the headers)

It was quick & easy to do it, much longer to explain ! The display below looks great. Let's try to remember the different steps, in case we need one day other _DebugArrayDisplay results formatted as below.

So the steps are found in Trac Ticket 39-45 ?
Then a mnemonic WW2 will do.

Row   Unicode CharacterName                \h   \v   \s  [[:space:]] [[:blank:]]
--------------------------------------------------------------------------------
# 1   0x0009  HT                           xX        xX      xX          xX
# 2   0x000A  LF                                xX   xX      xX
# 3   0x000B  VT                                xX   xX      xX
# 4   0x000C  FF                                xX   xX      xX
# 5   0x000D  CR                                xX   xX      xX
# 6   0x0020  SPACE                        xX        xX      xX          xX
# 7   0x0085  NEL                               xX    X       X
# 8   0x00A0  NO-BREAK SPACE               xX         X       X           X
# 9   0x1680  OGHAM SPACE MARK             xX         X       X           X
# 10  0x180E  MONGOLIAN VOWEL SEPARATOR    xX         X       X           X
# 11  0x2000  EN QUAD                      xX         X       X           X
# 12  0x2001  EM QUAD                      xX         X       X           X
# 13  0x2002  EN SPACE                     xX         X       X           X
# 14  0x2003  EM SPACE                     xX         X       X           X
# 15  0x2004  THREE-PER-EM SPACE           xX         X       X           X
# 16  0x2005  FOUR-PER-EM SPACE            xX         X       X           X
# 17  0x2006  SIX-PER-EM SPACE             xX         X       X           X
# 18  0x2007  FIGURE SPACE                 xX         X       X           X
# 19  0x2008  PUNCTUATION SPACE            xX         X       X           X
# 20  0x2009  THIN SPACE                   xX         X       X           X
# 21  0x200A  HAIR SPACE                   xX         X       X           X
# 22  0x2028  LINE SEPARATOR                    xX    X       X
# 23  0x2029  PARAGRAPH SEPARATOR               xX    X       X
# 24  0x202F  NARROW NO-BREAK SPACE        xX         X       X           X
# 25  0x205F  MEDIUM MATHEMATICAL SPACE    xX         X       X           X
# 26  0x3000  IDEOGRAPHIC SPACE            xX         X       X           X

comment:19 Changed 22 months ago by Jpm

Whoah, You need to write a small script to automise this process
You could have added at the End the signification of x X xY
Thanks a lot

comment:20 Changed 22 months ago by jchd18

Just did (hand editing) the same thing with (obviously) the exact same result.
Great.

comment:21 Changed 21 months ago by Jpm

  • Milestone set to 3.3.17.0
  • Resolution set to Fixed
  • Status changed from reopened to closed

Fixed by revision [12981] in version: 3.3.17.0

Guidelines for posting comments:

  • You cannot re-open a ticket but you may still leave a comment if you have additional information to add.
  • In-depth discussions should take place on the forum.

For more information see the full version of the ticket guidelines here.

Add Comment

Modify Ticket

Action
as closed The owner will remain Jpm.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.