leuce Posted November 23, 2020 Share Posted November 23, 2020 (edited) Hello everyone I'm trying to perform a regex find/replace on a piece of text, but I encounter two problems (possibly related to each other). The first problem is that my regular expression may be incorrect. The second problem is that during AutoIt's processing of the text, some characters are changed that should not be changed. Since pasting no-break spaces and zero width non joiners can't be shown in the forum, I've added an attachment with the text that I copy (to the clipboard), as well as what the result should look like after the regex replacement. The text contains, among others, one or more series consisting of a no-break space, a number, a no-break space, and a zero width non joiner. If you view the attached file in Word with "non-printing characters" enabled (i.e. so that you can see spaces and line breaks), you should see the zero width non joiner as a little box. However, when I run this script, and paste the text that is added to the clipboard, it appears that the no-break spaces were converted to normal spaces by AutoIt, which may (or may not) explain why the regex replacement does not work. $zerowidthnonjoiner = BinaryToString ("0x0C20", 2) $nobreakspace = BinaryToString ("0xA000", 2) $grabbedtext = ClipGet () Sleep ("1000") $grabbedtext2 = StringRegExpReplace ($grabbedtext, '(' & $nobreakspace & ')([0-9]+?)(' & $nobreakspace & $zerowidthnonjoiner & ')', '{$2}') MsgBox (0, "", $grabbedtext & @CRLF & @CRLF & $grabbedtext2, 0) $toput = $grabbedtext & @CRLF & @CRLF & $grabbedtext2 ClipPut ($toput) I originally tried: $grabbedtext2 = StringRegExpReplace ($grabbedtext, $nobreakspace & '([0-9]+?)' & $nobreakspace & $zerowidthnonjoiner, '{$1}') I have tried splitting up this problem into two separate problems, but I could not. Firstly, can you tell me if my regex syntax is correct? And secondly, do you know where the problem occurs with the no-break spaces being converted to normal spaces, and how I can avoid that? Thanks Samuel document with text.doc Edited November 23, 2020 by leuce Link to comment Share on other sites More sharing options...
jchd Posted November 23, 2020 Share Posted November 23, 2020 (edited) The code below shows that there is no emasculation of Unicode strings: expandcollapse popup$zerowidthnonjoiner = ChrW(0x200C) $nonbreakspace = ChrW(0xA0) $grabbedtext = _ $nonbreakspace & "111" & $nonbreakspace & $zerowidthnonjoiner & "blah1..." & _ $nonbreakspace & "222" & $nonbreakspace & $zerowidthnonjoiner & "blah2..." & _ $nonbreakspace & "333" & $nonbreakspace & $zerowidthnonjoiner & "blah3..." & _ $nonbreakspace & "444" & $nonbreakspace & $zerowidthnonjoiner & "blah4..." & _ $nonbreakspace & "555" & $nonbreakspace & $zerowidthnonjoiner & "blah5..." $grabbedtext2 = StringRegExpReplace($grabbedtext, '(?<=' & $nonbreakspace & ')([0-9]+?)(?=' & $nonbreakspace & $zerowidthnonjoiner & ')', '{$1}') Local $aChrW = StringToASCIIArray($grabbedtext) _NameIt($aChrW) _ArrayDisplay($aChrW, "Before") Local $aChrW2 = StringToASCIIArray($grabbedtext2) _NameIt($aChrW2) _ArrayDisplay($aChrW2, "After") Func _NameIt(ByRef $a) For $i = 0 To UBound($a) - 1 Switch $a[$i] Case 0xA0 $a[$i] = "NBS" Case 0x200B $a[$i] = "ZWS" Case 0x200C $a[$i] = "ZWNJ" Case 0x200D $a[$i] = "ZWJ" Case 0xFEFF $a[$i] = "ZWNBS" Case Else $a[$i] = ChrW($a[$i]) EndSwitch Next EndFunc Your regex was indeed not suited to the job. Then basic Clip* functions might convert (= emasculate) Unicode codepoints to Windows charset. You should be happier with _ClipBoard_{Get|Set}Data using $CF_UNICODETEXT explicitely. EDIT: after checking, it appears that ClipGet & ClipPut don't change the offending codepoints, so your other issue is elsewhere. Edited November 23, 2020 by jchd FrancescoDiMuro 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
JockoDundee Posted November 23, 2020 Share Posted November 23, 2020 2 hours ago, jchd said: The code below shows that there is no emasculation of Unicode strings:... Then basic Clip* functions might convert (= emasculate) Unicode codepoints to Windows charset. Isn’t Unicode Unisex by nature? Therefore is emasculation thru conversion therapy even possible? TheDcoder and jchd 2 Code hard, but don’t hard code... Link to comment Share on other sites More sharing options...
leuce Posted November 23, 2020 Author Share Posted November 23, 2020 5 hours ago, jchd said: Your regex was indeed not suited to the job. Thanks very much for the code snippet. However, my explanation was not up to scratch either (-: because your regex retains the no-break space and the zero width non joiner, and I want them removed as well. Oh, well, I could just StringReplace to remove them 🙂 that's good enough for me. For the record, I wanted this: [some text] [no break space] [number] [no break space] [zero width non joiner] [some more text] ...to be replaced with this:[some text] [left curly bracket] [number] [right curly bracket] [some more text] Link to comment Share on other sites More sharing options...
jchd Posted November 23, 2020 Share Posted November 23, 2020 Ah, then: expandcollapse popup$zerowidthnonjoiner = ChrW(0x200C) $nonbreakspace = ChrW(0xA0) $grabbedtext = _ $nonbreakspace & "111" & $nonbreakspace & $zerowidthnonjoiner & "blah1..." & _ $nonbreakspace & "222" & $nonbreakspace & $zerowidthnonjoiner & "blah2..." & _ $nonbreakspace & "333" & $nonbreakspace & $zerowidthnonjoiner & "blah3..." & _ $nonbreakspace & "444" & $nonbreakspace & $zerowidthnonjoiner & "blah4..." & _ $nonbreakspace & "555" & $nonbreakspace & $zerowidthnonjoiner & "blah5..." ;~ $grabbedtext2 = StringRegExpReplace($grabbedtext, '(' & $nonbreakspace & ')(\d+)(' & $nonbreakspace & $zerowidthnonjoiner & ')', '{$2}') ; less verbose $grabbedtext2 = StringRegExpReplace($grabbedtext, '(\xA0)(\d+)(\xA0\x{200C})', '{$2}') Local $aChrW = StringToASCIIArray($grabbedtext) _NameIt($aChrW) _ArrayDisplay($aChrW, "Before") Local $aChrW2 = StringToASCIIArray($grabbedtext2) _NameIt($aChrW2) _ArrayDisplay($aChrW2, "After") Func _NameIt(ByRef $a) For $i = 0 To UBound($a) - 1 Switch $a[$i] Case 0xA0 $a[$i] = "NBS" Case 0x200B $a[$i] = "ZWS" Case 0x200C $a[$i] = "ZWNJ" Case 0x200D $a[$i] = "ZWJ" Case 0xFEFF $a[$i] = "ZWNBS" Case Else $a[$i] = ChrW($a[$i]) EndSwitch Next EndFunc In fact I misunderstood your requirements and your initial pattern was on par AFAICT. It remains that the content of your grabbed text may not contain what you expect. My snippet demonstrates that: 1) both NBSs and ZWNJ are correctly detected in an input string; 2) the regex correctly matches them as requested. leuce 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
leuce Posted November 23, 2020 Author Share Posted November 23, 2020 2 hours ago, jchd said: It remains that the content of your grabbed text may not contain what you expect. I'm beginning to suspect that you're right. By the way, I'm using this script to process text that is copied from a form on a web site. I'm very fortunate in that the web site developer chose to put tags around these characters (on the HTML clipboard), so I'm going to rewrite my script to read the HTML clipboard instead and do the regex find replace using the tags. Hopefully then it should not matter if there are NBSP and ZWNJ characters inbetween. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now