photonbuddy Posted July 23, 2022 Share Posted July 23, 2022 Hi All, I am writing a script that I use to save an image from a Reddit post. As most of these save as a random string of letters, my script takes the post title (from the window title of the browser), and uses that as a filename. Problem is, while Windows will save and display the emojis in file explorer, my image viewer (ACDSee - very old pre-bloat version) can't display the file. How do I process the string to remove all emojis? Thanks for any help. Link to comment Share on other sites More sharing options...
Luke94 Posted July 23, 2022 Share Posted July 23, 2022 (edited) Looks like AutoIt replaces the emoji's with ??. Local $a = 'Test Title 😭' ConsoleWrite($a) Output: Quote Test Title ?? Maybe try and StringReplace the question marks with nothing? I guess one of the downsides is it would remove legitimate question marks, should it work at all. Edited July 23, 2022 by Luke94 Link to comment Share on other sites More sharing options...
jchd Posted July 23, 2022 Share Posted July 23, 2022 ConsoleWrite() silently "converts" Unicode text to ANSI, replacing almost all non-ANSI characters by question marks. This doesn't work fairly with non-latin languages. Below CW() is a homebrew Unicode-aware ConsoleWrite(): ; Mixed language strings $s = "Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة" CW($s) ConsoleWrite($s & @LF) ; A familly with different Fitzpatrick settings = only one glyph $s = "Our familly " & ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD) CW($s) ConsoleWrite($s & @LF) Result: Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة ??????? ???????? ??? ???? ?????? ????? ????? Our familly 👨🏻👩🏿👦🏽 Our familly ?????????????? I don't know which charset this legacy version of ACDSee handles for filenames. You can remove emojis or a wider range of Unicode charset explicitely using a regexp. BUT there is a pitfall however: AutoIt charset is UCS2, a limitation of Unicode UTF16 to the BMP (Unicode plane 0) using 16-bit encoding units. But there is more: Unicode codepoints in planes 1..16 use surrogate values to represent. For instance 😭 is represented in UCS2 (AutoIt string) as ChrW(0xD83D) & ChrW(0xDE2D). You might think: pretty easy, just use a regexp pattern to match and replace these values, using StringRegExpReplace($s, "[\x{D800}-\x{DFFF}]", "-") NO! Just because PCRE (the regexp engine used by AutoIt) invokation internally merges the two surrogates into the actual 😭 codepoint 0x1F62D (LOUDLY CRYING FACE). This will replace all series of non-BMP codepoints by an underscore: $s = "Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة Test Title 😭" & @LF $s &= "Our familly " & ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD) CW($s) $t = StringRegExpReplace($s, "[\x{10000}-\x{1FFFF}]+", "_") CW($t) Result: Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة Test Title 😭 Our familly 👨🏻👩🏿👦🏽 Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة Test Title _ Our familly ___ Note that in the last line, there are 3 "people" joined with ChrW(0x200D) [Zero Width Joiner] hence three underscores. Yet I suspect that your image viewer will bark at codepoints outside the default 8-bit codepage of your system. If you still get question marks in the last example above, then your only bet is to correctly convert characters into their 8-bit codepage counterpoint, or by a useable substitution character when impossible. Func _StringToCodepage($sStr, $iCodepage = Default) If $iCodepage = Default Then $iCodepage = 65001 ; or Int(RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\Codepage", "OEMCP")) Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _ "ptr", 0, "int", 0, "ptr", 0, "ptr", 0) Local $tCP = DllStructCreate("char[" & $aResult[0] & "]") $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _ "struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0) Return DllStructGetData($tCP, 1) EndFunc ;==>_StringToCodepage Invoke this conversion function with the codepage ID which suits your needs. See https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers pixelsearch and Musashi 2 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Solution photonbuddy Posted July 23, 2022 Author Solution Share Posted July 23, 2022 2 hours ago, Luke94 said: Maybe try and StringReplace the question marks with nothing? I guess one of the downsides is it would remove legitimate question marks, should it work at all. I tried this after seeing the 2 question marks, but AutoIT sees the emoji, not the question marks. Ironically, I can actually use StringReplace and pass in the copied emoji character, and it will work fine, but then I have to do a StringReplace for every emoji. The really annoying thing is AutoIT if I use StringIsASCII, it happily tells me it is, probably because internally it's converting the emojis to "??", which are ASCII. 58 minutes ago, jchd said: ConsoleWrite() silently "converts" Unicode text to ANSI, replacing almost all non-ANSI characters by question marks. While most of what you wrote went a little over my head, this little bit took me down a path which looks to have solved my issue. Using StringToBinary converts emojis (the couple of test ones anyway) to the aforementioned "??", and then BinaryToString gives me a string I can use. Thanks to all who replied. Much appreciated. Link to comment Share on other sites More sharing options...
Deye Posted July 23, 2022 Share Posted July 23, 2022 (edited) This gets the result necessary to do it equally well. $s = "Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة* (Test Title) ?😭 (Our familly) " MsgBox(0, "", StringStripWS(StringRegExpReplace($s, "[\x00-\x7F]\K|\W", ""), 7)) ; Or MsgBox(0, "", StringStripWS(StringRegExpReplace($s, "[^ -~]", ""), 7)) Edited July 23, 2022 by Deye Link to comment Share on other sites More sharing options...
jchd Posted July 23, 2022 Share Posted July 23, 2022 (edited) @Deye not true. There are significant differences between Unicode and the upper 8-bit ANSI. The badly mapped characters depend on which locale is in effect. While your code removes all characters beyond 0x7F. I find it's better to have mappable Unicode converted to the corresponding 0x80-0xFF locale counterpart. Edited July 23, 2022 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Deye Posted July 23, 2022 Share Posted July 23, 2022 54 minutes ago, jchd said: characters depend on which locale is in effect. Yes, it really depends on usability and manipulation In this case, the code should be changed to make it usable. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now