mike1950r Posted July 6, 2021 Share Posted July 6, 2021 Hi, if i use FileGetEncoding( to get the encoding of a file get always 256 instead of 512 in case of an ANSI file. i can check the format with notepad++, and it is for sure a ANSI file. thanks for assistance cheers mike Link to comment Share on other sites More sharing options...
jchd Posted July 6, 2021 Share Posted July 6, 2021 Attach an example to stop guesswork. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
mike1950r Posted July 6, 2021 Author Share Posted July 6, 2021 (edited) ok, thanks for reply. attach just a normal txt file. Local $iEncoding = FileGetEncoding("test.txt") MsgBox($MB_TOPMOST, "", $iEncoding, 0) cheers mike test.txt Edited July 6, 2021 by mike1950r Link to comment Share on other sites More sharing options...
TheXman Posted July 6, 2021 Share Posted July 6, 2021 (edited) From the Help File under the "Unicode Support" topic: File operations on text files not opened with FileOpen() and explicit unicode flags auto-detect encoding similar to most modern editors. This includes all file functions that are used with a filename, for example FileRead("filename.txt"). Specifically: Files containing a BOM will be opened in the relevant mode as per that BOM. UTF-8 and UTF-16 BOMs are checked. UTF-8 and UTF-16 files without a BOM will be automatically detected and opened in the relevant mode. Files containing nulls are opened in Binary ($FO_BINARY) mode by default (unless they are detected as valid UTF-16). Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override. Files containing only characters 1-127 are opened in UTF-8 with no BOM ($FO_UTF8_NOBOM) mode by default. Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override. Files containing only characters 1-255 are opened in ANSI ($FO_ANSI) mode by default. Due to the above FileGetEncoding() now returns 512 ($FO_ANSI) or 256 ($FO_UTF8_NOBOM) instead of 0 which was undocumented but indicated ANSI. Edited July 6, 2021 by TheXman CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
mike1950r Posted July 6, 2021 Author Share Posted July 6, 2021 thanks xman, if i understand you right: $iEncoding = $FO_UTF8_NOBOM would be ANSI as well as $iEncoding = $FO_ANSI ??? cheers mike Link to comment Share on other sites More sharing options...
TheXman Posted July 6, 2021 Share Posted July 6, 2021 (edited) Not quite. FileGetEncoding() with a file name, as opposed to a handle that opened the file with an explicit encoding flag, will open an ANSI file as UTF8 no BOM. So the return value is $FO_UTF8_NOBOM. That was a "code breaking" change that was documented in a previous version of AutoIt. You can look up which version if you feel so inclined. Edited July 6, 2021 by TheXman CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
mike1950r Posted July 6, 2021 Author Share Posted July 6, 2021 Hi again, this is quite confusing. why is an ansi file opened as an utf8 nobom file. but this would be only for checking the encoding. in my case i would check encoding with filegetencoding( if i get ansi or utf8 nobom i would open the file then with ansi flag for overwriting or appending and save like this. is this ok then ??? cheers mike Link to comment Share on other sites More sharing options...
Nine Posted July 6, 2021 Share Posted July 6, 2021 Not exactly If you look at the example provided with the FileGetEncoding, it says : ; The value returned for this example should be 0 or $FO_ANSI. But it is not. It returns $FO_UTF8_NOBOM (256). However, if you add a character over 128 (as a comment or whatever), it will now return 512. Like this : ; ¢ “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
mike1950r Posted July 6, 2021 Author Share Posted July 6, 2021 nine, thanks lot for your assistance. i fear i'm just to stupid to understand. this really confuses me. cheers mike Link to comment Share on other sites More sharing options...
TheXman Posted July 6, 2021 Share Posted July 6, 2021 (edited) I don't understand your issue. FileGetEncoding() tells you what encoding was used when the file was opened. If FileGetEncoding() used a file name or a handle gotten from an explicit FileOpen without an encoding flag, then the encoding was determined using a set of predefined rules. Keep in mind that FileGetEncoding, when supplied with a file name, still opens the file. 1 hour ago, mike1950r said: why is an ansi file opened as an utf8 nobom file. Because UTF8 no BOM can read/write an ANSI encoded file. Why don't you discuss the problem you are trying to solve instead of the solution that you've come up with? Maybe there's a better way to do whatever it is you are trying to do. Edited July 6, 2021 by TheXman Musashi 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
mike1950r Posted July 6, 2021 Author Share Posted July 6, 2021 thanks xman, i understood, that i should overwrite the utf8 nobom with ansi, right? if so for my case i treate the file encoding detection utf8 nobom = ansi and overwrite as ansi. this is alright for me. (strange though, that other editors like notepad, notepad++ etc. are able to detect this kind of file as ansi.) may be they have another methode for detecting the encoding. thanks lot for your help, and excuse for my difficult long lasting understanding. fortunately in other themes i'm much faster. 🙂 cheers mike Link to comment Share on other sites More sharing options...
jchd Posted July 7, 2021 Share Posted July 7, 2021 (edited) Here's the underlying issue with text files. Extended ANSI uses one byte per character and has 128 "upper" characters codes [0x80, 0xFF] which are assigned to a set of characters defined by the codepage in use. The codepage is not explicit and this is a problem for information interchange. Unicode has a very large character set encompassing all glyphs ever used by humans. The range of Unicode characters is [0x000000, 0x10FFFF] which is 1 114 112 possible characters! Obviously an Unicode character (a codepoint) must use something larger than one byte to represent, contrary to previous codepages. This is where encoding enters the scene. A useful encoding is UTF8 which uses sequences of 1 to 4 bytes to represent a character. See UTF8 to understand how this encoding works. The lower part of ANSI is mapped verbatim to the first 128 Unicode codepoints. In UTF8, a byte > 0x7F introduces a sequence one more than one byte and this sequence has to conform to UTF8 encoding. This is what FileGetEncoding tries to determine. The word "España" has different representations in Windows Occidental codepage and UTF8: ANSI Occidental codepage 1252 E s p a ñ a 45 73 70 61 F1 61 UTF8 (NoBOM) E s p a ñ a ┌─┴─┐ 45 73 70 61 C3 B1 61 UTF8 (BOM) E s p a ñ a ┌─┴─┐ EF BB BF 45 73 70 61 C3 B1 61 └──┬───┘ BOM The optional BOM (Byte Order Mark) serves as a special marker to help distinguish UTF8 from byte codepages. If you FileOpen a file with the first content without specifying a mode, AutoIt will try to find in the first 64k bytes if there are invalid UTF8 sequences. If found the file will be open as ANSI, else UTF8. The sequence 0xF1 0x61 is an invalid UTF8 sequence, hence file is treated as ANSI. If a file with the second example is mistakenly open as ANSI it would display as "Espaïa" which is probably not what users want. If an UTF8 BOM is found, it is ignored but the file is treated as UTF8 without further examination. EDIT: The file you provided is empty, hence will by default be considered as UTF8 w/o BOM. Edited July 7, 2021 by jchd Musashi, mike1950r and JockoDundee 3 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
JockoDundee Posted July 7, 2021 Share Posted July 7, 2021 Well said @jchd ! Since you’re so smart, why don’t you explain why sometimes you see all those ???? when opening a file Code hard, but don’t hard code... Link to comment Share on other sites More sharing options...
jchd Posted July 7, 2021 Share Posted July 7, 2021 (edited) Most of the time it's because the font used to display file content doesn't have a representation for the Unicode codepoints found. But there may be other reasons. If you have an example I'll be happy to help. For instance I use the latest DejaVu Sans Mono font for all fixed-size uses, including my SciTE UTF8 console. This allows the following code ; Mixed language strings $s = "Μεγάλο πρόβλημα Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة" CW($s) ; A familly with different Fitzpatrick settings = only one glyph $s = ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD) CW($s) to display this (CW() is a Unicode-aware ConsoleWrite): Μεγάλο πρόβλημα Большая проблема 大问题 बड़ी समस्या مشكلة كبيرة 👨🏻👩🏿👦🏽 You can also open cmd.exe then use chcp 65001 and try to paste the content of the result above. Several codepoints show as blank rectangular placeholders, others as unknown (a question mark in black hexagonal background). If you use a poorly complete Unicode font (no font cover all of Unicode) you're most likely going to see some garbage or rather many question marks, depending on how the font is coded to represent codepoints it has no representation for. EDIT: Forgot to mention that there are codepoints reserved for surrogates [0xD800, 0xDFFF] which if found as standalone cause an invalid codepoint detection by fonts rendering engines. There are also private use ranges where Unicode doesn't define a representation and currently unassigned codepoints which may get assigned in the future version of the character set. Edited July 7, 2021 by jchd Typo JockoDundee 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
mike1950r Posted July 7, 2021 Author Share Posted July 7, 2021 6 hours ago, jchd said: ANSI Occidental codepage 1252 E s p a ñ a 45 73 70 61 F1 61 UTF8 (NoBOM) E s p a ñ a ┌─┴─┐ 45 73 70 61 C3 B1 61 UTF8 (BOM) E s p a ñ a ┌─┴─┐ EF BB BF 45 73 70 61 C3 B1 61 └──┬───┘ BOM jchd, this was very helpful, thanks lot cheers mike Link to comment Share on other sites More sharing options...
JockoDundee Posted August 11, 2021 Share Posted August 11, 2021 On 7/7/2021 at 2:12 AM, jchd said: But there may be other reasons. If you have an example I'll be happy to help. As it turns out, your expertise is needed at bogus cybersymposium, as no one know what to make of: more information here: Code hard, but don’t hard code... Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now