pixelsearch Posted March 25, 2022 Share Posted March 25, 2022 1) Short story : Help file, FileOpen topic : ...When reading without an explicit unicode mode flag, the content of the file is examined and a guess is made whether the file is UTF8, UTF16 or ANSI. My question is : how is this "guess" made ? Because in the script below, opening the file "product.dbf" in read mode doesn't detect it's an ANSI file, so results are incorrect (file "product.dbf" attached at the end of the script) #include <FileConstants.au3> #include <MsgBoxConstants.au3> Opt("MustDeclareVars", 1) Local $hFileOpen = FileOpen("Product.dbf", $FO_READ) If $hFileOpen = -1 Then Exit MsgBox($MB_TOPMOST, "", "Open error") Local $sFileRead = FileRead($hFileOpen) Local $iKeepError = @error, $iKeepExtended = @extended If $iKeepError <> 0 Then Exit MsgBox($MB_TOPMOST, "", "Read error") ConsoleWrite("@extended = " & $iKeepExtended & @crlf) ; 146 (strangely correct) ConsoleWrite(Asc(StringMid($sFileRead, 1, 1)) & " " & _ Asc(StringMid($sFileRead, 2, 1)) & @crlf) ; 48, 120 (should be 3, 121) FileClose($hFileOpen) 2) Longer story : This test file was created today with a shareware program, after I encountered the same issue yesterday in this post. So here is how I created "product.dbf" today : I could explain the values found in the memory dump above, but it would be off-topic. Anyway, accurate explanations of the values can be found in this link and/or that link. Forget the 16 green marked bytes above (they correspond to the 1st record) and let's focus on byte 0 (0x03) and byte 1 (0x79, i.e 121 in decimal) 3) Back to the initial script : The values returned by ConsoleWrite are wrong : 48 120 The correct values are 3 121 and you will get the correct values only if you add by yourself $FO_ANSI (512) when opening the file. That's why I asked : how is the FileOpen() guess made ? Thanks Product.dbf Link to comment Share on other sites More sharing options...
jchd Posted March 25, 2022 Share Posted March 25, 2022 (edited) From what I recall, the leading part of the file is scanned for conformance to one of the UTF8 or UTF16-LE (w/ or w/o BOM) encodings. If ever a byte > 0x7F not introducing a valid UTF8 sequence is found and the file is not valid UTF16-LE, the file is considered codepage encoded, aka ANSI (improper term here). Not mentionned in the help, it seems that if a null byte (0x00) is encountered, then the file is read as binary. And your example shows exactly this. In the script below, the function vd() is a variable dump (not provided here to keep things short). #include <FileConstants.au3> #include <MsgBoxConstants.au3> Opt("MustDeclareVars", 1) Local $hFileOpen = FileOpen("Product.dbf", $FO_READ) If $hFileOpen = -1 Then Exit MsgBox($MB_TOPMOST, "", "Open error") Local $sFileRead = FileRead($hFileOpen) Local $iKeepError = @error, $iKeepExtended = @extended If $iKeepError <> 0 Then Exit MsgBox($MB_TOPMOST, "", "Read error") FileClose($hFileOpen) ConsoleWrite("@extended = " & $iKeepExtended & @crlf) ; 146 (strangely correct) vd($sFileRead, 0, 0, 0) vd(String($sFileRead), 0, 0) vd(BinaryMid($sFileRead, 1, 1)) vd(BinaryMid($sFileRead, 2, 1)) ConsoleWrite(Asc(StringMid($sFileRead, 1, 1)) & " " & _ Asc(StringMid($sFileRead, 2, 1)) & @crlf) ; 48, 120 (correct!) The console output I get is: @extended = 146 Binary (146) 0x03790113030000006100100000000000 ... 6D6E6F7020202020203131312E31311A String (294) '0x037901130300000061001000000000 ... 6E6F7020202020203131312E31311A' Binary (1) 0x03 Binary (1) 0x79 String (3) '121' 48 120 Here you see that the output of FileRead is a binary variant. StringMid forces this binary to be converted to a string. The first character of this string is '0' whose ASCII code is decimal 48. The next character in the string is 'x' whose ASCII code is decimal 120. Edited March 25, 2022 by jchd pixelsearch 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
pixelsearch Posted March 25, 2022 Author Share Posted March 25, 2022 Bravo jchd I tested what follows, after reading your post, by replacing many bytes with 0x20 (starting from the 1st 0x00 found at position 5, to the end), then : 1) If you leave only 1 byte = 0x00 (pic below) then ConsoleWrite shows : 48 120 2) If you overwrite that 0x00 byte with 0x20 (so not a single 0x00 byte exists anymore) then ConsoleWrite shows : 3 121 Shouldn't the help file be amended with your sentence, stipulating that "if a null byte (0x00) is encountered, then the file is read as binary." instead of "a guess is made" ? Link to comment Share on other sites More sharing options...
jchd Posted March 26, 2022 Share Posted March 26, 2022 This is just guesswork from my part, nothing close to a specification. Only @jpm & @Jon can tell: maybe other control characters trigger the switch to binary, that or even 0x7F. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
TheDcoder Posted March 26, 2022 Share Posted March 26, 2022 @jchd The code is actually open-source and published as a library https://github.com/AutoItConsulting/text-encoding-detect A detailed write-up of how it works on the AutoIt Consulting website: https://www.autoitconsulting.com/site/development/utf-8-utf-16-text-encoding-detection-library/ EasyCodeIt - A cross-platform AutoIt implementation - Fund the development! (GitHub will double your donations for a limited time) DcodingTheWeb Forum - Follow for updates and Join for discussion Link to comment Share on other sites More sharing options...
jchd Posted March 26, 2022 Share Posted March 26, 2022 I wasn't aware. First, just by quick look at the top of this library C++ code, the presence of NULL(s) denote binary if UTF16 is ruled out. But we have no clue that this is what current AutoIt implements in full gory detail, even if both pieces of code must be quite similar. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
pixelsearch Posted March 26, 2022 Author Share Posted March 26, 2022 (edited) 6 hours ago, jchd said: Not mentionned in the help, it seems that if a null byte (0x00) is encountered, then the file is read as binary. Luckily, I just found it written in the help file, not in the FileOpen topic, but... in the Unicode Support topic : Files containing nulls are opened in Binary ($FO_BINARY) mode by default (unless they are detected as valid UTF-16). Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override. This Unicode topic is found in our .chm help file, when we click this line in FileOpen topic (I discovered this 1 hour ago !) See "Unicode Support" for a detailed description. Now I just tried, without FileOpen, the FileGetEncoding("product.dbf") function : It returned... 16 (which means binary) for the "product.dbf" file. This value is not indicated in the "Success" return values of the function in the help file (the success list goes from 32 to 512) Also, the help file example of FileGetEncoding() is a bit strange : it checks for @error but @error will always be = 0 . A test with FileGetEncoding("sdfgsdfgggfghhsfg.txt") will bypass the @error test and Return - 1 in the help file example. 6 hours ago, jchd said: If ever a byte > 0x7F not introducing a valid UTF8 sequence is found and the file is not valid UTF16-LE, the file is considered codepage encoded, aka ANSI Very true ( after test ) I just made this test on a 10 bytes file : C2 70 C2 80 C2 80 C2 80 C2 80 It opens as $FO_ANSI (512) because no BOM and C2 70 is not an UTF-8 valid sequence [UTF-8 would code caract. 127 as 0x7F then jumps to C2 80 for caract. 128, says Wiki] ConsoleWrite would return 194 112 if this was our "product.dbf" in the script above : 0xC2 = 194 0x70 = 112 Now, the complementary test : the following 9 bytes file would open as $FO_UTF8_NOBOM (256) C3 A9 E2 82 AC C3 A9 C3 A9 Because there are 4 valid UTF-8 sequences in it : Edit: thx @TheDcoder for the 2 links, it looks very interesting. Edited March 26, 2022 by pixelsearch Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now