The FileOpen() guess

pixelsearch · March 25, 2022

1) Short story :

Help file, FileOpen topic :

...When reading without an explicit unicode mode flag, the content of the file is examined and a guess is made whether the file is UTF8, UTF16 or ANSI.

My question is : how is this "guess" made ?
Because in the script below, opening the file "product.dbf" in read mode doesn't detect it's an ANSI file, so results are incorrect (file "product.dbf" attached at the end of the script)

#include <FileConstants.au3>
#include <MsgBoxConstants.au3>

Opt("MustDeclareVars", 1)

Local $hFileOpen = FileOpen("Product.dbf", $FO_READ)
If $hFileOpen = -1 Then Exit MsgBox($MB_TOPMOST, "", "Open error")

Local $sFileRead = FileRead($hFileOpen)
Local $iKeepError = @error, $iKeepExtended = @extended
If $iKeepError <> 0 Then Exit MsgBox($MB_TOPMOST, "", "Read error")

ConsoleWrite("@extended = " & $iKeepExtended & @crlf) ; 146 (strangely correct)

ConsoleWrite(Asc(StringMid($sFileRead, 1, 1)) & "   " & _
             Asc(StringMid($sFileRead, 2, 1)) & @crlf) ; 48, 120 (should be 3, 121)

FileClose($hFileOpen)

2) Longer story :

This test file was created today with a shareware program, after I encountered the same issue yesterday in this post.

So here is how I created "product.dbf" today :

product_1.png.1f7484fdaaa3d33d811c327e8a77f723.png

product_2.png.930512e09f34199cc7fcb0668383d66e.png

I could explain the values found in the memory dump above, but it would be off-topic. Anyway, accurate explanations of the values can be found in this link and/or that link.

Forget the 16 green marked bytes above (they correspond to the 1st record) and let's focus on byte 0 (0x03) and byte 1 (0x79, i.e 121 in decimal)

3) Back to the initial script :
The values returned by ConsoleWrite are wrong : 48 120
The correct values are 3 121 and you will get the correct values only if you add by yourself $FO_ANSI (512) when opening the file.

That's why I asked : how is the FileOpen() guess made ?
Thanks

Product.dbf

jchd · March 25, 2022

From what I recall, the leading part of the file is scanned for conformance to one of the UTF8 or UTF16-LE (w/ or w/o BOM) encodings. If ever a byte > 0x7F not introducing a valid UTF8 sequence is found and the file is not valid UTF16-LE, the file is considered codepage encoded, aka ANSI (improper term here). Not mentionned in the help, it seems that if a null byte (0x00) is encountered, then the file is read as binary.

And your example shows exactly this. In the script below, the function vd() is a variable dump (not provided here to keep things short).

#include <FileConstants.au3>
#include <MsgBoxConstants.au3>

Opt("MustDeclareVars", 1)

Local $hFileOpen = FileOpen("Product.dbf", $FO_READ)
If $hFileOpen = -1 Then Exit MsgBox($MB_TOPMOST, "", "Open error")

Local $sFileRead = FileRead($hFileOpen)
Local $iKeepError = @error, $iKeepExtended = @extended
If $iKeepError <> 0 Then Exit MsgBox($MB_TOPMOST, "", "Read error")
FileClose($hFileOpen)
ConsoleWrite("@extended = " & $iKeepExtended & @crlf) ; 146 (strangely correct)

vd($sFileRead, 0, 0, 0)
vd(String($sFileRead), 0, 0)
vd(BinaryMid($sFileRead, 1, 1))
vd(BinaryMid($sFileRead, 2, 1))

ConsoleWrite(Asc(StringMid($sFileRead, 1, 1)) & "   " & _
             Asc(StringMid($sFileRead, 2, 1)) & @crlf) ; 48, 120 (correct!)

The console output I get is:

@extended = 146
Binary (146)             0x03790113030000006100100000000000 ... 6D6E6F7020202020203131312E31311A

String (294)             '0x037901130300000061001000000000 ... 6E6F7020202020203131312E31311A'

Binary (1)               0x03

Binary (1)               0x79

String (3)               '121'

48   120

Here you see that the output of FileRead is a binary variant. StringMid forces this binary to be converted to a string. The first character of this string is '0' whose ASCII code is decimal 48. The next character in the string is 'x' whose ASCII code is decimal 120.

Edited March 25, 2022 by jchd

pixelsearch · March 25, 2022

Bravo jchd

I tested what follows, after reading your post, by replacing many bytes with 0x20 (starting from the 1st 0x00 found at position 5, to the end), then :

1) If you leave only 1 byte = 0x00 (pic below) then ConsoleWrite shows :
48 120

product_3.png.35e44c60005dc6a9d0f00cda4e9222a8.png

2) If you overwrite that 0x00 byte with 0x20 (so not a single 0x00 byte exists anymore) then ConsoleWrite shows :
3 121

Shouldn't the help file be amended with your sentence, stipulating that "if a null byte (0x00) is encountered, then the file is read as binary." instead of "a guess is made" ?

jchd · March 26, 2022

This is just guesswork from my part, nothing close to a specification. Only @jpm & @Jon can tell: maybe other control characters trigger the switch to binary, that or even 0x7F.

TheDcoder · March 26, 2022

@jchd The code is actually open-source and published as a library

https://github.com/AutoItConsulting/text-encoding-detect

A detailed write-up of how it works on the AutoIt Consulting website: https://www.autoitconsulting.com/site/development/utf-8-utf-16-text-encoding-detection-library/

jchd · March 26, 2022

I wasn't aware.

First, just by quick look at the top of this library C++ code, the presence of NULL(s) denote binary if UTF16 is ruled out. But we have no clue that this is what current AutoIt implements in full gory detail, even if both pieces of code must be quite similar.

pixelsearch · March 26, 2022

6 hours ago, jchd said:

Not mentionned in the help, it seems that if a null byte (0x00) is encountered, then the file is read as binary.

Luckily, I just found it written in the help file, not in the FileOpen topic, but... in the Unicode Support topic :

Files containing nulls are opened in Binary ($FO_BINARY) mode by default (unless they are detected as valid UTF-16). Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override.

This Unicode topic is found in our .chm help file, when we click this line in FileOpen topic (I discovered this 1 hour ago !)

See "Unicode Support" for a detailed description.

Now I just tried, without FileOpen, the FileGetEncoding("product.dbf") function :
It returned... 16 (which means binary) for the "product.dbf" file. This value is not indicated in the "Success" return values of the function in the help file (the success list goes from 32 to 512)

Also, the help file example of FileGetEncoding() is a bit strange : it checks for @error but @error will always be = 0 . A test with FileGetEncoding("sdfgsdfgggfghhsfg.txt") will bypass the @error test and Return - 1 in the help file example.

6 hours ago, jchd said:

If ever a byte > 0x7F not introducing a valid UTF8 sequence is found and the file is not valid UTF16-LE, the file is considered codepage encoded, aka ANSI

Very true ( after test ) I just made this test on a 10 bytes file :

C2 70 C2 80 C2 80 C2 80 C2 80

It opens as $FO_ANSI (512) because no BOM and C2 70 is not an UTF-8 valid sequence [UTF-8 would code caract. 127 as 0x7F then jumps to C2 80 for caract. 128, says Wiki]

ConsoleWrite would return 194 112 if this was our "product.dbf" in the script above :
0xC2 = 194
0x70 = 112

Now, the complementary test : the following 9 bytes file would open as $FO_UTF8_NOBOM (256)
C3 A9 E2 82 AC C3 A9 C3 A9

Because there are 4 valid UTF-8 sequences in it :

2002912900_4validUTF-8codes.png.5293825fe8e91b33c7930654529ea39e.png

Edit: thx @TheDcoder for the 2 links, it looks very interesting.

Edited March 26, 2022 by pixelsearch

Sign In

The FileOpen() guess

Recommended Posts

pixelsearch

jchd

pixelsearch

jchd

TheDcoder

jchd

pixelsearch

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta