Pardalito Posted January 20, 2010 Share Posted January 20, 2010 (edited) Hello,There is a solution to detect file encoding/charset?I need to detect if my file(s) have UTF-8 encoding (without BOM).I try to read the first 3 Hex words but the number changes:0x3C3F700x3C34390x3C6D650x3C68740x3C74610x3C21440x093C64 (UTF-8 without BOOM)0xEFBBBF...Anyone knows a good solution to see if the file have UTF-8 encoding/charset without BOM?Best regards, Pardalito.Edit: Typo Edited January 20, 2010 by Pardalito Link to comment Share on other sites More sharing options...
jchd Posted January 20, 2010 Share Posted January 20, 2010 Anyone knows a good solution to see if the file have UTF-8 encoding/charset without BOM?Obviously, finding a BOM as in your last example line is the easiest case.The ambiguity between codepage (whatever it is) and UTF-8 w/o BOM is more difficult. There was a thread by Jon lately here which made its way into the latest release. So current AutoIt does this automagically but I haven't seen that the encoding detected is exfiltered by the read level.If you really need to get this information, what you should probably do is read the whole file and check each successive Unicode character for valid UTF-8 encoding and exit at the first invalid one. This will be very slow if coded in AutoIt, but it can be done with help of a small .DLL if you need to do that routinely.May I ask what is the purpose of your request? This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Pardalito Posted January 20, 2010 Author Share Posted January 20, 2010 Hello jchd,I need an app that search my web files (.php) and do a log of files that don’t have UTF-8 charset/encoding.For example:I have 1000 files and I have 2 files in ANSI.When a run this future app, the log must include this 2 files in ANSI.In the thread that you mention I don’t see any function to do detection of UTF-8 files.In the latest release of AutoIt, I don’t see any procedure to detect the file charset.Any help?Thanks for your reply.Best regards, Pardalito. Link to comment Share on other sites More sharing options...
Pardalito Posted January 20, 2010 Author Share Posted January 20, 2010 Hello again,In the release notes of v3.3.4.0:Added: Ability to read and write UTF-8 files with no BOM including automatic detection during reading.Sorry... But I don't see any function/procedure in the includes folder... Best regards, Pardalito. Link to comment Share on other sites More sharing options...
jchd Posted January 20, 2010 Share Posted January 20, 2010 Am I transfiguring your need if I say it's a one-time onversion? If it's the case, then I believe you can brute force the conversion very easily. Read up every file (FileRead will switch the read to ANSI or UTF-8 w/o BOM transparently) Rewite it by forcing UTF-8 with BOM. It will be as fast as possible and as fail-proof as the auto-detection routine that Jon has introduced. Keep a list of files already converted to apply the procedure to new files only to save time (if that's important). Would it work? This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
jchd Posted January 20, 2010 Share Posted January 20, 2010 Sorry... But I don't see any function/procedure in the includes folder... You won't find an UDF for that. The feature is built in the FileRead* functions which are part of AutoIt core. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Pardalito Posted January 20, 2010 Author Share Posted January 20, 2010 Hello jchd,This is a nice idea... Slower, but a nice idea.If you get only de ANSI files and then convert these files to UTF-8, then you have a fast process.For future versions a function to determinate a charset/enconding from file, are welcome.Something like that:FileEncoding('ansi.php') = 1FileEncoding('utf-8.php') = 2FileEncoding('utf-16.php') = 3...1 = ANSI Charset2 = UTF-8 Charset3 = UTF-16 Charset...Thanks again jchd.P.S.: If somebody have more ideas to do this, please write Link to comment Share on other sites More sharing options...
jchd Posted January 20, 2010 Share Posted January 20, 2010 This is a nice idea... Slower, but a nice idea.If you get only de ANSI files and then convert these files to UTF-8, then you have a fast process.For future versions a function to determinate a charset/enconding from file, are welcome.Something like that:FileEncoding('ansi.php') = 1FileEncoding('utf-8.php') = 2FileEncoding('utf-16.php') = 3I guess there can't be such magic without reading the file until an invalid UTF-8 sequence is found.For files already in UTF-8, it won't be better than reading the file, writing it (possibly to /dev/null) and comparing the lengths read and writen (must be the _byte_ length).For files in ANSI, it depends if they include 8-bit chars and how far is the first invalid sequence.So if only few (e.g. 2 among 1000) are ANSI, then it will be "slow" anyway.Anyway, note that this procedure can give wrong results. An ANSI file is absolutely entitled to contain a sequence of 8-bit chars identical to a single UTF-8 char. It will be valid UTF-8 hence be classified as UTF-8. But hopefully, since typical UTF-8 sequences make little or no sense when interpreted as series of ANSI chars, there are only little odds that this happens in human-readable" files or .php or other common types. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Pardalito Posted January 20, 2010 Author Share Posted January 20, 2010 Hello jchd, Yes... You have right. I understand now. Thanks. Link to comment Share on other sites More sharing options...
Administrators Jon Posted January 20, 2010 Administrators Share Posted January 20, 2010 In the latest release of AutoIt, I don’t see any procedure to detect the file charset. Hopefully we'll have a function (or a @extended code from FileOpen()) that does this in the next beta. Soon. FWIW the AutoIt internal procedure is: Read first 64KB of file While chars If char = valid UTF8 sequence Then Skip sequence (1,2,3 or 4 bytes) Else Return NOT_UTF8 WEnd Also, at the end if NO chars read were >127 then also return NOT_UTF8 because we can't tell if it's really UTF8 or standard ANSI. yahaosoft 1 Deployment Blog: https://www.autoitconsulting.com/site/blog/ SCCM SDK Programming: https://www.autoitconsulting.com/site/sccm-sdk/ Link to comment Share on other sites More sharing options...
jchd Posted January 20, 2010 Share Posted January 20, 2010 (edited) Hopefully we'll have a function (or a @extended code from FileOpen()) that does this in the next beta. Soon.That's nice, thanks.FWIW the AutoIt internal procedure is:Read first 64KB of fileOnly 64K and not whole beef? Isn't this a bit of gambling?Being picky, don't you also check the bytes n+1 and more, when needed?Also, at the end if NO chars read were >127 then also return NOT_UTF8 because we can't tell if it's really UTF8 or standard ANSI.:ahem: in his case, you can rightfully return UTF8 as well!EDIT: the above remark is because I interpret(ed) the semantics of your return value differently. I thought it was to mean "is compatible with" while you seem to mean "is requiring". You get the idea. Edited January 20, 2010 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Administrators Jon Posted January 20, 2010 Administrators Share Posted January 20, 2010 :ahem: in his case, you can rightfully return UTF8 as well!EDIT: the above remark is because I interpret(ed) the semantics of your return value differently. I thought it was to mean "is compatible with" while you seem to mean "is requiring". You get the idea.Not really. Open a text file in something like Notepad++ enter normal letters like "abcdefghijklmnopqrstuvwxyz" and save it as "UTF-8 with no BOM". Close the file and then open it again. It will say it's encoded as ANSI. If all chars are <127 then it can't assume anything else. Deployment Blog: https://www.autoitconsulting.com/site/blog/ SCCM SDK Programming: https://www.autoitconsulting.com/site/sccm-sdk/ Link to comment Share on other sites More sharing options...
Administrators Jon Posted January 20, 2010 Administrators Share Posted January 20, 2010 (edited) Only 64K and not whole beef? Isn't this a bit of gambling?Maybe. If it doesn't work out well then I can increase it to read the whole file it at the cost of perf - but it's statistically unlikely that valid UTF8 sequences would happen by random and if there is no character > 127 in the first 64KB then how likely is one to be in the rest of the file? You can also force the issue with a flag in FileOpen() if required.Being picky, don't you also check the bytes n+1 and more, when needed?Of course. Edited January 20, 2010 by Jon Deployment Blog: https://www.autoitconsulting.com/site/blog/ SCCM SDK Programming: https://www.autoitconsulting.com/site/sccm-sdk/ Link to comment Share on other sites More sharing options...
jchd Posted January 20, 2010 Share Posted January 20, 2010 Not really. Open a text file in something like Notepad++ enter normal letters like "abcdefghijklmnopqrstuvwxyz" and save it as "UTF-8 with no BOM". Close the file and then open it again. It will say it's encoded as ANSI. If all chars are <127 then it can't assume anything else.But 7-bit ASCII is UTF-8 compatible.That's probably because I regard ANSI as retarted and UTF as "untold default" (should be with BOM, anyway). It's a shame that such a dumb thing like codepages are still the default for that many editors/programs. I wonder how many decades after the avent of Unicode we (or our children) will still have to wait before the ANSI crap is gone. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
KaFu Posted January 21, 2010 Share Posted January 21, 2010 Maybe writing a wrapper function for this will do the job? IsTextUnicode Function http://msdn.microsoft.com/en-us/library/dd318672(VS.85).aspx OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2024-Oct-13) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Oct-13) HMW - Hide my Windows (2024-Oct-19) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2024-Oct-20) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16) Link to comment Share on other sites More sharing options...
jchd Posted January 21, 2010 Share Posted January 21, 2010 From what I understand, here (and at other places in MSDN as well), Microsoft uses Unicode to mean UTF-16.IS_TEXT_UNICODE_ODD_LENGTHThe number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text.It doesn't seem that they consider UTF-8 as a possibility (or I missed it). It's still possible that another call would do it.There are also several Unicode transformations that could be of interest, like normalizations. It's possible to have all valid Unicode sequences on a character basis, but invalid sequences of characters. This should be of marginal use and there is a risk of utter confusion for many if not most users. I bet that those with such demanding needs will either ask or know to do that by themselves using Windows calls. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Lazycat Posted January 21, 2010 Share Posted January 21, 2010 Time ago I done UDF for unicode type detection (first on the page): http://autoit.darkhost.ru/udfs.html But surely internal methods should be faster. Koda homepage ([s]Outdated Koda homepage[/s]) (Bug Tracker)My Autoit script page ([s]Outdated mirror[/s]) Link to comment Share on other sites More sharing options...
Administrators Jon Posted January 21, 2010 Administrators Share Posted January 21, 2010 It doesn't seem that they consider UTF-8 as a possibility (or I missed it).It doesn't. And it's pretty crappy for a lot of text (search for "bush hid the facts" and IsTextUnicode ). It only uses the first 256 bytes as well. Deployment Blog: https://www.autoitconsulting.com/site/blog/ SCCM SDK Programming: https://www.autoitconsulting.com/site/sccm-sdk/ Link to comment Share on other sites More sharing options...
Pardalito Posted January 21, 2010 Author Share Posted January 21, 2010 Hello Lazycat, Good UDF. Thanks. Best Regards, Pardalito. Link to comment Share on other sites More sharing options...
jchd Posted January 21, 2010 Share Posted January 21, 2010 http://autoit.darkhost.ru/udfs.htmlWhile not handling plane 1 of Unicode is probably harmless to most every day use, I am told that plane 2 (supplementary CJK extensions) is getting more fashionable. That would mean that for a large number of people, UTF-8 encoding using 3 or 4 bytes will become routine.I do have to deal with Asia a lot and I need to take care of this. Just to let you know. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now