face Posted April 28, 2014 Share Posted April 28, 2014 (edited) I have an autoit program that extracts text from all text files in a folder and saves the extracted words in a text file word list. I need to add an ignore characters option like a black list of words or single characters. Also I'm not sure if the program detects word fragments and spacing in Chinese text, it has to detect spacing in Chinese text so it doesn't extract entire phrases heres the code #include <File.au3> #include <Array.au3> #include <MsgBoxConstants.au3> Local $oDictionary = ObjCreate("Scripting.Dictionary") Local $mypath = @ScriptDir Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1) If @error Then MsgBox($MB_SYSTEMMODAL, "Error", "No files found") Exit Else MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files") EndIf Local $aWords For $i = 1 To $aFiles[0] $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) ; change pattern to fit your definition of "word" Local $iError = @error If $iError = 0 Then For $Word In $aWords $oDictionary.ADD($Word, $Word) Next Else MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError) EndIf Next $aWords = $oDictionary.Items FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF)) Edited April 28, 2014 by face Link to comment Share on other sites More sharing options...
face Posted April 29, 2014 Author Share Posted April 29, 2014 any suggestions? Link to comment Share on other sites More sharing options...
somdcomputerguy Posted April 29, 2014 Share Posted April 29, 2014 (edited) This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the three loops I started. Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1) Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1) Local $aWords For $i = 1 To $aFiles[0] For $j = 1 To $bFiles[0] If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3)edit: Actually, the If loop doesn't need to be closed as I wrote it above, as there is only one line to execute, but change it as you need to.. Edited April 29, 2014 by somdcomputerguy - Bruce /*somdcomputerguy */ If you change the way you look at things, the things you look at change. Link to comment Share on other sites More sharing options...
jchd Posted April 30, 2014 Share Posted April 30, 2014 (edited) Can you please post a significant sample of text including Chinese text? Remember that AutoIt implementation of PCRE (the regexp engine) is Unicode-aware but you need to use the (*UCP) option to correctly recognize non ANSI codepoints. Also s is probably not the condition you need. Start with experimenting using this to grab "words" having length > 1 (I also allowed digits by this may be something you don't want; remove d in that case): Local $sText = "A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically" & _ " কম্পিউটাৰক অসমীয়াত পৰিকলন যন্ত্ৰ বুলিও কোৱা হয়৷ ইংৰাজী কম্পিউটাৰ শব্দটো আহিছে লেটিন ভাষাৰ 'কম্পিউটে' শব্দৰ পৰা যাৰ অৰ্থ হৈছে গণনা৷" & _ " Сучасны камп'ютар складаецца з абсталявання, якое ўяўляе фізічныя часткі камп'ютара (працэсар, клавіятура, манітор і г.д.)" & _ " კომპიუტერი (ინგლ. computer) ინგლისური ზიტყვა რე დო გჷშმაკოროცხალს შანენს. თენა რე ელექტრონული გჷშმაკოროცხალი მანქანა" & _ " 電腦或計算機係一台揸得指令(程式)操作資料嗰機器。" & _ " '太字'コンピュータ(英: computer)は、自動計算機、とくに計算開始後は人手を介さずに計算終了まで動作する電子式汎用計算機。" & _ " محتویات این مقاله ممکن است غیر قابل اعتماد و نادرست یا جانبدارانه باشد یا قوانین حقوق پدیدآورندگان را نقض کرده باشد. " Local $res = StringRegExp($sText, "(*UCP)\b[\pL\d]{2,}", 3) _ArrayDisplay($res) The pL part means "any Unicode letter (in any language). It is a Unicode Character Property. See PCRE reference document (link in my signature) for more details about p and friends. I'm not knowledgeable into asian languages and the spacing which has to be considered, so this naïve attempt is certainly far from the real thing. Also you need to ensure that input text is Unicode and not one of the many multiple-byte encoding charset widely used in far Asia, like Big5 and countless others. Lastly, I need to remind you that AutoIt currently uses the UCS-2 subset of Unicode, which limits to the plane 0 (co-called BMP). If your input contains codepoints from higher Unicode planes, then converting input to UTF16-LE first might work but I'm unsure of that. You need to try that possibility. Edited April 30, 2014 by jchd trancexx 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
face Posted May 1, 2014 Author Share Posted May 1, 2014 This is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the three loops I started. Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1) Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1) Local $aWords For $i = 1 To $aFiles[0] For $j = 1 To $bFiles[0] If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) edit: Actually, the If loop doesn't need to be closed as I wrote it above, as there is only one line to execute, but change it as you need to.. i get this error msg code looks like this: #include <File.au3> #include <Array.au3> #include <MsgBoxConstants.au3> Local $oDictionary = ObjCreate("Scripting.Dictionary") Local $mypath = @ScriptDir Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1) Local $bFiles = _FileListToArray($mypath, "blacklist.txt", 1, 1) Local $aWords For $i = 1 To $aFiles[0] For $j = 1 To $bFiles[0] If $bFiles[$j] <> $aFiles[$i] Then $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) If @error Then MsgBox($MB_SYSTEMMODAL, "Error", "No files found") Exit Else MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files") EndIf Local $aWords For $i = 1 To $aFiles[0] $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) ; change pattern to fit your definition of "word" Local $iError = @error If $iError = 0 Then For $Word In $aWords $oDictionary.ADD($Word, $Word) Next Else MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError) EndIf Next $aWords = $oDictionary.Items FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF)) Link to comment Share on other sites More sharing options...
Palestinian Posted May 1, 2014 Share Posted May 1, 2014 Click Ctrl + T in SciTE and it will tell you how many Next you are missing, the code you posted is missing 2 Link to comment Share on other sites More sharing options...
mikell Posted May 1, 2014 Share Posted May 1, 2014 ... I'm sure you'll notice that I didn't close any of the three loops I started. There are 2 'Next' missing in your code Link to comment Share on other sites More sharing options...
somdcomputerguy Posted May 1, 2014 Share Posted May 1, 2014 There are 2 'Next' missing in your codeThis is just a snippet of your code, but I think you'll see where I'm coming from. Now, I haven't tested this, and I'm sure you'll notice that I didn't close any of the loops I started. - Bruce /*somdcomputerguy */ If you change the way you look at things, the things you look at change. Link to comment Share on other sites More sharing options...
mikell Posted May 1, 2014 Share Posted May 1, 2014 somdcomputerguy, Obviously I meant 'There are 2 'Next' missing in face's code' Link to comment Share on other sites More sharing options...
somdcomputerguy Posted May 1, 2014 Share Posted May 1, 2014 Ah. A misunderstanding then.. - Bruce /*somdcomputerguy */ If you change the way you look at things, the things you look at change. Link to comment Share on other sites More sharing options...
face Posted May 1, 2014 Author Share Posted May 1, 2014 now it works perfect but it doesn't search in all sub folders how can i make it find all text files from all sub folders #include <File.au3> #include <Array.au3> #include <MsgBoxConstants.au3> Local $oDictionary = ObjCreate("Scripting.Dictionary") Local $mypath = @ScriptDir Local $aFiles = _FileListToArray($mypath, "*.txt", 1, 1) Local $aWords If @error Then MsgBox($MB_SYSTEMMODAL, "Error", "No files found") Exit Else MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files") EndIf Local $aWords For $i = 1 To $aFiles[0] $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) ; change pattern to fit your definition of "word" Local $iError = @error If $iError = 0 Then For $Word In $aWords $oDictionary.ADD($Word, $Word) Next Else MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError) EndIf Next $aWords = $oDictionary.Items FileWrite("saved/words.txt", _ArrayToString($aWords, @CRLF)) Link to comment Share on other sites More sharing options...
BrewManNH Posted May 1, 2014 Share Posted May 1, 2014 _FileListToArrayRec If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag GudeHow to ask questions the smart way! I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from. Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays. - ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script. - Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label. - _FileGetProperty - Retrieve the properties of a file - SciTE Toolbar - A toolbar demo for use with the SciTE editor - GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI. - Latin Square password generator Link to comment Share on other sites More sharing options...
vinnyMS Posted December 10, 2020 Share Posted December 10, 2020 (edited) this script throws an error? how to fix it? #include <File.au3> #include <Array.au3> #include <MsgBoxConstants.au3> Local $oDictionary = ObjCreate("Scripting.Dictionary") Local $mypath = @ScriptDir Local $aFiles = _FileListToArrayRec($mypath, "*.txt", 1, 1) Local $aWords If @error Then MsgBox($MB_SYSTEMMODAL, "Error", "No files found") Exit Else MsgBox($MB_SYSTEMMODAL, "Found", $aFiles[0] & " files") EndIf Local $aWords For $i = 1 To $aFiles[0] $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) ; change pattern to fit your definition of "word" Local $iError = @error If $iError = 0 Then For $Word In $aWords $oDictionary.ADD($Word, $Word) Next Else MsgBox($MB_SYSTEMMODAL, "Error", $aFiles[$i] & " - " & $i & @CRLF & "error: " & $iError) EndIf Next $aWords = $oDictionary.Items FileWrite("words.txt", _ArrayToString($aWords, @CRLF)) what is the regexp for Chinese characters? Edited December 10, 2020 by vinnyMS Link to comment Share on other sites More sharing options...
mikell Posted December 10, 2020 Share Posted December 10, 2020 The error probably occurs because the key to be added already exists, example : $sd = ObjCreate("Scripting.Dictionary") $sd.add("test", "1") $sd.add("test", "2") msgbox(0,"", $sd.Item("test")) So you might try If not $oDictionary.Exists($Word) Then $oDictionary.ADD($Word, $Word) The regex should work for chinese chars, but you can add (*UCP) at the beginning of the pattern Link to comment Share on other sites More sharing options...
vinnyMS Posted December 10, 2020 Share Posted December 10, 2020 thank you, for extracting Chinese words this regex works: "[^\x00-\x7F]+" Link to comment Share on other sites More sharing options...
vinnyMS Posted December 10, 2020 Share Posted December 10, 2020 how do i extract every line that contains "#INCLUDE" Link to comment Share on other sites More sharing options...
Nine Posted December 10, 2020 Share Posted December 10, 2020 #include <Array.au3> Local $aIncl = StringRegExp(FileRead("YourFileGoesHere.au3"),"(?mi)^\s*(#include .*)$", 3) _ArrayDisplay($aIncl) Try this. “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
vinnyMS Posted December 10, 2020 Share Posted December 10, 2020 thank you so much ty Link to comment Share on other sites More sharing options...
AspirinJunkie Posted December 10, 2020 Share Posted December 10, 2020 8 hours ago, mikell said: The error probably occurs because the key to be added already exists, example : $sd = ObjCreate("Scripting.Dictionary") $sd.add("test", "1") $sd.add("test", "2") msgbox(0,"", $sd.Item("test")) So you might try If not $oDictionary.Exists($Word) Then $oDictionary.ADD($Word, $Word) The regex should work for chinese chars, but you can add (*UCP) at the beginning of the pattern Alternatively, you can use the assignment operator - in this case an item is either added if it does not exist or overwritten if it already exists: $oDictionary("TheKey") = "TheValue" Link to comment Share on other sites More sharing options...
Nine Posted December 10, 2020 Share Posted December 10, 2020 (edited) If want faster results, you could use MAP (see beta version) : Const $mypath = @ScriptDir Local $aFiles = _FileListToArray($mypath, "*.txt", $FLTA_FILES) Local $mWord[] ; create map array Local $aWords For $i = 1 To $aFiles[0] $aWords = StringRegExp(FileRead($aFiles[$i]), "[^\s]+", 3) ; change pattern to fit your definition of "word" If Not IsArray($aWords) Then ContinueLoop For $Word In $aWords $mWord[$Word] = 1 Next Next $aWords = MapKeys($mWord) ConsoleWrite (UBound($aWords) & @CRLF) Edited December 10, 2020 by Nine FrancescoDiMuro 1 “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now