Biatu Posted January 18, 2017 Share Posted January 18, 2017 Hello I am trying to find the LCS in an array of strings. Goal: To traverse the array, comparing string against string, and extracting similar groups of words, then single words given that they are longer than the "~!SubStrID", being be a number starting with "~!" to point to the substitute. Problem: I keep getting too much junk, or parts of words, or numbers, and not full sub strings/words, etc. Here is the code: #Include <Array.au3> Local $aCommons[0], $vTest $vTest=StringSplit(FileRead("Strings.txt"),@CRLF,1);@LF Only from now on! For $i=1 To $vTest[0] For $j=1 To $vTest[0] If $i=$j Then ContinueLoop $sCommon=_LCS($vTest[$i],$vTest[$j]) If StringInStr($sCommon,"~!") Then ContinueLoop $iMax=UBound($aCommons,1) For $k=0 To $iMax-1 If $sCommon=$aCommons[$k] Then ContinueLoop 2 Next ReDim $aCommons[$iMax+1] $aCommons[$iMax]=$sCommon $vTest[$j]=StringReplace($vTest[$j],$sCommon,"~!SubStrID") Next Next _ArrayDisplay($aCommons) Func _LCS($sStringA,$sStringB) Local $sTmp,$sRet For $i=1 To StringLen($sStringA) $sTmp&=StringMid($sStringA,$i,1) If StringInStr($sStringB,$sTmp) Then If StringLen($sTmp)>StringLen($sRet) Then $sRet=$sTmp EndIf Else $sTmp="" EndIf Next Return $sRet EndFunc What is what? What is what. Link to comment Share on other sites More sharing options...
jchd Posted January 18, 2017 Share Posted January 18, 2017 Can you post a significant sample of the strings you have to process? This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Biatu Posted January 18, 2017 Author Share Posted January 18, 2017 Here are 2 samples, and there will be like hundreds of lists that will need processing. StringsA.txtStringsB.txt What is what? What is what. Link to comment Share on other sites More sharing options...
orbs Posted January 18, 2017 Share Posted January 18, 2017 @Biatu, are you trying to detect which string in list A also exists in string B (i.e. find the intersection)? this is by far easier than LCS - and in general, LCS is not the most efficient way when working with sorted lists. Signature - my forum contributions: Spoiler UDF: LFN - support for long file names (over 260 characters) InputImpose - impose valid characters in an input control TimeConvert - convert UTC to/from local time and/or reformat the string representation AMF - accept multiple files from Windows Explorer context menu DateDuration - literal description of the difference between given dates Apps: Touch - set the "modified" timestamp of a file to current time Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes SPDiff - Single-Pane Text Diff Link to comment Share on other sites More sharing options...
jchd Posted January 19, 2017 Share Posted January 19, 2017 The string pool will probably be fairly large, given the content of StringsB.txt for instance. Doing a comparison character by character, string after string in such a huge list of all notebook models (just that) will take forever. Instead and if what you're going to do is match one input string with one or more entries in the huge list of "reference strings", if I were you I'd use a 3-step approach. 0- Before anything else you need to build an SQLite database of all reference strings, possibly categorized to avoid mixing peripheral names and notebook/desktop models. I bet you aren't searching completely blindly. Be sure to add an FTS (Full Text Search) table of the text part. You might have to remove special characters to keep only alphanumeric, spaces and possibly hyphens and such. All this can be done automagically. 1- For every input string, put it in a separate table along its own FTS table. Using the suitable FTS tokenizer you can end up with input strings broken down into only significant words, just like the reference strings are. 2- Querying the input FTS table you get a list of words in the input string, which you can then use to build up a query in the reference FTS table, finally retrieving the candidate reference string(s) matching at least the same list of words as the input string. All this may seem a little bit complex, but in practice and once the relevant queries are set up and debugged, things will be extremely fast. 3- Once you have a list of reference strings matching the words in the input string, you can go the pedestrian way to decide which one is a/the best match of the input. You can certainly display a listview and let the user choose, or implement some best match algorithm suitable to your actual context. All in all I believe this will reveal the most efficient approach, at least with the kind of examples you've shown and the volume implied by the nature of the second example file (hundreds of such is no problem). The standard SQLite DLLs include FTS4 which offers a number of features (http://www.sqlite.org/fts3.html). The new FTS5 extension is very fast and even more powerful (http://www.sqlite.org/fts5.html#full_text_query_syntax). Skysnake 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Biatu Posted January 19, 2017 Author Share Posted January 19, 2017 @orbs No, those are separate lists and I am looking to get the LCS between all the entries in the list. Basically I'm trying to reduce the file size of each list by replacing commons strings first, then common words with pointers to another list index. @jchd That's a good approach, however I cannot use SQLite, this script needs to have no dependencies, and I'm not too worried about it being slow. What is what? What is what. Link to comment Share on other sites More sharing options...
jchd Posted January 19, 2017 Share Posted January 19, 2017 You can always fileinstall the DLL, or inline it in binary like it was in old AutoIt versions and even run it direct from memory. Anyway I whish you best of luck. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
czardas Posted January 19, 2017 Share Posted January 19, 2017 5 hours ago, Biatu said: Basically I'm trying to reduce the file size of each list by replacing commons strings first, then common words with pointers to another list index. This is a very interesting problem. I have pondered it myself: not with words but with binary. The problem is that it takes forever. Biatu 1 operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
jchd Posted January 19, 2017 Share Posted January 19, 2017 Guys, you're rediscovering compression algorithms. Biatu and czardas 2 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
czardas Posted January 19, 2017 Share Posted January 19, 2017 Everything I come up with turns out that way. Biatu 1 operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
Gianni Posted January 19, 2017 Share Posted January 19, 2017 here is a quick attempt in a brute force way. It will find all possible substrings of the passed words and will store results in an SQLite table. Then will extract with an sql query the most recurring substrings. This is just a quick draft. Could it be a way to go? note: to use this script you have to save the Sqlite3.dll in the same path of the script. expandcollapse popup#include <array.au3> #include <MsgBoxConstants.au3> #include <SQLite.au3> ; #include <SQLite.dll.au3> Local $aResult, $iRows, $iColumns, $iRval Local $sString[19] = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] ; Local $sString = FileReadToArray(@ScriptDir & "\StringsA.txt") ; <--- use this to load a list of words from a text file. ; _ArrayDisplay($sString, "Readed file") _SQLite_Startup(@ScriptDir & "\sqlite3.dll") If @error Then MsgBox($MB_SYSTEMMODAL, "SQLite Error", "SQLite3.dll Can't be Loaded!") Exit -1 EndIf ConsoleWrite("_SQLite_LibVersion=" & _SQLite_LibVersion() & @CRLF) Local $sDbName = @ScriptDir & "\SubStrings.sqlite" Local $hDskDb = _SQLite_Open() ; Open a temp disk database If @error Then MsgBox($MB_SYSTEMMODAL, "SQLite Error", "Can't open or create a permanent Database!") Exit -1 EndIf _SQLite_Exec(-1, "Create table tblSubstrings (substring, stringID int);") ; The following nested loops will generate any possible substring out of the passed string ; and results are stored into the sqlite archive for later analysis For $n = 0 To UBound($sString) - 1 ; save the whole string _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & $sString[$n] & "'," & $n & ");") If StringLen($sString[$n]) > 1 Then For $x = 2 To StringLen($sString[$n]) - 1 For $i = 1 To StringLen($sString[$n]) - $x + 1 ; ConsoleWrite(StringMid($sString[$n], $i, $x) & @CRLF) _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & StringMid($sString[$n], $i, $x) & "'," & $n & ");") Next Next EndIf Next ; Query ; this query extract all the substrings ordered from the most common to the least recurring $iRval = _SQLite_GetTable2d(-1, "SELECT substring, COUNT(substring) AS subs_sum FROM tblSubstrings GROUP BY substring HAVING subs_sum>1 ORDER BY subs_sum DESC;", $aResult, $iRows, $iColumns) If $iRval = $SQLITE_OK Then _ArrayDisplay($aResult) Else MsgBox($MB_SYSTEMMODAL, "SQLite Error: " & $iRval, _SQLite_ErrMsg()) EndIf ; close sqlite _SQLite_Close($hDskDb) _SQLite_Shutdown() Biatu 1 Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Biatu Posted January 19, 2017 Author Share Posted January 19, 2017 1 hour ago, Chimp said: here is a quick attempt in a brute force way. It will find all possible substrings of the passed words and will store results in an SQLite table. Then will extract with an sql query the most recurring substrings. This is just a quick draft. Could it be a way to go? note: to use this script you have to save the Sqlite3.dll in the same path of the script. expandcollapse popup#include <array.au3> #include <MsgBoxConstants.au3> #include <SQLite.au3> ; #include <SQLite.dll.au3> Local $aResult, $iRows, $iColumns, $iRval Local $sString[19] = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] ; Local $sString = FileReadToArray(@ScriptDir & "\StringsA.txt") ; <--- use this to load a list of words from a text file. ; _ArrayDisplay($sString, "Readed file") _SQLite_Startup(@ScriptDir & "\sqlite3.dll") If @error Then MsgBox($MB_SYSTEMMODAL, "SQLite Error", "SQLite3.dll Can't be Loaded!") Exit -1 EndIf ConsoleWrite("_SQLite_LibVersion=" & _SQLite_LibVersion() & @CRLF) Local $sDbName = @ScriptDir & "\SubStrings.sqlite" Local $hDskDb = _SQLite_Open() ; Open a temp disk database If @error Then MsgBox($MB_SYSTEMMODAL, "SQLite Error", "Can't open or create a permanent Database!") Exit -1 EndIf _SQLite_Exec(-1, "Create table tblSubstrings (substring, stringID int);") ; The following nested loops will generate any possible substring out of the passed string ; and results are stored into the sqlite archive for later analysis For $n = 0 To UBound($sString) - 1 ; save the whole string _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & $sString[$n] & "'," & $n & ");") If StringLen($sString[$n]) > 1 Then For $x = 2 To StringLen($sString[$n]) - 1 For $i = 1 To StringLen($sString[$n]) - $x + 1 ; ConsoleWrite(StringMid($sString[$n], $i, $x) & @CRLF) _SQLite_Exec(-1, "Insert into tblSubstrings values ('" & StringMid($sString[$n], $i, $x) & "'," & $n & ");") Next Next EndIf Next ; Query ; this query extract all the substrings ordered from the most common to the least recurring $iRval = _SQLite_GetTable2d(-1, "SELECT substring, COUNT(substring) AS subs_sum FROM tblSubstrings GROUP BY substring HAVING subs_sum>1 ORDER BY subs_sum DESC;", $aResult, $iRows, $iColumns) If $iRval = $SQLITE_OK Then _ArrayDisplay($aResult) Else MsgBox($MB_SYSTEMMODAL, "SQLite Error: " & $iRval, _SQLite_ErrMsg()) EndIf ; close sqlite _SQLite_Close($hDskDb) _SQLite_Shutdown() I'll check it out, thanks guys. What is what? What is what. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now