frank10 Posted September 26, 2014 Share Posted September 26, 2014 (edited) I have an unordered list of more than 30000 ID in a file.txt (separated by "|"), such as: msw-3k-031114-a029_c026_1110gt A009_C009_062709 jkh-4k-novsowia-028614-a094_c071_09206t 9170-22 UT_BCNP_Arils_25 WHtD_Fo_2592 102510-17 I need to check if a new ID is inside this list and if not there, update it. I tried loading the file in a variable and check with StringInStr() but it's slow (as I must check them at blocks of 80, so at 0.4s it results 32s waiting time...), so I would improve the search. StringInStr($myListFiles, "04-060112-17" ) ;0.4 s I thought subdividing the list in several arrays checking the first letter of each ID, so I will have 9 array for numeric ID and 21 array for chars. This StringInStr() should be faster. But it depends of how many ID starts with let's say "A": if they are the majority, the speed won't be a lot faster (their names aren't equal subdivided). I tried also using Scripting.Dictionary but it's a lot slower than StringInStr() (??) $oDict.Item("04-060112-17") ;7s (??) Other suggestions? Edited September 26, 2014 by frank10 Link to comment Share on other sites More sharing options...
kylomas Posted September 26, 2014 Share Posted September 26, 2014 What designates an ID? Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 They corresponds to the name of a unique file. Link to comment Share on other sites More sharing options...
kylomas Posted September 26, 2014 Share Posted September 26, 2014 I don't see 04-060112-17 in your test data. Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 Well, it's in my full list "file.txt". Here I posted only some example IDs to understand how they are made. In the test speed I searched one ID in the last positions to compare speed searches in the worst cases. My 0,4s refers of a search of more than 30000 IDs, not on only the 7 I posted. Link to comment Share on other sites More sharing options...
kylomas Posted September 26, 2014 Share Posted September 26, 2014 OK, so each line can be an ID? Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 Yes. Link to comment Share on other sites More sharing options...
kylomas Posted September 26, 2014 Share Posted September 26, 2014 Instead of an array you might try it like this... local $st = timerinit() local $fileid = stringregexp(fileread(@scriptdir & '\test.txt'),'UT_BCNP_Arils_25',3)[0] ConsoleWrite($fileid & ' found in ' & timerdiff($st)/1000 & ' seconds.' & @lf) kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
czardas Posted September 26, 2014 Share Posted September 26, 2014 (edited) I don't know if this is of any use to you. I wrote it a while ago after some inspiration from kylomas. I think there may be better functions doing something similar. Edit You need to have at least two arrays (read from file I imagine) to pass to the function. It will test both arrays for duplicates and concatenate the unique entries - so it's only useful if you have a lot of new items to add in one batch process. There is an example of use in the first post in that thread (I should write a simpler example sometime). Because you only have approximately 30,000 IDs to test, it should be reasonably fast. Any problems getting it to work, then post the code you tried and I'll see if I can help. Edited September 26, 2014 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 Instead of an array you might try it like this... local $st = timerinit() local $fileid = stringregexp(fileread(@scriptdir & '\test.txt'),'UT_BCNP_Arils_25',3)[0] ConsoleWrite($fileid & ' found in ' & timerdiff($st)/1000 & ' seconds.' & @lf) kylomas Thank you a lot kylomas! That stringregexp was really an improvement! I didn't suspect it was that faster vs StringInStr(). I previously used: $timer = TimerInit() $result = StringInStr($myListFiles, "04-060112-17") ConsoleWrite("StringSearch sec:" & TimerDiff($timer)/1000 & " result:" & $result & @CRLF) The stringregexp() makes it in 0,004s vs the 0,4s of StringInStr() !! That is a huge improvement of 100x !! Quite unbelieavable... Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 (edited) There is something strange. 1) i.e. if I search: "A011_C018_0804LF" I get regExp: 0.0029s vs 3.2s ( ! ) but if I search: "sloquid-082212-05" I get regExp: 0.003s vs 0.037s What's so different in the first string with StringInStr()? 2) And another test reducing the size of file to only 7 IDs: if I search: "A011_C018_0804LF" I get regExp: 5.377s ( ! ) vs 1.31s and "sloquid-082212-05" I get regExp: 6.11s ( ! ) vs 1.76s So with very short variables regExp is worse !? PS: I had to change your line, because if I search for an ID not in the list I get error: local $aFileid = stringregexp($myListFiles,"sid-082212-05s" ,3) if IsArray($aFileid) Then msgBox(0,''","ID found") Edited September 26, 2014 by frank10 Link to comment Share on other sites More sharing options...
czardas Posted September 26, 2014 Share Posted September 26, 2014 (edited) Regexp has various characters which are actually commands to the regular expression engine. These cannot be searched directly without an escape backslash being placed before the character. For example the string "sid-082212-05s" would need to be changed to "sid-082212-05s". I assume the odd runtime discrepancies will be associated with the complexity of parsing Regular Expressions. This is possibly the reason RegExp seems slower on your test strings. The code should give the wrong results in this instance. Also consider that the the expression (as it is written) will allow substring matches - which is probably not what you want. Edited September 26, 2014 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 (edited) Regexp has various characters which are actually commands to the regular expression engine. These cannot be searched directly without an escape backslash being placed before the character. For example the string "sid-082212-05s" would need to be changed to "sid-082212-05s". I don't think that's the problem, because using the "A011_C018_0804LF" string that has no special RegExp char, inside a short variable, I get very slow speed, whilst in a huge var I get very fast speed. More, If I put "" before "-" I get no improved speed, but also if I don't put "" it gets the correct result... Edited September 26, 2014 by frank10 Link to comment Share on other sites More sharing options...
czardas Posted September 26, 2014 Share Posted September 26, 2014 (edited) If you get the correct results without escaping the minus sign, then it's plain luck. Having said that I don't exactly understand how that works. The regular expression engine may be doing various internal tests before searching - making it apparently slower on small strings. This is a guess. It's also a case sensitive search ATM. Edit Got it! - the minus sign should be escaped if used inside a set. Edit 2 Without a full description of all the possible ID variants (or rules to generate them) it is never possible to give a regular expression solution. Here the only rule seems to be that there is a delimiter string "|". Edited September 26, 2014 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 (edited) I found out what it was happening! Sometimes the search was so fast it was less than 0.00001, so Consolewrite("StringRegExp search sec:" & Timerdiff($st)/1000 & "seconds.") printed this: StringRegExp search sec:7.92299668260434e-005 seconds. I read only the first numbers.... but there was also the e-005 at the end! So, ok, now it makes sense: with shorter var it is lightning fast! EDIT: The same was for StringInStr(): that string was at the first positions of the var, so it was a lot fast also! If the matching string is near the beginning of the var, StringInStr is even slightly faster than RegExp. How could I transform the search string to make it safe with all RegExp special chars? Edited September 26, 2014 by frank10 Link to comment Share on other sites More sharing options...
czardas Posted September 26, 2014 Share Posted September 26, 2014 (edited) How could I transform the search string to make it safe with all RegExp special chars? Code needs testing for bugs (it seems to be working though). Local $sDelimString = "msw-3k-031114-a029_c026_1110gt|A009_C009_062709|jkh-4k-novsowia-028614-a094_c071_09206t|9170-22|UT_BCNP_Arils_25|WHtD_Fo_2592|102510-17" Local $sSearchStr = "jkh-4k-novsowia-028614-a094_c071_09206t" Local $iMatch = StringRegExp("|" & $sDelimString & "|", "\|" & StringRegExpReplace($sSearchStr, "[\W]", "\\$0") & "\|") ConsoleWrite($iMatch & @LF) ; 1 = match found : This basically tests for anything between the delimiters "|" by including them in the search. This prevents substring matches. If you want case insensitive matches also then you can try this: Local $iMatch = StringRegExp("|" & $sDelimString & "|", "(?i)\|" & StringRegExpReplace($sSearchStr, "[\W]", "\\$0") & "\|") Edited September 26, 2014 by czardas operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
frank10 Posted September 26, 2014 Author Share Posted September 26, 2014 Thank you, it seems good. Link to comment Share on other sites More sharing options...
czardas Posted September 26, 2014 Share Posted September 26, 2014 You are welcome. It will be more reliable and still very fast. I enjoyed thinking about it. Perhaps someone will post another solution. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
kylomas Posted September 26, 2014 Share Posted September 26, 2014 frank10, EDIT: The same was for StringInStr(): that string was at the first positions of the var, so it was a lot fast also! If the matching string is near the beginning of the var, StringInStr is even slightly faster than RegExp. This was bothering me when I got pulled away. If you are doing a char by char match stringinstr should be just as fast as SRE (I think...). Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
czardas Posted September 26, 2014 Share Posted September 26, 2014 I thought so too, but it appears that RegExp is much faster when the search space grows beyond a certain size. operator64 ArrayWorkshop Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now