I created a script to split a text file to multiple files based on the first two characters of each line, Example:


about my brother and me.
About me?
Naturally, you can't know.
Nature must take her course!

The result of this example will be two files:


about my brother and me.
About me?


Naturally, you can't know.
Nature must take her course!

As you see the first two characters will be the file name.

My script does the job.

So, What's the problem?

The problem is that my script is so slow with big files.

I tried it with a text file with 1,000,000 lines and it took about half an hour to finish 20% only.


Here is my script:

#include <Array.au3>
#include <AutoItConstants.au3>
#include <File.au3>
#include <scriptingdic.au3>
;Download from: https://www.autoitscript.com/forum/topic/182334-scripting-dictionary-modified/

Global $Lines    
_FileReadToArray("ORIGINAL.txt", $Lines, $FRTA_NOCOUNT)  
Global $initArr = ["----"]

Global $dict = _InitDictionary()

$Total = UBound($Lines)
$LastRound = 0

For $i = 0 To UBound($Lines)-1  Step +1 
    ;Extract the first two characters of the current line
    $FirstTwoChar =  StringMid($Lines[$i], 1, 2)
    ;Replace symbols that are not valid for file names
    $FirstTwoChar = StringReplace($FirstTwoChar, " ", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "<", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, ">", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "?", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, '"', "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "|", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, ":", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "\", "_")
    $FirstTwoChar = StringReplace($FirstTwoChar, "/", "_")

    ;Add the first two characters as a key in the dictionary with an array as its value.
    if not _ItemExists($dict, $FirstTwoChar) then
        $initArr[0] = $FirstTwoChar
        _AddItem($dict, $FirstTwoChar, $initArr)
    ;Add the current line to the array.
    $tmpArray = _Item($dict, $FirstTwoChar)
    _ArrayAdd($tmpArray ,$Lines[$i])
    _ChangeItem($dict, $FirstTwoChar, $tmpArray)

    ;Show progress on the screen
    $Percent = $i / $Total * 100
    if round($Percent) <> $LastRound then 
        $LastRound = round($Percent)

;Save each array as text file
For $Key In $Dict
    $FinalArray = _Item($dict, $Key)
    $FileName = $FinalArray[0]
    _ArrayDelete($FinalArray, 0)
    _FileWriteFromArray("result\"&$FileName&".txt", $FinalArray)

My limitations:
* Lines must stay in the same order. You can't change lines order while processing the file.

Any idea to make this script fast?


Example of ORIGINAL text file.rar

I expect your slow down is from _ArrayAdd. Each time you use it AutoIt is re-dimensioning the array. Can you define the array size at the start and then assign values? 

Edit: Or burn some memory and set up large empty arrays then assign to them.

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Thanks, SlackerAl for reply.

I changed the following  lines :
Global $initArr = ["----"]
Global $initArr[2500]

_ArrayAdd($tmpArray ,$Lines[$i])
 _ArrayPush($tmpArray, $Lines[$i])

(and added a new line after _ArrayPush :   "$tmpArray[0] = $FirstTwoChar" to make sure the first item is always the filename)
But with no luck, the script still slow.

I'm not sure of the cost of a push (that's still an index change to everything in the array). Can you not re-work your code to directly assign your values to the array(s)? How many possible 2 letter combos are you expecting? Is it the full 26^2 or just a small subset of that? Could you collect each combo in its own array with direct assignment?


Edit: OK I see you have a large number of possible pairs.... I'll think about it for a bit

Edited by SlackerAl

Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Try this baby :)

#include <Array.au3>
#include <AutoItConstants.au3>
#include <File.au3>

Global $Lines
_FileReadToArray("ORIGINAL.txt", $Lines, $FRTA_NOCOUNT)

Global $oDict = ObjCreate("Scripting.Dictionary")

Local $Total = UBound($Lines), $LastRound = 0, $FirstTwoChar

For $i = 0 To $Total - 1
  ;Extract the first two characters of the current line
  $FirstTwoChar = StringInStr(' <>?"|:\/', StringMid($Lines[$i], 1, 1)) ? "_" : StringMid($Lines[$i], 1, 1)
  $FirstTwoChar &= StringInStr(' <>?"|:\/', StringMid($Lines[$i], 2, 1)) ? "_" : StringMid($Lines[$i], 2, 1)

  ;Add the first two characters as a key in the dictionary with its value.
  If Not $oDict.Exists($FirstTwoChar) Then
    $oDict.Add($FirstTwoChar, $Lines[$i] & @CRLF)
  Else   ;Add the current line to the dict.
    $oDict.Item($FirstTwoChar) = $oDict.Item($FirstTwoChar) & $Lines[$i] & @CRLF

  ;Show progress on the screen
  If Not Mod($i, 100) Then
    $Percent = $i / $Total * 100
    If Round($Percent) <> $LastRound Then
      ToolTip('...' & Round($Percent) & "%", 0, 5)
      $LastRound = Round($Percent)

;Save each item as text file
For $Key In $oDict
  FileWrite("result\" & $Key & ".txt", $oDict.Item ($Key))

Not fully tested but I believe it is quite close of what you are looking for...

$str = "about my brother and me." & @LF & _
"About me?" & @LF & _
"Babout my brother and me." & @LF & _
"BAbout me?" & @LF & _
"Naturally, you can't know." & @LF & _
"Nature must tabke her course!"


    $a = stringregexp($str , "(?:\A|\R)(?im:" & stringleft($str , 2) & ".*)" , 3)

    _FileWriteFromArray(stringleft($str , 2) & ".txt" , $a)

    $str = stringstripws(stringregexpreplace($str , "(?:\A|\R)(?im:" & stringleft($str , 2) & ".*)" , "") , 1)

Until $str = ""



2 hours ago, Nine said:




2 hours ago, iamtheky said:




@Nine @iamtheky @SlackerAl
Hello guys and thanks for your contributes.
After testing,
Nine's code was the fastest and after comparing it with my old code, it was a huge improvement.
Took only one minute to finish 100000 lines and this is great compared to the old one. 👍

I can work with this for now.
Thanks, everyone.


28 minutes ago, MajKSA said:

Nine's code was the fastest and after comparing it with my old code, it was a huge improvement.
Took only one minute to finish 100000 lines and this is great compared to the old one.


20 hours ago, Exit said:

So it's really time to introduce Maps in the production version as well.

They don't work correctly yet, so they won't be added. Scripting Dictionary can be put into a UDF to do the same, or almost the same, thing so it's not really that much of a rush to add broken implementations.

the unofficial udf is pretty sexy tho, and nobody is making a better one.  😎


3 minutes ago, Nine said:

There is already a UDF.  But it is useless.  Most functions replace a one liner by another one liner...


Beside scripting dictionary objects being more verbose to use, they need care to generalize since they don't handle int64.  See notes in _ArrayUnique() help, for instance.

1 hour ago, Nine said:

Most functions replace a one liner by another one liner...

A lot of the Misc.au3 functions are like that, look at RunDos, replaces a Run statement for the lazy. It's mainly a documentation issue.

