AspirinJunkie Posted January 4, 2021 Posted January 4, 2021 (edited) I've noticed that for the task of "reading a file row by row into an array" FileReadToArray is consistently slower than a FileRead + StringSplit. The following script: Spoiler expandcollapse popup#include <String.au3> #include <WinAPIFiles.au3> ; size steps: Global $aN[] = [1024, 10240, 102400, 1048576, 10485760, 104857600] Global Const $sFILEPATH = @ScriptDir & "\Testfile.txt" Global $nBytes = 0, $nLineLength, $iTimer, $nTimerDiff, $aLines For $nFileSize In $aN ; create test file Local $hFile = FileOpen($sFILEPATH, 2 + 512) Do $nLineLength = Random(1, 20, 1) ; random line length FileWriteLine($hFile, _StringRepeat("a", $nLineLength)) $nBytes += $nLineLength + 2 ; "+2" for @crlf Until $nBytes >= $nFileSize FileFlush($hFile) FileClose($hFile) ConsoleWrite("Filesize: " & _WinAPI_StrFormatByteSize($nFileSize) & @CRLF) ; todo: truncate file to exact size with _WinAPI_SetFilePointer and _WinAPI_SetEndOfFile ; test StringSplit() $iTimer = TimerInit() $aLines = StringSplit(FileRead($sFILEPATH), @CRLF, 3) $nTimerDiff = TimerDiff($iTimer) $aLines = "" ; clean up ConsoleWrite(StringFormat("\t% 26s: % 8.2f ms (%s PeakWorkingSetSize)\n", "StringSplit", $nTimerDiff, _WinAPI_StrFormatByteSize(ProcessGetStats(@AutoItPID, 0)[1]) )) ; test FileReadToArray() $iTimer = TimerInit() $aLines = FileReadToArray($sFILEPATH) $nTimerDiff = TimerDiff($iTimer) $aLines = "" ; clean up ConsoleWrite(StringFormat("\t% 26s: % 8.2f ms (%s PeakWorkingSetSize)\n\n", "FileReadToArray", $nTimerDiff, _WinAPI_StrFormatByteSize(ProcessGetStats(@AutoItPID, 0)[1]) )) ; clean up FileDelete($sFILEPATH) Next Results in the following output for me: Spoiler Filesize: 1,00 KB StringSplit: 0.16 ms (13,5 MB PeakWorkingSetSize) FileReadToArray: 0.15 ms (13,7 MB PeakWorkingSetSize) Filesize: 10,0 KB StringSplit: 0.45 ms (13,8 MB PeakWorkingSetSize) FileReadToArray: 0.58 ms (13,8 MB PeakWorkingSetSize) Filesize: 100 KB StringSplit: 3.16 ms (14,5 MB PeakWorkingSetSize) FileReadToArray: 4.90 ms (14,8 MB PeakWorkingSetSize) Filesize: 1,00 MB StringSplit: 32.73 ms (26,1 MB PeakWorkingSetSize) FileReadToArray: 47.17 ms (28,4 MB PeakWorkingSetSize) Filesize: 10,0 MB StringSplit: 315.15 ms (140 MB PeakWorkingSetSize) FileReadToArray: 485.55 ms (163 MB PeakWorkingSetSize) Filesize: 100 MB StringSplit: 3191.81 ms (1,25 GB PeakWorkingSetSize) FileReadToArray: 4635.66 ms (1,47 GB PeakWorkingSetSize) And the output with switched call sequence: Spoiler Filesize: 1,00 KB FileReadToArray: 0.15 ms (13,5 MB PeakWorkingSetSize) StringSplit: 0.26 ms (13,7 MB PeakWorkingSetSize) Filesize: 10,0 KB FileReadToArray: 0.58 ms (13,9 MB PeakWorkingSetSize) StringSplit: 0.37 ms (14,0 MB PeakWorkingSetSize) Filesize: 100 KB FileReadToArray: 4.99 ms (14,9 MB PeakWorkingSetSize) StringSplit: 3.16 ms (14,9 MB PeakWorkingSetSize) Filesize: 1,00 MB FileReadToArray: 49.30 ms (28,5 MB PeakWorkingSetSize) StringSplit: 31.74 ms (28,5 MB PeakWorkingSetSize) Filesize: 10,0 MB FileReadToArray: 480.50 ms (163 MB PeakWorkingSetSize) StringSplit: 323.10 ms (163 MB PeakWorkingSetSize) Filesize: 100 MB FileReadToArray: 5050.74 ms (1,47 GB PeakWorkingSetSize) StringSplit: 3202.90 ms (1,47 GB PeakWorkingSetSize) This surprises me, since I first assume that the native function offers more optimization possibilities than having to create an array in AutoIt via an intermediate string. So I thought that maybe FileReadToArray is more economical with memory. But again FileReadToArray seems to consume more memory than the StringSplit variant. For this I got the PeakWorkingSetSize after the call of the respective function. After FileReadToArray the Size is always higher than with StringSplit. Since this is not the case if one changes the order of the two functions (see above), one can assume that FileReadToArray actually has this difference. The difference is mostly the double file size. Which would make sense if the file is completely saved internally as a UTF-16 string. Can anyone explain to me why FileReadToArray performs so unexpectedly poor - or am I missing something? Edited January 4, 2021 by AspirinJunkie
Nine Posted January 4, 2021 Posted January 4, 2021 2 hours ago, AspirinJunkie said: Can anyone explain to me why FileReadToArray performs so unexpectedly poor - or am I missing something? Since I have no access to source code, I cannot give you a true explanation. But I can give you my first take on the difference : #include <Array.au3> #include <Constants.au3> $str = "abcAbcAbcaBc" $arr = StringSplit($str, "A") _ArrayDisplay($arr, "String Split") FileWrite ("Test.txt", "abc" & @CR & "bc" & @LF & "bcAbc" & @CRLF & "test") $arr = FileReadToArray("Test.txt") _ArrayDisplay($arr, "File Read") FileDelete("Test.txt") 1) StringSplit is case-sensitive. While FileReadToArray should also be, but I cannot prove it. This is one "possible" reason. 2) There is a tad more logic in FileReadToArray, as it searches for different types (and combinations) of end-of-line characters. More logic implies longer execution. You cannot achieve this with StringSplit BTW. “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy
jpm Posted January 4, 2021 Posted January 4, 2021 Hi, the difference comes from the fact that the FileReadToArray is reading line by line as opposed to FileRead which read the whole file. Later on the splitting will work on the whole read buffer and act on it allocating entries of the array more quicker as the delimiter is just the @CRLF. In the case of FileReadToArray the delimiter can be @CR Or @LF or @CRLF which done char by char. That certainly the main overhead. If we want the same speed FileReadToArray will need a new parameter defining the delimiter. Cheers argumentum and FrancescoDiMuro 1 1
AspirinJunkie Posted January 4, 2021 Author Posted January 4, 2021 1 hour ago, Nine said: There is a tad more logic in FileReadToArray, as it searches for different types (and combinations) of end-of-line characters. Shure - that might be one reason. 6 minutes ago, jpm said: the difference comes from the fact that the FileReadToArray is reading line by line as opposed to FileRead which read the whole file O.k. - the difference in performance. But the difference in memory consumption?
Bilgus Posted January 8, 2021 Posted January 8, 2021 My guess is that an array is instantiated by a power of 2 to limit resizing and you are probably seeing the slack in your ram usage
AspirinJunkie Posted January 10, 2021 Author Posted January 10, 2021 On 1/8/2021 at 4:06 PM, Bilgus said: My guess is that an array is instantiated by a power of 2 to limit resizing and you are probably seeing the slack in your ram usage Hm the difference in memory consumption is quite noticeably the same as the string size (internally represented as UTF-16). If it's really the difference because of the dynamic array resizing, then the difference should be in worst case [number of lines] * 8 Byte (or 4 Byte at 32 Bit). But the difference we see here is much greater.
Bilgus Posted January 10, 2021 Posted January 10, 2021 (edited) and then rounded to the next power of 2.. Typically in schemes like this you would start with some low number of indices (power of 2) 1024 for instance if that wasn't enough then you might double it to 2048 or again to 4096 at some point the work will fit or you will run out of memory. in this case lets say we had a need for 2049 indices and we now have 4096 provisioned thats 50% more than we needed! Generally resizing arrays take time and processor so you typically see the array get overprovisioned to cut down on the resizes needed. Not sure how Autoit cleans up the ram but my guess is that it is never cleaned up while the array is in scope.. Edited January 10, 2021 by Bilgus Splainin
AspirinJunkie Posted January 10, 2021 Author Posted January 10, 2021 (edited) Exact - in worst case this would be double of the number of lines in the file. If you have 1000 lines the array would be 1024 - difference of 24 elements. Worst case instead: If the file would have 1024 (more precise 1025) lines then the array would have 2048 elements - diffence of 1024 lines. Just as i wrote. Edited January 10, 2021 by AspirinJunkie
Bilgus Posted January 10, 2021 Posted January 10, 2021 you assume its lockstep to the size but maybe they quadruple each resize or double for the first 3 resizes and x10 for the next no clue but if you really want to know go fill the array and have a look at the memory I believe sysinternals has a virtual ram viewer
AspirinJunkie Posted January 10, 2021 Author Posted January 10, 2021 (edited) 49 minutes ago, Bilgus said: you assume its lockstep to the size but maybe they quadruple each resize or double for the first 3 resizes and x10 for the next no clue This would be very unlikely. The most common factors are 2 (C++ Arraylist, Python) and 1.5 (Java Arraylist). But well - we can check this empirically with a concrete example: FileSize (ANSI encoded): 104,857,600 Bytes Number of lines: 7,549,361 Difference in memory consumption: 239,345,664 Bytes Stringsize in memory (UTF-16 encoded): 209,715,200 Bytes Difference explainable through array resize factor 1.5 28,063,771 Bytes 2 6,713,976 Bytes 3 54,396,368 Bytes 4 73,822,840 Bytes 5 17,730,112 Bytes 6 20,226,680 Bytes 7 262,433,968 Bytes 8 73,822,840 Bytes 9 283,978,880 Bytes 10 19,605,112 Bytes I think we can safely disregard factors other than these. As can be seen, the differences that can be explained by the resize factor are significantly smaller than the actual difference. Or they are significantly larger than this and therefore cannot be used as a reason either. The array resize factor therefore does not seem to be the reason for the difference. Edited January 10, 2021 by AspirinJunkie
Bilgus Posted January 10, 2021 Posted January 10, 2021 I'm not sure what your numbers are trying to show there I'm gonna need a whole lot more explanation of your methodology before these numbers even start to make sense to safely disregard Are you saying you tested this in a different language then? Next are you taking in account the variant nature of the backend of autoit variables and that each array spot could be holding a string that is also power of 2 growing?
AspirinJunkie Posted January 10, 2021 Author Posted January 10, 2021 1 minute ago, Bilgus said: I'm not sure what your numbers are trying to show there I'm gonna need a whole lot more explanation of your methodology before these What to explain? I explained it already. Take the number of lines. Determine the next power of the factor. Calc the difference and multiply by 8 Bytes. (Arrays in AutoIt are pointer arrays because every element points to a variant structure. This could be easily checked) This would be the difference in memory consumption if these are due to the dynamic array enlargement. Why don't you explain why you suspect the dynamic array so much? That was your assumption - but so far this suspicion has not been backed up with anything concrete. My assumption is, that it has something to do with the file string in memory. I also have a concrete hint for this: The relation of the string size to the whole memory difference is stable independent of the size of the file: At 1 MB it is 87% of the total difference, at 10 MB it is also 87%, at 100 MB it is 88%. If the differences were really due to the array resizing, the differences would have to vary more significantly, since completely different factor-powers would be hit.
Bilgus Posted January 10, 2021 Posted January 10, 2021 I figure over provisioning is the most likely difference since the arrays need to have enough room to search for the line endings without any hints unlike the stringsplit which I figure is more optimized to the workload and has an upperbound from the start
AspirinJunkie Posted January 11, 2021 Author Posted January 11, 2021 This is an theory (a good one, by the way, in my opinion). But for now, this is just the first step. In the next step, a theory must be verified against the reality. In the practical examination of this, two aspects emerge that speak against this hypothesis: We can calculate the expected effect of overprovisioning for a concrete resize factor and the results showed that none of these factors is suitable to explain the memory difference because the actual difference is far too different from the expected once. The relation of the string size to the whole memory difference is remarkable stable even over completely different file sizes. If the reason is really overprovisioning then we should not be able to observe this effect. The ratio of file size/string size to memory consumption difference should result in a staircase function when overprovisioning is the reason. In fact, however, we observe a linear dependency. Both empirical observations argue against the theory that overprovisioning is the main reason for the differences. So far, I have not seen an empirical aspect that speaks in favor of the theory. However, my problem remains that this is a bit like reading from a crystal ball. The methodology for determining the memory difference is extremely shaky. However, I do not have a better approach so far and therefore try to make deductions about the inner mode of functioning on this basis. In the end, it is about giving the devs hints where something might be wrong so that they can look more specifically if there is really a potential for improvement.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now