Jump to content

Recommended Posts

Posted

Hello,

I am trying to manipulate a large CSV file (Over 5 million lines with 50-300 characters per line).

I thought I would start with something like:

_FileReadToArray("C:\Folder\Data.csv", $My_Data_Array)

But, I get a popup "Error allocating memory." as soon as I hit 2GB of RAM usage.

Does anyone have a good way to approach such a task?

Posted

Could you give more context on what you are trying to do with the data? Do you need all of the data at once or can you process 1 line at a time?

Here is a modified example from the help file to loop through each line of the file using FileReadLine.

#include <FileConstants.au3>
#include <MsgBoxConstants.au3>
#include <WinAPIFiles.au3>

Example()

Func Example()
    Local $timer = TimerInit()
    ; Create a constant variable in Local scope of the filepath that will be read/written to.
    Local Const $sFilePath = "testFile.csv"

;~  Local $sStr = ""
;~  For $i=0 to 5000000
;~      $sStr &= "00000000000000000000000000000000000000000000000" & @CRLF
;~  Next

;~  ; Create a temporary file to read data from.
;~  If Not FileWrite($sFilePath, $sStr) Then
;~      MsgBox($MB_SYSTEMMODAL, "", "An error occurred whilst writing the temporary file.")
;~      Return False
;~  EndIf
;~  ConsoleWrite(TimerDiff($timer) & @CRLF)
;~  $timer = TimerInit()

    ; Open the file for reading and store the handle to a variable.
    Local $hFileOpen = FileOpen($sFilePath, $FO_READ)
    If $hFileOpen = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "An error occurred when reading the file.")
        Return False
    EndIf

    ; loop through the file lines
    Local $sFileLine
    While 1
        ;read the next line
        $sFileLine = FileReadLine($hFileOpen)
        If @error Then ExitLoop

        ;process this line

    WEnd
    ConsoleWrite(TimerDiff($timer) & @CRLF)

    ; Close the handle returned by FileOpen.
    FileClose($hFileOpen)
EndFunc   ;==>Example

 

Posted
5 hours ago, baconaise said:

Title : Memory limits and massive arrays

But, I get a popup "Error allocating memory." as soon as I hit 2GB of RAM usage.

You can find the limits for AutoIt e.g. here : LimitsDefaults

Excerpt : Maximum number of elements for an array = 16,777,216

Your problem is apparently the default limit of 2,147,483,647 bytes in RAM.

@kurtykurtyboy's suggestion to process the .csv file line by line is good, assuming that this is what you need. 

Another option would be to read the data into a SQLite database and process it with the respective SQL statements.

 

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Posted
22 hours ago, kurtykurtyboy said:

Could you give more context on what you are trying to do with the data? Do you need all of the data at once or can you process 1 line at a time?

Great question. Each CSV line is formatted: Wholesaler,Retailer,Product_Category,Model_Number,Sale_Date,Transaction_Date

I am trying to build something that will tell me about each retailer's orders. I do not believe the sum of data for any individual retailer would encounter the memory cap, so there is probably some merit to the idea that I could read line by line to create an array for each retailer.

I set up a test for that last night to see the speed at which it might process using something like this:

Global $Lines_Of_Data_In_The_CSV = _FileCountLines("C:\Folder\Data.csv")
    Global $Sales_Data_Dump_CSV = FileOpen("C:\Folder\Data.csv", $FO_READ)
    Global $New_Array[$Lines_Of_Data_In_The_CSV]
    Global $GUI_Reading = GUICtrlCreateLabel($Lines_Of_Data_In_The_CSV,5,5,500,30)

    For $i = 1 to $Lines_Of_Data_In_The_CSV
        $New_Array[$i] = FileReadLine($Sales_Data_Dump_CSV,$i)
        GUICtrlSetData($GUI_Reading, $i & "     " & $New_Array[$i])
    Next

In almost a full day, it's only about 10% of the way through the file. That is concerningly slow. Though, it is still better than doing it by hand.

Posted
18 hours ago, genius257 said:

Hi @baconaise.
Have you tried running your script as a 64-bit application?

Now there's an idea! I compiled it using the 64-bit Aut2Exe (because I don't know how to do that directly from Scite), and it did not give me the error. It's sitting here occupying 1,805 MB of ram. I'm a little curious what would make it happy to hold the data using 1.8GB in 64-bit while it gave me the 2-GB error in 32-bit.

Posted

For sure, the GuiCtrlSetData is slowing your loop to a crawl!

Here is 1 way to sum one of the CSV columns for each retailer using a dictionary object to store and lookup the sums. This took between 90-160 seconds on my humble laptop to process my test file with 5 million lines. You'll most likely need to adjust the columns for your data, but hopefully this gets you going in the right direction. If you want to do something different with your data, let us know and we'll see what we can come up with.

#include <FileConstants.au3>
#include <MsgBoxConstants.au3>
#include <WinAPIFiles.au3>
#include <Array.au3>

Example()

Func Example()
    Local $timer = TimerInit()
    ; Create a constant variable in Local scope of the filepath that will be read/written to.
    Local Const $sFilePath = "testFile.csv"

;~  Local $sStr = "", $j = 1
;~  For $j = 1 To 5
;~      For $i = 0 To 1000000
;~          $sStr &= "wholesaler" & $j & ",retailer" & $j & ",cat1,model1,saledate,transdate," & $j & @CRLF
;~      Next
;~  Next

;~  ; Create a temporary file to read data from.
;~  If Not FileWrite($sFilePath, $sStr) Then
;~      MsgBox($MB_SYSTEMMODAL, "", "An error occurred whilst writing the temporary file.")
;~      Return False
;~  EndIf
;~  ConsoleWrite(TimerDiff($timer) & @CRLF)
;~  $timer = TimerInit()

    ; Open the file for reading and store the handle to a variable.
    Local $hFileOpen = FileOpen($sFilePath, $FO_READ)
    If $hFileOpen = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "An error occurred when reading the file.")
        Return False
    EndIf

    ;create dictionary object to hold the sums
    Local $oRetailerSums = ObjCreate("Scripting.Dictionary")

    Local $sFileLine, $aLineParts, $i = 0
    ; loop through the file lines
    While 1
        ;read the next line
        $sFileLine = FileReadLine($hFileOpen)
        If @error Then ExitLoop

        ;process this line
        $aLineParts = StringSplit($sFileLine, ",")
        If @error Then ContinueLoop

        ;add this value to the matching retailer sum
        $oRetailerSums.Item($aLineParts[2]) = $oRetailerSums.Item($aLineParts[2]) + $aLineParts[7]

        ;display line after every 100,000 lines to indicate progress
        $i += 1
        If Mod($i, 100000) = 0 Then
            ConsoleWrite("Line " & $i & "  " & $sFileLine & @CRLF)
        EndIf
    WEnd
    ConsoleWrite(TimerDiff($timer) & @CRLF)

    ; Close the handle returned by FileOpen.
    FileClose($hFileOpen)

    ;Create array of retailer/value pairs for display
    Local $aRetailersFound = $oRetailerSums.Keys()
    Local $aRetailerSums[UBound($aRetailersFound)][2]
    For $i = 0 To UBound($aRetailersFound) - 1
        $aRetailerSums[$i][0] = $aRetailersFound[$i]
        $aRetailerSums[$i][1] = $oRetailerSums.Item($aRetailersFound[$i])
    Next
    _ArrayDisplay($aRetailerSums)

EndFunc   ;==>Example

 

Posted
17 hours ago, baconaise said:

Now there's an idea! I compiled it using the 64-bit Aut2Exe (because I don't know how to do that directly from Scite), and it did not give me the error. It's sitting here occupying 1,805 MB of ram. I'm a little curious what would make it happy to hold the data using 1.8GB in 64-bit while it gave me the 2-GB error in 32-bit.

Well there are a lot of deeper answers if you google it, but basically 32-bit application have 2GB available memory for the application and 64-bit have 8GB i seem to remember.

So is it a pretty solution? No.

It's a solution that does not solve your problem, should you allocate more again. But if your application never exceeds the 64-bit limit and you don't wish to waste time optimizing if not needed, then 64-bit is the solution :)

Posted (edited)

And if you want to look further into the dictionary object in particular.. See here.

I used the dictionary in my example because it is a very fast an easy way to get a sort of associative array. Fast as in development and execution.

You could accomplish the same thing by building and searching through regular arrays but it would be much more complicated and likely much slower, depending on how many keys your data ends up with.

Edit: I feel obligated to mention there are several alternatives for associative arrays floating around. The native AutoIt map datatype is probably the way to go; I just have an old habit of using the scripting.dictionary object.

Edited by kurtykurtyboy

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...