CSV_Split & Listview Array

mikell · November 27, 2016

Hobbyist,
The suggestion by czardas - find the reason why the file is corrupted and manage to get a correct file - is by far the very best way
Anyway eventually you might want to try to "repair" the corrupted file.
It's an extremely hazardous way. It assumes a lot of things :
- the first line is the only reliable line - showing the correct fields
- you know exactly what the corrupted file should be if not corrupted - to fit the first line
- the next corrupted files you could get will be corrupted in exactly the same way - so the "repair" will also work
- etc
The code below is such a try using the posted file. There is absolutly no evidence that it would work with another file, it's nothing but a concept

#include <Array.au3>

$f = FileRead("February2016_205MY LF.CSV")

$f = StringReplace($f, '""', '"')
$f = StringReplace($f, ',",', ',"",')
$f = StringRegExpReplace($f, '(\R",)', "")
$f = StringRegExpReplace($f, ',(?=,|\R)', "")

;Msgbox(0,"", $f)

$array = _CSVSplit($f)
_ArrayDisplay($array," split")


Func _CSVSplit($string) ; Parses csv string input and returns a one or two dimensional array
;     Local Const $sMessage = "Select a single file .CSV file."
;  Local $sFileOpenDialog = FileOpenDialog($sMessage, "" & "c:\", "Text & Commas(*.csv)", $FD_FILEMUSTEXIST)
; $statementfile = $sFileOpenDialog

$sDelim = ","
$csv = $string  ;FileRead ($statementfile)

    If Not IsString($csv) Or $csv = "" Then Return SetError(1, 0, 0) ; Invalid string
    If Not IsString($sDelim) Or $sDelim = "" Then Return SetError(2, 0, 0) ; Invalid string

    $csv = StringRegExpReplace($csv, "[\r\n]+\z", "") ; [Line Added] Remove training breaks
    ConsoleWrite($csv)
    Local $iOverride = 63743, $asDelim[3] ; $asDelim => replacements for comma, new line and double quote
    For $i = 0 To 2
        $asDelim[$i] = __GetSubstitute($csv, $iOverride) ; Choose a suitable substitution character
        If @error Then Return SetError(3, 0, 0) ; String contains too many unsuitable characters
    Next
    $iOverride = 0

    Local $aArray = StringRegExp($csv, '\A[^"]+|("+[^"]+)|"+\z', 3) ; Split string using double quotes delim - largest match

    $csv = ""

    Local $iBound = UBound($aArray)
    For $i = 0 To $iBound -1
        $iOverride += StringInStr($aArray[$i], '"', 0, -1) ; Increment by the number of adjacent double quotes per element
        If Mod ($iOverride +2, 2) = 0 Then ; Acts as an on/off switch
            $aArray[$i] = StringReplace($aArray[$i], $sDelim, $asDelim[0]) ; Replace comma delimeters
            $aArray[$i] = StringRegExpReplace($aArray[$i], "(\r\n)|[\r\n]", $asDelim[1]) ; Replace new line delimeters
        EndIf
        $aArray[$i] = StringReplace($aArray[$i], '""', $asDelim[2]) ; Replace double quote pairs
        $aArray[$i] = StringReplace($aArray[$i], '"', '') ; Delete enclosing double quotes - not paired
        $aArray[$i] = StringReplace($aArray[$i], $asDelim[2], '"') ; Reintroduce double quote pairs as single characters
        $csv &= $aArray[$i] ; Rebuild the string, which includes two different delimiters
    Next
    $iOverride = 0

    $aArray = StringSplit($csv, $asDelim[1], 2) ; Split to get rows
    $iBound = UBound($aArray)
    Local $aCSV[$iBound][2], $aTemp
    For $i = 0 To $iBound -1
        $aTemp = StringSplit($aArray[$i], $asDelim[0]) ; Split to get row items
        If Not @error Then
            If $aTemp[0] > $iOverride Then
                $iOverride = $aTemp[0]
                ReDim $aCSV[$iBound][$iOverride] ; Add columns to accomodate more items
            EndIf
        EndIf
        For $j = 1 To $aTemp[0]
            If StringLen($aTemp[$j]) Then
                If Not StringRegExp($aTemp[$j], '[^"]') Then ; Field only contains double quotes
                    $aTemp[$j] = StringTrimLeft($aTemp[$j], 1) ; Delete enclosing double quote single char
                EndIf
                $aCSV[$i][$j -1] = $aTemp[$j] ; Populate each row
            EndIf
        Next
    Next

    If $iOverride > 1 Then
        Return $aCSV ; Multiple Columns
    Else
        For $i = 0 To $iBound -1
            If StringLen($aArray[$i]) And (Not StringRegExp($aArray[$i], '[^"]')) Then ; Only contains double quotes
                $aArray[$i] = StringTrimLeft($aArray[$i], 1) ; Delete enclosing double quote single char
            EndIf
        Next
        Return $aArray ; Single column
    EndIf
EndFunc ;==> _CSVSplit

Func __GetSubstitute($string, ByRef $iCountdown)
    If $iCountdown < 57344 Then Return SetError(1, 0, "") ; Out of options
    Local $sTestChar
    For $i = $iCountdown To 57344 Step -1
        $sTestChar = ChrW($i)
        $iCountdown -= 1
        If Not StringInStr($string, $sTestChar) Then
            Return $sTestChar
        EndIf
    Next
    Return SetError(1, 0, "") ; Out of options
EndFunc ;==> __GetSubstitute

Hobbyist · November 27, 2016

Mikell & Czardas

Thanks again for so much help.

First, I receive the file as is and just save it to my computer. The chances of me exercising control over the sender to conform to some format is next to none. Thus my quest.

After the several posts, I came to the same conclusion the only reliable line of format was the first line which indicated (to me) some column organization. So with my last post I thought the only conceiveable thing would be to "gut" all the CR as supplied as they are obviously the problem(to me) and then to reconstruct each line by using line one as the template and add my CR. In this case get to the 4 lines of readable data instead of the 7 misinformation lines.

So with this new input I will take another look and definitely get back. Both of you have been helpful and enlightened me along the way.

Hobbyist

Hobbyist · November 28, 2016

Mikell

I ran the script and it performs great. Fully understand that works for this format and definitely gives me more knowledge for possible other "bad" formats.

I went to the hlp files to learn about the StringRegExpReplace; found some but not all. Would you clarify something for me?

StringRegExpReplace($f, '(\R",)', "")

(\R",):
I think I understand that \R is for all CR's
Didn't understand or could find what the ", is for right
after the \R and inside the ().

StringRegExpReplace($f, ',(?=,|\R)', "")

(?=,|\R):
Think the ?=, matches started at the , for current position.
\R is same as above or does the | change it? Couldn't
find |

So I'm part of the way there, but lacking completing understanding for future.

Thanks again for your help

Hobbyist

mikell · November 28, 2016

Regular expressions are great tools but a little hard to learn
Sorry, I didn't put comments in my previous code - I didn't even know if this code could be really useful for your needs

So here it is, including comments. You can insert some msgboxes/Consolewrite's to have a better look on how it works
Feel free to ask for more explanations if needed

$f = FileRead("February2016_205MY LF.CSV")

; replace double double quotes by single double quotes
$f = StringReplace($f, '""', '"')    
; replace single double quotes by double double quotes when alone and enclosed in commas
$f = StringReplace($f, ',",', ',"",')
; remove groups composed exactly of : newline + double quote + comma
; \R matches any newline sequence @crlf, @cr or @lf
$f = StringRegExpReplace($f, '(\R",)', "")
; remove commas when followed by another comma OR by newline
; (?=...) means 'followed by , | means OR
$f = StringRegExpReplace($f, ',(?=,|\R)', "")

Msgbox(0,"", $f)

czardas · November 29, 2016

It's highly risky and unsound: especially since you can't control the corruption. This patch will fix this one file and corrupt many others. @mikell knows this, but @Hobbyist you need to be careful.

Hobbyist · November 29, 2016

Yes. I can see the danger and with eyes wide open and you are very right about the being careful. When starting to get involved with CSV files, I very much thought they were all standardized. Wow, wake up call.

So this one does help in this case and is a good experience builder - thanks to both of you.

My second (newbie effort) mistake was thinking it possible to read a CSV into a preview screen(listview in this case) and move columns around to "my standard" sequence/template.

So its wax on, wax off. Arghhhh.

Hobbyist

kylomas · November 29, 2016

Hobbyist,

4 hours ago, Hobbyist said:

My second (newbie effort) mistake was thinking it possible to read a CSV into a preview screen(listview in this case) and move columns around to "my standard" sequence/template.

You can move columns around like this...

#include <GUIConstantsEx.au3>
#include <GuiListBox.au3>
#include <WindowsConstants.au3>
#include <GuiListView.au3>

Local $gui010 = GUICreate("Example", 420, 420)
Local $lv0010 = GUICtrlCreateListView('', 05, 05, 410, 350, _
        BitOR($LVS_SHOWSELALWAYS, $LVS_SINGLESEL), _
        BitOR($LVS_EX_HEADERDRAGDROP, $LVS_EX_FULLROWSELECT, $LVS_EX_GRIDLINES))

Local $hlv = GUICtrlGetHandle($lv0010)

_GUICtrlListView_AddColumn($hlv, "A", 100)
_GUICtrlListView_AddColumn($hlv, "B", 100)
_GUICtrlListView_AddColumn($hlv, "C", 100)
_GUICtrlListView_AddColumn($hlv, "D", 100)

For $i = 0 To 9
    _GUICtrlListView_AddItem($hlv, $i & '-0')
    _GUICtrlListView_AddSubItem($hlv, $i, $i & '-1', 1)
    _GUICtrlListView_AddSubItem($hlv, $i, $i & '-2', 2)
    _GUICtrlListView_AddSubItem($hlv, $i, $i & '-3', 3)
Next

GUISetState(@SW_SHOW)

Local $msg

While 1
    $msg = GUIGetMsg()
    Switch $msg
        Case $GUI_EVENT_CLOSE
            Exit
    EndSwitch
WEnd

if that helps at all.

kylomas

Edited November 29, 2016 by kylomas

Hobbyist · November 29, 2016

kylomas

Yes this is good and helps and i didn't have a particular problem with the column movement part of the GUI.

My real problem came with the loaded data not having equal fields due to the CSV format. Of course I don't know this when receiving a file, so I wanted to preview it before running it in my other sript. I have no control over the sender. In this case the data as received shows the following when loaded into a listview or array:

"Status","Date","Description","Debit","Credit"
"Cleared","1/16/2016","CHINA STARFIRE SAINT Elsewhere XX ",,
",""19.92"",",,,,
"Cleared","1/27/2016","THE HOME DEPOT 3028 SAINT Elsewhere XX ",,
",""9.09"",",,,,
"Cleared","2/5/2016","PAYMENT THANK YOU",,
","",""25.00",,,,

My goal was to find such errors in the preview, get them to the proper position and then save it to a new correct file.

Sort of - isolate, move, repair, save process. I am beginning to think this is not a possibility given the character of CSV formatting or lack thereof.

Hobbyist

Hobbyist · December 3, 2016

Czardas & Mikell

I have been continuing to look at CSV manipulations and have a question if you would help with.

If I have a CSV file with multiple lines, as in the above example would it be misguiding to do the following as a simple process:

Use FileReadLine.

Read first line to establish the number of columns.

FileReadLine the entire file, which if I understand the help file, will remove all carriage controls at the end of each line and also then concatenate each line so as to result in one continuous line.

Take resulting line and "insert" the carriage control after each nth set of fields(for columns).

What are the pitfalls of doing it this way?

Thanks

Hobbyist

mikell · December 3, 2016

Hobbyist,

"My real problem came with the loaded data not having equal fields"
So the answer is in my post #21.
My little code worked because I could compare the initial file with the probable expected result, and I had to guess what this result should be
But as I warned and as czardas pointed out, it's very risky because such a repair would work for several files only if these files are corrupted in exactly the same way
If you try your "split way" on the previous file, you will probably get a wrong result. In corrupted files the error fields can be in various places... in this case I'm not sure that a common and reliable method can be defined to treat them

kylomas · December 4, 2016

Hobbyist,

I'm sure you've checked this but thought I'd mention it. In the corrupt file you posted, the bad entries are always the only fields with data on a line and always belong to the line above.

Can you post several examples of corruption?

edit: Are all standard files always the same format (fields and type of content)?

kylomas

Edited December 4, 2016 by kylomas

czardas · December 4, 2016

It is impossible to predict and fix all the types of corruption that might occur. The first task is to identify the rows. FileReadLine() cannot differentiate between line breaks within fields, so this function won't work by itself. This is exactly the kind of process that might have corrupted the file in the first place: that could easily produce an inconsistent number of items within each row.

If such a corrupt file is then passed to _CSVSplit(), the inconsistent row entries are ignored and an array is created anyway (this was perhaps a questionable design choice). When such an array is converted back to CSV, guess what - all fields end up separated by commas, including those that were missing from the original. The reverse process does not know that the array contains fields that were missing from the original CSV.

Now going back to the basic specs, it is stated that there should be a consistent number of fields in each row. If there is not, then there are two choices: either append the missing fields to each row or return an error. I decided to add the missing fields to account for one of the most typical human errors (a missing comma); or more importantly to comply with strict implementation as and according to the precise wording of the specs. Perhaps the author meant to say ''... records must have the same number of fields ...".

Quote

All records should have the same number of fields, in the same order.

Bear in mind that CSV is not currently a standardized format (not universally). _CSVSplit() was designed to work in conjunction with _ArrayToCSV() or any software that generates CSV files without corruption. It was not designed to catch all human errors.

One of the most likely causes for the corruption in this case; is that someone opened the file in a text editor, saw three double quotes in a row and deleted a few thinking this to be an error. After manual corruption, some automated process misinterpreted the number of rows and added commas to compensate for inconsistent number of fields. Then we get a file which looks very much like the one you have. The sequence of events might have been slightly different, but you get the general idea.

Edit: I should add a warning about inconsistency in row item count, but that wouldn't make any difference to the file in its current state, because all the fields are actually present. My conclusion is that the file was probably generated from an array, which in turn was created (perhaps using my UDF) from a previously corrupted CSV. I still think you corrupted the file accidentally without realising it: this might have something to do with modifications you made to my code at an earlier stage.

Edited December 4, 2016 by czardas

czardas · December 4, 2016

I knew I'd seen this file before. Here is the original:

"Status","Date","Description","Debit","Credit"
"Cleared","01/16/2016","CHINA STAR Fire            My Town AA 
","19.92",""
"Cleared","01/27/2016","THE HOME DEPOT 8029    Sometown  XX 
","9.09",""
"Cleared","02/05/2016","PAYMENT THANK YOU
","","25.00"

When I posted this, the forum software converted LF to CRLF (3 replacements producing lines 3, 5 & 7), but that shouldn't matter at all because all those characters appear between double quotes. ~~Still testing.~~ _CSVSplit() converts this content correctly: producing only 4 records (copy the contents and try for yourself).

Other than it being a bit strange, the original file source is not the problem. I do wonder what those line feed characters are actually doing there. Even so, consider how easily corruption occurred (just by posting the contents here).

Edit: After comparison with the corrupted file you posted, this appears to be the result of consistently automated corruption by a process not suitable for parsing CSV (100% certain) = human error.

Edited December 4, 2016 by czardas

mikell · December 4, 2016

Hobbyist,
You could remove the unwanted newlines (meaning 1 out of 2, i.e. the 2nd, 4th, 6th, etc) as asked in post #12

$f = FileRead("February2016_205MY LF.CSV")
$f = StringRegExpReplace($f, '(\N+\R\N+)\K\R', "")
Msgbox(0,"", $f)

So you get this

"Status","Date","Description","Debit","Credit"
"Cleared","1/16/2016","CHINA STARFIRE             SAINT Elsewhere XX ",,",""19.92"",",,,,
"Cleared","1/27/2016","THE HOME DEPOT 3028    SAINT Elsewhere  XX ",,",""9.09"",",,,,
"Cleared","2/5/2016","PAYMENT THANK YOU",,","",""25.00",,,,

But, what will you do then ?
Eventually, remove the error fields - the ones with one quote only inside - by adding this

$f = StringReplace($f, ',",', "")

so _CSVSplit can return now an almost correct result

czardas · December 4, 2016

8 minutes ago, mikell said:

Eventually, remove the error fields - the ones with one quote only inside

Very brilliant and very funny! :lmao:

What if you come across this ' ",",",",",",",","," '? :think:

Edited December 4, 2016 by czardas

mikell · December 4, 2016

21 minutes ago, czardas said:

What if you come across this ' ",",",",",",",","," '?

If so then I'll say to myself : this one is immedicable. Straight to the recycle bin

Seriously. Hobbyist's problem is clearly here : the file I got was corrupted, no way I can change this, so what can I do ?
Discussing about how the file was corrupted and how it should be is interesting but finally not really helpful IMHO
I only try - with a lot of warnings - to suggest some things which may work (maybe) in this particular case

czardas · December 4, 2016

2 minutes ago, mikell said:

I only try - with a lot of warnings - to suggest some things which may work (maybe) in this particular case

I know. We've given Hobbyist a few things to think about.

kylomas · December 4, 2016

Quote

It is impossible to predict and fix all the types of corruption that might occur.

I'm not so sure. We don't have enough info yet. So far we've seen one example of what the OP is calling a corrupted file. The reason I keep asking generic questions is because it is pre-mature to offer a solution.

Two possibilities seem likely;

1 - The file is generate from some data entry system

2 - The file is produced from a data source

While this seems obvious, it may also argue that there is some pattern to the corruption.

The OP needs to supply more details.

kylomas

mikell · December 4, 2016

2 minutes ago, kylomas said:

It is impossible to predict and fix all the types of corruption that might occur.

I agree, and this means that a generic solution is not possible, period.

3 minutes ago, kylomas said:

We don't have enough info yet (...) The OP needs to supply more details.

I also agree. Because a little chance may exist that all the corrupted files show exactly the same type of corruption

czardas · December 4, 2016

I agree with what you say: we need more information. Although the method of corruption does not appear to be resolvable by any general method. Modification to field content, which involves adjusting double quote positions, according to wrongly interpreted record separators is pretty much a one way corruption process as far as I can tell.

Edited December 4, 2016 by czardas

Sign In

CSV_Split & Listview Array

Recommended Posts

mikell

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

kylomas

mikell

mikell

Hobbyist

Hobbyist

mikell

czardas

Hobbyist

kylomas

Hobbyist

Hobbyist

mikell

kylomas

czardas

czardas

mikell

czardas

mikell

czardas

kylomas

mikell

czardas

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta