mikell Posted November 27, 2016 Posted November 27, 2016 Hobbyist, The suggestion by czardas - find the reason why the file is corrupted and manage to get a correct file - is by far the very best way Anyway eventually you might want to try to "repair" the corrupted file. It's an extremely hazardous way. It assumes a lot of things : - the first line is the only reliable line - showing the correct fields - you know exactly what the corrupted file should be if not corrupted - to fit the first line - the next corrupted files you could get will be corrupted in exactly the same way - so the "repair" will also work - etc The code below is such a try using the posted file. There is absolutly no evidence that it would work with another file, it's nothing but a concept expandcollapse popup#include <Array.au3> $f = FileRead("February2016_205MY LF.CSV") $f = StringReplace($f, '""', '"') $f = StringReplace($f, ',",', ',"",') $f = StringRegExpReplace($f, '(\R",)', "") $f = StringRegExpReplace($f, ',(?=,|\R)', "") ;Msgbox(0,"", $f) $array = _CSVSplit($f) _ArrayDisplay($array," split") Func _CSVSplit($string) ; Parses csv string input and returns a one or two dimensional array ; Local Const $sMessage = "Select a single file .CSV file." ; Local $sFileOpenDialog = FileOpenDialog($sMessage, "" & "c:\", "Text & Commas(*.csv)", $FD_FILEMUSTEXIST) ; $statementfile = $sFileOpenDialog $sDelim = "," $csv = $string ;FileRead ($statementfile) If Not IsString($csv) Or $csv = "" Then Return SetError(1, 0, 0) ; Invalid string If Not IsString($sDelim) Or $sDelim = "" Then Return SetError(2, 0, 0) ; Invalid string $csv = StringRegExpReplace($csv, "[\r\n]+\z", "") ; [Line Added] Remove training breaks ConsoleWrite($csv) Local $iOverride = 63743, $asDelim[3] ; $asDelim => replacements for comma, new line and double quote For $i = 0 To 2 $asDelim[$i] = __GetSubstitute($csv, $iOverride) ; Choose a suitable substitution character If @error Then Return SetError(3, 0, 0) ; String contains too many unsuitable characters Next $iOverride = 0 Local $aArray = StringRegExp($csv, '\A[^"]+|("+[^"]+)|"+\z', 3) ; Split string using double quotes delim - largest match $csv = "" Local $iBound = UBound($aArray) For $i = 0 To $iBound -1 $iOverride += StringInStr($aArray[$i], '"', 0, -1) ; Increment by the number of adjacent double quotes per element If Mod ($iOverride +2, 2) = 0 Then ; Acts as an on/off switch $aArray[$i] = StringReplace($aArray[$i], $sDelim, $asDelim[0]) ; Replace comma delimeters $aArray[$i] = StringRegExpReplace($aArray[$i], "(\r\n)|[\r\n]", $asDelim[1]) ; Replace new line delimeters EndIf $aArray[$i] = StringReplace($aArray[$i], '""', $asDelim[2]) ; Replace double quote pairs $aArray[$i] = StringReplace($aArray[$i], '"', '') ; Delete enclosing double quotes - not paired $aArray[$i] = StringReplace($aArray[$i], $asDelim[2], '"') ; Reintroduce double quote pairs as single characters $csv &= $aArray[$i] ; Rebuild the string, which includes two different delimiters Next $iOverride = 0 $aArray = StringSplit($csv, $asDelim[1], 2) ; Split to get rows $iBound = UBound($aArray) Local $aCSV[$iBound][2], $aTemp For $i = 0 To $iBound -1 $aTemp = StringSplit($aArray[$i], $asDelim[0]) ; Split to get row items If Not @error Then If $aTemp[0] > $iOverride Then $iOverride = $aTemp[0] ReDim $aCSV[$iBound][$iOverride] ; Add columns to accomodate more items EndIf EndIf For $j = 1 To $aTemp[0] If StringLen($aTemp[$j]) Then If Not StringRegExp($aTemp[$j], '[^"]') Then ; Field only contains double quotes $aTemp[$j] = StringTrimLeft($aTemp[$j], 1) ; Delete enclosing double quote single char EndIf $aCSV[$i][$j -1] = $aTemp[$j] ; Populate each row EndIf Next Next If $iOverride > 1 Then Return $aCSV ; Multiple Columns Else For $i = 0 To $iBound -1 If StringLen($aArray[$i]) And (Not StringRegExp($aArray[$i], '[^"]')) Then ; Only contains double quotes $aArray[$i] = StringTrimLeft($aArray[$i], 1) ; Delete enclosing double quote single char EndIf Next Return $aArray ; Single column EndIf EndFunc ;==> _CSVSplit Func __GetSubstitute($string, ByRef $iCountdown) If $iCountdown < 57344 Then Return SetError(1, 0, "") ; Out of options Local $sTestChar For $i = $iCountdown To 57344 Step -1 $sTestChar = ChrW($i) $iCountdown -= 1 If Not StringInStr($string, $sTestChar) Then Return $sTestChar EndIf Next Return SetError(1, 0, "") ; Out of options EndFunc ;==> __GetSubstitute Â
Hobbyist Posted November 27, 2016 Author Posted November 27, 2016 Mikell & Czardas Thanks again for so much help. First, I receive the file as is and just save it to my computer. Â The chances of me exercising control over the sender to conform to some format is next to none. Â Thus my quest. After the several posts, I came to the same conclusion the only reliable line of format was the first line which indicated (to me) some column organization. Â So with my last post I thought the only conceiveable thing would be to "gut" all the CR as supplied as they are obviously the problem(to me) and then to reconstruct each line by using line one as the template and add my CR. Â In this case get to the 4 lines of readable data instead of the 7 misinformation lines. So with this new input I will take another look and definitely get back. Both of you have been helpful and enlightened me along the way. Hobbyist
Hobbyist Posted November 28, 2016 Author Posted November 28, 2016 Mikell I ran the script and it performs great.  Fully understand that works for this format and definitely gives me more knowledge for possible other "bad" formats. I went to the hlp files to learn about the StringRegExpReplace; found some but not all. Would you clarify something for me? StringRegExpReplace($f, '(\R",)', "") (\R",): I think I understand that \R is for all CR's Didn't understand or could find what the ", is for right after the \R and inside the (). StringRegExpReplace($f, ',(?=,|\R)', "") (?=,|\R): Think the ?=, matches started at the , for current position. \R is same as above or does the | change it? Couldn't find | So I'm part of the way there, but lacking completing understanding for future. Thanks again for your help Hobbyist
mikell Posted November 28, 2016 Posted November 28, 2016 Regular expressions are great tools but a little hard to learn Sorry, I didn't put comments in my previous code - I didn't even know if this code could be really useful for your needs So here it is, including comments. You can insert some msgboxes/Consolewrite's to have a better look on how it works Feel free to ask for more explanations if needed $f = FileRead("February2016_205MY LF.CSV") ; replace double double quotes by single double quotes $f = StringReplace($f, '""', '"') ; replace single double quotes by double double quotes when alone and enclosed in commas $f = StringReplace($f, ',",', ',"",') ; remove groups composed exactly of : newline + double quote + comma ; \R matches any newline sequence @crlf, @cr or @lf $f = StringRegExpReplace($f, '(\R",)', "") ; remove commas when followed by another comma OR by newline ; (?=...) means 'followed by , | means OR $f = StringRegExpReplace($f, ',(?=,|\R)', "") Msgbox(0,"", $f) Â
czardas Posted November 29, 2016 Posted November 29, 2016 It's highly risky and unsound: especially since you can't control the corruption. This patch will fix this one file and corrupt many others. @mikell knows this, but @Hobbyist you need to be careful. operator64  ArrayWorkshop
Hobbyist Posted November 29, 2016 Author Posted November 29, 2016 Yes. Â I can see the danger and with eyes wide open and you are very right about the being careful. Â When starting to get involved with CSV files, I very much thought they were all standardized. Wow, wake up call. So this one does help in this case and is a good experience builder - thanks to both of you. My second (newbie effort) mistake was thinking it possible to read a CSV into a preview screen(listview in this case) and move columns around to "my standard" sequence/template. So its wax on, wax off. Â Arghhhh. Hobbyist
kylomas Posted November 29, 2016 Posted November 29, 2016 (edited) Hobbyist, 4 hours ago, Hobbyist said: My second (newbie effort) mistake was thinking it possible to read a CSV into a preview screen(listview in this case) and move columns around to "my standard" sequence/template. You can move columns around like this... #include <GUIConstantsEx.au3> #include <GuiListBox.au3> #include <WindowsConstants.au3> #include <GuiListView.au3> Local $gui010 = GUICreate("Example", 420, 420) Local $lv0010 = GUICtrlCreateListView('', 05, 05, 410, 350, _ BitOR($LVS_SHOWSELALWAYS, $LVS_SINGLESEL), _ BitOR($LVS_EX_HEADERDRAGDROP, $LVS_EX_FULLROWSELECT, $LVS_EX_GRIDLINES)) Local $hlv = GUICtrlGetHandle($lv0010) _GUICtrlListView_AddColumn($hlv, "A", 100) _GUICtrlListView_AddColumn($hlv, "B", 100) _GUICtrlListView_AddColumn($hlv, "C", 100) _GUICtrlListView_AddColumn($hlv, "D", 100) For $i = 0 To 9 _GUICtrlListView_AddItem($hlv, $i & '-0') _GUICtrlListView_AddSubItem($hlv, $i, $i & '-1', 1) _GUICtrlListView_AddSubItem($hlv, $i, $i & '-2', 2) _GUICtrlListView_AddSubItem($hlv, $i, $i & '-3', 3) Next GUISetState(@SW_SHOW) Local $msg While 1 $msg = GUIGetMsg() Switch $msg Case $GUI_EVENT_CLOSE Exit EndSwitch WEnd if that helps at all. kylomas Edited November 29, 2016 by kylomas Forum Rules        Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill
Hobbyist Posted November 29, 2016 Author Posted November 29, 2016 kylomas Yes this is good and helps and i didn't have a particular problem with the column movement part of the GUI. My real problem came with the loaded data not having equal fields due to the CSV format. Of course I don't know this when receiving a file, so I wanted to preview it before running it in my other sript.  I have no control over the sender. In this case the data as received shows the following when loaded into a listview or array: "Status","Date","Description","Debit","Credit""Cleared","1/16/2016","CHINA STARFIRE       SAINT Elsewhere XX ",,",""19.92"",",,,,"Cleared","1/27/2016","THE HOME DEPOT 3028   SAINT Elsewhere  XX ",,",""9.09"",",,,,"Cleared","2/5/2016","PAYMENT THANK YOU",,","",""25.00",,,, My goal was to find such errors in the preview, get them to the proper position and then save it to a new correct file. Sort of  - isolate, move, repair, save process.  I am beginning to think this is not a possibility given the character of CSV formatting or lack thereof.  Hobbyist Â
Hobbyist Posted December 3, 2016 Author Posted December 3, 2016 Czardas & Mikell I have been continuing to look at CSV manipulations and have a question if you would help with. If I have a CSV file with multiple lines, as in the above example would it be misguiding to do the following as a simple process: Use FileReadLine. Read first line to establish the number of columns. FileReadLine the entire file, which if I understand the help file, will remove all carriage controls at the end of each line and also then concatenate each line so as to result in one continuous line.  Take resulting line and "insert" the carriage control after each nth set of fields(for columns). What are the pitfalls of doing it this way?  Thanks Hobbyist
mikell Posted December 3, 2016 Posted December 3, 2016 Hobbyist, "My real problem came with the loaded data not having equal fields" So the answer is in my post #21. My little code worked because I could compare the initial file with the probable expected result, and I had to guess what this result should be But as I warned and as czardas pointed out, it's very risky because such a repair would work for several files only if these files are corrupted in exactly the same way If you try your "split way" on the previous file, you will probably get a wrong result. In corrupted files the error fields can be in various places... in this case I'm not sure that a common and reliable method can be defined to treat them kylomas 1
kylomas Posted December 4, 2016 Posted December 4, 2016 (edited) Hobbyist, I'm sure you've checked this but thought I'd mention it. In the corrupt file you posted, the bad entries are always the only fields with data on a line and always belong to the line above. Can you post several examples of corruption? edit: Are all standard files always the same format (fields and type of content)? kylomas Edited December 4, 2016 by kylomas Forum Rules        Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill
czardas Posted December 4, 2016 Posted December 4, 2016 (edited) It is impossible to predict and fix all the types of corruption that might occur. The first task is to identify the rows. FileReadLine() cannot differentiate between line breaks within fields, so this function won't work by itself. This is exactly the kind of process that might have corrupted the file in the first place: that could easily produce an inconsistent number of items within each row. If such a corrupt file is then passed to _CSVSplit(), the inconsistent row entries are ignored and an array is created anyway (this was perhaps a questionable design choice). When such an array is converted back to CSV, guess what - all fields end up separated by commas, including those that were missing from the original. The reverse process does not know that the array contains fields that were missing from the original CSV. Now going back to the basic specs, it is stated that there should be a consistent number of fields in each row. If there is not, then there are two choices: either append the missing fields to each row or return an error. I decided to add the missing fields to account for one of the most typical human errors (a missing comma); or more importantly to comply with strict implementation as and according to the precise wording of the specs. Perhaps the author meant to say ''... records must have the same number of fields ...". Quote All records should have the same number of fields, in the same order. Bear in mind that CSV is not currently a standardized format (not universally). _CSVSplit() was designed to work in conjunction with _ArrayToCSV() or any software that generates CSV files without corruption. It was not designed to catch all human errors. One of the most likely causes for the corruption in this case; is that someone opened the file in a text editor, saw three double quotes in a row and deleted a few thinking this to be an error. After manual corruption, some automated process misinterpreted the number of rows and added commas to compensate for inconsistent number of fields. Then we get a file which looks very much like the one you have. The sequence of events might have been slightly different, but you get the general idea. Edit: I should add a warning about inconsistency in row item count, but that wouldn't make any difference to the file in its current state, because all the fields are actually present. My conclusion is that the file was probably generated from an array, which in turn was created (perhaps using my UDF) from a previously corrupted CSV. I still think you corrupted the file accidentally without realising it: this might have something to do with modifications you made to my code at an earlier stage. Edited December 4, 2016 by czardas operator64  ArrayWorkshop
czardas Posted December 4, 2016 Posted December 4, 2016 (edited) I knew I'd seen this file before. Here is the original: "Status","Date","Description","Debit","Credit" "Cleared","01/16/2016","CHINA STAR Fire My Town AA ","19.92","" "Cleared","01/27/2016","THE HOME DEPOT 8029 Sometown XX ","9.09","" "Cleared","02/05/2016","PAYMENT THANK YOU ","","25.00" When I posted this, the forum software converted LF to CRLF (3 replacements producing lines 3, 5 & 7), but that shouldn't matter at all because all those characters appear between double quotes. Still testing. _CSVSplit() converts this content correctly: producing only 4 records (copy the contents and try for yourself). Other than it being a bit strange, the original file source is not the problem. I do wonder what those line feed characters are actually doing there. Even so, consider how easily corruption occurred (just by posting the contents here). Edit: After comparison with the corrupted file you posted, this appears to be the result of consistently automated corruption by a process not suitable for parsing CSV (100% certain) = human error. Edited December 4, 2016 by czardas operator64  ArrayWorkshop
mikell Posted December 4, 2016 Posted December 4, 2016 Hobbyist, You could remove the unwanted newlines (meaning 1 out of 2, i.e. the 2nd, 4th, 6th, etc) as asked in post #12 $f = FileRead("February2016_205MY LF.CSV") $f = StringRegExpReplace($f, '(\N+\R\N+)\K\R', "") Msgbox(0,"", $f) So you get this "Status","Date","Description","Debit","Credit" "Cleared","1/16/2016","CHINA STARFIRE SAINT Elsewhere XX ",,",""19.92"",",,,, "Cleared","1/27/2016","THE HOME DEPOT 3028 SAINT Elsewhere XX ",,",""9.09"",",,,, "Cleared","2/5/2016","PAYMENT THANK YOU",,","",""25.00",,,, But, what will you do then ? Eventually, remove the error fields - the ones with one quote only inside - by adding this $f = StringReplace($f, ',",', "") so _CSVSplit can return now an almost correct resultÂ
czardas Posted December 4, 2016 Posted December 4, 2016 (edited) 8 minutes ago, mikell said: Eventually, remove the error fields - the ones with one quote only inside Very brilliant and very funny! What if you come across this ' ",",",",",",",","," '? Edited December 4, 2016 by czardas operator64  ArrayWorkshop
mikell Posted December 4, 2016 Posted December 4, 2016 21 minutes ago, czardas said: What if you come across this ' ",",",",",",",","," '? If so then I'll say to myself : this one is immedicable. Straight to the recycle bin  Seriously. Hobbyist's problem is clearly here : the file I got was corrupted, no way I can change this, so what can I do ? Discussing about how the file was corrupted and how it should be is interesting but finally not really helpful IMHO I only try - with a lot of warnings - to suggest some things which may work (maybe) in this particular case czardas 1
czardas Posted December 4, 2016 Posted December 4, 2016 2 minutes ago, mikell said: I only try - with a lot of warnings - to suggest some things which may work (maybe) in this particular case I know. We've given Hobbyist a few things to think about. operator64  ArrayWorkshop
kylomas Posted December 4, 2016 Posted December 4, 2016 Quote It is impossible to predict and fix all the types of corruption that might occur. I'm not so sure. We don't have enough info yet. So far we've seen one example of what the OP is calling a corrupted file. The reason I keep asking generic questions is because it is pre-mature to offer a solution. Two possibilities seem likely; 1 - The file is generate from some data entry system 2 - The file is produced from a data source While this seems obvious, it may also argue that there is some pattern to the corruption. The OP needs to supply more details. kylomas Forum Rules        Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill
mikell Posted December 4, 2016 Posted December 4, 2016 2 minutes ago, kylomas said: It is impossible to predict and fix all the types of corruption that might occur. I agree, and this means that a generic solution is not possible, period.  3 minutes ago, kylomas said: We don't have enough info yet (...) The OP needs to supply more details. I also agree. Because a little chance may exist that all the corrupted files show exactly the same type of corruption
czardas Posted December 4, 2016 Posted December 4, 2016 (edited) I agree with what you say: we need more information. Although the method of corruption does not appear to be resolvable by any general method. Modification to field content, which involves adjusting double quote positions, according to wrongly interpreted record separators is pretty much a one way corruption process as far as I can tell. Edited December 4, 2016 by czardas operator64  ArrayWorkshop
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now