Jump to content

Recommended Posts

Posted
1 hour ago, czardas said:

We've given Hobbyist a few things to think about.

 

1 hour ago, czardas said:

we need more information

Let's now patiently hope for feedback   :)

Posted

Good Morning!

Thank you for your continued expertise. I know I am way out of your league since I am new to this so everything you offer is of huge value to my learning.

Ok the "file" - the corrected file, as I posted, is generated by - I really don't know, but suspect machine.  I say this because of what i see: (I downloaded the file and did NOTHING to it, it is as follows)

"Status","Date","Description","Debit","Credit"
"Cleared","1/16/2016","CHINA STARFIRE             SAINT Elsewhere XX ",,
",""19.92"",",,,,
"Cleared","1/27/2016","THE HOME DEPOT 3028    SAINT Elsewhere  XX ",,
",""9.09"",",,,,
"Cleared","2/5/2016","PAYMENT THANK YOU",,
","",""25.00",,,,

In this case each line has a @CRLF, which I think is an error. An error in manual input or machine(coding). Either way it causes a new line when it should NOT.  I guess this fits your definition of corrupt.  Given it is consistent through the file and involves a misplaced @CRLF, I thought it could easily be caught and rectified to match the first line indicating the fields expected (with data or blanks).  In my newbie thoughts, corrupt would be sporatic characters of no use to the data set and with no uniformity.  I have learned here that is not the industry definition.

So my thoughts were: a misplaced CRLF is perhaps most probable as well as columns out of order.  The out of order possibility seem correctable (as I attempted in the original posted script) but I had not anticipated the out of place @CRLF. I was able to see this by opening it in SCITE and doing a "end of line" view, as suggested by one of the responses.

I thought my using CZARDAS CSV_Split was my answer. I misunderstood.  CSV_Split would still have the misplaced @CRLF and I would end up with results like above due to the @CRLF (CSV_Split does what its suppose to).  

I understand a "one off" fix doesn't necessarily fit "all" files after reading your responses.  But I think even out place columns seems more correctable than out of place rows. Thus if I plan on 5 columns and get something more or less, it seems manageable to pick out the fields(in the case of more columns) or add columns in the case of less. Data flowing from one line to the next seems the biggest headache in how to catch it.

So downloading a file, doing nothing to it, running it through my main script and get results. IF it blows up, find out why(thus my posted script, massage the data, save and rerun.  Is it too naive to say carriage control is a bigger concern than column number provided versus column number expected? Trying to anticpated both.

Thanks again for so much support and advice.

Hobbyist

Posted

IMHO corruption caused both problems
All you can do then is to try tricks... the most simple code and easy to manage I could build is the one below. Hoping it can help  :sweating:

#include <Array.au3>

; read the nasty file
$f = FileRead("February2016_205MY LF.CSV")
; first step : remove ALL quotes
$f = StringReplace($f, '"', "")
; 2nd, remove the unwanted @crlf on even lines
$f = StringRegExpReplace($f, '(\N+\R\N+)\K\R', "")
; 3rd, remove unwanted empty fields by firing double commas ONLY IF followed by another comma or EOL
$f = StringRegExpReplace($f, ',,(?=,|\R)', "")
; and you're done
$array = _CSVSplit($f)
_ArrayDisplay($array," split")

 

Posted

kylomas, czardas, mikell

Thank you so much for your help, insights and expertise.

I'm going to delve back into this from the start and that is at file download time and go from there.

Actually I'm hoping that it is MY mistake in downloading, although that is fairly a basic task.

I might be back.  If not, Happy New Year. You three rock.

Hobbyist

Posted

kylomas, czardas, mikell

I have gone through a sample batch of files looking at the data supplied and found the following:

The file in post #10, which was not changed or altered by me, had CR LF at the end of each line as viewed in SCITE editor.

Out of 10 files there were 2 that fit the above format. Both performed the same in script - "badly"

The other 8 had the format of CR LF end of line 1 for column labels. Line 2 (first row of data) had LF at the end, row 3 had CR LF at the end, row 4 had LF  and such alternating  in the rows all the way through.  The 8 performed with success in script, the 2 bad ones did not. None of the 10 would properly open in spread sheet - data was split in alternating rows.

Eliminating the LF allowed the files to also open in spread sheet with data in expected rows.  So in my script design in my first post, I was focused on moving columns around, thinking misplaced columns would be easily rectified as long as the minimum required columns were part of the file, as can be seen in the Listview.  I honestly never gave thought to the possibility that rows would be a problem but during this experience see it is to be reckconed with and somehow caught before it blows things up.

So running through the script, when the format is NOT CR LF at the end of each line works.  Running in spread sheet becomes the secondary issue, but the cause of misplaced rows is, I guess now known - the @LF at the end of rows. The importance of being able to use spread sheet is for the following reason.  I want to concatenate mutliple files(all having the right columns) into one big file at some point.  Successfully did this when in the right format, but when the row issued came up, the results were not pretty.

Thats what I know and have found.

Hobbyist

Posted

@mikell

If I have done this right, the zip file is attached with the 8 good files as I described above.  These should run in the script I originally posted. But as well, if you attempt to open them in spread sheet you should get broken rows which is useless if one is attempting to do spread sheet work as well.  My last post mentioned the LF culprit.

Ideally I should attempt to make "any" file run both in the script I posted and spread sheet.  The posted script is mainly to move columns around to the correct order if they don't match the template in the lower Listview.  As I said, I thought that was the main concern to be aware of (boy was I wrong).  So if I cannot get to a point where a file works in both, I guess I could put a button on the gui to prep the file for spread sheet purposes.

The original corrected file runs in my posted script but with broken rows. Upon opening it in the editor the CR LF is at the end of every line.  That is the difference.  So does it make sense first to look at every EOL and determine what is there before proceeding with ANY script.  In this issue, I don't think extra commas or other, were the problems but rather just the EOL.  So perhaps it makes sense for me to both run any file through CZARDAS CSV_Split  AND do the EOL check???

Thanks so much for your help.

Hobbyist

 

 

GOODFILE.zip

Posted (edited)

The main problem is that any number of line breaks can appear in any column within any row, any number of times. So unless all corrupt files follow the exact same format, reconstruction is theoretically impossible. When there are multiple line breaks within each row, there is no way to tell if subsequent line breaks are within the same field or not. If trailing commas were not added during corruption, then reconstruction would be a bit easier.

Edited by czardas
Posted (edited)

Hobbyist,
This works with the 8 posted files and the previously posted corrupted one. Why didn't you post all 10 files ?

#include <Array.au3>
#include <File.au3>

$a = _FileListToArray (@scriptdir & "\GOODFILE", "*.CSV", $FLTA_FILES, true)
_ArrayDisplay($a,"files")

For $i = 1 to $a[0]
    $f = FileRead($a[$i])
    $f = StringReplace($f, '"', "")
    $f = StringRegExpReplace($f, '(\N+\R\N+)\K\R', "")
    $f = StringRegExpReplace($f, ',,(?=,|\R)', "")
    $array = _CSVSplit($f)
    _ArrayDisplay($array, $i & " - " & $a[$i])
Next

 

Edited by mikell
Posted

@mikell

Thanks, I will give it a try.

OPPS, I got so focused on the 8, I didn't do the 9th one.  The one in the original post (second time submitted) would have been the 10th.  Any way the 9th and 10th are formatted just like the original one so it would have been repetitive.

Again, thanks.  You and  kylomas and czardas have greatly helped me AND I see I still have much to learn (and experience) with CSV files.  And all fore warnings about them are duely noted.

Hobbyist

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...