revonatu Posted April 14, 2015 Share Posted April 14, 2015 Hi, I have a big CSV file with 1,864,653 lines with the following variables: Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), ... 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397, ... ... I want to find the line whose UTC Time is closest to a given one (stored as variable) in the same format (hh:mm:ss.sss). When found the line I want to extract the Lat, Lon and Altitude values. How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time? Thanks for hints and ideas. Link to comment Share on other sites More sharing options...
UEZ Posted April 14, 2015 Share Posted April 14, 2015 You can read the file line by line and compare the date. If the date is near or equal the searched date you can save it. You have to define what date is near. Further reading a file with 1.864.653 lines will take some time. Btw, what is the file size of this csv? Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ Link to comment Share on other sites More sharing options...
argumentum Posted April 14, 2015 Share Posted April 14, 2015 How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time? Thanks for hints and ideas. well, I don't know how to guess a file position for a data I haven't found. You'll have to read some how. Line at a time or chunk of x bytes but how can you find it without finding it ?. Sorry about the file size. Why don't you chop the file into dates if that is the search criteria. It will make consequent searches faster. Follow the link to my code contribution ( and other things too ). FAQ - Please Read Before Posting. Link to comment Share on other sites More sharing options...
Zedna Posted April 14, 2015 Share Posted April 14, 2015 (edited) You may try SQLite with memory database and test speed of this way ... Edited April 14, 2015 by Zedna Resources UDF ResourcesEx UDF AutoIt Forum Search Link to comment Share on other sites More sharing options...
jchd Posted April 14, 2015 Share Posted April 14, 2015 If this is routine operation, I strongly suggest SQLite as Zedna just said. You didn't give enough information so this is not a ready to use road but you can try doing the following: Download the latest sqlite3.exe utility, e.g. from the official SQLite site http://www.sqlite.org/download.html and select the Precompiled command-line shell for Windows. I tried with the following dummy file which I named testimport.csv: Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m),Other data 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/01/2015,01:37:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/01/2015,01:38:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/01/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/01/2013,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/01/2001,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/02/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/03/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 28/04/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl 26/01/2014,01:36:44.010,0.000,-36.78730154,158.63387961,19.397,abc def ghi jkl using this in a prompt shell: C:\Users\jc\Documents\AutoMAT\tmp>sqlite3 data.sq3 SQLite version 3.8.8.3 2015-02-25 13:29:11 Enter ".help" for usage hints. sqlite> .mode csv sqlite> .import testimport.csv Usage: .import FILE TABLE sqlite> .import testimport.csv TestData sqlite> .quit Now file data.sq3 is created with the data in table TestData. Go download SQLite Expert personal Edition (free) here http://www.sqliteexpert.com/ Open database data.sq3 and hit the Data tab: you can see the sample data. Because the Date and Time columns are separate and not in a standard format, I advise you to issue the following commands in a new SQL tab: update testdata set timestamp = replace("Date (UTC)" || ' ' || "Time (UTC)", '/', '-'); CREATE INDEX idxTS ON TestData (Timestamp); alter table testdata add column Timestamp; update testdata set timestamp = substr("Date (UTC)", 7, 4) || '-' || substr("Date (UTC)", 4, 2) || '-' || substr("Date (UTC)", 1, 2) || ' ' || "Time (UTC)"; CREATE INDEX idxTS ON TestData (Timestamp); Now you're ready in minutes to query the database efficiently. If you have a time window like '2014-27-01 10:28:00' +/- 23 hours, you can get the result by querying: select "poslat (deg)" Lat, "poslon (deg)" Lon, "posalt (m)" Alt from testdata where timestamp between datetime('2014-01-27 10:28:00', '-23 hours') and datetime('2014-01-27 10:28:00', '+23 hours'); giving you: Timestamp Lat Lon Alt 2014-01-28 01:36:44.010 -36.78730154 158.63387961 19.397 (I was lazy to change the coordinates!) If you actually want the row where the timestamp is closest to a given point, then the query is a bit more involved but it's definitely possible. There are MANY optimizations left open here, this is just a sketch This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Spandex Posted April 14, 2015 Share Posted April 14, 2015 Hi, I have a big CSV file with 1,864,653 lines with the following variables: Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), ... 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397, ... ... I want to find the line whose UTC Time is closest to a given one (stored as variable) in the same format (hh:mm:ss.sss). When found the line I want to extract the Lat, Lon and Altitude values. How can I do this and at the same time avoid reading the whole file, which is too big and would cost much processing time? Thanks for hints and ideas. A few basic questions: Is this information used as read only or read mostly? If so, it may be reasonable to pre-process the data into a more easily searchable format whenever the file is updated. Are the CSV records in chronological order (that is, monotonic with respect to UTC)? If so, you can perform a binary search on the UTC fields. If the answer to #1 is Yes, then the pre-processing can include sorting the records (or producing a separate sorted index file). If the answer to #1 is No, but #2 is Yes, you can perform a binary search based on the File offset. Each iteration of the search will likely end up in the middle of a record, so some logic to sync up to the next end-of-line will be needed. Hope this helps. Link to comment Share on other sites More sharing options...
revonatu Posted April 15, 2015 Author Share Posted April 15, 2015 (edited) Thanks for the suggestions so far. I hoped it would be easier. Here are some more details on my data: The 167 MB table contains the the coordinates of a whole flight day. Within this day I need the mean value of a single flight line. I calculated the mean time from start and end time of that stripe (saved in another file) and now I need the coordinates at that time. Read only. Date (UTC),Time (UTC),TimeFromStart (s),PosLat (deg),PosLon (deg),PosAlt (m), 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397 28/01/2015,01:36:44.014,0.004,-36.78730154,158.63387961,19.397 28/01/2015,01:36:44.018,0.008,-36.78730154,158.63387961,19.397 .... As you can see the changes in time are slight and continuously increasing. The date remains the same. As an example the given time would be 03:03:23.071. As this task is only a preparing piece of puzzle within a long AutoIt process I would prefer to solve it within AutoIt. Edited April 15, 2015 by revonatu Link to comment Share on other sites More sharing options...
iamtheky Posted April 15, 2015 Share Posted April 15, 2015 (edited) so roughly 25 entries per second? You could read the first line of the file, see how many seconds that is away from your desired time, multiply that value by 25, and thats roughly where your target is. Grab the 50 entries before that and the 50 entries after. Find your time in that small chunk of 100. you could probably expand that out to many thousands and still finish in under a minute. Edited April 15, 2015 by boththose revonatu 1 ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
revonatu Posted April 15, 2015 Author Share Posted April 15, 2015 @boththose Nice idea! But to select the 50 lines before and after the desired time I needed to know their numbers. Then we come full circle. Using AutoIt: By now I opened the CSV file with Notepad and "Found" (Ctrl F) the desired Time. Afterwards I selected and copied that line. Now I try to pick my variables out of this line via StrRegExp. Not elegant, but at least it works more or less and doesn't take too long. If someone has more elegant ideas, please share them with me. Link to comment Share on other sites More sharing options...
iamtheky Posted April 15, 2015 Share Posted April 15, 2015 (edited) you do know the line number, at least a rough guess. It is 25 x $n where $n equals the number of seconds between the time on line 1 and your target time. then that resulting in $x just read lines $x - 50 to $x + 50, and search for your target. and if you have to expand that window due to there being more or less entries for some seconds, I dont think you will see an unbearable performance hit until you get into the thousands. Edited April 15, 2015 by boththose ,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-. |(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/ (_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_) | | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) ( | | | | |)| | \ / | | | | | |)| | `--. | |) \ | | `-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_| '-' '-' (__) (__) (_) (__) Link to comment Share on other sites More sharing options...
revonatu Posted April 15, 2015 Author Share Posted April 15, 2015 (edited) @boththose BTW it has to be 250 x $n as one second consists of ca. 250 lines. So far so good, but to get the difference between the two times I'd have to reformat them to hh-mm-ss which costs time again. Otherwise only the first digits in front of the ':' are taken into account for the subtraction which doesn't help. Oh, I was wrong. No need to reformat. Fine. Next Question. I only know and found the functions FileRead and FileReadLine. Both don't work for several lines in the middle of a file. How do I read my $X +/-1000 lines? Edited April 15, 2015 by revonatu Link to comment Share on other sites More sharing options...
kylomas Posted April 15, 2015 Share Posted April 15, 2015 revonatu, Let me suggest that you re-visit jchd's comments in post #5. What he has outlined is an example of what SQLite can do, AND, it can all be done inline in your program relatively simply. If this data is going to be read once then the use of SQLite might be debatable. However, if you are going to access this data multiple times then SQLite is the only storage mechanism that makes sense. kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
revonatu Posted April 15, 2015 Author Share Posted April 15, 2015 I only need to read it once in this AutoIt session. Link to comment Share on other sites More sharing options...
kylomas Posted April 15, 2015 Share Posted April 15, 2015 I see. If you can read the file to a string and you know the date/time you are looking for then an SRE solution might be appropriate. (I know, a lot of "if"'n) Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
kylomas Posted April 15, 2015 Share Posted April 15, 2015 revonatu, Just for grins I put together a test using SRE. The commented code generates a test file according to the sample data that you posted. Each log entry is incremented 1 millisecond. I generated 1.8 million entries (114.25 MB). The code at the end is using regexp to find the last entry in the test file. This is the run log >"C:\Program Files (x86)\AutoIt3\SciTE\AutoIt3Wrapper\AutoIt3Wrapper.exe" /run /prod /ErrorStdOut /in "C:\Users\ADMIN010\Documents\help large csv.au3" /UserParams +>02:54:49 Starting AutoIt3Wrapper v.14.801.2025.0 SciTE v.3.4.4.0 Keyboard:00000409 OS:WIN_7/Service Pack 1 CPU:X64 OS:X64 Environment(Language:0409) +> SciTEDir => C:\Program Files (x86)\AutoIt3\SciTE UserDir => C:\Users\ADMIN010\AppData\Local\AutoIt v3\SciTE\AutoIt3Wrapper SCITE_USERHOME => C:\Users\ADMIN010\AppData\Local\AutoIt v3\SciTE >Running AU3Check (3.3.13.19) from:C:\Program Files (x86)\AutoIt3 input:C:\Users\ADMIN010\Documents\help large csv.au3 +>02:54:49 AU3Check ended.rc:0 >Running:(3.3.12.0):C:\Program Files (x86)\AutoIt3\autoit3.exe "C:\Users\ADMIN010\Documents\help large csv.au3" --> Press Ctrl+Alt+Break to Restart or Ctrl+Break to Stop Time to find entry = 4.77974345543552 Entry = 15/04/2015, 03:15:02.000,0.000,-36.78730154,158.63387961,19.397 +>02:54:54 AutoIt3.exe ended.rc:0 +>02:54:54 AutoIt3Wrapper Finished. >Exit code: 0 Time: 5.505 The time is in seconds. The code I used... expandcollapse popup; 28/01/2015,01:36:44.010,0.000,-36.78730154,158.63387961,19.397 #cs - code used to generate test file of incrementing milliseconds #include <array.au3> #include <date.au3> #include <file.au3> local $st = timerinit() local $a20[1800001] local $MS = 000, $tail = '0.000,-36.78730154,158.63387961,19.397', $dte = _NowCalc() for $1 = 0 to 1800000 if mod($1,1000) = 0 then _add_second() $a20[$1] = stringregexpreplace($dte,'(\d\d\d\d)/(\d\d)/(\d\d)(.*)','$3/$2/$1,$4') & '.' & stringformat('%03i',$MS) & ',' & $tail $MS += 1 Next ConsoleWrite('Test array generated in ' & timerdiff($st)/1000 & ' seconds' & @CRLF) $st = timerinit() _FileWriteFromArray(@scriptdir & '\testcsv.csv',$a20) ConsoleWrite('Test file generated in ' & timerdiff($st)/1000 & ' seconds' & @CRLF) func _add_second() $MS = 000 $dte = _dateadd('s',1,$dte) endfunc #ce local $TimeToFind = '15/04/2015, 03:15:02.000' local $st = timerinit() local $aRET = stringregexp(fileread(@scriptdir & '\testcsv.csv'),$TimeToFind & '.*',3) ConsoleWrite('Time to find entry = ' & timerdiff($st)/1000 & @LF & 'Entry = ' & $aRET[0] & @CRLF) kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
Zedna Posted April 15, 2015 Share Posted April 15, 2015 (edited) With such a big CSV file (167 MB) I recommend to read only part of file using FileOpen+FileSetPos+FileRead as in this example (example is for reading chars from end of file) In your case determine approximate position of your time in file based on approx. length of each line and number of expected lines. Then read few KB before and after that point into variable and Then you may do simple StringSplit() by @CRLF and go through each line in FOR/NEXT loop because there will be only "few" lines so it will not be big speed problem anymore. So idea is to avoid reading (and parsing) of WHOLE file into variable. EDIT: here is example for my idea: msgbox(0, '10KB from position 3MB' , _FileReadMyPos('test.csv' , 3000000)) Func _FileReadMyPos($FileName, $pos, $count = 10000) ; implicitly read 10 KB $hFile = FileOpen($FileName, 0) ; for reading FileSetPos($hFile , $pos , 0) ; from begin $sOut = FileRead($hFile, $count) FileClose($hFile) return $sOut EndFunc Edited April 15, 2015 by Zedna Resources UDF ResourcesEx UDF AutoIt Forum Search Link to comment Share on other sites More sharing options...
Zedna Posted April 15, 2015 Share Posted April 15, 2015 (edited) So you can read few lines from begin of file and few lines from end of file to analyze starting/ending date/time and average length of line. After this you can read (and parse) small part of file from desired postion as I wote in my previous post ... Edited April 15, 2015 by Zedna Resources UDF ResourcesEx UDF AutoIt Forum Search Link to comment Share on other sites More sharing options...
revonatu Posted April 16, 2015 Author Share Posted April 16, 2015 (edited) I reduced the resolution to one line per second, which is enough for my needs and now I can read it to an array and pick my variables. Here is my well working code (blowen up by error requests): If Not _FileReadToArray ($posPath, $Pos, 4, " ") Then msgbox(16, "error", "Failed to read file. Error: " & @error) Exit EndIf Local $LineX = _ArrayBinarySearch ($Pos, Round($MeanTime,0), 0) If Not @error Then ;MsgBox(64, "Found", "LineX= " & $LineX, 2) ; e.g. row 5200 Else msgbox(16, "Error", "Error: " & @error) Exit EndIf Local $ArrayX = _ArrayExtract($Pos, $LineX, $LineX) ; extract the desired row to a new array If Not @error Then ;_ArrayDisplay($ArrayX, "ArrayX", "", 96, Default, "TimeOfDay (UTC)|PosLat (deg)|PosLon (deg)|PosHeight (m)|AngleRoll (deg)|AnglePitch (deg)|Heading (deg)", 2) Else msgbox(16, "Error", "Error: " & @error) Exit EndIf $Lat = _ArrayUnique($ArrayX, 1); $ArrayX[1] If Not @error Then ; _ArrayDisplay($Lat) Else msgbox(16, "Error", "Failed to extract. Error: " & @error) EndIf $Lon = _ArrayUnique($ArrayX, 2); $ArrayX[2] ;$Elev = _ArrayUnique($ArrayX, x); $ArrayX[x] $Alt = _ArrayUnique($ArrayX, 3); $ArrayX[3] $Head = _ArrayUnique($ArrayX, 6); $ArrayX[6] Thanks for your help. Edited April 16, 2015 by revonatu Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now