dejhost Posted January 15, 2021 Share Posted January 15, 2021 Hello! I am dealing with image annotations for machine-learning purposes. My input-file is a .csv-file, which I need to edit. So output should still be .csv, but the little content that is actually required from the original file should be presented in a different structure. That means, that a lot of content from the original file can be deleted. My challange is that the input file is quite large: Right now, I am just working with a snipped, consisting of a 8 rows (that's 1 image per row) and 822 columns. The final file will consist of more than 20.000 lines and possibly 2-3000 columns. Even a file with 500 rows is taking a lot of time and I fear that the final file will just fail. I checked the time-consumption after each manipulation and wrote it as a comment behind the MsgBoxes (e.g. line 21). Could you please take a look at my code below and the screenshots as well as a small input-file attached? In the orig. file, 1 column lists 1 image, followed by all annotations. In line 64-78 in the code below, I change the format: After this loop, 1 coloumn contains only one annotation. Do you have an idea on how to speed this up? Deleting columns (line 125 - 131) fails when $i = 514. I do not understand why. For some reasons, the array seems to remain quite large, even after I deleted many columns. See the screenshot 03 attached. Why is that? I'd appreciate any help to mature the code. Thank you very much. expandcollapse popup#include <Array.au3> #include <File.au3> #include <MsgBoxConstants.au3> #include <String.au3> ;~ Start timer Local $hTimer = TimerInit() ; Begin the timer and store the handle in a variable. Local $sFilePath = "C:\Users\Jost\Documents\RnD\rePho\Readcsv\result_short.csv" Local $aCSV Local $i, $j = 0, $cols, $rows Global $sSingleFill ;Read the csv-file to Array and check timer _FileReadToArray($sFilePath, $aCSV, Default,',') $rows = $aCSV[0][0] $cols = $aCSV[0][1] consolewrite("$rows = " & $rows & " $cols = " & $cols & @CRLF) ; Timer Local $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "21 Time Difference ", $fDiff) ;14 ms $hTimer = TimerInit() ;~ Remove the first 2 columns of the array since they do not contain any valuable information _ArrayColDelete($aCSV,0) _ArrayColDelete($aCSV,0) $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "30 Time Difference", $fDiff) ;28 ms $hTimer = TimerInit() ; update array dimensions $aCSV[0][0] = $aCSV[0][0]-2 $rows = $aCSV[0][0] consolewrite("$rows = " & $rows & " $cols = " & $cols & @CRLF) ;~ Remove columns with unwanted information (repetative) While $j < $cols - 6 _ArrayColDelete($aCSV,$j + 1) _ArrayColDelete($aCSV,$j + 1) _ArrayColDelete($aCSV,$j + 1) _ArrayColDelete($aCSV,$j + 2) _ArrayColDelete($aCSV,$j + 6) $cols = $cols - 5 $j = $j + 5 Wend $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "50 Time Difference", $fDiff) ;2.8 s $hTimer = TimerInit() ; update array dimensions $aCSV[0][0] = Ubound ($aCSV, 1) $aCSV[0][1] = Ubound ($aCSV, 2) $rows = Ubound ($aCSV, 1) $cols = Ubound ($aCSV, 2) consolewrite("$rows = " & $rows & " $cols = " & $cols & @CRLF) ;Each row is supposed to descibe only 1 annotation. But right now it contains all annotations, which leads to many columns. So: ;duplicate the field containing the image name ($aCSV[$i][0]), and create a new row for each annotation (5 coloumns). For $i = 2 to $aCSV[0][0]-1 ; circle through all rows in array $j = 6 ; in column 6, the second annotation starts For $j = 6 To $aCSV[0][1]-1 Step 5 ; circle through all cols... If $aCSV[$i][$j] <> "" Then ; as long as there are annotations. $sSingleFill = $aCSV[$i][0] ; pick the image name For $k = 0 To 4 ; add the next 4 coloumns $sSingleFill = $sSingleFill & "|" & $aCSV[$i][$j+$k] $aCSV[$i][$j+$k] = "" NEXT EndIF _ArrayAdd($aCSV, $sSingleFill) ;and shift them to a new row. $sSingleFill = "" $aCSV[0][0] = $aCSV[0][0] + 1 ;update field containing number of rows Next Next $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "81 Time Difference", $fDiff) ;20.3 s ; update array dimensions $aCSV[0][0] = Ubound ($aCSV, 1) $aCSV[0][1] = Ubound ($aCSV, 2) $rows = Ubound ($aCSV, 1) $cols = Ubound ($aCSV, 2) consolewrite("$rows = " & $rows & " $cols = " & $cols & @CRLF) _ArrayDisplay($aCSV, "finnished transformation line 91", "5") $hTimer = TimerInit() ;~ Clean up the array _ArrayDelete($aCSV, 1) ;Delete the row containing the header $aCSV[0][0] = $aCSV[0][0] - 1 ;Update field containing number of rows $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "98 Time Difference", $fDiff) ;0.4 s _ArrayDisplay($aCSV, "deleted header line 100", "5") ; update array dimensions $aCSV[0][0] = Ubound ($aCSV, 1) $aCSV[0][1] = Ubound ($aCSV, 2) $hTimer = TimerInit() ;Delete empty columns For $i = 7 to $aCSV[0][1] _ArrayColDelete($aCSV, $i) $aCSV[0][1] = $aCSV[0][1] - 1 Next $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "114 Time Difference", $fDiff) ; 45 s _ArrayDisplay($aCSV, "finnished Delete empty columns line 117") $hTimer = TimerInit() ; update array dimensions $aCSV[0][0] = Ubound ($aCSV, 1) $aCSV[0][1] = Ubound ($aCSV, 2) ;~ _DeleteEmptyRows($aCSV) For $i = 1 to Ubound ($aCSV, 1)-1 ConsoleWrite ("$i = " & $i & @CRLF) If $aCSV[$i][1] = "" Then _ArrayDelete($aCSV, $i) $aCSV[0][0] = $aCSV[0][0]-1 EndIf Next $aCSV[0][0] = Ubound ($aCSV, 1) $aCSV[0][1] = Ubound ($aCSV, 2) $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "137 Time Difference", $fDiff) ;0.3s _ArrayDisplay($aCSV, "Array cleaned up line 136") $hTimer = TimerInit() ;~ Extract image-name from field $aCSV[$i][0] For $i = 1 to $aCSV[0][0]-1 ;Circle through all rows $string = $aCSV[$i][0] ;Pick field containtn image name Local $rpos = StringInStr($string, "?") ;Search for delimiter $string = StringLeft ($string, $rpos-1) Local $lpos = StringInStr($string, "-", 0 , -1) ;Search for delimiter $string = StringTrimLeft ($string, $lpos ) consolewrite("$string = " & $string & @CRLF) $aCSV[$i][0] = $string ;Insert image-name into $aCSV[$i][0] (overwrite) Next $fDiff = TimerDiff($hTimer) MsgBox($MB_SYSTEMMODAL, "154 Time Difference", $fDiff) ; 10 s _ArrayDisplay($aCSV, "All done line 156") Exit Func _DeleteEmptyRows($aCSV) Local $Rows = Ubound($aCSV,1) Local $Cols = Ubound($aCSV,2) Local $aTemp[$Rows][$Cols] Local $not_empty Local $Count = 0 ;Loop through rows For $Y = 0 to $Rows - 1 $not_empty = 0 ;Loop through columns For $X = 0 to $Cols - 1 ;Copy all columns to temp array even if they are all empty $aTemp[$Count][$X] = $aCSV[$Y][$X] ;If even one column contains data, make sure it doesn't get deleted If $aCSV[$Y][$X] <> "" Then $not_empty = BitOr($not_empty, 1) Next ;If the row has any data, increment, else keep overwriting last row until it contains something If $not_empty Then $Count += 1 Next Redim $aTemp[$Count][$Cols] Return $aTemp EndFunc result_short.csv Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted January 15, 2021 Share Posted January 15, 2021 @dejhost I think that it could be more easily for us to know what's your goals instead of trying to understand your script. Just make a list of "to do's" and we'll try our best to help you Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
Nine Posted January 15, 2021 Share Posted January 15, 2021 A few general comments : 1- Using _Array* repeatedly to add or delete 1 row/col each time is terribly slow with a large array, consider using ReDim instead 2- Multiple steps (like you did) simplify the programmer conceptional effort but generate extra machine execution work, consider reducing it to a single step (maybe 2 at most) 3- On Win10, array manipulation is quite slow with the current AutoIt version, consider using beta version “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
orbs Posted January 15, 2021 Share Posted January 15, 2021 (edited) i fully agree with the two posts above, except that ReDim is also a poor choice for many successive operations. i'd go about it this way: read the input file line by line, and for each line in turn: 1) split the columns into a 1-D array of values. 2) build the target row, as a string, from the desired values. 3) write the new line into the target file. this way you avoid loading the RAM too much, and you do not need to perform array manipulation at all. Edited January 15, 2021 by orbs Signature - my forum contributions: Spoiler UDF: LFN - support for long file names (over 260 characters) InputImpose - impose valid characters in an input control TimeConvert - convert UTC to/from local time and/or reformat the string representation AMF - accept multiple files from Windows Explorer context menu DateDuration - literal description of the difference between given dates Apps: Touch - set the "modified" timestamp of a file to current time Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes SPDiff - Single-Pane Text Diff Link to comment Share on other sites More sharing options...
Nine Posted January 15, 2021 Share Posted January 15, 2021 1 minute ago, orbs said: except that ReDim is also a poor choice for many successive operations I was not clear enough, instead of deleting last X columns with _ArrayColDelete one at a time and deleting last Y rows with _ArrayDelete one at a time, you can ReDim the array in a single statement step (which is quite fast). For example : #include <Constants.au3> #include <Array.au3> Local $arr[200][1000] For $i = 0 to UBound($arr, 1) - 1 For $j = 0 to UBound($arr, 2) - 1 $arr[$i][$j] = $i & "-" & $j Next Next ; _ArrayDisplay($arr) Local $hTime = TimerInit() ReDim $arr[10][10] MsgBox ($MB_SYSTEMMODAL,"",TimerDiff($hTime)) _ArrayDisplay($arr) “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
dejhost Posted January 15, 2021 Author Share Posted January 15, 2021 7 hours ago, FrancescoDiMuro said: @dejhost I think that it could be more easily for us to know what's your goals instead of trying to understand your script. Just make a list of "to do's" and we'll try our best to help you Good point, thank you. Since I received quite valuable comments already, let me improve my skript and come back to you afterwards concerning a to-do list. However: this is the goal: a .csv, looking like this: --- Thanks everybody: I had no idea that the array-operations are relatively slow. Reading and processing one line at the time makes sense to me. Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted January 16, 2021 Share Posted January 16, 2021 (edited) @dejhost As I was saying, knowing the final result is a must in order to build a script that works for your needs, but this may help you to continue your development: expandcollapse popup#include <Array.au3> #include <File.au3> #include <StringConstants.au3> Test() Func Test() Local $strFileName = @ScriptDir & "\result_short.csv", _ $arrFileContent, _ $arrImages[1][8] = [["Filename", "Width", "Height", "X_Min", "Y_Min", "X_Max", "Y_Max"]], _ $arrImageFileName, _ $hTimer ; Start the timer $hTimer = TimerInit() _FileReadToArray($strFileName, $arrFileContent, $FRTA_COUNT, ",") If @error Then Return ConsoleWrite("_FileReadToArray ERR: " & @error & @CRLF) ; Loop through the array content For $i = 1 To $arrFileContent[0][0] Step 1 ; If the image has no width or height, then just continue the loop If ($arrFileContent[$i][10] = "") Or ($arrFileContent[$i][11] = "") Then ContinueLoop ; Obtaining the filename of the image $arrImageFileName = StringRegExp($arrFileContent[$i][2], '(?<=%)([^.]+\.jpg)', $STR_REGEXPARRAYMATCH) ; If the name of the image has been obtained correctly If IsArray($arrImageFileName) Then ; Adding the elements to the Images array _ArrayAdd($arrImages, $arrImageFileName[0] & "|" & _ $arrFileContent[$i][11] & "|" & _ $arrFileContent[$i][10] & "|" & _ "|" & _ "|" & _ "|" & _ "|" & _ "") If @error Then Return ConsoleWrite("_ArrayAdd ERR: " & @error & @CRLF) EndIf Next ; Time for the script execution ConsoleWrite("Script executed in " & TimerDiff($hTimer)/1000 & " sec." & @CRLF) ; Display the Images array _ArrayDisplay($arrImages) EndFunc And here's the execution time of the script: Script executed in 0.0331623 sec. Edited January 16, 2021 by FrancescoDiMuro Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
junkew Posted January 18, 2021 Share Posted January 18, 2021 Maybe first read it into an sqlite database and then query it first to an easier to understand 13 column table Another example with first transforming it in powershell (same you could do with SQLite database) Powershell is powerfull in dealing with csv files and querying them directly with named columns instead of numbers Your testfile misses 1 column at the end You have to create the calculations yourself. Below outputs it into file transformed.csv which you then can deal with from Autoit clear-host <# ID : ckj9snx1r00003a699vfsn650 DataRow ID : ckj9odjg47bie0rh60mq7edin Labeled Data : https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2Fa7fdb985-6331-2931-da04-ad7fa0e1 48c1-n101_0058.jpg?Expires=1611474714883&KeyName=labelbox-assets-key-1&Signature=uviaZ-Av HVLSuDCabq3Fw9CU13E Label/objects/0/featureId : ckj9qsu9707br0yaf04i5bzxd Label/objects/0/schemaId : ckj9q1xtm00ko0yae812f3asg Label/objects/0/title : Ocluded Ucoded Target Label/objects/0/value : ocluded_ucoded_target Label/objects/0/color : #008941 Label/objects/0/bbox/top : 803 Label/objects/0/bbox/left : 0 Label/objects/0/bbox/height : 24 Label/objects/0/bbox/width : 20 Label/objects/0/instanceURI : https://api.labelbox.com/masks/feature/ckj9qsu9707br0yaf04i5bzxd?token=eyJhbGciOiJIUzI1Ni IsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJja2o5cTNjeTg3anRvMDcxNmU2Zzd0ejEzIiwib3JnYW5pemF0aW9uSW QiOiJja2o5bzFudmY2cHFrMDcxNnZuMjRub3o3IiwiaWF0IjoxNjEwMjY1MTEzLCJleHAiOjE2MTI4NTcxMTN9.nI Ov1e49Ci1etM4YxN57hujTKB0kQpueX26YULxFrsI ... Label/objects/81/bbox/width #> $rawData = Import-Csv -Path "result_short.csv" $rowcount=$rawdata.count $columnCount=$rawdata[0].psobject.properties.name.count #10 is default number of columns $repeatNNTimes = ($columncount-2) / 10 #0-81 repeats for all blocks for ($i=0; $i -le $repeatNNTimes-1; $i++) { $prefix="Label/objects/$i" $global:rowId=0 $transformedData=$rawdata | select-object @{name="rowid" ; expression={$global:rowId=$global:rowId+1; return $global:rowId}}, @{name="colid" ; expression={return $i}}, @{name="imgname"; expression={$_."Labeled Data"-match "[^-]*.jpg" | out-null; return $matches[0]}}, @{name="width" ; expression={$_."$prefix/bbox/width"}}, @{name="height" ; expression={$_."$prefix/bbox/height"}}, @{name="class" ; expression={"tag"}}, @{name="xmin" ; expression={"xmin"}}, @{name="ymin" ; expression={"ymin"}}, @{name="xmax" ; expression={"xmax"}}, @{name="ymax" ; expression={"ymax"}}, @{name="instanceURI" ; expression={$_."$prefix/instanceURI"}} | where {$_.instanceURI -ne ""} if ($i -eq 0) {($transformedData | select -First 1 | convertto-csv -NoTypeInformation)[0] | out-file transformed.csv -encoding utf8 -force } $transformedData | convertto-csv -NoTypeInformation | Select-Object -Skip 1 | out-file transformed.csv -Append -encoding utf8 #$transformedData | format-table -AutoSize -HideTableHeaders } FrancescoDiMuro 1 FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
dejhost Posted January 20, 2021 Author Share Posted January 20, 2021 Thanks @everybody here for your input. I re-wrote my script: It reads 1 line at the time, builds the output-line in form of an array and finally appends the line to an output file. It's not superfast (I'm not exactly what you would call an expert on Regexp or SQL), but it's acceptable. Processing the input-file, the script bumps into troubles after 120 annotations of success. Reasons: The csv-input-file breakes its rutines (in column 1130 or whatever). Fixing this manually just leads to more troubles a little later. So either the csv-input-file was inconsistent when I revceived it, or the conversion I performed (prior reading it with autoit) screwed it up. However: it seems to me that working with the original files is the better way. Disadvantage is, that they are pretty hard to read for me. Please take a look at them (identical content, diffrerent format). Could you please advise me what would be a good approach? Should I work with the csv or json? Thank you very much for giving me a hand. csv.zip json.zip Link to comment Share on other sites More sharing options...
dejhost Posted January 20, 2021 Author Share Posted January 20, 2021 (edited) Below I add the strucutre of the output-file as it is supposed to be (the first few Annotations). The values in the second and third columns are retrieved from another source. Filename Width Height Class X_Min Y_Min X_Max Y_Max n101_0058.jpg 2848 4288 ocluded_ucoded_target 803 0 24 20 n101_0058.jpg 2848 4288 ocluded_coded_target 845 0 57 53 n101_0058.jpg 2848 4288 uncoded_target 1007 66 32 38 n101_0058.jpg 2848 4288 coded_target 1051 72 58 65 n101_0058.jpg 2848 4288 uncoded_target 1213 145 30 38 n101_0058.jpg 2848 4288 coded_target 1257 155 56 62 n101_0058.jpg 2848 4288 uncoded_target 1417 225 32 37 n101_0058.jpg 2848 4288 ocluded_coded_target 645 0 48 26 n101_0058.jpg 2848 4288 uncoded_target 795 48 31 33 n101_0058.jpg 2848 4288 coded_target 844 63 48 46 n101_0058.jpg 2848 4288 uncoded_target 999 130 32 32 n101_0058.jpg 2848 4288 uncoded_target 1205 212 30 29 n101_0058.jpg 2848 4288 coded_target 1251 226 52 44 Edited January 20, 2021 by dejhost text in table had white color... Link to comment Share on other sites More sharing options...
Nine Posted January 20, 2021 Share Posted January 20, 2021 I am supposing that the json file is the original and the csv is the result of the conversion of the json, right ? Each line in the json file is an annotation as you are calling it ? “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
dejhost Posted January 20, 2021 Author Share Posted January 20, 2021 Both attached files are exports from a software called label-box: https://labelbox.com Quote Each line in the json file is an annotation as you are calling it ? If you look at the json-file in notepad++ (See screenshot below), each line contains many annotations. Starting with column 3 ("Labeled Data"), every 10th column starts a new annotation. So column 3, 13, 23, 33.... is called ("Labeled Data") and contains the image name (the image name ist hidden in the link). But using this numbering failed in coloumn 11200 (or whatever). However: each line describes one image (so about 50 images): --- If you look at the json in a webbrowser (See screenshot below)... well: the first "0" relates to the first image. The second "0" (below "objects") is the first annotation. The values I am looking for are red encircled, and are part of the line in the table posted an hour ago: Link to comment Share on other sites More sharing options...
Nine Posted January 20, 2021 Share Posted January 20, 2021 (edited) Could you test that, seems to be working : #include <Constants.au3> #include <Array.au3> #include <File.au3> Local $aFile = FileReadToArray("export-2021-01-10T07_51_53.836Z.json") Local $hFile = FileOpen("Output.txt", $FO_OVERWRITE) Local $sContent, $sLine, $aContent Local $hTime = TimerInit() For $i = 1 to UBound($aFile)-1 $sContent = StringRegExp(StringRegExp($aFile[$i],'"Labeled Data":"([^"]+)',1)[0],'(?i)([^-]*?\.jpg)',1)[0] ConsoleWrite ($sContent & @CRLF) $aContent = StringRegExp($aFile[$i],'"value":"([^"]+).+?:\{"top":([^,]+),"left":([^,]+),"height":([^,]+),"width":([^}]+)', 3) $sLine = "" For $j = 0 To UBound($aContent) - 1 Step 5 $sLine &= $sContent For $k = $j To $j+4 $sLine &= @TAB & $aContent[$k] Next $sLine &= @CRLF Next FileWrite($hFile, $sLine) Next FileClose($hFile) MsgBox ($MB_SYSTEMMODAL,"",Round(TimerDiff($hTime)/1000,2)) I optimized the code. Went from ~5secs to ~0.6sec Edited January 20, 2021 by Nine “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
junkew Posted January 20, 2021 Share Posted January 20, 2021 As said by @FrancescoDiMuro it helps to get the overall target picture clear first. If it starts with JSON you should manipulate that as the earlier in the process you can structure your data the faster your full process will be. Intall Notepad++ with plugin JSON viewer. Browser is fine but more specific JSON viewer tools will help sometimes I do not know a good JSON library in Autoit that can easily flatten your filebut maybe others can give input for that See below how you can easily get your array from the JSON in powershell directly as csv you have to do the calculations yourself $jsonFile= "c:\temp\export-2021-01-10T07_51_53.836Z.json" $rawData=Get-Content -Raw -Path $jsonFile | ConvertFrom-Json $result=foreach($row in $rawdata) { foreach($o in $row.label.objects) { # write-host $row.'Labeled Data' $o.featureid $o.schemaid $ o.title $o.value $o.color $o.bbox.top $o.bbox.left $o.bbox.height $o.bbox.width # write-host $row."External ID" $o.featureid $o.schemaid $o.title $o.value $o.color $o.bbox.top $o.bbox.left $o.bbox.height $o.bbox.width [pscustomobject]@{ id = $row."External ID" ; featurid = $o.featureid ; schemaid = $o.schemaid ; title = $o.title ; value = $o.value ; color = $o.color ; top = $o.bbox.top ; left = $o.bbox.left ; height = $o.bbox.height ; width = $o.bbox.width } } } $result | convertto-csv -NoTypeInformation | out-file result.csv -encoding utf8 FAQ 31 How to click some elements, FAQ 40 Test automation with AutoIt, Multithreading CLR .NET Powershell CMDLets Link to comment Share on other sites More sharing options...
seadoggie01 Posted January 20, 2021 Share Posted January 20, 2021 I'd second junkew's suggestion to get the data from JSON. This UDF is used for the Danp2's WebDriver UDF, so I think it's pretty stable: Here's a brief example: (Note: your JSON file is huge, it may take a long time to load into AutoIT. I gave up after 20 seconds, but I use a relic) #include <json.au3> Func Main() ; ... Local $sFileContents = FileRead("export-2021-01-10T07_51_53.836Z.json") ; Load the JSON into AutoIt Local $oJson = Json_Decode($sFileContents) If @error Then Exit ConsoleWrite("Json_Decode Error: " & @error & @CRLF) Local $sValue For $i=0 To 5 ; Get the value of the ID at item $i $sValue = Json_Get($oJson, "[" & $i & "][ID]") If @error Then Exit ConsoleWrite("Json_Get[" & $i & "] Error: " & @error & @CRLF) ConsoleWrite("ID: " & $sValue & @CRLF) Next ; ... EndFunc FrancescoDiMuro 1 All my code provided is Public Domain... but it may not work. Use it, change it, break it, whatever you want. Spoiler My Humble Contributions:Personal Function Documentation - A personal HelpFile for your functionsAcro.au3 UDF - Automating Acrobat ProToDo Finder - Find #ToDo: lines in your scriptsUI-SimpleWrappers UDF - Use UI Automation more Simply-erKeePass UDF - Automate KeePass, a password managerInputBoxes - Simple Input boxes for various variable types Link to comment Share on other sites More sharing options...
TheXman Posted January 21, 2021 Share Posted January 21, 2021 (edited) Here's an example that uses the jq UDF to generate CSV data from the JSON file attached in one of the previous posts. It generates the 52,405 CSV records in less than 2 seconds on my laptop. The CSV contains the data items that were circled in red in the post above. If you want TSV data instead of CSV data, just change the @csv to @tsv in $JQ_FILTER. expandcollapse popup#cs This example uses the jq UDF. https://www.autoitscript.com/forum/files/file/502-jq-udf-a-powerful-flexible-json-processor/ #ce #include <Constants.au3> #include "jq.au3" ;<== Modify as needed _jqInit("jq-win64.exe") ;<== Modify as needed If @error Then Exit MsgBox($MB_ICONERROR + $MB_TOPMOST, "ERROR", StringFormat("ERROR: Unable to initialize jq - @error = %s", @error)) example() Func example() Const $JSON_FILE = "export-2021-01-10T07_51_53.836Z.json" ;<== Modify as needed Const $JQ_FILTER = '.[] | ' & _ '{file: ."External ID", type: .Label.objects[]?} | ' & _ '{file, value: .type.value, top: .type.bbox.top, left: .type.bbox.left, height: .type.bbox.height, width: .type.bbox.width} | ' & _ '[.[]] | @csv' Local $hTimer = TimerInit() Local $sCmdOutput = _jqExecFile($JSON_FILE, $JQ_FILTER) Local $iTimerDiff = TimerDiff($hTimer) write_log_line(@CRLF & "================================") write_log_line($sCmdOutput) write_log_line(@CRLF & "Duration: " & StringFormat("%.3f seconds", $iTimerDiff / 1000)) EndFunc Func write_log_line($sMsg = "") Const $TITLE_NOTEPAD = "[RegExpTitle:(?i)untitled - notepad]" Static $hWndNotepad = -1 ;If we don't have a handle to notepad yet If $hWndNotepad = -1 Then ;If there isn't an existing instance of notepad running, launch one If Not WinExists($TITLE_NOTEPAD) Then Run("Notepad.exe") ;Get handle to notepad window $hWndNotepad = WinWait($TITLE_NOTEPAD, "", 3) If Not $hWndNotepad Then Exit MsgBox($MB_ICONERROR, "ERROR", "Unable to find Notepad window.") EndIf ;Paste msg to notepad text If WinExists($hWndNotepad) Then ControlCommand($hWndNotepad, "", "Edit1", "EditPaste", $sMsg & @CRLF) EndIf EndFunc Snippet of the output: "n101_0058.jpg","ocluded_ucoded_target",803,0,24,20 "n101_0058.jpg","ocluded_coded_target",845,0,57,53 "n101_0058.jpg","uncoded_target",1007,66,32,38 "n101_0058.jpg","coded_target",1051,72,58,65 "n101_0058.jpg","uncoded_target",1213,145,30,38 . . . "n101_0252.jpg","uncoded_target",2377,4194,40,54 "n101_0252.jpg","uncoded_target",2651,3126,40,48 "n101_0252.jpg","uncoded_target",2013,3421,31,47 "n101_0252.jpg","coded_target",2054,3224,54,75 "n101_0252.jpg","uncoded_target",2084,3170,33,42 Duration: 1.279 seconds For anyone reading this later who's interested in understanding the jq filter that was used, here's the filter with comments. FYI, this filter can also be fed to the jqExec* functions as a file, instead of as a string literal, using the -f option. The "|" symbol basically tells the processor to feed the result of the previous part to the next part. # For each top-level array entry .[] | # Reduce to only interesting json objects/data, # creating an object for each .Label.objects' array # object { file: ."External ID", label: .Label.objects[]? } | # Reduce result to final set of objects with data of interest { file, value: .label.value, top: .label.bbox.top, left: .label.bbox.left, height: .label.bbox.height, width: .label.bbox.width } | # Convert result to an array for input to @csv [.[]] | # Convert array to CSV @csv Edited January 27, 2021 by TheXman Added commented filter for those that may be interested robertocm 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
dejhost Posted January 21, 2021 Author Share Posted January 21, 2021 I feel a bit overwhelemed by your enthusiasm! Thanks everybody! @Nine: Works like a charm! Damn fast as well. I realised by now that the image-name is also listed as "External ID" (easy to see, once you kow about the json plugin for notepad++ ;-)) . So I retreive the image name directly, and skip the line $sContent = StringRegExp(StringRegExp($aFile[$i],'"Labeled Data":"([^"]+)',1)[0],'(?i)([^-]*?\.jpg)',1)[0] This brings it down to 0,5s. 👍 One comment/question remains: As mentioned, the input file is only a trial, dealing with 500 images. The final file will deal with more than 20.000 images, so 40 times larger. So I might go back to reading one line at the time: using "FileReadLine". Does that make sense? @junkew: I agree on your "target picture"- comment. But it wasn't so easy to understand it myself: Only 1-2 days ago, I realised that one can look at json-files in a structured way (web browser) - giving a much better understanding of the file structure. The notepad-pluggin sure comes in handy. Haven't used powershell yet, but reading about it on the web... possibly an even better tool than autoit (for this task). I also found https://pandas.pydata.org . But I will stick to autoit for now, since the task is basically done. @seadoggie01: I too started using the json.au3 yesterday evening, but it seemed rather slow, yes. @TheXman: Nice! Very fast, too! Once again: Thank you everybody! Your skrips make it look so easy.... 😉 seadoggie01 1 Link to comment Share on other sites More sharing options...
Nine Posted January 21, 2021 Share Posted January 21, 2021 (edited) 4 hours ago, dejhost said: @Nine: Works like a charm! Damn fast as well. Thank you, by writing once at every line of input makes it way faster... 4 hours ago, dejhost said: As mentioned, the input file is only a trial, dealing with 500 images. The final file will deal with more than 20.000 images, so 40 times larger. At the moment 500 images takes about 27.5 Mb, so 40 times brings it to 1.1Gb. Tad too big in my opinion to use FileReadToArray, it may slow down the whole process. I would tend to agree with you, that reading the json file line by line may be a better option. But nothing stops you to try both methods...Let us know how it goes. Edited January 21, 2021 by Nine “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
TheXman Posted January 21, 2021 Share Posted January 21, 2021 (edited) 9 hours ago, dejhost said: @TheXman: Nice! Very fast, too! Thanks! I updated my previous post with a commented description of what the jq filter does just in case anyone else, that comes across this topic, may be interested in jq and how it can be used to process JSON. Edited January 21, 2021 by TheXman FrancescoDiMuro and dejhost 1 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
TheXman Posted January 22, 2021 Share Posted January 22, 2021 (edited) @dejhost FYI, Because you're dealing with such large datasets, it is easy to miss some of the details & data if you aren't careful. For example, in the sample JSON file that you provided, there are 2 images that don't have any .Label object information. Therefore, those 2 records did not show up in the CSV data. If that's what was expected, then all is good. If you expected to see all images, whether they had object information or not, then those types of records would need to be taken into account. So if you noticed and were wondering why you were missing some of the input images in the CSV data, that's why. The 2 JSON objects without label object information: expandcollapse popup{ "ID": "ckjayp5lc00003a69wakzxsxg", "DataRow ID": "ckj9odjg37bi20rh6ei442ug7", "Labeled Data": "https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2F0bacdc01-bee2-3b16-627e-671b8f6c2a1a-DSC04905.jpg?Expires=1611474714892&KeyName=labelbox-assets-key-1&Signature=dvsZdXT3dbUeRBvORvVxtMhhyMA", "Label": {}, "Created By": "victor@oasisoutsourcing.co.ke", "Project Name": "SubSeaScanning", "Created At": "2020-12-30T05:11:30.000Z", "Updated At": "2020-12-30T05:23:05.000Z", "Seconds to Label": 32.094, "External ID": "DSC04905.jpg", "Agreement": -1, "Benchmark Agreement": -1, "Benchmark ID": null, "Dataset Name": "Trial", "Reviews": [ { "score": 1, "id": "ckjc8j0k60gk50yaw07n69wiu", "createdAt": "2020-12-31T02:34:25.000Z", "createdBy": "victor@oasisoutsourcing.co.ke" } ], "View Label": "https://editor.labelbox.com?project=ckj9obfp954gq0718tasdrinc&label=ckjayp5lc00003a69wakzxsxg" } { "ID": "ckjmlilco000039686bfxpf8k", "DataRow ID": "ckjm8xorhfbe80rj53hi0bg36", "Labeled Data": "https://storage.labelbox.com/ckj9o1nvf6pqk0716vn24noz7%2F4644e43b-5e58-ff51-d1e3-4551ed722d6f-n101_0408.jpg?Expires=1611474715610&KeyName=labelbox-assets-key-1&Signature=erTXaMYyBO56kurONJv6AVGt8zU", "Label": {}, "Created By": "evans@oasisoutsourcing.co.ke", "Project Name": "SubSeaScanning", "Created At": "2021-01-07T08:31:15.000Z", "Updated At": "2021-01-07T08:31:16.000Z", "Seconds to Label": 219.925, "External ID": "n101_0408.jpg", "Agreement": -1, "Benchmark Agreement": -1, "Benchmark ID": null, "Dataset Name": "Aassgard Spool - Batch 1", "Reviews": [ { "score": 1, "id": "ckjn49ska0p010yd17ws9eyyh", "createdAt": "2021-01-07T17:20:44.000Z", "createdBy": "victor@oasisoutsourcing.co.ke" } ], "View Label": "https://editor.labelbox.com?project=ckj9obfp954gq0718tasdrinc&label=ckjmlilco000039686bfxpf8k" } Edited January 22, 2021 by TheXman FrancescoDiMuro 1 CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now