footswitch Posted July 23, 2010 Posted July 23, 2010 (edited) Hello there,I'm successfully importing XML files into a SQLite database, parsing them with the UDFs from _XMLDomWrapper.au3.However, my script is taking too long to parse XML files (about 3 times more, compared to opening the XML file directly into Excel).This is mainly for two reasons:1. I'm using too many If statements, because I need to check the "integrity" of the data that I'm importing (imagine that somewhere along the file, one of the fields isn't present: it would ruin the import process);2. Probably there are better and safer ways of accomplishing this goal.The script sequence is essentially as described below:1. Open XML file2. Get node count for parent nodes (populate parent nodes)3. For each parent node, get node count for its child nodes (populate child nodes)5. For each child node, get its field names6. Compare every field name with the intended field names, to make sure the import of that child node will be successful4. For each child node, get values from every field5. Transform the array into a formatted string (obeying another 3 or 4 rules so that the final string will be compatible with the SQL INSERT command)(6. Repeat from step 5 until there are no more child nodes for that parent node, then go all over again from step 3 until there are no more parent nodes)Initially the script took double the time, because I was retrieving field values individually, rather than grouped by child node.So, do you have any suggestions? Alternative methods, perhaps?What about ADO?I tried some stuff, but with no success:http://support.microsoft.com/?scid=kb%3Ben-us%3B263247&x=11&y=9http://msdn.microsoft.com/en-us/library/aa468547.aspxI couldn't write an equivalent for those lines in AutoIt without getting COM errors (XML is invalid??)...EDIT (2): These errors happen because ADO can only handle certain XML types. The XML files you find most regularly don't comply to that format.Your thoughts would be most appreciated.footswitch EDIT (1): Here's an example code of what "my" XML parsing looks like (also check out the attached file xml_parsing.zip):expandcollapse popup;#include <Array.au3> #include <_XMLDomWrapper.au3> #region Load XML $xml_file="cd_catalog.xls" $xml_string=FileRead($xml_file) _XMLLoadXML($xml_string) If @error Then MsgBox(16,"Erro",_XMLError("")) Exit EndIf #endregion Load XML $path = "/MAIN_PARENT/CATALOG" $kids = _XMLGetChildNodes($path&"/*") ;_ArrayDisplay($kids) $rows_total=_XMLGetNodeCount($path&"/CD") ConsoleWrite("Total rows: "[email="&$rows_total&@CRLF"]&$rows_total&@CRLF[/email]) If $rows_total<=0 Then MsgBox(16,"Error","The selected XML file doesn't contain any useful data.") __ExitError() EndIf ;Exit __ReadXMLtoSQLite() Func __ExitError() ;_SQLite_Close() ;_SQLite_Shutdown() Exit EndFunc Func __ReadXMLtoSQLite() $rows_progress=0 ; irá contabilizar a evolução do processamento do ficheiro ;_SQLite_Exec(-1,"BEGIN IMMEDIATE;") ; inicia o transact SQL - vai escrevendo os dados em memoria $progress=0 $progress_lastupdate=0 $timer=TimerInit() ; inicia o cronómetro For $h = 1 To _XMLGetNodeCount($path) ; for each CATALOG ConsoleWrite("-> CATALOG "[email="&$h&@CRLF"]&$h&@CRLF[/email]) For $i = 1 To _XMLGetNodeCount($path&"[" & $h & "]/CD") ; for each CD in each CATALOG $rows_progress += 1 $progress=Floor(($rows_progress/$rows_total)*100) If @error Then $progress=0 #region show progress If $progress>$progress_lastupdate Then ConsoleWrite("Progress: "&$progress&" %"[email="&@CRLF"]&@CRLF[/email]) $progress_lastupdate=$progress EndIf #endregion show progress #region Count Checking $aValues = "" $sValue = "" $aValues = _XMLGetValue($path & "[" & $h & "]/CD[" & $i & "]/" & $kids[1]) If IsArray($aValues) Then $sValue=$aValues[1] If $sValue<>1 Then ; this doesn't apply to this XML file, but it's something I have to check with my XML reports. EndIf #endregion Count Checking $aValues = "" $sValues = "" ;#cs #region field checking ; the following verification is essential, because there's the possibility ; of having a node which doesn't contain a certain field, or the field isn't in the same order as expected ; This requires around 50% more processing time (e.g. instead of 20 seconds, it'll take 30) $aValues = _XMLGetChildNodes($path & "[" & $h & "]/*[" & $i & "]") ;_ArrayDisplay($aValues) If IsArray($aValues) Then If $aValues[0]<>6 Then MsgBox(16,"Erro","O registo nº " & $rows_progress & "contém " & $aValues[0] & " campos, quando devia conter 6."[email="&@CRLF"]&@CRLF[/email]& _ "O ficheiro XML não está correcto. O programa vai sair.") __ExitError() ElseIf $aValues[1]<>"TITLE" Or $aValues[2]<>"ARTIST" Or $aValues[3]<>"COUNTRY" Or $aValues[4]<>"COMPANY" Or _ $aValues[5]<>"PRICE" Or $aValues[6]<>"YEAR" Then MsgBox(16,"Error","The headers don't match. We're in trouble. I'm gonna run now.") __ExitError() EndIf EndIf #endregion field checking ;#ce #region get values $aValues = _XMLGetValue($path & "[" & $h & "]/*[" & $i & "]/*") ;_ArrayDisplay($aValues) If Not IsArray($aValues) Then MsgBox(16,"Erro","O registo nº " & $rows_progress & "não foi lido correctamente. O programa vai sair.") __ExitError() EndIf If $aValues[3]=="" Then ; if this is blank (it should always be 1 or 2), then the whole row would be blank. avoid writing blank lines ConsoleWrite("-> Line " & $rows_progress & " is blank." & @CRLF) $write=False ; linha inválida / sem conteúdo ContinueLoop Else $sValues &= $aValues[3] & "," EndIf $sValues &= '"'&StringReplace($aValues[4],'"',"'")&'"'&"," If $aValues[5]=="" Then $sValues &="NULL," Else $sValues &= $aValues[5] & "," EndIf $sValues &= $aValues[6] & "," #endregion get values #region insert in SQL ;_SQLite_Exec(-1, "INSERT INTO tblPicagens VALUES (" & $sValues & ");") ; remove as virgulas e substitui ';' por ',' ;If $iRval <> $SQLITE_OK Then ConsoleWrite("SQLite Error: "&_SQLite_ErrMsg ()&@CRLF) #endregion insert in SQL Next Next ConsoleWrite("-> It took "&TimerDiff($timer)&" milisseconds."[email="&@CRLF"]&@CRLF[/email]) Return EndFunc Edited July 27, 2010 by footswitch
ChrisL Posted July 23, 2010 Posted July 23, 2010 One thing that might help to speed things up is to get SQLITE to use a buffer while you do all your inserts. Quite a bit of time maybe lost updating the table on each insert because the database file is saved every time. _SQLite_Exec(-1,"BEGIN;") Process all of your XML files 1. Open XML file 2. Get node count for parent nodes (populate parent nodes) 3. For each parent node, get node count for its child nodes (populate child nodes) 5. For each child node, get its field names 6. Compare every field name with the intended field names, to make sure the import of that child node will be successful 4. For each child node, get values from every field 5. Transform the array into a formatted string (obeying another 3 or 4 rules so that the final string will be compatible with the SQL INSERT command) (6. Repeat from step 5 until there are no more child nodes for that parent node, then go all over again from step 3 until there are no more parent nodes) _SQLite_Exec(-1,"COMMIT;") [u]Scripts[/u]Minimize gui to systray _ Fail safe source recoveryMsgbox UDF _ _procwatch() Stop your app from being closedLicensed/Trial software system _ Buffering Hotkeys_SQL.au3 ADODB.Connection _ Search 2d Arrays_SplashTextWithGraphicOn() _ Adjust Screen GammaTransparent Controls _ Eventlogs without the crap_GuiCtrlCreateFlash() _ Simple Interscript communication[u]Websites[/u]Curious Campers VW Hightops Lambert Plant Hire
ChrisL Posted July 23, 2010 Posted July 23, 2010 Oh One other thing you can do which may speed things up is rather than opening the XML file with the XML dom, you could read the file in to memory and parse the string with the dom. $XMLStr = FileRead("C:\File.xml") _XMLLoadXML($XMLStr) [u]Scripts[/u]Minimize gui to systray _ Fail safe source recoveryMsgbox UDF _ _procwatch() Stop your app from being closedLicensed/Trial software system _ Buffering Hotkeys_SQL.au3 ADODB.Connection _ Search 2d Arrays_SplashTextWithGraphicOn() _ Adjust Screen GammaTransparent Controls _ Eventlogs without the crap_GuiCtrlCreateFlash() _ Simple Interscript communication[u]Websites[/u]Curious Campers VW Hightops Lambert Plant Hire
jchd Posted July 23, 2010 Posted July 23, 2010 @ChrisL While I globaly agree with you, bracketting a group of RMW SQL statement inside a transaction has nothing to do with "buffers". The helicopter-view effect of a transaction is to get out of autocommit mode for the duration of the transaction, which keeps the pager free to minimize disk writes. It doesn't however stop the writing of a journal file, so large enough transactions never get to zero disk I/O. It's correct that bulk insert/update statement see a dramatic speed improvement when grouped in chunks of few thousands, best figure depending on each situation. BTW the OP is using several concurent processes at times, hence using a long ebough SQLite timeout and BEGIN IMMEDIATE/COMMIT is _much_ more robust in his case. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
footswitch Posted July 24, 2010 Author Posted July 24, 2010 Thank you, gentlemen. Thank you for your observations. The SQL transaction is already performed with BEGIN / COMMIT. The time that SQLite consumes in this scenario is meaningless, when compared to the XML parsing. Loading the XML file into memory is an interesting suggestion, though. Surprisingly enough I hadn't even think of it It's a 10% improvement in speed. Thank you. Now I'm still wondering where the other 60% are...
footswitch Posted July 25, 2010 Author Posted July 25, 2010 I added an example script to the first post. Maybe it can be of use to someone someday. That same script with the attached XML file: 8 seconds Excel importing the XML file: 1,2 seconds footswitch
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now