rudi Posted March 31, 2023 Posted March 31, 2023 (edited) Hi, for quite a bunch of PDF files with mixed and somewhat complex content I've tried several tools out the, that can convert PDF to TEXT. Those that are command line capable like PDFtoTEXT.EXE unfortunately are 1.) slow and 2.) the resulting TXT files are not presenting proper content. Best results I got so far using the Foxit Reader: Fast and reliably creates correct content in the resulting TXT files, but Foxit Reader is not command line capable in a manner to use it for batch processing. So I wrote this script (ugly: Using a lot of send() commands, as ControlClick() to the appropriate control-IDs doesn't seem to work), and I would like to ask here, if someone has either done a better automation for Foxit Reader already, or maybe there is a good alternative approach to convert PDF to TXT files? expandcollapse popup#include <File.au3> #include <Debug.au3> $foxit="C:\Program Files (x86)\Foxit Software\Foxit Reader\FoxitReader.exe" $DirInputPDF="c:\temp\folder-with-PDF-files" ; put some PDF files here with Text content. Image based PDF cannot be processed that way! $DirOutputTXT="C:\temp\PDF2TXT-result" DirCreate($DirOutputTXT) $aPDF=_FileListToArray($DirInputPDF,"*.pdf",1,0) ; _DebugArrayDisplay($aPDF) for $i = 1 to $aPDF[0] $NxPDF=$aPDF[$i] ShellExecute($foxit,$DirInputPDF & "\" & $NxPDF) WinWait($NxPDF) ToolTip("document " & $NxPDF & " opened") WinActivate($NxPDF) ToolTip("window activated") Sleep(1000) send("{alt}") ToolTip("ALT sent") Sleep(2000) send("f") ; f for open file menu ToolTip("f sent") Sleep(2000) send("a") ; a for save using alternative name ToolTip("a sent") Sleep(2000) ControlClick($NxPDF,"","Button11") ; button to invoke a "search locations" ToolTip("alternative save name dialog button clicked") Sleep(1000) ; the file name input is active $TxtName=$DirOutputTXT & "\" & StringTrimRight($NxPDF,4) ; cut off ".PDF" if FileExists($TxtName & ".TXT") Then FileDelete($TxtName & ".TXT") send($TxtName) Sleep(2000) send("!t") ; Combobox2 cannot be clicked, use ALT+t for file T-ype ToolTip("ALT+t sent to change file type") Sleep(1000) send ("{down}") ; drop down file type combo box ToolTip("{DOWN} send 1st time") Sleep(1000) send ("{down}") ; select "TXT" file type ToolTip("{DOWN} send 2nd time") Sleep(1000) send ("{enter}") ; end file type selection ToolTip("{ENTER} send to end file type selection") Sleep(1000) ; doesnt work ControlClick($NxPDF,"","Button3") ; click SAVE button send("!s") ; hotkey ALT+s for SAVE button Sleep(3000) $CtrlF4Count=0 while WinExists($NxPDF) $CtrlF4Count+=1 WinActivate($NxPDF) send("^{f4}") ; close current document ToolTip("CTRL+F4 sent " & $CtrlF4Count & " times") Sleep(1000) WEnd ToolTip($NxPDF & " done.") Next any suggestions appreciated! <edit> typos </edit> Edited March 31, 2023 by rudi Cerno_b 1 Earth is flat, pigs can fly, and Nuclear Power is SAFE!
mistersquirrle Posted March 31, 2023 Posted March 31, 2023 (edited) Just a thought of something else that you can try, if you haven't already: What about converting the PDF to a simple image (png/bmp/jpg), then using a more specific OCR program to read the images? It gives you the flexibility of not looking for a PDF specific OCR program, and you can try something like or any other CLI OCR. An added benefit of converting to an image is that you can then also do some modifications to the image, to improve the potential clarity of the text if the PDF isn't computer generated (such as it was created from a scanned document). Edited March 31, 2023 by mistersquirrle Musashi 1 We ought not to misbehave, but we should look as though we could.
Musashi Posted March 31, 2023 Posted March 31, 2023 45 minutes ago, mistersquirrle said: What about converting the PDF to a simple image (png/bmp/jpg), then using a more specific OCR program to read the images? This is an interesting approach ! @rudi : You have already mentioned the command line tool pdftotext.exe. I therefore assume, that you know the other tools such as pdftopng.exe. In case not, then you can find it here : https://www.xpdfreader.com/download.html In particular the Xpdf command line tools -> Windows 32/64-bit (Win 7 and newer) : Download These command line tools work 'stand-alone' , which means that no installation is required . Excerpt from the terms of use of the "open source stand-alone executables" : If you want to use the stand-alone executables (pdftopng for example) with your application, you're free to do so. (To comply with the GPL, you'll need to distribute the Xpdf documentation along with the pdftopng executable - see the Xpdf README file for details.) ; Infos : ; -r 300 = 300 DPI (150 = 150 DPI etc) ; - q = quiet ; target : no need to set the extension .png ; For a multi-page PDF, one graphic is generated per page pdftopng.exe -r 300 -q "source.pdf" "target" mistersquirrle 1 "In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."
TimRude Posted April 1, 2023 Posted April 1, 2023 With your Foxit Reader method, after loading the PDF desired, have you tried simply sending a Ctrl-a to select all text, then a Ctrl-c to copy it to the clipboard? Then your script can retrieve the copied text with ClipGet and do whatever you want with it (i.e. save it to a file, display it, manipulate it, etc.). This method doesn't require as much faffing about with the menus and controls in Foxit, and it seems to produce cleaner text since it doesn't include all of the extraneous blank lines that the 'Save As' method generates. For a very lengthy document, it might take a few seconds to select all of the text, and then then another few seconds to copy it to the clipboard. So you'd have to figure out how to know when each step was done. On the old version of Foxit Reader I use (the last one that lets you choose the classic toolbar instead of that horrible Ribbon Mode), a progress window appears while selecting the text after pressing Ctrl-a. Then when that window disappears, you can go on to copy the text to the clipboard using Ctrl-c and another progress window appears while that takes place. By watching for these progress windows you could know when to take the next step. For a short document, the selecting and copying progress windows may appear only very briefly, but it seems that they are momentarily visible even then. I don't know if the same thing happens with the later versions of Foxit Reader. You'd have to test to see. Just another idea. pixelsearch 1
Jfish Posted April 1, 2023 Posted April 1, 2023 (edited) I have not tested this (I don't have Foxit), but I was curious what Chat GPT-4 would say when asked about possible solutions to best automate reader. I asked it specifically about using a COM API (Foxit does have an API). It barfed up the following tester which may or may not work. Either way, the API reference docs can be found here. I would think the API would be more reliable then sending commands to the GUI. ; Create a FoxitReader COM object $foxit = ObjCreate("FoxitReader.Application") ; Open a PDF file $doc = $foxit.Open("C:\example.pdf") ; Set the zoom level to 100% $viewer = $doc.GetViewer() $viewer.Zoom = 100 ; Save the document $doc.Save() ; Close the document and quit Foxit $doc.Close() $foxit.Quit() Edited April 1, 2023 by Jfish Build your own poker game with AutoIt: pokerlogic.au3 | Learn To Program Using FREE Tools with AutoIt
pixelsearch Posted April 1, 2023 Posted April 1, 2023 12 hours ago, TimRude said: On the old version of Foxit Reader I use (the last one that lets you choose the classic toolbar instead of that horrible Ribbon Mode) [...] Hi Tim, Could you please indicate the version of the old FoxIt Reader you're using, the one with the classic toolbar ? Thanks
TimRude Posted April 1, 2023 Posted April 1, 2023 4 hours ago, pixelsearch said: Could you please indicate the version of the old FoxIt Reader you're using, the one with the classic toolbar ? It's version 7.2.8.1124 from 2015. Direct download link from the source: http://cdn01.foxitsoftware.com/pub/foxit/reader/desktop/win/7.x/7.2/en_us/FoxitReader728.1124_enu_Setup.exe That was the last version you could opt to use the classic toolbar. Anything after that you're stuck with the Ribbon, Here's the discussion about it on Foxit's forum: https://forums.foxitsoftware.com/forum/portable-document-format-pdf-tools/foxit-reader/152744-need-previous-version-of-foxit-reader-7-2#post152744 Musashi and pixelsearch 2
rudi Posted April 4, 2023 Author Posted April 4, 2023 Thanks to all repliers, interesting approaches, I'll go through all of them. Rudi. Earth is flat, pigs can fly, and Nuclear Power is SAFE!
rudi Posted April 4, 2023 Author Posted April 4, 2023 @Jfish this COM approach is interesting, unfortunately I'm not familiar with it at all. Global $oGlobalCOMErrorHandler = ObjEvent("AutoIt.Error", "_ErrFuncGlobal") ; Global COM error handler $foxit = ObjCreate("FoxitReader.Application") ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $foxit = ' & $foxit & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console ConsoleWrite(VarGetType($foxit) & @CRLF) #cs $pdf="C:\temp\1529683_1529684_1.pdf" $doc=$foxit.open($pdf) ConsoleWrite(VarGetType($doc) & @CRLF) #ce Func _ErrFuncGlobal($oError) ; Do anything here. ;taken from post by @water here: https://www.autoitscript.com/forum/topic/191401-com-error-handling-in-a-udf-best-practice/?do=findComment&comment=1373102 ConsoleWrite(@ScriptName & " (" & $oError.scriptline & ") : ==> Global COM error handler - COM Error intercepted !" & @CRLF & _ @TAB & "err.number is: " & @TAB & @TAB & "0x" & Hex($oError.number) & @CRLF & _ @TAB & "err.windescription:" & @TAB & $oError.windescription & @CRLF & _ @TAB & "err.description is: " & @TAB & $oError.description & @CRLF & _ @TAB & "err.source is: " & @TAB & @TAB & $oError.source & @CRLF & _ @TAB & "err.helpfile is: " & @TAB & $oError.helpfile & @CRLF & _ @TAB & "err.helpcontext is: " & @TAB & $oError.helpcontext & @CRLF & _ @TAB & "err.lastdllerror is: " & @TAB & $oError.lastdllerror & @CRLF & _ @TAB & "err.scriptline is: " & @TAB & $oError.scriptline & @CRLF & _ @TAB & "err.retcode is: " & @TAB & "0x" & Hex($oError.retcode) & @CRLF & @CRLF) EndFunc ;==>_ErrFunc >"C:\Program Files (x86)\AutoIt3\SciTE\..\AutoIt3.exe" "C:\Program Files (x86)\AutoIt3\SciTE\AutoIt3Wrapper\AutoIt3Wrapper.au3" /run /prod /ErrorStdOut /in "C:\temp\foxit-automation.au3" /UserParams +>16:01:26 Starting AutoIt3Wrapper (21.316.1639.1) from:SciTE.exe (4.4.6.0) Keyboard:00000407 OS:WIN_10/2009 CPU:X64 OS:X64 Environment(Language:0407) CodePage:0 utf8.auto.check:4 +> SciTEDir => C:\Program Files (x86)\AutoIt3\SciTE UserDir => C:\Users\admin.AD\AppData\Local\AutoIt v3\SciTE\AutoIt3Wrapper SCITE_USERHOME => C:\Users\admin.AD\AppData\Local\AutoIt v3\SciTE >Running AU3Check (3.3.16.1) from:C:\Program Files (x86)\AutoIt3 input:C:\temp\foxit-automation.au3 +>16:01:26 AU3Check ended.rc:0 >Running:(3.3.16.1):C:\Program Files (x86)\AutoIt3\autoit3.exe "C:\temp\foxit-automation.au3" +>Setting Hotkeys...--> Press Ctrl+Alt+Break to Restart or Ctrl+BREAK to Stop. foxit-automation.au3 (4) : ==> Global COM error handler - COM Error intercepted ! err.number is: 0x800401F3 err.windescription: Ungültige Klassenzeichenfolge -> invalid class string err.description is: err.source is: err.helpfile is: err.helpcontext is: err.lastdllerror is: 0 err.scriptline is: 4 err.retcode is: 0x00000000 @@ Debug(5) : $foxit = 0 >Error code: -2147221005 Int32 +>16:01:26 AutoIt3.exe ended.rc:0 +>16:01:27 AutoIt3Wrapper Finished. >Exit code: 0 Time: 1.478 Earth is flat, pigs can fly, and Nuclear Power is SAFE!
Jfish Posted April 5, 2023 Posted April 5, 2023 @rudi - I was digging around a bit, I think it may have a dependency on the paid Foxit SDK. It appears you can get a free trial on their site but then you need to pay after the trial. So not sure if this is a work thing ... if so may still be worth investigating. Then (untested) something like this: Dim foxitApp As Object Set foxitApp = CreateObject("FoxitReader.SDK.CommonUIAutomation") Build your own poker game with AutoIt: pokerlogic.au3 | Learn To Program Using FREE Tools with AutoIt
ioa747 Posted April 7, 2023 Posted April 7, 2023 I prefer sumatrapdfreader , because it's fast, it's open source, it supports many file types (especially for .chm) expandcollapse popupLocal $pdf = "D:\Documents\_pdf\MX Linux Users Manual.pdf" FuncSpeedTest('_PdfToTxt($pdf)') ; #FUNCTION# ---------------------------------------------------------------------------- ; Name...........: _PdfToTxt() ; Description ...: Save pdf as text with SumatraPDF ; Syntax.........: _PdfToTxt($pdf [, $Dest]) ; Parameters ....: $pdf - The path of source pdf file ; $Dest - The dir destnation where txt file save ; Notes .........: if no $Dest parameter then destnation = source dir ; https://www.sumatrapdfreader.org/download-free-pdf-viewer ;---------------------------------------------------------------------------------------- Func _PdfToTxt($pdf, $Dest = "") Local $Sumatra, $hWnd, $hSvWnd, $txt, $aTmp $Sumatra = "D:\i\Pro\SumatraPDF-3.4.6\SumatraPDF-3.4.6-32.exe" If $Dest = "" Then $txt = StringTrimRight($pdf, 4) & ".txt" Else If StringRight($Dest, 1) = "\" Then $Dest = StringTrimRight($Dest, 1) EndIf $aTmp = StringSplit($pdf, "\", 1) $txt = $aTmp[$aTmp[0]] $txt = $Dest & "\" & StringTrimRight($txt, 4) & ".txt" EndIf Run('"' & $Sumatra & '" "' & $pdf & '"', "", @SW_MINIMIZE) $hWnd = WinWait("[CLASS:SUMATRA_PDF_FRAME]") ControlSend($hWnd, "", "SUMATRA_PDF_CANVAS1", "^s") $hSvWnd = WinWait("Save As") ControlSetText($hSvWnd, "", "ComboBox1", $txt) ControlCommand($hSvWnd, "", "ComboBox2", "SelectString", 'Text documents') ControlSend($hSvWnd, "", "Button2", "{ENTER}") WinClose($hWnd) EndFunc ;==>_PdfToTxt ;---------------------------------------------------------------------------------------- Func FuncSpeedTest($sExecute) Local $hTimer = TimerInit() Execute($sExecute) ConsoleWrite($sExecute & " processed in: " & Round(TimerDiff($hTimer) / 1000, 3) & " seconds " & @LF) EndFunc ;==>FuncSpeedTest I know that I know nothing
Solution bdr529 Posted April 7, 2023 Solution Posted April 7, 2023 ghostscript 9.56.1 gswin32c.exe -dNOPAUSE -dBATCH -dSAFER -sDEVICE=txtwrite -dTextFormat=3 -sOutputFile=- -q input.pdf > output.txt 2>error.txt Gianni 1 To community goes all my regards and thanks
rudi Posted April 11, 2023 Author Posted April 11, 2023 @bdr529 thanks for pointing out gostscript. I use GS quite a lot for other tasks and so far I wasn't aware of the txtwrite output device. First of all I was disappointed by the results: gswin64c.exe v 9.16 doesn't produce the output expected -- just some few lines for a ~ 350 pages long document, from which Foxit correctly extracts ~30000 lines of text (3/4 of these are just WHITESPACE padding lines, but these are easy to be ignored for the final content processing) But after upgrading to the currently latest release, v10.01.1, the results look quite promising. The remaining constraint is, that quite a lot of lines, that are saved as two lines by foxit (separate table rows in the original PDF file) are now saved as one line by gs. But that can be handled by the data processing done later on. Earth is flat, pigs can fly, and Nuclear Power is SAFE!
Cerno_b Posted January 5 Posted January 5 Your initial script was perfect. I had to make some minor changes because I might use a newer version of Foxit Reader, but it did exactly what I needed it to do. I was able to reduce most sleeps with 500ms versions to make it faster. In case you're interested in the changes I made: Save As uses the D shortcut in my version The Search locations button is Button5 in my version Cutting off .pdf and replacing it with .txt is not necessary, this is done automatically once you change the file type in the dropdown. Thanks a lot for your script, it resolved a lot of headaches for me as Foxit seems to be the only option I found viable when trying to read that script.
Cerno_b Posted January 5 Posted January 5 @TimRude My initial hunch was to use copy/paste manually before I turned to scripting, but the export results are different and the Save As version gave me better results (YMMV).
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now