Jump to content

PDF to TEXT conversion -- foxit reader automation


Go to solution Solved by bdr529,

Recommended Posts

Posted (edited)

Hi,

for quite a bunch of PDF files with mixed and somewhat complex content I've tried several tools out the, that can convert PDF to TEXT. Those that are command line capable like PDFtoTEXT.EXE unfortunately are 1.) slow and 2.) the resulting TXT files are not presenting proper content.

Best results I got so far using the Foxit Reader: Fast and reliably creates correct content in the resulting TXT files, but Foxit Reader is not command line capable in a manner to use it for batch processing.

 

So I wrote this script (ugly: Using a lot of send() commands, as ControlClick() to the appropriate control-IDs doesn't seem to work), and I would like to ask here, if someone has either done a better automation for Foxit Reader already, or maybe there is a good alternative approach to convert PDF to TXT files?

 

#include <File.au3>
#include <Debug.au3>



$foxit="C:\Program Files (x86)\Foxit Software\Foxit Reader\FoxitReader.exe"

$DirInputPDF="c:\temp\folder-with-PDF-files" ; put some PDF files here with Text content. Image based PDF cannot be processed that way!
$DirOutputTXT="C:\temp\PDF2TXT-result"
DirCreate($DirOutputTXT)

$aPDF=_FileListToArray($DirInputPDF,"*.pdf",1,0)

; _DebugArrayDisplay($aPDF)

for $i = 1 to $aPDF[0]
    $NxPDF=$aPDF[$i]
    ShellExecute($foxit,$DirInputPDF & "\" & $NxPDF)
    WinWait($NxPDF)
    ToolTip("document " & $NxPDF & " opened")
    WinActivate($NxPDF)
    ToolTip("window activated")
    Sleep(1000)
    send("{alt}")
    ToolTip("ALT sent")
    Sleep(2000)
    send("f") ; f for open file menu
    ToolTip("f sent")
    Sleep(2000)
    send("a") ; a for save using alternative name
    ToolTip("a sent")
    Sleep(2000)
    ControlClick($NxPDF,"","Button11") ; button to invoke a "search locations"
    ToolTip("alternative save name dialog button clicked")
    Sleep(1000)
    ; the file name input is active
    $TxtName=$DirOutputTXT & "\" & StringTrimRight($NxPDF,4) ; cut off ".PDF"
    if FileExists($TxtName & ".TXT") Then FileDelete($TxtName & ".TXT")
    send($TxtName) 
    Sleep(2000)
    send("!t") ; Combobox2 cannot be clicked, use ALT+t for file T-ype
    ToolTip("ALT+t sent to change file type")
    Sleep(1000)
    send ("{down}") ; drop down file type combo box
    ToolTip("{DOWN} send 1st time")
    Sleep(1000)
    send ("{down}") ; select "TXT" file type
    ToolTip("{DOWN} send 2nd time")
    Sleep(1000)
    send ("{enter}") ; end file type selection
    ToolTip("{ENTER} send to end file type selection")
    Sleep(1000)
    ; doesnt work ControlClick($NxPDF,"","Button3") ; click SAVE button
    send("!s") ; hotkey ALT+s for SAVE button
    Sleep(3000)
    $CtrlF4Count=0
    while WinExists($NxPDF)
        $CtrlF4Count+=1
        WinActivate($NxPDF)
        send("^{f4}") ; close current document
        ToolTip("CTRL+F4 sent " & $CtrlF4Count & " times")
        Sleep(1000)
    WEnd
    ToolTip($NxPDF & " done.")
Next

any suggestions appreciated!

 

<edit> typos </edit>

Edited by rudi

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Posted (edited)

Just a thought of something else that you can try, if you haven't already:

What about converting the PDF to a simple image (png/bmp/jpg), then using a more specific OCR program to read the images? It gives you the flexibility of not looking for a PDF specific OCR program, and you can try something like 

or any other CLI OCR. An added benefit of converting to an image is that you can then also do some modifications to the image, to improve the potential clarity of the text if the PDF isn't computer generated (such as it was created from a scanned document).

 

Edited by mistersquirrle

We ought not to misbehave, but we should look as though we could.

Posted
45 minutes ago, mistersquirrle said:

What about converting the PDF to a simple image (png/bmp/jpg), then using a more specific OCR program to read the images?

This is an interesting approach !

@rudi : You have already mentioned the command line tool pdftotext.exe. I therefore assume, that you know the other tools such as pdftopng.exe.

In case not, then you can find it here : https://www.xpdfreader.com/download.html

In particular the Xpdf command line tools -> Windows 32/64-bit (Win 7 and newer) : Download

These command line tools work 'stand-alone' , which means that no installation is required :).

Excerpt from the terms of use of the "open source stand-alone executables" :

If you want to use the stand-alone executables (pdftopng for example) with your application, you're free to do so. (To comply with the GPL, you'll need to distribute the Xpdf documentation along with the pdftopng executable - see the Xpdf README file for details.)

 

; Infos :
; -r 300 = 300 DPI (150 = 150 DPI etc)
; - q = quiet
; target : no need to set the extension .png 
; For a multi-page PDF, one graphic is generated per page

pdftopng.exe -r 300 -q "source.pdf" "target"

 

Musashi-C64.png

"In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move."

Posted

With your Foxit Reader method, after loading the PDF desired, have you tried simply sending a Ctrl-a to select all text, then a Ctrl-c to copy it to the clipboard? Then your script can retrieve the copied text with ClipGet and do whatever you want with it (i.e. save it to a file, display it, manipulate it, etc.). This method doesn't require as much faffing about with the menus and controls in Foxit, and it seems to produce cleaner text since it doesn't include all of the extraneous blank lines that the 'Save As' method generates.

For a very lengthy document, it might take a few seconds to select all of the text, and then then another few seconds to copy it to the clipboard. So you'd have to figure out how to know when each step was done. On the old version of Foxit Reader I use (the last one that lets you choose the classic toolbar instead of that horrible Ribbon Mode), a progress window appears while selecting the text after pressing Ctrl-a. Then when that window disappears, you can go on to copy the text to the clipboard using Ctrl-c and another progress window appears while that takes place. By watching for these progress windows you could know when to take the next step.

For a short document, the selecting and copying progress windows may appear only very briefly, but it seems that they are momentarily visible even then.

I don't know if the same thing happens with the later versions of Foxit Reader. You'd have to test to see.

Just another idea.

Posted (edited)

I have not tested this (I don't have Foxit), but I was curious what Chat GPT-4 would say when asked about possible solutions to best automate reader. I asked it specifically about using a COM API (Foxit does have an API). It barfed up the following tester which may or may not work. Either way, the API reference docs can be found here. I would think the API would be more reliable then sending commands to the GUI.

; Create a FoxitReader COM object
$foxit = ObjCreate("FoxitReader.Application")

; Open a PDF file
$doc = $foxit.Open("C:\example.pdf")

; Set the zoom level to 100%
$viewer = $doc.GetViewer()
$viewer.Zoom = 100

; Save the document
$doc.Save()

; Close the document and quit Foxit
$doc.Close()
$foxit.Quit()

 

Edited by Jfish

Build your own poker game with AutoIt: pokerlogic.au3 | Learn To Program Using FREE Tools with AutoIt

Posted
12 hours ago, TimRude said:

On the old version of Foxit Reader I use (the last one that lets you choose the classic toolbar instead of that horrible Ribbon Mode) [...]

Hi Tim,
Could you please indicate the version of the old FoxIt Reader you're using, the one with the classic toolbar ?
Thanks :)

Posted
4 hours ago, pixelsearch said:

Could you please indicate the version of the old FoxIt Reader you're using, the one with the classic toolbar ?

It's version 7.2.8.1124 from 2015. Direct download link from the source:

http://cdn01.foxitsoftware.com/pub/foxit/reader/desktop/win/7.x/7.2/en_us/FoxitReader728.1124_enu_Setup.exe

That was the last version you could opt to use the classic toolbar. Anything after that you're stuck with the Ribbon, :x

Here's the discussion about it on Foxit's forum:

https://forums.foxitsoftware.com/forum/portable-document-format-pdf-tools/foxit-reader/152744-need-previous-version-of-foxit-reader-7-2#post152744

Posted

Thanks to all repliers,

 

interesting approaches, I'll go through all of them.

 

Rudi.

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Posted

@Jfish this COM approach is interesting, unfortunately I'm not familiar with it at all.

 

 

Global $oGlobalCOMErrorHandler = ObjEvent("AutoIt.Error", "_ErrFuncGlobal") ; Global COM error handler

$foxit = ObjCreate("FoxitReader.Application")
ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $foxit = ' & $foxit & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console
ConsoleWrite(VarGetType($foxit) & @CRLF)


#cs
$pdf="C:\temp\1529683_1529684_1.pdf"
$doc=$foxit.open($pdf)
ConsoleWrite(VarGetType($doc) & @CRLF)
#ce



Func _ErrFuncGlobal($oError)
    ; Do anything here.
    ;taken from post by @water here: https://www.autoitscript.com/forum/topic/191401-com-error-handling-in-a-udf-best-practice/?do=findComment&comment=1373102
    ConsoleWrite(@ScriptName & " (" & $oError.scriptline & ") : ==> Global COM error handler - COM Error intercepted !" & @CRLF & _
            @TAB & "err.number is: " & @TAB & @TAB & "0x" & Hex($oError.number) & @CRLF & _
            @TAB & "err.windescription:" & @TAB & $oError.windescription & @CRLF & _
            @TAB & "err.description is: " & @TAB & $oError.description & @CRLF & _
            @TAB & "err.source is: " & @TAB & @TAB & $oError.source & @CRLF & _
            @TAB & "err.helpfile is: " & @TAB & $oError.helpfile & @CRLF & _
            @TAB & "err.helpcontext is: " & @TAB & $oError.helpcontext & @CRLF & _
            @TAB & "err.lastdllerror is: " & @TAB & $oError.lastdllerror & @CRLF & _
            @TAB & "err.scriptline is: " & @TAB & $oError.scriptline & @CRLF & _
            @TAB & "err.retcode is: " & @TAB & "0x" & Hex($oError.retcode) & @CRLF & @CRLF)
EndFunc   ;==>_ErrFunc
>"C:\Program Files (x86)\AutoIt3\SciTE\..\AutoIt3.exe" "C:\Program Files (x86)\AutoIt3\SciTE\AutoIt3Wrapper\AutoIt3Wrapper.au3" /run /prod /ErrorStdOut /in "C:\temp\foxit-automation.au3" /UserParams    
+>16:01:26 Starting AutoIt3Wrapper (21.316.1639.1) from:SciTE.exe (4.4.6.0)  Keyboard:00000407  OS:WIN_10/2009  CPU:X64 OS:X64  Environment(Language:0407)  CodePage:0  utf8.auto.check:4
+>         SciTEDir => C:\Program Files (x86)\AutoIt3\SciTE   UserDir => C:\Users\admin.AD\AppData\Local\AutoIt v3\SciTE\AutoIt3Wrapper   SCITE_USERHOME => C:\Users\admin.AD\AppData\Local\AutoIt v3\SciTE 
>Running AU3Check (3.3.16.1)  from:C:\Program Files (x86)\AutoIt3  input:C:\temp\foxit-automation.au3
+>16:01:26 AU3Check ended.rc:0
>Running:(3.3.16.1):C:\Program Files (x86)\AutoIt3\autoit3.exe "C:\temp\foxit-automation.au3"    
+>Setting Hotkeys...--> Press Ctrl+Alt+Break to Restart or Ctrl+BREAK to Stop.
foxit-automation.au3 (4) : ==> Global COM error handler - COM Error intercepted !
    err.number is:      0x800401F3
    err.windescription: Ungültige Klassenzeichenfolge -> invalid class string

    err.description is:     
    err.source is:      
    err.helpfile is:    
    err.helpcontext is:     
    err.lastdllerror is:    0
    err.scriptline is:  4
    err.retcode is:     0x00000000

@@ Debug(5) : $foxit = 0
>Error code: -2147221005
Int32
+>16:01:26 AutoIt3.exe ended.rc:0
+>16:01:27 AutoIt3Wrapper Finished.
>Exit code: 0    Time: 1.478

 

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

Posted

@rudi - I was digging around a bit, I think it may have a dependency on the paid Foxit SDK. It appears you can get a free trial on their site but then you need to pay after the trial. So not sure if this is a work thing ... if so may still be worth investigating.

Then (untested) something like this: 

Dim foxitApp As Object
Set foxitApp = CreateObject("FoxitReader.SDK.CommonUIAutomation")

 

Build your own poker game with AutoIt: pokerlogic.au3 | Learn To Program Using FREE Tools with AutoIt

Posted

I prefer sumatrapdfreader , because it's fast, it's open source, it supports many file types (especially for .chm)

Local $pdf = "D:\Documents\_pdf\MX Linux Users Manual.pdf"

FuncSpeedTest('_PdfToTxt($pdf)')


; #FUNCTION# ----------------------------------------------------------------------------
; Name...........: _PdfToTxt()
; Description ...: Save pdf as text with SumatraPDF
; Syntax.........: _PdfToTxt($pdf [, $Dest])
; Parameters ....: $pdf     - The path of source pdf file
;                  $Dest    - The dir destnation where txt file save
; Notes .........: if no $Dest parameter then destnation = source dir
; https://www.sumatrapdfreader.org/download-free-pdf-viewer
;----------------------------------------------------------------------------------------
Func _PdfToTxt($pdf, $Dest = "")
    Local $Sumatra, $hWnd, $hSvWnd, $txt, $aTmp

    $Sumatra = "D:\i\Pro\SumatraPDF-3.4.6\SumatraPDF-3.4.6-32.exe"

    If $Dest = "" Then
        $txt = StringTrimRight($pdf, 4) & ".txt"
    Else
        If StringRight($Dest, 1) = "\" Then
            $Dest = StringTrimRight($Dest, 1)
        EndIf
        $aTmp = StringSplit($pdf, "\", 1)
        $txt = $aTmp[$aTmp[0]]
        $txt = $Dest & "\" & StringTrimRight($txt, 4) & ".txt"
    EndIf

    Run('"' & $Sumatra & '" "' & $pdf & '"', "", @SW_MINIMIZE)

    $hWnd = WinWait("[CLASS:SUMATRA_PDF_FRAME]")
    ControlSend($hWnd, "", "SUMATRA_PDF_CANVAS1", "^s")

    $hSvWnd = WinWait("Save As")
    ControlSetText($hSvWnd, "", "ComboBox1", $txt)
    ControlCommand($hSvWnd, "", "ComboBox2", "SelectString", 'Text documents')
    ControlSend($hSvWnd, "", "Button2", "{ENTER}")

    WinClose($hWnd)

EndFunc   ;==>_PdfToTxt
;----------------------------------------------------------------------------------------
Func FuncSpeedTest($sExecute)
    Local $hTimer = TimerInit()
    Execute($sExecute)
    ConsoleWrite($sExecute & " processed in: " & Round(TimerDiff($hTimer) / 1000, 3) & " seconds " & @LF)
EndFunc   ;==>FuncSpeedTest

 

I know that I know nothing

  • Solution
Posted

ghostscript 9.56.1
gswin32c.exe -dNOPAUSE -dBATCH -dSAFER -sDEVICE=txtwrite -dTextFormat=3 -sOutputFile=- -q input.pdf > output.txt 2>error.txt

To community goes all my regards and thanks

Posted

@bdr529 thanks for pointing out gostscript. I use GS quite a lot for other tasks and so far I wasn't aware of the txtwrite output device.

 

First of all I was disappointed by the results: gswin64c.exe v 9.16 doesn't produce the output expected -- just some few lines for a ~ 350 pages long document, from which Foxit correctly extracts ~30000 lines of text (3/4 of these are just WHITESPACE padding lines, but these are easy to be ignored for the final content processing)

 

But after upgrading to the currently latest release, v10.01.1, the results look quite promising. The remaining constraint is, that quite a lot of lines, that are saved as two lines by foxit (separate table rows in the original PDF file) are now saved as one line by gs. But that can be handled by the data processing done later on.

 

image.thumb.png.7129718456aa1ac4a70cb7fed3e2ad65.png

Earth is flat, pigs can fly, and Nuclear Power is SAFE!

  • 8 months later...
Posted

Your initial script was perfect. I had to make some minor changes because I might use a newer version of Foxit Reader, but it did exactly what I needed it to do. I was able to reduce most sleeps with 500ms versions to make it faster.

In case you're interested in the changes I made:

  • Save As uses the D shortcut in my version
  • The Search locations button is Button5 in my version
  • Cutting off .pdf and replacing it with .txt is not necessary, this is done automatically once you change the file type in the dropdown.

Thanks a lot for your script, it resolved a lot of headaches for me as Foxit seems to be the only option I found viable when trying to read that script.

Posted

@TimRude My initial hunch was to use copy/paste manually before I turned to scripting, but the export results are different and the Save As version gave me better results (YMMV).

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...