PDF Search

yclee99 · March 13, 2019

Dear All,

I have large number PDF files which I need to checked each of them whether containing certain string (equipment tag name). Example: I have 500 PDF files and I want to know which PDF files contains the specific equipment tag name. There are 100 equipment tag name in total. In the past, we did this process manually by opening PDF file and search for the equipment tag name. We don't really care the equipment tag name is located at which page. What is important to use is the PDF files contains which equipment tag name. This process is really time consuming. I wonder it there a way to do it automatically.

My original idea is to convert the PDF to text file (XPDF - pdftotext) and search for for the equipment tag name. Is there any better way to deal with this?

p/s: I just found out that pdftotext is not free as I am using it for commercial purpose. I am trying to avoid using pdftotext.

Skeletor · March 13, 2019

See if this topic is applicable to you:

Skeletor · March 13, 2019

As you pointed out, you can take your manual process and AutoIt!

Use Arrays and FileOpen, and Send("{^s}")

BigDaddyO · March 13, 2019

from my understanding, the xpdf tools are opensource/free to use (the command line tools). it's the XpdfReader that isn't free, which you don't need.

edit: read closer and if your going to be selling/distributing your app then yes you need a license.

;====================================================================================================================
;   Get the text out of a PDF file and return it as a String value
;   If error is encountered @Extended will contain the error returned from pdftotext.exe
;   $bMaintainLayout:       True = (Default) This will try to keep the spacing as it shows in the PDF file
;                           False = This will just display the text without any layout
;====================================================================================================================
Func _XPDF_GetText($sPDFFile, $bMaintainLayout = True)
    Local $sXpdftotext = @ScriptDir & "\pdftotext.exe"
    If NOT FileExists($sXpdftotext) Then Return SetError(1, 0, 0)
    ;ConsoleWrite('"' & $sXpdftotext & '" -layout "' & $sPDFFile & '" "-"' & @CRLF)
    If $bMaintainLayout = True Then
        $sLayout = " -layout "
    Else
        $sLayout = " "
    EndIf

    Local $iPid = Run('"' & $sXpdftotext & '"' & $sLayout & '"' & $sPDFFile & '" "-"', "", @SW_HIDE, 2 + 4)             ;Run the converter and get the StdOut "2" and the StdErr "4"
    ProcessWaitClose($iPID)                                                                                             ;Need to wait for it to finish before we get the StdOutput and StdErr values

    Local $sResult
    While 1                                                                                                             ;Loop through the StdoutRead getting all the available text from the PDF file
        $sResult &= StdoutRead($iPid)                                                                                   ;Put the output into the $sResults string
        If @error Then ExitLoop                                                                                         ;Once we reach the end of the output string, exit the loop
    WEnd

    Local $sErrOutput
    While 1                                                                                                             ;Loop through the StderrRead incase there are any problems reading the PDF
        $sErrOutput = StderrRead($iPID)                                                                                 ;Put the error output into the $sErrOutput
        If @error Then                                                                                                  ;Exit the loop if the process closes or StderrRead returns an error.
            ExitLoop
        EndIf
        If $sErrOutput <> "" Then Return SetError(1, $sErrOutput, 0)                                                    ;If there is something in the $sErrOutput then there was a problem, return Error and sets @extended to whatever was returned by the error
        ;MsgBox($MB_SYSTEMMODAL, "Stderr Read:", $sOutput)
    WEnd

    Return $sResult                                                                                                     ;Return the contents of the PDF as a variable

EndFunc

Edited March 13, 2019 by BigDaddyO

jguinch · March 14, 2019

if needed :

yclee99 · March 29, 2019

Thanks for your reply.

orbs · March 30, 2019

another approach would be to utilize the filtdump.exe utility (from Windows SDK) in conjunction with PDF IFilter. both free, and very easy to use.

IFilter allows Windows Search (formerly Windows Indexing Service) to parse text from non-textual files for indexing and searching purposes. FYI, IFilter for MS-Office files (and OpenOffice files) is installed by default with MS-Office installation (and is also provided as a standalone installer) - and that's why you can search for text inside Office files. the IFilter for PDF is a free 3rd-party component provided by Adobe (other vendors, like Foxit, also provide PDF IFilter, but that is paid).

once the PDF IFilter is installed, download the Windows SDK and get the command-line utility filtdump.exe (i use the Windows 7 SDK, but i see it exists in Windows 10 SDK as well). filtdump.exe accepts an input file as a parameter, calls upon the appropriate IFilter to parse the text from that file, and then output the text to a new txt file. in that file you can search.

yclee99 · April 12, 2019

Thanks guys.

I manage to develop the function that I am looking for. Special thanks to BigDaddyO as I am using the XPDF_gettext function to archive the functionality.

My next task it to improve the tool performance (speed). Personally, I think Line 3 should be take out from the loop and just do 1 time _Excel_rangeWrite the whole range but I am struggle to find the way. Any advise?

For $i = 1 To UBound($aDatasheet) - 1
   $readPDF = _XPDF_GetText($sDatasheetDir & $aDatasheet[$i])
_Excel_RangeWrite($oNewWorkbook, "Sheet1", $sDatasheetDir & $aDatasheet[$i], "A" & $i + 1)
   For $j = 1 To UBound($aEquipment) - 1
       $readPDFoutput = StringReplace($readPDF, $aEquipment[$j], $aEquipment[$j])
       $iReplacedCount = @Extended
       If $iReplacedCount Then
           $iCol = $j + 1
           $sLetter = _Excel_ColumnToLetter($iCol)
           _Excel_RangeWrite($oNewWorkbook, "Sheet1", $iReplacedCount, $sLetter & $i + 1)
       EndIf
   Next
   _Excel_BookSave($oNewWorkbook)
Next

BigDaddyO · April 12, 2019

Do your initial _Excel_RangeRead and pull in the entire spreadsheet into a 2D array.

Then as you go through, you update the array with your new data.

Once all of the checking is finished, you can write out the entire 2D array to a new spreadsheet in a single _Excel_RangeWrite.

$aDataSheet[0][0] = A1, $aDataSheet[100][10] = J100, etc...

yclee99 · April 15, 2019

On 4/12/2019 at 8:16 PM, BigDaddyO said:

Do your initial _Excel_RangeRead and pull in the entire spreadsheet into a 2D array.

Then as you go through, you update the array with your new data.

Once all of the checking is finished, you can write out the entire 2D array to a new spreadsheet in a single _Excel_RangeWrite.

$aDataSheet[0][0] = A1, $aDataSheet[100][10] = J100, etc...

My current codes perform _Excel_RangeWrite when matching is found and skip _Excel_RangeWrite when there is NO matching.

I will try to modify my codes as per your suggestion above. The problem is I don't know who to write 2D array from certain cell (i.e. from B2). I was not able to find the sample in the forum.

Sign In

PDF Search

Recommended Posts

yclee99

Skeletor

Skeletor

BigDaddyO

jguinch

yclee99

orbs

yclee99

BigDaddyO

yclee99

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta