yclee99 Posted March 13, 2019 Posted March 13, 2019 Dear All, I have large number PDF files which I need to checked each of them whether containing certain string (equipment tag name). Example: I have 500 PDF files and I want to know which PDF files contains the specific equipment tag name. There are 100 equipment tag name in total. In the past, we did this process manually by opening PDF file and search for the equipment tag name. We don't really care the equipment tag name is located at which page. What is important to use is the PDF files contains which equipment tag name. This process is really time consuming. I wonder it there a way to do it automatically. My original idea is to convert the PDF to text file (XPDF - pdftotext) and search for for the equipment tag name. Is there any better way to deal with this? p/s: I just found out that pdftotext is not free as I am using it for commercial purpose. I am trying to avoid using pdftotext.
Skeletor Posted March 13, 2019 Posted March 13, 2019 See if this topic is applicable to you: Kind RegardsSkeletor "Coffee: my defense against going postal." Microsoft Office Splash Screen | Basic Notepad Program (Beginner) | Transparent Splash Screen | Full Screen UI
Skeletor Posted March 13, 2019 Posted March 13, 2019 As you pointed out, you can take your manual process and AutoIt! Use Arrays and FileOpen, and Send("{^s}") Kind RegardsSkeletor "Coffee: my defense against going postal." Microsoft Office Splash Screen | Basic Notepad Program (Beginner) | Transparent Splash Screen | Full Screen UI
BigDaddyO Posted March 13, 2019 Posted March 13, 2019 (edited) from my understanding, the xpdf tools are opensource/free to use (the command line tools). it's the XpdfReader that isn't free, which you don't need. edit: read closer and if your going to be selling/distributing your app then yes you need a license. expandcollapse popup;==================================================================================================================== ; Get the text out of a PDF file and return it as a String value ; If error is encountered @Extended will contain the error returned from pdftotext.exe ; $bMaintainLayout: True = (Default) This will try to keep the spacing as it shows in the PDF file ; False = This will just display the text without any layout ;==================================================================================================================== Func _XPDF_GetText($sPDFFile, $bMaintainLayout = True) Local $sXpdftotext = @ScriptDir & "\pdftotext.exe" If NOT FileExists($sXpdftotext) Then Return SetError(1, 0, 0) ;ConsoleWrite('"' & $sXpdftotext & '" -layout "' & $sPDFFile & '" "-"' & @CRLF) If $bMaintainLayout = True Then $sLayout = " -layout " Else $sLayout = " " EndIf Local $iPid = Run('"' & $sXpdftotext & '"' & $sLayout & '"' & $sPDFFile & '" "-"', "", @SW_HIDE, 2 + 4) ;Run the converter and get the StdOut "2" and the StdErr "4" ProcessWaitClose($iPID) ;Need to wait for it to finish before we get the StdOutput and StdErr values Local $sResult While 1 ;Loop through the StdoutRead getting all the available text from the PDF file $sResult &= StdoutRead($iPid) ;Put the output into the $sResults string If @error Then ExitLoop ;Once we reach the end of the output string, exit the loop WEnd Local $sErrOutput While 1 ;Loop through the StderrRead incase there are any problems reading the PDF $sErrOutput = StderrRead($iPID) ;Put the error output into the $sErrOutput If @error Then ;Exit the loop if the process closes or StderrRead returns an error. ExitLoop EndIf If $sErrOutput <> "" Then Return SetError(1, $sErrOutput, 0) ;If there is something in the $sErrOutput then there was a problem, return Error and sets @extended to whatever was returned by the error ;MsgBox($MB_SYSTEMMODAL, "Stderr Read:", $sOutput) WEnd Return $sResult ;Return the contents of the PDF as a variable EndFunc Edited March 13, 2019 by BigDaddyO
jguinch Posted March 14, 2019 Posted March 14, 2019 if needed : Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
orbs Posted March 30, 2019 Posted March 30, 2019 another approach would be to utilize the filtdump.exe utility (from Windows SDK) in conjunction with PDF IFilter. both free, and very easy to use. IFilter allows Windows Search (formerly Windows Indexing Service) to parse text from non-textual files for indexing and searching purposes. FYI, IFilter for MS-Office files (and OpenOffice files) is installed by default with MS-Office installation (and is also provided as a standalone installer) - and that's why you can search for text inside Office files. the IFilter for PDF is a free 3rd-party component provided by Adobe (other vendors, like Foxit, also provide PDF IFilter, but that is paid). once the PDF IFilter is installed, download the Windows SDK and get the command-line utility filtdump.exe (i use the Windows 7 SDK, but i see it exists in Windows 10 SDK as well). filtdump.exe accepts an input file as a parameter, calls upon the appropriate IFilter to parse the text from that file, and then output the text to a new txt file. in that file you can search. Signature - my forum contributions: Spoiler UDF: LFN - support for long file names (over 260 characters) InputImpose - impose valid characters in an input control TimeConvert - convert UTC to/from local time and/or reformat the string representation AMF - accept multiple files from Windows Explorer context menu DateDuration - literal description of the difference between given dates Apps: Touch - set the "modified" timestamp of a file to current time Show For Files - tray menu to show/hide files extensions, hidden & system files, and selection checkboxes SPDiff - Single-Pane Text Diff
yclee99 Posted April 12, 2019 Author Posted April 12, 2019 Thanks guys. I manage to develop the function that I am looking for. Special thanks to BigDaddyO as I am using the XPDF_gettext function to archive the functionality. My next task it to improve the tool performance (speed). Personally, I think Line 3 should be take out from the loop and just do 1 time _Excel_rangeWrite the whole range but I am struggle to find the way. Any advise? For $i = 1 To UBound($aDatasheet) - 1 $readPDF = _XPDF_GetText($sDatasheetDir & $aDatasheet[$i]) _Excel_RangeWrite($oNewWorkbook, "Sheet1", $sDatasheetDir & $aDatasheet[$i], "A" & $i + 1) For $j = 1 To UBound($aEquipment) - 1 $readPDFoutput = StringReplace($readPDF, $aEquipment[$j], $aEquipment[$j]) $iReplacedCount = @Extended If $iReplacedCount Then $iCol = $j + 1 $sLetter = _Excel_ColumnToLetter($iCol) _Excel_RangeWrite($oNewWorkbook, "Sheet1", $iReplacedCount, $sLetter & $i + 1) EndIf Next _Excel_BookSave($oNewWorkbook) Next
BigDaddyO Posted April 12, 2019 Posted April 12, 2019 Do your initial _Excel_RangeRead and pull in the entire spreadsheet into a 2D array. Then as you go through, you update the array with your new data. Once all of the checking is finished, you can write out the entire 2D array to a new spreadsheet in a single _Excel_RangeWrite. $aDataSheet[0][0] = A1, $aDataSheet[100][10] = J100, etc...
yclee99 Posted April 15, 2019 Author Posted April 15, 2019 On 4/12/2019 at 8:16 PM, BigDaddyO said: Do your initial _Excel_RangeRead and pull in the entire spreadsheet into a 2D array. Then as you go through, you update the array with your new data. Once all of the checking is finished, you can write out the entire 2D array to a new spreadsheet in a single _Excel_RangeWrite. $aDataSheet[0][0] = A1, $aDataSheet[100][10] = J100, etc... My current codes perform _Excel_RangeWrite when matching is found and skip _Excel_RangeWrite when there is NO matching. I will try to modify my codes as per your suggestion above. The problem is I don't know who to write 2D array from certain cell (i.e. from B2). I was not able to find the sample in the forum.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now