zfisherdrums Posted July 21, 2007 Share Posted July 21, 2007 (edited) At work, I needed a way to compare PDFs. I stumbled across the XPDF toolset sometime ago. What I wanted was a utility that I could pass two PDFs into and have them rendered as text for a text-based comparison in WinMerge. Now, WinMerge does have a plugin that provides PDF comparison, but I am not thrilled with what it does to the layout. At least pdftotext.exe maintains an assemblance of the layout (provided you use the appropriate args).That said, here is the script that I'm using. It will consume two PDF files passed in via the command line or the Config.ini. It will then render a text file using a PDF-To-Text utility of your choosing, apply reg-ex masking, and create a new-transformed file suitable for text-based comparison.The Config.ini also holds the command lines for the PDF Conversion tool (pdftotext.exe in my case) and the Comparison tool (WinMerge in my case). Because the comandline to these tools is contained in the config file, you can change them out to suit your needs. I just like compiling once and letting the config file handle the user preference stuff. Finally, regular expressions provide a level of masking to prevent false alarms. The key used in the Config file will be the text that replaces the search pattern defined in the value. Placing an star (*) before the key will remove the search pattern text altogether. For example, any long format US dates would be replaced with <<Long Date>> in this example: <<<LongDate>>>=(January|February|March|April|May|June|July|August|September|October|November|December) \d\d?, 20\d\dSo, download the attached zip, then the XPDF toolset from the link shown earlier and drop the pdftotext.exe in the folder to run the example. I won't distribute that exe here for obvious reasons.I'm attaching the code I wrote here:PDFComparisonHelper.zipFor all you copy-and-pasters out there:expandcollapse popup#include <File.au3> Global $RegExpressions, $PDF_A, $PDF_B, $PDFCommandLine, $CompareCommandLine Global $ConfigPath = @ScriptDir & "\Config.INI" Func DoAllConversions( $file ) If Not FileExists( $file ) Then Die( "File A cannot be found" ) $TextConvertedFile = ConvertPDFtoText( $file ) If $TextConvertedFile = "" Then Die( "File A could not be converted" ) $RegExConvertedFile = ApplyRegExTranformations( $TextConvertedFile ) Return $RegExConvertedFile EndFunc Func ConvertPDFtoText( $file ) Local $cmdLine = StringFormat( $PDFCommandLine, $file ) ConsoleWrite( "Converting PDF to Text ---> " & $cmdLine & @CRLF) $exit = RunWait( $cmdLine, @ScriptDir ) If $exit = 0 Then Return StringReplace( $file, ".pdf", ".txt" ) Else Return "" EndIf EndFunc Func ApplyRegExTranformations( $filename ) Local $FileRaw = FileRead( $filename ) For $i = 1 to $RegExpressions[0][0] Switch StringMid( $RegExpressions[$i][0], 1, 1) Case "*" ReplaceRegEx( $FileRaw, $RegExpressions[$i][1] ) Case "#" ; SKIP THIS ONE Case Else ReplaceRegEx( $FileRaw, $RegExpressions[$i][1], $RegExpressions[$i][0] ) EndSwitch ConsoleWrite($RegExpressions[$i][0] & @CRLF) Next $NewFile = StringReplace( $filename, ".txt", "_RegEx.txt" ) FileDelete( $NewFile ) FileWrite( $NewFile, $FileRaw ) ;~ ConsoleWrite( $FileRaw & @CRLF ) return $NewFile EndFunc Func ReplaceRegEx( ByRef $text, $pattern, $replace = "" ) $text = StringRegExpReplace( $text, $pattern, $replace ) EndFunc Func DoComparisons( $fileA, $fileB) Local $commandLine = StringFormat( $CompareCommandLine, $fileA, $fileB ) ConsoleWrite( $commandLine & @CRLF) RunWait( $commandLine, @ScriptDir ) EndFunc Func Die( $Message ) MsgBox(0, @ScriptName, $Message ) Exit EndFunc ; ////////////////////////////////////////////////////////////////////////////////////////////////// ; ////////////////////////////////////////////////////////////////////////////////////////////////// ; START HERE ; ////////////////////////////////////////////////////////////////////////////////////////////////// ; ////////////////////////////////////////////////////////////////////////////////////////////////// ; Determine if Config file exists If Not FileExists( $ConfigPath ) Then Die( "Config file cannot be found" ) ; Define Regular Expressions $RegExpressions = IniReadSection( $ConfigPath, "RegEx" ) ; Define PDF Command Line $PDFCommandLine = IniRead( $ConfigPath, "Paths", "PDFCommandLine", "" ) If $PDFCommandLine = "" Then Die( "PDF Command Line not found in Config file" ) ; Define Compare Tool Command Line $CompareCommandLine = IniRead( $ConfigPath, "Paths", "CompareCommandLine", "" ) If $CompareCommandLine = "" Then Die( "Compare Command Line not found in Config file" ) ; Read in File A and File B If $CmdLine[0] >= 2 Then $PDF_A = $CmdLine[1] $PDF_B = $CmdLine[2] Else $PDF_A = IniRead( $ConfigPath, "Paths", "LeftPath", "" ) If $PDF_A = "" Then Die( "No File A path provided" ) $PDF_B = IniRead( $ConfigPath, "Paths", "RightPath", "" ) If $PDF_B = "" Then Die( "No File B path provided" ) EndIf ; Compare the two text files DoComparisons( DoAllConversions( $PDF_A ), DoAllConversions( $PDF_B ))Config.ini[Paths] LeftPath=A.pdf RightPath=B.pdf PDFCommandLine=pdftotext.exe -layout "%s" CompareCommandLine=""C:\\Program Files\\WinMerge\\WinMerge.exe" "%s" "%s" [RegEx] <<<LongDate>>>=(January|February|March|April|May|June|July|August|September|October|November|December) \d\d?, 20\d\d <<<Money>>>=\$\d\d?\d?,?\d?\d?\d?,?\d?\d?\d?\.?\d?\d?\*?\*? <<<NumericValue>>>=\d\d?\d?,?\d?\d?\d?,?\d?\d?\d?\.?\d?\d?\*?\*? [Sandbox] #<<<ExtraBlankLine>>>=(\r\n){2,}Let me know if you have any questions,Zach... Edited July 21, 2007 by zfisherdrums Identify .NET controls by their design time namesLazyReader© could have read all this for you. Unit Testing for AutoItFolder WatcherWord Doc ComparisonThis here blog... Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now