DeltaRocked Posted July 25, 2011 Share Posted July 25, 2011 (edited) Hello ,I have been trying to parse a PDF and decode Javascript which is embedded within the PDF. The problem I am facing is with the decoding of the Flatedecode object. I have been using zlib_udf.au3 by w00ter and also Zlib function provided by Ward without any success. while using w00ter's udf am getting -3 as the error i.e. $Z_DATA_ERRORany inputs ? regardsDeltarockedThe SOLVED code. Huge thanks to ProgAndy and Ward.Post no 7 by ProgAndy : use the modifications done by ProgAndy , if you are using zlib1.dll Code which needs to be modified in the zlib_udf.zu3; Decompresses data, you need to know how large the decompressed data will be. Func _Zlib_Uncompress($CompressedPtr, ByRef $CompressedSize, $UncompressedPtr, $UncompressedSize) ; modified by ProgAndy $call = DllCall($Zlib_Dll, "int:cdecl", "uncompress", "ptr", $UncompressedPtr, "long*", $UncompressedSize, "ptr", $CompressedPtr, "long", $CompressedSize) If @error Then Return SetError(1,0,-7) $CompressedSize = $call[2] Return $call[0] EndFunc ;==>_Zlib_Uncompress Func _ZLib_UncompressBinary($bBinary, $iLength = 0) ; ProgAndy Local $i=1, $tBuf, $iSize, $iRes Local $tBin = DllStructCreate("byte[" & BinaryLen($bBinary) & "]") DllStructSetData($tBin, 1, $bBinary) If $iLength < 1 Then $iLength = DllStructGetSize($tBin) * 2 $bBinary = 0 Do $tBuf = DllStructCreate("byte[" & $iLength * $i & "]") $iSize = DllStructGetSize($tBin) $iRes = _Zlib_Uncompress(DllStructGetPtr($tBin), $iSize, DllStructGetPtr($tBuf), DllStructGetSize($tBuf)) $i += 1 Until $iRes <> -5 If $iRes <> 0 Then Return SetError($iRes, 0, "") $tBin = 0 Return DllStructGetData(DllStructCreate("byte[" & $iSize & "]", DllStructGetPtr($tBuf)), 1) EndFuncexpandcollapse popup#include<string.au3> #include <array.au3> #include "zlib_udf.au3" ;~ #include "zlib.au3" ;~ IOS ;~ $file = 'C:\pdf\ios_poc\' ;~ $file &= 'iPad1,1_3.2.1.pdf' ;~ Contagio ;~ $file = 'C:\pdf\contagio\' ;~ $file &= 'invitation.pdf' ;~ $file &= 'RB.pdf' ;~ $file &= 'SB.pdf' ;~ INFECTED ;~ $file = 'C:\pdf\infected\' ;~ $file &= '116d92f036f68d325068f3c7bbf1d535.pdf' ;~ $file &= '0_infect_invitation.pdf' ;~ POC ;~ $file = 'c:\pdf\poc\' ;~ $file &= 'eicar.pdf' ;~ $file &= 'goodness.pdf' ;~ $file &= 'hello-world-reverse-uri8.pdf' ;~ $file &= 'launch-action-cmd.pdf' ;~ $file &= 'testx.pdf' ;~ ADOBE INFECTED ;~ $file = 'C:\pdf\infected\adobe-0day\' ;~ $file &= '721601bdbec57cb103a9717eeef0bfca' ;~ Normal PDFs $file = 'C:\pdf\' ;~ $file = 'a4_1008-Form23AC.PDF' ;~ $file &= 'a3_R-intro.pdf' $file &= 'a2.pdf' ;~ $file &= 'ab.txt' $start_pt = '(?i) obj' $start_obj = '(?i)\d* \d* obj' $end_pt = '(?i)endobj' _Zlib_Startup() _CountPDFObj($file, $start_pt, $end_pt) ;~ _Zlib_Shutdown() Exit Func _CountPDFObj($fullfilename, $start_pt, $end_pt) Local $strpos = 0, $length = 10, $count_loop, $ex_data, $Decompressed, $header, $binlen Local $start_ex_pt = '(?i)>>\s*stream' ; & '\r\n' ;at the end of the string. this will include @CR @LF in the search. Local $end_ex_pt = '(?i)endstream' ;'\r\n' & ; at the start of the string If Not FileExists($fullfilename) Then Return SetError(1, 0, 0) $sData = FileRead($fullfilename) $start_array = StringRegExp($sData, $start_pt, 3) $start_obj_array = StringRegExp($sData, $start_obj, 3) $end_array = StringRegExp($sData, $end_pt, 3) FileDelete('c:\pdf\test.log') FileWrite('c:\pdf\test.log', 'Analyzing ' & $fullfilename & @CRLF) If IsArray($start_array) And IsArray($end_array) Then If UBound($start_array) == UBound($end_array) Then $count_loop = 1 While $count_loop <= UBound($start_array) $start_pos = StringInStr($sData, $start_array[$count_loop - 1], 2, $count_loop) + StringLen($start_array[$count_loop - 1]) $end_pos = StringInStr($sData, $end_array[$count_loop - 1], 2, $count_loop) $ex_data = StringMid($sData, $start_pos, $end_pos - $start_pos) If StringInStr($ex_data, 'stream', 2) And StringInStr($ex_data, '/flatedecode', 2) And StringInStr($ex_data, '/predictor', 2) == 0 And StringInStr($ex_data, '/BBox', 2) == 0 And StringInStr($ex_data, '/ASCIIHexDecode', 2) == 0 Then $start_extract_array = StringRegExp($ex_data, $start_ex_pt, 3) $end_extract_array = StringRegExp($ex_data, $end_ex_pt, 3) If IsArray($start_extract_array) And IsArray($end_extract_array) Then $start_ex_pos = StringInStr($ex_data, $start_extract_array[0], 2, 1) + StringLen($start_extract_array[0]) $end_ex_pos = StringInStr($ex_data, $end_extract_array[0], 2, 1) $ex_ex_data = StringStripWS(StringMid($ex_data, $start_ex_pos, $end_ex_pos - $start_ex_pos), 3) $binlen = BinaryLen($ex_ex_data) $header = StringStripWS(StringLeft($ex_data, $start_ex_pos), 7) ; used for writing logs and has got nothing to do with the exracted stream $Decompressed = zlib($ex_ex_data, $binlen) ;~ If StringInStr($Decompressed, '/javascript', 2) <> 0 Or StringInStr($Decompressed, 'else if', 2) <> 0 Then ;~ MsgBox(0,'Info','JS Decrypted') FileWrite('c:\pdf\test.log', $start_obj_array[$count_loop - 1] & @CRLF & $header & @CRLF & _ 'BinaryLen of the extracted compressed stream = ' & $binlen & @CRLF & '-------------------------' & _ @CRLF & StringReplace($Decompressed, '>><<', '>>' & @CRLF & '<<') & @CRLF & '-------------------------' & @CRLF) ;~ EndIf EndIf EndIf $count_loop += 1 WEnd Return UBound($start_array) Else Return SetError(1, 0, 0) EndIf Else Return SetError(2, 0, 0) EndIf EndFunc ;==>_CountPDFObj Func zlib($ex_ex_data, $binlen) ; requires zlib1.dll, zlib_udf.au3 and modification by progandy $Decompressed = _Zlib_UncompressBinary($ex_ex_data, $binlen) If StringLeft($Decompressed, 2) == '0x' Then $Decompressed = _HexToString($Decompressed) Return $Decompressed Else $Decompressed = _HexToString($Decompressed) Return $Decompressed EndIf EndFunc ;==>zlib ;~ Func zlib($ex_ex_data, $binlen) ; requires zlib.au3 by WARD ;~ Dim $Decompressed = _ZLIB_Uncompress($ex_ex_data) ;~ If StringLeft($Decompressed, 2) == '0x' Then ;~ $Decompressed = _HexToString($Decompressed) ;~ Return $Decompressed ;~ Else ;~ $Decompressed = _HexToString($Decompressed) ;~ Return $Decompressed ;~ EndIf ;~ EndFunc ;==>zlibPS: this is a part of the tool I have been working on which will be used for PDF analysis. extraction of the DeflateDecode stream is complete but am unable to decode it. Edited July 29, 2011 by deltarocked Link to comment Share on other sites More sharing options...
ProgAndy Posted July 25, 2011 Share Posted July 25, 2011 Hello, If you want an answer, you should add an example script and a PDF file to test it. *GERMAN* [note: you are not allowed to remove author / modified info from my UDFs]My UDFs:[_SetImageBinaryToCtrl] [_TaskDialog] [AutoItObject] [Animated GIF (GDI+)] [ClipPut for Image] [FreeImage] [GDI32 UDFs] [GDIPlus Progressbar] [Hotkey-Selector] [Multiline Inputbox] [MySQL without ODBC] [RichEdit UDFs] [SpeechAPI Example] [WinHTTP]UDFs included in AutoIt: FTP_Ex (as FTPEx), _WinAPI_SetLayeredWindowAttributes Link to comment Share on other sites More sharing options...
DeltaRocked Posted July 25, 2011 Author Share Posted July 25, 2011 (edited) Hi Progandy, My problem is that these pdfs are infected ... so would it be alright if I just upload the code and a link for the pdfs? cause I do want to end up getting banned for uploading something malicious ... regards deltarocked... With this am getting -5 i.e. Z_BUF_ERROR or sometimes -3 ie. Z_DATA_ERROR and very rarely does it decode . The string which it is able to decode is as follows and has been extracted from an infected PDF file. uploaded the extracted PDF log .... searching for 116d92f036f68d325068f3c7bbf1d535.pdf in google will provide you with the link. Edited July 29, 2011 by deltarocked Link to comment Share on other sites More sharing options...
ProgAndy Posted July 25, 2011 Share Posted July 25, 2011 (edited) You must not use StringStripWS on the compressed data. The format is stream@LF{{DATA}}@LFendstream. You need the unmodified data between stream@LF and @LFendstream and decompress it: $data = StringMid($sFile, $posOfStream + 7, StringInStr($sFile, @LF & "endstream", 1, 0, $posOfStream)) $data = StringToBinary($data, 1) $UncompressedLength = ; I think this is the value of /Length1 in the obj-descriptor. ( /Length1 {{Length}} ) $decompress = _Zlib_UncompressBinary($data, $UncompressedLength) MsgBox(0, "", BinaryToString($decompress)) If /Length1 is not availbale or incorrect, try this: In the beginning, use BinrayLen($data)*2 and each time Z_BUF_ERROR occurs, double the uncompressed size. Edited July 25, 2011 by ProgAndy *GERMAN* [note: you are not allowed to remove author / modified info from my UDFs]My UDFs:[_SetImageBinaryToCtrl] [_TaskDialog] [AutoItObject] [Animated GIF (GDI+)] [ClipPut for Image] [FreeImage] [GDI32 UDFs] [GDIPlus Progressbar] [Hotkey-Selector] [Multiline Inputbox] [MySQL without ODBC] [RichEdit UDFs] [SpeechAPI Example] [WinHTTP]UDFs included in AutoIt: FTP_Ex (as FTPEx), _WinAPI_SetLayeredWindowAttributes Link to comment Share on other sites More sharing options...
DeltaRocked Posted July 25, 2011 Author Share Posted July 25, 2011 Hi progandy, thanks for the input. rgds delta rocked... Link to comment Share on other sites More sharing options...
DeltaRocked Posted July 26, 2011 Author Share Posted July 26, 2011 (edited) Hi, something is really wrong with this code . ab.txt contains the execution code . But the problem is obj 53 is decoded while all other give -5 error (Z_BUF_ERROR) Decoded Objects with this code: 53 0 obj Decoded Text within quotes: "37 0 <</IDS 20 0 R/Javascript 50 0 R/URLS 21 0 R>>" After decoding we learn that the next object to be read and executed is -- 50 0 obj From this object we are pointed to 51 0 obj this object reveals to us that Adobe reader needs to execute a /JS i.e. javascript which is available in the object 52 0 obj why only one object gets decoded ? [EDIT UPDATE] I have been analysing the PDF using python tools and even that is not able to decode some of the section.... so I might be wrong ... Edited July 29, 2011 by deltarocked Link to comment Share on other sites More sharing options...
ProgAndy Posted July 26, 2011 Share Posted July 26, 2011 (edited) $binlen must be the size of the uncompressed data. You use the size of the compressed data. THat causes problems, since compression reduces the size and as a result, your buffer is too small.Edit: I modified the functions to automatically adjust the size of the buffer if it is too small: ; Decompresses data, you need to know how large the decompressed data will be. Func _Zlib_Uncompress($CompressedPtr, ByRef $CompressedSize, $UncompressedPtr, $UncompressedSize) ; modified by ProgAndy $call = DllCall($Zlib_Dll, "int:cdecl", "uncompress", "ptr", $UncompressedPtr, "long*", $UncompressedSize, "ptr", $CompressedPtr, "long", $CompressedSize) If @error Then Return SetError(1,0,-7) $CompressedSize = $call[2] Return $call[0] EndFunc ;==>_Zlib_Uncompress Func _ZLib_UncompressBinary($bBinary, $iLength = 0) ; ProgAndy Local $i=1, $tBuf, $iSize, $iRes Local $tBin = DllStructCreate("byte[" & BinaryLen($bBinary) & "]") DllStructSetData($tBin, 1, $bBinary) If $iLength < 1 Then $iLength = DllStructGetSize($tBin) * 2 $bBinary = 0 Do $tBuf = DllStructCreate("byte[" & $iLength * $i & "]") $iSize = DllStructGetSize($tBin) $iRes = _Zlib_Uncompress(DllStructGetPtr($tBin), $iSize, DllStructGetPtr($tBuf), DllStructGetSize($tBuf)) $i += 1 Until $iRes <> -5 If $iRes <> 0 Then Return SetError($iRes, 0, "") $tBin = 0 Return DllStructGetData(DllStructCreate("byte[" & $iSize & "]", DllStructGetPtr($tBuf)), 1) EndFunc Edited July 26, 2011 by ProgAndy *GERMAN* [note: you are not allowed to remove author / modified info from my UDFs]My UDFs:[_SetImageBinaryToCtrl] [_TaskDialog] [AutoItObject] [Animated GIF (GDI+)] [ClipPut for Image] [FreeImage] [GDI32 UDFs] [GDIPlus Progressbar] [Hotkey-Selector] [Multiline Inputbox] [MySQL without ODBC] [RichEdit UDFs] [SpeechAPI Example] [WinHTTP]UDFs included in AutoIt: FTP_Ex (as FTPEx), _WinAPI_SetLayeredWindowAttributes Link to comment Share on other sites More sharing options...
DeltaRocked Posted July 26, 2011 Author Share Posted July 26, 2011 (edited) $binlen must be the size of the uncompressed data. You use the size of the compressed data. THat causes problems, since compression reduces the size and as a result, your buffer is too small. Hi ProgAndy, its done, will be posting the complete code for analyzing PDF very soon. Thanks for your patience. Thanks a Million. regards DeltaRocked. [uPDATE] Ran into a small problem with /BBOX ... anyway its not of a concern as nothing can be hidden inside the TextInput Box structure construct ... ROFL .... this had me taken by surprise ... was wondering why I was getting 0 as the return value. [uPDATE] tested both ZLIB udfs A: (monoceres - edited by ProgAndy) and B: Ward same results ... wondering where am i going wrong ? will be posting about the python result shortly... Edited July 27, 2011 by deltarocked Link to comment Share on other sites More sharing options...
ReFran Posted July 27, 2011 Share Posted July 27, 2011 Mmmh, I really wonder that Zlib can be used to decompress and or decrypt a pdf. Is that real the right tool for that ?? However for compress/encrypt and decompress/decrypt for a PDF you can use PDFTK.exe, a commandline tool for the handling of pdfs. best regards, Reinhard Link to comment Share on other sites More sharing options...
DeltaRocked Posted July 29, 2011 Author Share Posted July 29, 2011 (edited) Mmmh,I really wonder that Zlib can be used to decompress and or decrypt a pdf.Is that real the right tool for that ??However for compress/encrypt and decompress/decrypt for a PDF you can use PDFTK.exe, a commandline tool for the handling of pdfs.best regards, ReinhardHi,yes it is used.... very soon will be uploading the code for analysing the PDF .... once this is over will be going ahead with ASCII85 decode routine.... PDFTK is good but it require manual intervention and there are loads of python scripts available but this is autoit ... and I need an analyzer ... RegardsDeltarocked Edited July 29, 2011 by deltarocked Link to comment Share on other sites More sharing options...
Avee Posted November 27, 2013 Share Posted November 27, 2013 (edited) Hi, I am trying to use this code to get some text out of a pdf. Unfortunately, my script never executes the $end_array = StringRegExp($sData, $end_pt, 3) correctly. It will stop evaluating the $sData input to the StringRegExp as soon as it hits a 00h value. As soon as I point it to a spot further on in $sData as such: $end_array = StringRegExp( StringMid($sData,3936,1000), $end_pt, 3) it will find the $end_pt expression. So the expression is in the $sData string, but StringRegExp refuses to evaluate the whole string. How can I fix this? Surely the OP must have hit some 00h values in his pdfs as well? Edited November 27, 2013 by Avee Link to comment Share on other sites More sharing options...
Avee Posted December 6, 2013 Share Posted December 6, 2013 (edited) It is still a mystery to me how the OP got the regex to find the end of a stream. I completely rewrote the code. I have a slightly different requirement, I want to get text data out of a stream. I tried to cut down on the regex functions. Since the regex selfdestructs on hitting a null byte in most streams and I move the pointer after each found stream, the function is not that expensive I think. On my four year old laptop it chews through 8 megabytes in 5-6 seconds. This code doesn't drill down into the tagging, it just searches for a typical start of a stream with a length declaration, and then trusts that length declaration to find the end of the stream. The regex repeats on the rest of the data that is past the found stream. This probably is a dirty way to do it, but it seems to work for me with a variety of pdfs. expandcollapse popup; Find Streams within a pdf and return contents assuming text string #include <Array.au3> #include "zlib_udf.au3" Dim $stream_Array[1] Dim $streamtext Dim $x _Zlib_Startup("zlib1.dll") $stream_Array = GetPdfStreamContent(FileOpenDialog("select pdf File", @MyDocumentsDir, "Adobe PDF Files (*.pdf)")); calls the function that extracts the streams For $x = 1 To UBound($stream_Array) - 1 $Streaminput = BinaryToString(_Zlib_UncompressBinary($stream_Array[$x], 0)) If StringIsASCII ( $Streaminput ) Then $streamtext &= $Streaminput Next FileWrite( "result.txt", $streamtext ) Func GetPdfStreamContent($fullfilename) ;Finds streams within a PDF. Returns an Array with the streams starting at $strm_Array[1] Local $start_obj = '(?i)(?s)\d* \d* obj[\n|\r]<<.*/Length (\d+).*>>\n*stream\r*\n' ;this is regex, (\d+) will return length, @extended the position of the stream Local $offset = 0 Local $strm_Array[1] $sData = FileRead($fullfilename) While 1 ;Find the stream via regex. We get the length declaration since the regex will die at null bytes, making it impossible to find endstream tags $result_array = StringRegExp(StringTrimLeft($sData, $offset), $start_obj, 1) If @error = 0 Then ;A stream with length declaration was found $offset += @extended ;store stream in array _ArrayAdd($strm_Array, StringMid($sData, $offset, $result_array[0])) ;sometimes the stream length is wrong due to missing @cr at start, this will leave a @LF at the end. We delete it here $strm_Array[UBound($strm_Array) - 1] = StringStripWS($strm_Array[UBound($strm_Array) - 1], 2) ;Advance offset past the end of the stream, so that the regex won't run into a null byte prior to the next stream $offset += StringLen($strm_Array[UBound($strm_Array) - 1]) Else ExitLoop EndIf WEnd Return $strm_Array EndFunc ;==>GetPdfStreamContent I guess one could also use it to extract image content, depends on how you handle the returned data. Edited December 7, 2013 by Avee mLipok 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now