Analyzing Malicious Documents
PDF files can possess powerful capabilities that adversaries misuse to infect systems
The structure and contents of a PDF file are defined using objects, which issue directives using ASCII based keywords
Same risky keywords include
Execute Embedded Javascript --> /JS /Javascript /AcroForm /XFA
Try launching external or embedded programs --> /Launch /EmbeddedFiles
Take actin automatically when the PDF file is opened --> /AA /OpenAction
Interact with websites --> /URI /SubmitFormA PDF file is a collection of elements
header --> %PDF-1.6
object --> object delimited with:
X Y obj
endobj
...
xref --> Table with offsets of objects in the file
trailer --> Lists the number of objects and the offset of xrefPDF objects can reference each other and specify actions
Indirect object 1 0 references 43 0
1 0 obj
Type: /Page
<<
/AA /O 43 O R
>>
endobjStreams can encode various data
44 0 obj
<<
/Filter
[/FlatDecode]
/Length 463
>>
stream
encoded contents
endstream
endobjAlways start by opening the sample in vs-code
unzip steel1.zip
code steel1.pdfUse
pdfid.pyfor an initial perspective to check for risky keywordspdfid.pyscans for suspicious keywords without formally parsing the PDF fileIts useful for an initial review to inform the next steps
The
/URIkeyword indicates clickable URLs can be used in PDFs as phishing baitWe use "keyword" in a generic sense through PDF specs use other terms
pdfid.py steel1.pdfUse
pdf-parser.pyfor a more detailed look at the PDF fileThe
-aparameter topdf-parser.pyshows statisticsBecause
pdf-parser.pyproperly parses PDF syntax, its output is more accurate than that ofpdfid.py
pdf-parser.py steel1.pdf -a The
-kparameter shows just the values for the given key
pdf-parser.py steel1.pdf -k /URIImages in PDF Documents
The attacker tries to persuade the victim to clicking on the picture
To locate images in the PDF file, look for objects of type
/XObject
Examine an Object
Use the
-oparameter topdf-parser.pyto examine object 6 which contains/XObject
pdf-parser.py steel1.pdf -o 6obj 6 0
Type: /XObject
Referencing 7 0 R
Contains Stream <-- Object includes encoded data
<<
/Type /XObject
/Subtype /Image
/Width 625 <-- Image size is 625 x 155 pixels
/Height 155
/BitsPerComponent 8
/ColorSpace /DeviceRBG
/Length 7 0 R
/Filter /DCTDecode <-- This decoding is used for JPEG images
>>Extract and view the image object
pdf-parser.py steel1.pdf -o 6 -d object6.jpgFollow the trail of references that leads to object 6 to see if the strail starts with a link
The
-rparameter finds a reference to the specified objectObject 6 which was of type
/XObjectis referenced by object 13
obj 13 0
Type:
Referencing: 4 0 R, 3 0 R, 8 0 R, 9 0 R, 6 0 R
<<
/ColorSpace
<<
/PCSp 4 0 R
/CSp /DeviceRGB
/CSpg /DeviceGray
>>
/ExtGStateNote:
/Annotesoffers a way to associate a link with an objectContinue to follow the trail of references
If you see
/Annotes 14 0 R--> Look at object 14 now
Dealing with Malicious Websites / Retrieving malicious 2nd stages
One-by-one requests using
wgetorcurlRecomment spoofing HTTP headers to make these requests look more like a normal web browser....Especially the UA strings for
wgetandcurl!!!!Can also tweak the config files of
wgetandcurl
~/.wgetrc, ~/.curlrcSpecialized tools such as
PinpointorScoutHoneyclients software such as
ThugReal borwser on a purposefully vulnerable Windows system enabling the website to infect the lab machine
Activate behavioral monitoring tools to observe the infection
Capture network traffic
If using a sniffer such as Fiddler configure it to save SSL keys
Visit the website from several different IPs to see if its behavior changes View PDF Object Streams
pdf-parser.py steel2.pdf -O -aIf you see an
/ObjStreamfrom the output ofpdf-parser.py steel2.pdf -acommand then you need to view the/ObjStreampdf-parser.pydoes not examine object streams by default
Find all objects that refer to object 10
pdf-parser.py steel2.pdf -O -r 10Aditional Considerations with PDFs
Look for risky objects, examine them, follow the trail of referenced or otherwise related objects
If you see a suspicious object with a stream you can dump that stream to a file using parameters
-f -w -dMalicious PDFs can include JS --> look for
/JS /Javascript /Acroform /XFAPDF files could be password protected
The strucutre will be visible but youll need to decrypt streams to examine them
Youll need to determine the password then decrypt with tolls such as
qpdfandpdftk
VBA Macros in Microsoft OFfice Documents
Note: Even if the document of VBA project is password protected the macros are not stored in an encrypted way
Office docsuments can follow two different formats
The "legacy" binary format is OLE2 (a.k.a structured storage etc)
OLE2 mimics capabilities of a file system using the concepts of storages (like folders) and streams (like files)
The more modern XML based format OOZML incorporates multiple files that include the documents contents in a ZIP file
Both formats can carry macros
Macros in an OOZML file are inside a binary OLE2 file which is inside the zip archive
Normally VBA macro code is embedded inside streams as compiled code (p-code) and compressed source code
Initial Triage
file particulars.doc
trid particulars.doctrid
Open XML Format --> means its an OOXML files
Examine the files that comprise the OOXML document using unzip or zipdump.py
unzip or zipdump.pyzipdump.py particulars.doc
unzip particulars.doc -d particulars-filesCan extract individual files as well with
zipdump.py-s--> specify the file-d--> extract or dump it
zipdump.py particulars.doc -s 5 -d > image1.jpegUse
fehimage viewer to view the image
feh image1.jpeg &olevba to extract VBA Macros
olevba to extract VBA Macrosolevba particulars.doc > particulars.olevba #extract
code particulars.olevba #viewolevbautility can locate, decode, and extract VBA macros from Office files. The tool also shows a summary of the risky keywords it located in the macroAny line that starts with
'it is a comment in VBAWhen Office sees
AutoOpenit automatically executes that function as soon as the function is allowed to runExample:
Sub AutoOpen()
g
End Sub
-----------------------------------------
Sub g()
' useless comment
' another useless comment for obsfucation
y
' blah
' blah blah
B
End SubCan see that
AutoOpen()callsSub g()which then call functionyand functionBwhich are defined laterFor deeper visibility into VBA macros and related artifacts examine streams
Use
oledump.py
oledump.py particulars.doc -iMmeans there is a macro present2823+809Size of the compiled code is the first number, second number is the size of the compressed source codeExample:
A3: M 3632 2823+809 'VBA/PjUse
-s aparameter to oledump.py to extract VBA macros from all streams inparticulars.doc
oledump.py particulars.doc -s a -v | morePass the
oledump.pyoutput throughgrepto eliminate the comments
oledump.py particulars.doc -s a -v | grep -v "^'" | moreSometimes minor aspects of the document can offer additional context for your investigation
They can sometimes reveal artifacts used in its previous version
Use
oledump.pyto extract them
Macros via LOLBin
Be on the look out for obsfucated strings that are backwards
Public Const O As String =
" 23rvsger"
...
Function U5(qe)
Dim bT As New WshShell
bT.exec StrReverse(O) & " " & DU(1)
End FunctionWhen this is executed it will use the LOLBin
regserv32Be aware of LOLBin
mshtaas well
Viewing MetaData
exiftool filename.docXML source code files sometimes include details such as:
Hidden comments such as URLs from which images were pasted
The language code of the system where the document was created
Analyzing OOXML
You can unzip its contents and examine individual XML files
Start with
zipdump.pywith no command line arguments
zipdump.py particualars.doc Once you have identified the index of the file you'd like to examine you can call
zipdump.pyagain specifying the desired files index using-s-dparameter will direct the tool to dump the file to STDOUTCan then pipe to
xmldump.pywith the parameterprettyto reformat the file
zipdump.py particulars.doc -s 9 -d | xmldump.py pretty | moreVipermonkey can emulate VBA macros
vmonkey particulars.doc > particulars.vmonkey
code particulars.vmonkeyTool will auto decode the VBA macros
Numbers to Strings
After performing analysis you notice a macro in
A3When extracting it with
oledump.py
oledump.py mydoc.docm -s A3 -v | moreYou see alot of these lines
exec = exec & ChrW(112) & ChrW(111)...You can use
numbers-to-strings.py
oledump.py mydoc.docm -s A3 -v | numbers-to-string.py -j | moreMake sure to add new lines and examine the output
numbers-to-strings.py -j | sed "s/;/;\n/g" > mydoc.oledumpPassword protected VBA Macros
Can see the VBA macro using
oledump.pyeven though MSFT Office refuse to show you the code due to the password being set
oledump.py invoice.doc -i
oledump.py invlice.doc -s 7 -v | moreRemove the distracting junk code, then examine the macro
oledump.py invoice.doc -s 7 -v | grep -v "^GoTo" | grep -v ":$" > invoice.oledumpxor-kpa.py
The tool
xor-kpa.pyis designed to derive an XOR key from the supplied plaintext and cipher textIt can also XOR a string with its multi-byte key which mimics the algorithm employed by our malicious macro
-xtells the tool to XOR the data with the keyStart each param with
#h#to designate it as a hex-encoded string and enclose in''
xor-kpa.py -x '#h#89789FD89AF897AKJHF43HK23' '#h#66546F'Auto deobsfucation with oledump.py
oledump.pyplugin_http_heuristics--> will automatically decode embedded URLs if they are encoded using a common obsfucation method
oledump.py invoice.doc -p plugin_http_heuristicsSometimes a faster approach to deobsfucate macros involves the VBA debugger built into MSFT Office
evilclippy -uu invoice.docWill remove the macro password with
-uuflagThen open MSFT Word click
View tab --> Macros --> View Macros --> editBring up the locals window so you can see the variables
Add the following at the beginning of the macro (e.g. at the start of the AutoOpen function) so the macros starts the debugger
Sub AutoOpen() <-- Line already there
Debug.Assert False <-- Line you add
GoTo jlskdffjieoajioehjfueahfekjanufiw <-- Start of obsfucated messSave the macro so the line you added doesn't get lost
Switch to the MSFT word main view and enable macros
Once you enable the macros it will run and pause in the AutoOpen function on the line you set
Set the breakpoint on the line that interests you
Then click
Run > ContinueOnce it hits your breakpoint examine the locals window, it will show the current variables in the bottom window, you should see what you are looking for
VBA Stomping
When a macro is added to an Office Document MSFT Office compiles it into a bytecode form known as
p-codeThis is the code that is actually executed when the macro is run (most of the time: https://github.com/bontchev/pcodedmp)
Malware authors could modify or fully delete the source code version of the macro while keeping the
p-codeversion intactOur analysis tools focus on the source code of the macro and wont recognize the true nature of the file
Extract the file as always
olevba order.docmNow extract the file structure info
oledump.py order.docm -iWill see a
!which will indicate an Unusual start of source codeAnother sign of VBA stomping will show if the size of the compressed source code being
0oledump.pycan extract thep-codebut it cannot decode it
oledump.py order.docm -s A3 -v <-- Will get an error "Cannot decompress"
oledump.py order.docm -s A3s -A <-- -A will show the contents the way a hex editor might show them
oledump.py order.docm -s A3c | more <-- adding -C will show the compiled code (what c stands for)Use
pcodedmp.pyto disassemble VBAp-code
pcodedmp order.docm > order.pcodedmp
code order.pcodedmpUse
pcode2codeto decompile VBAp-code
pcode2code order.docm | moreNote:
MSFT Office automatically decompiles the
p-codegenerating the VBA source code, however:Macros without the source code will only run in the specific version of Office for which the
p-codewas createdIf you want to debug the macros you can decompile the
p-codeusingpcode2codeyou can embed the macro in a document
Base64 PowerShell
If you identify some base64 encoded PowerShell, ensure to use
bse64dump.pyto convert it
oledump.py checkbox.doc -s 7 -d | base64dump.py -s 1 -t utf16 > checkbox.ps1
more checkbox.ps1However when you view the dump we can see that it is also
gzipencoded dataExtract the gzip data
base64dump.py checkbox1.ps1 -s 3 -d | gunzip - > checkbox2.ps1
code checkbox2.ps1Shellcode
Shellcode is machine code that the CPU can understand
It is represented as a series of bytes sorted in a memory region
base64dump.py checkbox2.ps1 -n 10 -nparameter directsbase64dump.pyto only consider strings that when decoded are at least 10 bytes longYou should now see the long shellcode string, and see that it is the second stream, use
-s 2to extract that stream
base64dump.py checkbox2.ps1 -n 10 -s 2 -d | translate.py "byte ^ 35" > checkbox.binUse
scdbgcto emulate the execution of shellcode to understand its capabilities
scdbgc /f checkbox.bin /s -1Can now use
yara-rulesto identify known malware patterns in file
yara-rules checkbox.bin
1768.py checkbox.binyara-rulescommand will scan the file to see if it hits off any rules1768.pyis designed for parsing Cobalt Strike artifacts and is installed on REMnuxIn CS files the License ID is stored as a 32-bit integer in the last 4 bytes of the shell code
Examining Malicious RTF Documents
RTF documents are supported by MSFT word and many non-MSFT applications
RTF does not support macros but it allows attackers to embed other dangerous files as
OLEobjects and other binary contentsUsers can be persuaded to open and execute the embedded file
RTF files can also directly target a vulnerability using an exploit to execute the embedded shellcode payload
When examining RTF documents, focus on the objects or other embedded artifacts
RTF format
Usually formatted as ASCII plaintext and includes control words and groups
Control words start with
/and specifies how the RTF rendering application should format and display the charactersA group encloses other elements in
{}delimiters and specifies the text affected by the group and its formattingGroups can be nested
Objects and other binary content are embedded as serialized strings that represent hex values
You will see the
/objdatacontrol work followed by a string encoded in hexUse
rtfdump.pyand| moreto get and overview of the RTF files groups and to spot embedded objects-owill allow you to examine the object
rtfdump.py new-order.doc -O-sparameter specifies the index of the object-dtells the tool to dump the object in its raw form
rtfdump.py new-order.doc -O -s 1 -d > new-order.objectUse
oledump.pyto examine the extracted objectoledump.py new-order.object -iIf you now want to examine a specific steam use the
-Aparameteroledump.py new-order.object -s 4 -AWhen analyzing malicious documents that might have exploits look for shellcode to understand the payload of the attack
Use the
-Sparameter to examine the stringsoledump.py new-order.object -s 4 -SFor parsing
Equation Editor 3.0data we have an option-f name=eqn1oledump.py new-order.object -s 4 -d | format-bytes.py -f name=eqn1
Shellcode searching in Binary files
When looking for shell code look out for a lot of
0x90also known as a NOP sledUse
xorsearchto spot shellcode patterns in binary filesxorsearch -W -d 3 qa.binEIPpoints to the current instruction but assembly code cannot read it directly, so malware authors do it indirectly
Call followed by a POP allows code to get its EIP contents
CALL 00401024
POP EAX
Sellcode developers attempt to evade detection by using other instructions to perform GetEIP
00401027 JMP SHORT 0040102C #Happens first and moves down to the CALL
00401029 POP ESI
0040102A JMP SHORT 00401031
0040102C CALL 00401029 #Call is made and it moves back up to the POP
00401031 ADD ESI, 9
This code suceeds at making the CALL and then POP in an indirect mannerShellcode Requirements
Shellcode needs to do some work before it can make API calls
To load DLLs and resolve API function names, shellcode often seeks
kernel32.dllforLoadLibraryandGetProcAddressShellcode loos for the
Process Environment Block (PEB)to locatekernel32.dllin memory of the exploited processFor every process the Windows OS creates a structure called the
PEBThis data structure contains information about the process including the list of modules (DLLs) that have been loaded or mapped into the processes memory
The
FSregister contains the address of the data structure called theThread Information Block (TIB), which contains information about the currently running threadA pointer to the
PEBresides within theTIBat offset0x30with respect to the beginning of theTIBTherefore a pointer to
PEBis always located atFS:[0x30]This syntax directs the processor to look for the address stored
0x30bytes away from the beginning of theTIBstructureTwo methods to retrieve the
PEB
MOV EAX, DWORD PTR FS:[30h]
PUSH 30h
POP EBX
MOV EAX, FS:[EBX]scdbgc
Use
scdbgcto analyze shellcode by emulating its executionthe
-foffparameter specifies the hex offset within the file where the shellcode startsThis can be determined by
xorsearchPress CTRL+C three times if
scdbgcgets stuck
scdbgc /f a.bin /s -1 .foff 3B/s -1parameter indicates to continue the emulation without restricting the max number of instructionsDirect
scdbgcto open a handle to the malicious file so the shellcode can find the overlay to where it likely stores additional contentsHit CTRL+C three times after it starts to avoid too many repeating instructions from filling your screen
Can hide the numerous
READ/WRITEevents with/norw
scdbgc /f qa.bin /s -1 /foff 3B qa.doc /norwIf you see shellcode attempting to drop another file such as an exe, we can allow the shellcode to execute in order to capture the file
use
runsc
runsc32 runsc64
Can use it also on REMnux due to wine being installed
To execute shellcode:
runsc32 -f qa.bin -o 0x3B -d qa.doc -n
find ~/.wine -name WINWORD.EXE -exec -cp "{}" .\;XML Macros
Microsoft Excel 4 (XML) macros are legacy technology that can offer attackers an alternative to VBA macros
Were built in 1992 before the introduction of VBA in 1993
Are being retired by MSFT but work in recent versions of Excel
Are defined as formulas in cells of sheets
Sheets are often hidden
The formulas are often in white text on white background
To see where the XLM macro execution starts use
zipdump.pywith-sparameter to examine thexl/workbook.xml
zipdump.py koti.xlsm -s "xl/workbook.xml" -d | xmldump.py prettyTo see where execution starts look for:
<definedNames>
<definedName name="_xlnm.Auto_Open">Lodet!$A$154</definedName>
</definedNames>Execution starts in cell A154 in sheet Lodet
Look above at the
<sheet name=>parameter to figure out whichrIdnumber is assigned to our sheetLodetand whether it is hidden or notTo see which XML files represent the sheets Loded and kOTI look at the
xl/rels/wordkbook.xmlfile
zipdump.py koti.xlsm -s "xl/_rels/workbook.xml.rels" -d | xmldump.py prettyIt will show you:
`<Relationship Id="rId3"...Target=worksheets/sheet2.xml"/>
Now examine the
worksheets/sheet2.xmlNow extract the contents of Lodet which is
macrosheets/sheet1.xmlusingzipdump.pyzipdump.py koti.xlsmzipdump.py koti.xlsm -s 6 -f | xmldump.py pretty | moreFor easier analysis, direct
xmldump.pyto display just the cell text
zipdump.py koti.xlsm -s 6 -d | xmldump.py celltext > koti.csvXML Macro obsfucation techniques include the following:
Use formulas to compute sensitive values such as strings during the runtime of the macro
Compute some values randomly during runtime i.e. the URL
Static analysis to compute possible values can be complex and time consuming
Cached value saves time byt displays only one possible outcomeInstead of including a string in the formula include a reference to a string that is stored in a shared table elsewhere in the document
The shared strings are always in
xl/sharedStrings.xmlShared strings can reveal IOCs
You can direct
xmldump.pyto look up the strings for you by using the-jpaameter and pointing to a stream that has the macros
zipdump.py koti.xlsm -j | xmldump.py -j 6 celltextMSFT office is very helpful for decoding XLM macros
Use the built in debugger to examine and deobsfucate code
Covert file format from OOXML to OLE2 and the other way
Execute the macro the way a victim would to observe effects on the system from a behavorial perspective
Use Windows AMSI functionality to observe which script commands end up executing
logman start AMSITrace -p Microsoft-Antimalware-Scan-Interface Event1 -o AMSITrace.etl -etsRun the suspicious script or macro you wish to examine
Stop AMSI Monitoring
logman stop AMSITrace -etsExamine the AMSI data saved to the file
AMSIScriptContentRetrievalAdditional tools and considerations for XLM macro analysis
oledump.pycan examine XLM macros in OLE2 filesoledump.py file.xls -p plugin_biff --pluginoptions "-x"
Last updated