# Analyzing Malicious Documents

### **PDF files can possess powerful capabilities that adversaries misuse to infect systems**

* The structure and contents of a PDF file are defined using objects, which issue directives using ASCII based keywords
* Same risky keywords include

```
Execute Embedded Javascript --> /JS /Javascript /AcroForm /XFA
Try launching external or embedded programs --> /Launch /EmbeddedFiles
Take actin automatically when the PDF file is opened --> /AA /OpenAction
Interact with websites --> /URI /SubmitForm
```

#### **A PDF file is a collection of elements**

```
header --> %PDF-1.6
object --> object delimited with: 
X Y obj 
endobj
...
xref --> Table with offsets of objects in the file
trailer --> Lists the number of objects and the offset of xref
```

#### **PDF objects can reference each other and specify actions**

* Indirect object 1 0 references 43 0

```
1 0 obj 
Type: /Page
<<
  /AA /O 43 O R
>>
endobj
```

#### **Streams can encode various data**

```
44 0 obj 
<<
  /Filter 
    [/FlatDecode]
  /Length 463
>>
stream
    encoded contents
endstream 
endobj
```

### **Always start by opening the sample in vs-code**

```
unzip steel1.zip
code steel1.pdf
```

* **Use `pdfid.py` for an initial perspective to check for risky keywords**
* `pdfid.py` scans for suspicious keywords without formally parsing the PDF file
* Its useful for an initial review to inform the next steps
* The `/URI` keyword indicates clickable URLs can be used in PDFs as phishing bait
* We use "keyword" in a generic sense through PDF specs use other terms

```
pdfid.py steel1.pdf
```

* **Use `pdf-parser.py` for a more detailed look at the PDF file**
* The `-a` parameter to `pdf-parser.py` shows statistics
* Because `pdf-parser.py` properly parses PDF syntax, its output is more accurate than that of `pdfid.py`

```
pdf-parser.py steel1.pdf -a 
```

* **The `-k` parameter shows just the values for the given key**

```
pdf-parser.py steel1.pdf -k /URI
```

### Images in PDF Documents

* The attacker tries to persuade the victim to clicking on the picture
* To locate images in the PDF file, look for objects of type `/XObject`

### Examine an Object

* Use the `-o` parameter to `pdf-parser.py` to examine object 6 which contains `/XObject`

```
pdf-parser.py steel1.pdf -o 6
```

```
obj 6 0
  Type: /XObject
  Referencing 7 0 R
  Contains Stream     <-- Object includes encoded data
  
  <<
    /Type /XObject
    /Subtype /Image
    /Width 625      <-- Image size is 625 x 155 pixels
    /Height 155
    /BitsPerComponent 8
    /ColorSpace /DeviceRBG
    /Length 7 0 R
    /Filter /DCTDecode      <-- This decoding is used for JPEG images
  >>
```

### Extract and view the image object

```
pdf-parser.py steel1.pdf -o 6 -d object6.jpg
```

* **Follow the trail of references that leads to object 6 to see if the strail starts with a link**
* The `-r` parameter finds a reference to the specified object
* Object 6 which was of type `/XObject` is referenced by object 13

```
obj 13 0 
  Type:
  Referencing: 4 0 R, 3 0 R, 8 0 R, 9 0 R, 6 0 R
  
  <<
    /ColorSpace
      <<
        /PCSp 4 0 R
        /CSp /DeviceRGB
        /CSpg /DeviceGray
      >>
    /ExtGState
```

* Note: `/Annotes` offers a way to associate a link with an object
* Continue to follow the trail of references
* If you see `/Annotes 14 0 R` --> Look at object 14 now

### Dealing with Malicious Websites / Retrieving malicious 2nd stages

* One-by-one requests using `wget` or `curl`
* Recomment spoofing HTTP headers to make these requests look more like a normal web browser....Especially the UA strings for `wget` and `curl`!!!!
* Can also tweak the config files of `wget` and `curl`

```
~/.wgetrc, ~/.curlrc
```

* Specialized tools such as `Pinpoint` or `Scout`
* Honeyclients software such as `Thug`
* Real borwser on a purposefully vulnerable Windows system enabling the website to infect the lab machine

```
Activate behavioral monitoring tools to observe the infection
Capture network traffic 
If using a sniffer such as Fiddler configure it to save SSL keys 
Visit the website from several different IPs to see if its behavior changes 
```

### View PDF Object Streams

```
pdf-parser.py steel2.pdf -O -a
```

* If you see an `/ObjStream` from the output of `pdf-parser.py steel2.pdf -a` command then you need to view the `/ObjStream`
* `pdf-parser.py` does not examine object streams by default

#### Find all objects that refer to object 10

```
pdf-parser.py steel2.pdf -O -r 10
```

**Aditional Considerations with PDFs**

* Look for risky objects, examine them, follow the trail of referenced or otherwise related objects
* If you see a suspicious object with a stream you can dump that stream to a file using parameters `-f -w -d`
* Malicious PDFs can include JS --> look for `/JS /Javascript /Acroform /XFA`
* PDF files could be password protected
* The strucutre will be visible but youll need to decrypt streams to examine them
* Youll need to determine the password then decrypt with tolls such as `qpdf` and `pdftk`

### VBA Macros in Microsoft OFfice Documents

* Note: Even if the document of VBA project is password protected the macros are not stored in an encrypted way
* **Office docsuments can follow two different formats**
* The "legacy" binary format is OLE2 (a.k.a structured storage etc)
* OLE2 mimics capabilities of a file system using the concepts of storages (like folders) and streams (like files)
* The more modern XML based format OOZML incorporates multiple files that include the documents contents in a ZIP file
* Both formats can carry macros
* Macros in an OOZML file are inside a binary OLE2 file which is inside the zip archive
* Normally VBA macro code is embedded inside streams as compiled code (p-code) and compressed source code

### Initial Triage

```
file particulars.doc
trid particulars.doc
```

### trid

* Open XML Format --> means its an OOXML files

#### Examine the files that comprise the OOXML document using `unzip` or `zipdump.py`

```
zipdump.py particulars.doc 
unzip particulars.doc -d particulars-files
```

* Can extract individual files as well with `zipdump.py`
* `-s` --> specify the file
* `-d` --> extract or dump it

```
zipdump.py particulars.doc -s 5 -d > image1.jpeg
```

* Use `feh` image viewer to view the image

```
feh image1.jpeg &
```

### `olevba` to extract VBA Macros

```
olevba particulars.doc > particulars.olevba #extract
code particulars.olevba #view
```

* `olevba` utility can locate, decode, and extract VBA macros from Office files. The tool also shows a summary of the risky keywords it located in the macro
* Any line that starts with `'` it is a comment in VBA
* When Office sees `AutoOpen` it automatically executes that function as soon as the function is allowed to run
* Example:

```
Sub AutoOpen()
g
End Sub 
-----------------------------------------
Sub g()
' useless comment 
' another useless comment for obsfucation 
y
' blah
' blah blah 
B
End Sub
```

* Can see that `AutoOpen()` calls `Sub g()` which then call function `y` and function `B` which are defined later
* **For deeper visibility into VBA macros and related artifacts examine streams**
* Use `oledump.py`

```
oledump.py particulars.doc -i
```

* `M` means there is a macro present
* `2823+809` Size of the compiled code is the first number, second number is the size of the compressed source code
* Example:

```
A3: M 3632 2823+809 'VBA/Pj
```

* Use `-s a` parameter to oledump.py to extract VBA macros from all streams in `particulars.doc`

```
oledump.py particulars.doc -s a -v | more
```

* **Pass the `oledump.py` output through `grep` to eliminate the comments**

```
oledump.py particulars.doc -s a -v | grep -v "^'" | more
```

* **Sometimes minor aspects of the document can offer additional context for your investigation**
* They can sometimes reveal artifacts used in its previous version
* Use `oledump.py` to extract them

### Macros via LOLBin

* Be on the look out for obsfucated strings that are backwards

```
Public Const O As String = 
" 23rvsger"
...
Function U5(qe)
Dim bT As New WshShell
bT.exec StrReverse(O) & " " & DU(1)
End Function
```

* When this is executed it will use the LOLBin `regserv32`
* Be aware of LOLBin `mshta` as well

#### Viewing MetaData

```
exiftool filename.doc
```

* XML source code files sometimes include details such as:
* Hidden comments such as URLs from which images were pasted
* The language code of the system where the document was created

### Analyzing OOXML

* You can unzip its contents and examine individual XML files
* Start with `zipdump.py` with no command line arguments

```
zipdump.py particualars.doc 
```

* Once you have identified the index of the file you'd like to examine you can call `zipdump.py` again specifying the desired files index using `-s`
* `-d` parameter will direct the tool to dump the file to STDOUT
* Can then pipe to `xmldump.py` with the parameter `pretty` to reformat the file

```
zipdump.py particulars.doc -s 9 -d | xmldump.py pretty | more
```

### Vipermonkey can emulate VBA macros

```
vmonkey particulars.doc > particulars.vmonkey
code particulars.vmonkey
```

* Tool will auto decode the VBA macros

### Numbers to Strings

* After performing analysis you notice a macro in `A3`
* When extracting it with `oledump.py`

```
oledump.py mydoc.docm -s A3 -v | more
```

* You see alot of these lines

```
exec = exec & ChrW(112) & ChrW(111)...
```

* You can use `numbers-to-strings.py`

```
oledump.py mydoc.docm -s A3 -v | numbers-to-string.py -j | more
```

* Make sure to add new lines and examine the output

```
numbers-to-strings.py -j | sed "s/;/;\n/g" > mydoc.oledump
```

### Password protected VBA Macros

* Can see the VBA macro using `oledump.py` even though MSFT Office refuse to show you the code due to the password being set

```
oledump.py invoice.doc -i
oledump.py invlice.doc -s 7 -v | more
```

#### Remove the distracting junk code, then examine the macro

```
oledump.py invoice.doc -s 7 -v | grep -v "^GoTo" | grep -v ":$" > invoice.oledump
```

### xor-kpa.py

* The tool `xor-kpa.py` is designed to derive an XOR key from the supplied plaintext and cipher text
* It can also XOR a string with its multi-byte key which mimics the algorithm employed by our malicious macro
* `-x` tells the tool to XOR the data with the key
* Start each param with `#h#` to designate it as a hex-encoded string and enclose in `''`

```
xor-kpa.py -x '#h#89789FD89AF897AKJHF43HK23' '#h#66546F'
```

### Auto deobsfucation with `oledump.py`

* `plugin_http_heuristics` --> will automatically decode embedded URLs if they are encoded using a common obsfucation method

```
oledump.py invoice.doc -p plugin_http_heuristics
```

* Sometimes a faster approach to deobsfucate macros involves the VBA debugger built into MSFT Office

```
evilclippy -uu invoice.doc
```

* Will remove the macro password with `-uu` flag
* Then open MSFT Word click `View tab --> Macros --> View Macros --> edit`
* Bring up the locals window so you can see the variables
* Add the following at the beginning of the macro (e.g. at the start of the AutoOpen function) so the macros starts the debugger

```
Sub AutoOpen() <-- Line already there
Debug.Assert False <-- Line you add
GoTo jlskdffjieoajioehjfueahfekjanufiw <-- Start of obsfucated mess
```

* Save the macro so the line you added doesn't get lost
* Switch to the MSFT word main view and enable macros
* Once you enable the macros it will run and pause in the AutoOpen function on the line you set
* Set the breakpoint on the line that interests you
* Then click `Run > Continue`
* Once it hits your breakpoint examine the locals window, it will show the current variables in the bottom window, you should see what you are looking for

### VBA Stomping

* When a macro is added to an Office Document MSFT Office compiles it into a bytecode form known as `p-code`
* This is the code that is actually executed when the macro is run (most of the time: <https://github.com/bontchev/pcodedmp>)
* Malware authors could modify or fully delete the source code version of the macro while keeping the `p-code` version intact
* Our analysis tools focus on the source code of the macro and wont recognize the true nature of the file

**Extract the file as always**

```
olevba order.docm
```

* Now extract the file structure info

```
oledump.py order.docm -i
```

* Will see a `!` which will indicate an Unusual start of source code
* Another sign of VBA stomping will show if the size of the compressed source code being `0`
* `oledump.py` can extract the `p-code` but it cannot decode it

```
oledump.py order.docm -s A3 -v <-- Will get an error "Cannot decompress"
oledump.py order.docm -s A3s -A <-- -A will show the contents the way a hex editor might show them 
oledump.py order.docm -s A3c | more <-- adding -C will show the compiled code (what c stands for)
```

* Use `pcodedmp.py` to disassemble VBA `p-code`

```
pcodedmp order.docm > order.pcodedmp
code order.pcodedmp
```

* Use `pcode2code` to decompile VBA `p-code`

```
pcode2code order.docm | more
```

* Note:
* MSFT Office automatically decompiles the `p-code` generating the VBA source code, however:
* Macros without the source code will only run in the specific version of Office for which the `p-code` was created
* If you want to debug the macros you can decompile the `p-code` using `pcode2code` you can embed the macro in a document

### Base64 PowerShell

* If you identify some base64 encoded PowerShell, ensure to use `bse64dump.py` to convert it

```
oledump.py checkbox.doc -s 7 -d | base64dump.py -s 1 -t utf16 > checkbox.ps1
more checkbox.ps1
```

* However when you view the dump we can see that it is also `gzip` encoded data
* Extract the gzip data

```
base64dump.py checkbox1.ps1 -s 3 -d | gunzip - > checkbox2.ps1
code checkbox2.ps1
```

### Shellcode

* Shellcode is machine code that the CPU can understand
* It is represented as a series of bytes sorted in a memory region

```
base64dump.py checkbox2.ps1 -n 10 
```

* `-n` parameter directs `base64dump.py` to only consider strings that when decoded are at least 10 bytes long
* You should now see the long shellcode string, and see that it is the second stream, use `-s 2` to extract that stream

```
base64dump.py checkbox2.ps1 -n 10 -s 2 -d | translate.py "byte ^ 35" > checkbox.bin
```

* Use `scdbgc` to emulate the execution of shellcode to understand its capabilities

```
scdbgc /f checkbox.bin /s -1
```

* Can now use `yara-rules` to identify known malware patterns in file

```
yara-rules checkbox.bin
1768.py checkbox.bin
```

* `yara-rules` command will scan the file to see if it hits off any rules
* `1768.py` is designed for parsing Cobalt Strike artifacts and is installed on REMnux
* In CS files the License ID is stored as a 32-bit integer in the last 4 bytes of the shell code
* See more: <https://isc.sans.edu/forums/diary/Finding+Metasploit+Cobalt+Strike+URLs/27204/>

### Examining Malicious RTF Documents

* RTF documents are supported by MSFT word and many non-MSFT applications
* RTF does not support macros but it allows attackers to embed other dangerous files as `OLE` objects and other binary contents
* Users can be persuaded to open and execute the embedded file
* RTF files can also directly target a vulnerability using an exploit to execute the embedded shellcode payload
* When examining RTF documents, focus on the objects or other embedded artifacts
* <https://cofense.com/rtf-malware-delivery/>

### RTF format

* Usually formatted as ASCII plaintext and includes control words and groups
* Control words start with `/` and specifies how the RTF rendering application should format and display the characters
* A group encloses other elements in `{}` delimiters and specifies the text affected by the group and its formatting
* Groups can be nested
* Objects and other binary content are embedded as serialized strings that represent hex values
* You will see the `/objdata` control work followed by a string encoded in hex
* Use `rtfdump.py` and `| more` to get and overview of the RTF files groups and to spot embedded objects
* `-o` will allow you to examine the object

```
rtfdump.py new-order.doc -O
```

* `-s` parameter specifies the index of the object
* `-d` tells the tool to dump the object in its raw form

```
rtfdump.py new-order.doc -O -s 1 -d > new-order.object
```

* Use `oledump.py` to examine the extracted object
* `oledump.py new-order.object -i`
* If you now want to examine a specific steam use the `-A` parameter
* `oledump.py new-order.object -s 4 -A`
* **When analyzing malicious documents that might have exploits look for shellcode to understand the payload of the attack**
* Use the `-S` parameter to examine the strings
* `oledump.py new-order.object -s 4 -S`
* For parsing `Equation Editor 3.0` data we have an option `-f name=eqn1`
* `oledump.py new-order.object -s 4 -d | format-bytes.py -f name=eqn1`

### Shellcode searching in Binary files

* When looking for shell code look out for a lot of `0x90` also known as a NOP sled
* Use `xorsearch` to spot shellcode patterns in binary files
* `xorsearch -W -d 3 qa.bin`
* `EIP` points to the current instruction but assembly code cannot read it directly, so malware authors do it indirectly

```
Call followed by a POP allows code to get its EIP contents
CALL 00401024
POP EAX
Sellcode developers attempt to evade detection by using other instructions to perform GetEIP
00401027 JMP SHORT 0040102C #Happens first and moves down to the CALL
00401029 POP ESI
0040102A JMP SHORT 00401031
0040102C CALL 00401029 #Call is made and it moves back up to the POP
00401031 ADD ESI, 9 
This code suceeds at making the CALL and then POP in an indirect manner
```

#### Shellcode Requirements

* Shellcode needs to do some work before it can make API calls
* To load DLLs and resolve API function names, shellcode often seeks `kernel32.dll` for `LoadLibrary` and `GetProcAddress`
* Shellcode loos for the `Process Environment Block (PEB)` to locate `kernel32.dll` in memory of the exploited process
* For every process the Windows OS creates a structure called the `PEB`
* This data structure contains information about the process including the list of modules (DLLs) that have been loaded or mapped into the processes memory
* The `FS` register contains the address of the data structure called the `Thread Information Block (TIB)`, which contains information about the currently running thread
* A pointer to the `PEB` resides within the `TIB` at offset `0x30` with respect to the beginning of the `TIB`
* Therefore a pointer to `PEB` is always located at `FS:[0x30]`
* This syntax directs the processor to look for the address stored `0x30` bytes away from the beginning of the `TIB` structure
* Two methods to retrieve the `PEB`

```
MOV EAX, DWORD PTR FS:[30h]

PUSH 30h
POP EBX
MOV EAX, FS:[EBX]
```

### scdbgc

* Use `scdbgc` to analyze shellcode by emulating its execution
* the `-foff` parameter specifies the hex offset within the file where the shellcode starts
* This can be determined by `xorsearch`
* Press CTRL+C three times if `scdbgc` gets stuck

```
scdbgc /f a.bin /s -1 .foff 3B
```

* `/s -1` parameter indicates to continue the emulation without restricting the max number of instructions
* Direct `scdbgc` to open a handle to the malicious file so the shellcode can find the overlay to where it likely stores additional contents
* Hit CTRL+C three times after it starts to avoid too many repeating instructions from filling your screen
* Can hide the numerous `READ/WRITE` events with `/norw`

```
scdbgc /f qa.bin /s -1 /foff 3B qa.doc /norw
```

* If you see shellcode attempting to drop another file such as an exe, we can allow the shellcode to execute in order to capture the file
* use `runsc`

### runsc32 runsc64

* Can use it also on REMnux due to wine being installed
* To execute shellcode:

```
runsc32 -f qa.bin -o 0x3B -d qa.doc -n 
find ~/.wine -name WINWORD.EXE -exec -cp "{}" .\;
```

### XML Macros

* Microsoft Excel 4 (XML) macros are legacy technology that can offer attackers an alternative to VBA macros
* Were built in 1992 before the introduction of VBA in 1993
* Are being retired by MSFT but work in recent versions of Excel
* Are defined as formulas in cells of sheets
* Sheets are often hidden
* The formulas are often in white text on white background
* To see where the XLM macro execution starts use `zipdump.py` with `-s` parameter to examine the `xl/workbook.xml`

```
zipdump.py koti.xlsm -s "xl/workbook.xml" -d | xmldump.py pretty
```

* To see where execution starts look for:

```
<definedNames>
  <definedName name="_xlnm.Auto_Open">Lodet!$A$154</definedName>
 </definedNames>
```

* Execution starts in cell A154 in sheet Lodet
* Look above at the `<sheet name=>` parameter to figure out which `rId` number is assigned to our sheet `Lodet` and whether it is hidden or not
* To see which XML files represent the sheets Loded and kOTI look at the `xl/rels/wordkbook.xml` file

```
zipdump.py koti.xlsm -s "xl/_rels/workbook.xml.rels" -d | xmldump.py pretty
```

* It will show you:
* \`\<Relationship Id="rId3"...Target=worksheets/sheet2.xml"/>
* Now examine the `worksheets/sheet2.xml`
* Now extract the contents of Lodet which is `macrosheets/sheet1.xml` using `zipdump.py`
* `zipdump.py koti.xlsm`
* `zipdump.py koti.xlsm -s 6 -f | xmldump.py pretty | more`
* For easier analysis, direct `xmldump.py` to display just the cell text

```
zipdump.py koti.xlsm -s 6 -d | xmldump.py celltext > koti.csv
```

* **XML Macro obsfucation techniques include the following:**
* Use formulas to compute sensitive values such as strings during the runtime of the macro
* Compute some values randomly during runtime i.e. the URL

```
Static analysis to compute possible values can be complex and time consuming 
Cached value saves time byt displays only one possible outcome
```

* Instead of including a string in the formula include a reference to a string that is stored in a shared table elsewhere in the document
* The shared strings are always in `xl/sharedStrings.xml`
* Shared strings can reveal IOCs
* You can direct `xmldump.py` to look up the strings for you by using the `-j` paameter and pointing to a stream that has the macros

```
zipdump.py koti.xlsm -j | xmldump.py -j 6 celltext
```

* MSFT office is very helpful for decoding XLM macros
* Use the built in debugger to examine and deobsfucate code
* Covert file format from OOXML to OLE2 and the other way
* Execute the macro the way a victim would to observe effects on the system from a behavorial perspective
* Use Windows AMSI functionality to observe which script commands end up executing

```
logman start AMSITrace -p Microsoft-Antimalware-Scan-Interface Event1 -o AMSITrace.etl -ets
```

* Run the suspicious script or macro you wish to examine
* Stop AMSI Monitoring

```
logman stop AMSITrace -ets
```

* Examine the AMSI data saved to the file
* `AMSIScriptContentRetrieval`
* Additional tools and considerations for XLM macro analysis
* `oledump.py` can examine XLM macros in OLE2 files
* `oledump.py file.xls -p plugin_biff --pluginoptions "-x"`
