Reversing Malicious Code

Goal is to understand common malware characteristics at a code level
May include potential branches of execution with code analysis
Overview of the code lifecycle
Source code is translated into object code by a compiler
Object code is then combined with libraries and an executable file is created
To run the file, the operating system reads various information from the executable file, allocates memory, and loads required libraries into memory
Control is transferred to the code to execute
At this final stage is where we examine the code with a debugger
Note: Libraries may be loaded during the programs execution

Ghidra

Developed by NSA
Its decompiler produces a C representation of the code to speed up analysis
Includes support for writing java and python scripts to automate analysis
Help is accessed via F1 key
Ghidra v10 includes a debugger
https://ghidra-sre.org/

Create a new project

File --> New Project
Choose the project type
Click Finish
Drag and drop the specimen into the project window
Accept defaults in the Imports windows and click Ok

Launch the code browser and being the auto-analysis

Make sure to enable WindowsPE x86 Propagate External Parameters option
Finally click the Analyze button and wait for Ghidra to finish
Once auto analysis is completed an Auto Analysis summary will show any warnings or issues encountered during the process
A common warning is that the file does not contain debug information
This is common and not an issue

Before Proceeding save the project and take a snapshot

Ghidra Overview

Main window is the Listing View which presents the target programs code and data
Will initially bring you to the beginning of the file in the Listing View --> notice the MZ string
If you scroll down from there you can examine the programs header

Program Tree

Window is in the top left and shows the different sections and headers
Section names are typically:

.text - Contains executable code
.rdata - Contains read-only data
.data - Contains data 
.reloc - Contains relocation data to fix up addresses in the file if it is not loaded at the prefered address

To jump to the .text section double click the .text node
https://docs.microsoft.com/en-us/windows/win32/debug/pe-format

FUN in Ghidra

In Ghidra the FUN_ prefix generically refers to a function while the numeric value refers to the address where the function is loaded into memory
Original name of the function is normally lost during compilation
Execution occurs linearly one instruction after the next
On the far left you will have a 32 bit address such as 00401007 (hex)
This address represents the location of code in memory after the program is loaded, not the address of a location on disk i.e. within a file hex editor
On the right there are x86 assembly instructions
Note: - This is the beginning of the .text section, not the beginning of the program, that occurs at the entry point

Function graph view provides a visual perspective on code

Click on the function you want i.e. FUN_00401007
Browse to Window --> Function Graph menu item
Helpful for visualizing loops and complex conditionals within a function but the Listing view is more compact nd easier for some people to navigate
The color of the arrows symbolize code flow
If the code block ends in a conditional jump green arrows indicate the path here execution will continue if the condition is met
If the condition is not met a red arrow will show where execution continues
If the arrow is blue the code ends in an unconditional jump
View Imports to review a programs external dependencies
The import address table (IAT) helps direct code analysis
You can view imports in the Symbol Tree window but we will access this information via Window --> Symbol References
Filter symbols by "Imported" to focus on dependencies

Look for API call patterns associated with malware behavior

We can examine imports to identify potential functionality associated with common malware characteristics
Learn more about an API call at microsoft.com
Types of API Calls:

A --> (ANSI)
W --> (Wide)
Ex --> (Extended)

Refers to if the function supports ANSI (8 bit character)
Wide refers to a two byte character representation (UTF-16)
Extended is when MSFT updates a function and the new function is not compatible with the old one
Instructions reference registers, immediate values and memory
Instructions have two components: operation and operand
Instructions can have 0-3 operands
An Operand can be:

A register
A memory location 
An immediate value e.g. 0x6453)

Consider MOV EAX, 0x6453
EAX is the destination (first)
0x6453 is the source (second)
You are setting EAX to the value 0x6453
Operands may be implied

Intel processor uses registers to track the state of computation as instructions are executed

Registers are on chip memory locations
Instructions act on registers and memory locations
A CPU has a series of registers

Some registers are general purpose
Some have a particular use
Some are both

We monitor registers to track arguments, variables, and function return values
The x86 architecture uses the following general purpose registers to hold code and data

EAX --> Used for addition, multiplication, and return values
ECX --> Used as a counter 
EBP --> Used to reference arguments and local variables
ESP --> Points to the last item on the stack 
ESI/EDI --> Used by memory to transfer instructions

Special use registers hold flags and track program execution

EIP points to the next instruction to execute
EFLAGS bit represents the outcome of computers and they control CPU operations

Segment registers include:

CS - Code segment
DS - Data segment 
SS - Stack segment

32 bit registers can also be accessed as 16 and 8 bit registers
On 32 bit arch, registers can be accessed by their default dword size
To access a registers lower 16 bits the leading E is omitted from the name e.g. EAX becomes AX
The naming scheme for EAX EBX ECX EDX is as followed
E<letter>X --> dword 32 bit value of the register
<letter>X --> lower word 16 bit value of the register
<letter>H --> high byte 8 bit of the <letter>X value of the register
<letter>L --> low byte 8 bit of the letterX> value of the register

EAX means 32 bits 
AX means the low 16 bit value 
AH means the high 8 bytes of AX 
AL means the low 8 bits of AX

The length of a word, dword, and qword are 16, 32, and 64 bits
A word in assembly is the natural size for a unit of data
16 bit processor has 16-bit words
Many tools consider a word to be 16 bits regardless of processor size
Additional common data sizes:

8 bits --> 1 byte 
32 bits --> dword 
64 bits --> qword

The operand for one push instruction is a pointer to a string
A pointer is a variable that holds a memory address (it points to a memory location)
When the address that the pointer points to is accessed it is called dereferencing because the pointer references another location in memory
Pointers are more efficient, rather than copying around a data structure in memory its more efficient to copy the value of a pointer (4 bytes on 32 bit systems)
A PUSH instruction before a CALL often represents arguments passed to the function specified by the CALL

Memory can be accessed directly by many assembly instructions

Example:

MOV EAX, [0x410230]

Brackets mean fetch data at the specified address (dereference)
This is direct addressing because we are dereferencing an immediate value
The result is that 4 bytes of data at 0x410230 will be moved to EAX
Some tools like IDA omit brackets for direct addresses (IDA: dword_410230)
Memory may also be addressed by reference indirectly
The address may be calculated or in a register
This is called an Effective Address and it enables us to work efficiently with data structures
Format: Base + (Index * Scale) + Displacement

BASE        Index   Scale       Displacement
(EAX EBX) + (EAX EBX  1)   +     (None)
(ECX EDX) + (ECX EDX  2)   +     (8 bit value)
(ESP EBP) + (EBP ESI  4)   +     (16 bit value)
(ESI EDI) + (EDI      8)   +     (32 bit value)

Indirect Referencing: address of the destination is calculated or it resides in a register. The calculated address is called the effective address (EA)
If the address sits in a register, it is still different from direct memory addressing where the register is the destination
In indirect memory addressing the register holds the address of the destination.
Large advantage of indirect memory addressing is the capability to efficiently work with data structures
You can increment the value of a single register to step through fields of a data structure or the same field of an array of data structures
If the scale is used and index register must also be used
Examples of indirectly addressing memory
[EAX] : Access dynamically allocated memory (base)
[EBP + 0x10] : Access data on the stack (base + displacement)
[EAX + EBX * 8] : Access an array with 8-byte structure ( base + index * scale)
EAX +EBX + 0xC] : Access fields of a two dimensional array of structures (base + index + displacement)
Indirect memory addressing may pose challenges for static code analysis because registers are not populated until runtime
Strings are an example of a data structure
Data structures groups simple variables into more complex types
Examples of data structures include: strings, linked lists, sockets, and file handles
When reversing determine the type of data structure by usage
Data structures enable us to group bytes and advance our understanding of the code

Code vs Data

Context determines the answer
RegOpenKeyExA Example
The API call will have to have a symbolic constant i.e. PUSH 0x80000001
During compilation it will be changed from the symbolic constant into the hex representation
Right click the hex value, choose Set Equate and then choose HKEY_CURRENT_USER to change it back to the symbolic constant
Will bring clarity to the code

Branch instructions direct code execution to another location

The flow of execution i.e. control flow is sequential until a branching instruction is reached
Then the EIP is updated and execution is transferred to another location in memory
The code under review contains two types of jumps
Jumps are an example of a branching instruction
Unconditional jumps always perform a jump JMP, CALL, RET
Conditional jumps only jump if a condition is met: JCC, Loop
Conditional jump represents a decision point
Conditional jumps require that we review multiple instructions
To evaluate whether a conditional is true, arithmetic instructions and Boolean are used
sub ecx, 8 Will test if ECX is equal to 8
and eax, eax will test if EAX is equal to zero
If the result of zero then the ZF bit is set in the flags register

Jumps

A Jcc instruction will be performed if a jump condition is met
Form: Jcc

A --> jump if Above 
B --> jump if Below
E --> jump if jmp if equal 
G --> jump if greater 
L --> jump if less than 
Z --> jump if if zero 
N --> jump if not condition JNZ jump if not zero

Comments

Use the ; key to add a comment
Can add EOL comments, Pre, Post or other types of comments

HTTP Command and Control

These APIS enable HTTP C2

InternetOpen, InternetConnect --> Create an HTTP connection
HttpOpenRequest, HttpAddRequestHeaders (Optional) --> Build an HTTP request
HttpSendRequest --> Send an HTTP request
InternetReadFile --> Read a response

To view the API calls

Window --> Symbol References --> Locate API's of interest in the Symbol Table

The code references variables, which holds code or data not known at compile time
Local variables are relevant for the current function and are not saved
Local variables are stored on the stack relative to ESP and EBP
Global variables are accessible from all functions e.g. DAT_00403374
Also static variables can be only used from within the function that allocates it, but unlike local variables it does not get marked for reuse when the function exists

Viewing Function Call Trees

Window --> Function Call Tree
View the outgoing calls on the right side
View is ideal for determining which functions are called from the current function
Once you determine what the current function is being used for make sure to Rick Click --> Edit Label and give it a meaningful name

GetTempFileNameW

Creates a file name for a temp file
Can explore other function references to find new IOCs
Look for a PUSH to lpPrefixString_XXXXXX
MSFT documentation states the first three characters make up the temp file name prefix
To assist Ghidra:

Right click on the lpPrefixString --> Click data --> terminate Unicode

Functions

A function is a group of instructions that performs a specific task (read, write files, send network data, log keystrokes)
Three Basic Components

Input: values passed int
Body: code to perform tasks
Return: value passed back

Calling a function involves a jump to another memory location
After the function is done execution continues at the instruction after the original function call
Calling a function involves two control transfers
Function format: return = function(arg0, arg1)
Specific events occur when calling a function

Pass in parameters (stack/register)
Save the return pointer 
Transfer control to the funciton

Specific events occur when returning from a function

Set up a return value (typically EAX)
Clean up the stack and restore registers 
Transfer control to the saved return pointer

Within a function, the prologue and epilogue perform setup and cleanup activities
Most functions contain a standard prologue and epilogue
The prologue occurs at the start of the function

Allocates space for variables
Saves resisters that will be reused in the function body

Function epilogue occurs at the end of the function

It cleans up the stack e.g. POP allocated variables
It restores registers

The stack is a section in memory used to store saved registers, local variables and function parameters
The stack is LIFO Last in First out
PUSH adds an element and POP removes one
ESP points to the next item on the stack and changes with instructions like PUSH POP CALL LEAVE RET
EBP a.k.a frame pointer serves as an unchanging reference
EBP - value = local variable registers may also be used
EBP + value = parameter
When EBP is set up in the function prologue in this manner, it means that when you see code reference EBP minus some value i.e. [EBP -8] it is accessing a local variable
When its EBP plus some value i.e. [EBP +8] it is referencing a parameter that was passed in
When cleaning up the stack compilers use some tricks
Compilers may POP off a value i.e. POP EDX which has the result of adding four to ESP
It is also very common to see a value added to ESP the used of the RET (which can also pop stuff off the stack, and the leave instruction

Functions are called according to calling conventions

The convention describes how data is passed into and out of functions
The implementation of the convention may vary by compiler
The cdecl convention (most common) has these characteristics

The arguments are placed onto the stack right to left
The return value is placed into EAX
The caller cleans up the stack (removes the arguments)

The stdcall convention has the following characteristics

Similar to cdecl but the callee cleans up the stack 
This is the convention used in !IN32 APIs

Additional calling conventions include fastcall and thiscall
fastcall
Arguments are stored in registers
Any extra arguments are placed on the stack
The callee cleans up arguments on the stack
thiscall
Used in C++ code (member functions)
This convention includes a reference to this pointer
For MSFT compilers, ECX holds the "this" pointer and the callee cleans up the arguments on the stack
For GNU compilers the "this" pointer is pushed onto the stack last and the caller cleans up
Reviewing strings reveals filenames and directories of interest
To Locate a reference to a string right click on it and choose to show references

Loops in malware

Used to encrypt and decrypt network traffic --> loop over each character in the string to send
Attempt to connect to C2 server --> loop over a lists of servers
Perform a port scan --> try to connect to a port 1-65535
Log keystrokes --> Check state for each key code 0...92
Similar to JCC the Cs in LOOPcc represent the conditional code that must be met for the loop instruction to branch to the address specified
The conditions are:

Z --> Loop if zero 
E --> Loop if equal
N --> Inverts the logic of the looping condition

Reviewing imports to direct our code analysis

The import table lists functions used to access the resource section

FindResourceW --> determine the location of a resource
SizeofResource --> obtain the size of a resource
LockResource --> obtain a pointer to a resource

The resource .rsrc section is often used to store information like icons, dialog boxes, and version information
However malware may hide executables here
Malware that drops files is called a dropper

CreateMutexA

CreateMutexA --> creates or opens a mutex object
Malware authors often use a mutex to avoid re-infecting a machine

Keylogging

GetKeyState and GetAsyncKeyState --> Determine if a particular key is pressed
GetWindowText --> Retrieves text from a windows title bar
OpenClipboard, GetClipboardData, and CloseClipboard --> Opens the clipboard for access, gathers data, and then closes the clipboard
GetWindowText --> obtains the text of a windows title bar, combined with the two previous APIs an attacker could learn about what keys are pressed and what the application context is.
GetAsyncKeyState determines if a key is currently up or down or if it was pressed since the last call to the API

64 Bit Malware

Vast majority is 32 bit
We will see more 64 bit in the future as they become the standard
Two types of 64 bit malware have been common

Browser Helper Objects for 64 bit Internet Explorer
Device Drivers (rootkits) for Windows x64

Analyze 32-bit malware on 64-bit OS with caution

32 bit code running on a 64 bit operating systems runs in the WOW64 Subsystem
32 bit executables load 32 bit dlls
32 bit dlls are located in %SystemRoot%\Syswow64
32 bit processes reference Software hive registry values in Wow6432Node using registry redirection
Some executables run subtly different under WoW64 than on a native 32 bit OS

64-Bit Assembly Differences

All general purpose registers are expanded to 64 bits
EAX --> RAX
There are eight new general purpose registers R8 --> R15
Special use registers are exted and renamed EIP --> RIP
RSP not RBP is often used to access parameters and variables
Calling convention resembles fastcall (parameters via registers)

First four parameters are passed in RCX RDX R8 R9
Additional parameters are stored on the stack

There is a new addressing mode (RIP + displacement)

PreviousIn Depth Malware Analysis NextInfrastructure Development

Last updated 2 years ago

Ghidra

Create a new project

Launch the code browser and being the auto-analysis

Ghidra Overview

FUN in Ghidra

Function graph view provides a visual perspective on code

Look for API call patterns associated with malware behavior

Special use registers hold flags and track program execution

Segment registers include:

Memory can be accessed directly by many assembly instructions

**Code vs Data **

Branch instructions direct code execution to another location

Jumps

Comments

HTTP Command and Control

Viewing Function Call Trees

GetTempFileNameW

Functions

Functions are called according to calling conventions

Loops in malware

Reviewing imports to direct our code analysis

CreateMutexA

Keylogging

64 Bit Malware

Analyze 32-bit malware on 64-bit OS with caution

64-Bit Assembly Differences

Code vs Data