Syntactic Normalization

The problem

Ghidra’s decompiler produces syntactically valid C, but it’s noisy. Negative numbers appear as unsigned hex, modulo operations are expanded into division chains, and undefined types litter the output. This noise wastes LLM tokens and can confuse the model. Kong runs four normalization passes on every function’s decompilation before sending it to the LLM.

Normalization passes

1. Negative literal recovery

Ghidra often represents negative numbers as large unsigned values or with awkward + - syntax. Before:

x = y + -5;
offset = base + 0xfffffffc;  // This is actually -4

After:

x = y - 5;
offset = base - 4;

The transformation catches patterns like + -N and converts them to - N, making the code more readable.

2. Modulo operation recovery

Compilers optimize x % N into a division-multiply-subtract pattern. Ghidra decompiles the optimized form, which obscures the original modulo: Before:

remainder = x - (x / 5) * 5;

After:

remainder = x % 5;

Kong also handles the cast-wrapped variant that appears with certain compiler optimizations: Before:

r = (int)(x) + (int)((x) / 7) * -7;

After:

r = (int)((x) % 7);

3. Undefined type inference

Ghidra uses placeholder types like undefined4 (4-byte unknown) and undefined8 (8-byte unknown) when it can’t determine the actual type. Kong infers types from usage context: Before:

undefined4 i;
for (i = 0; i < count; i++) {
    // loop body
}

After:

int i;
for (i = 0; i < count; i++) {
    // loop body
}

The rules:

undefined4 → int when the variable is used as a loop counter or accumulator (initialized to 0, incremented)
undefined8 → long when the variable is compared to NULL, assigned from a DAT_ global, or cast-dereferenced as a pointer

4. Dead null-assignment removal

Inside null-check blocks, Ghidra sometimes emits redundant assignments that set the variable being checked to (type *)0x0. These are artifacts of decompilation, not real logic: Before:

if (ptr == (char *)0x0) {
    ptr = (char *)0x0;  // dead assignment — ptr is already null here
    return -1;
}

After:

if (ptr == (char *)0x0) {
    return -1;
}

Impact

These transformations are small individually but compound across a binary. In a 500-function analysis, normalization typically reduces total token usage by 10-15% and improves LLM accuracy by removing confusing patterns that the model might misinterpret.

Getting Started

Core Concepts

Usage

Configuration

Syntactic Normalization

The problem

Normalization passes

1. Negative literal recovery

2. Modulo operation recovery

3. Undefined type inference

4. Dead null-assignment removal

Impact

Further reading

Getting Started

Core Concepts

Usage

Configuration

Documentation Index

​The problem

​Normalization passes

​1. Negative literal recovery

​2. Modulo operation recovery

​3. Undefined type inference

​4. Dead null-assignment removal

​Impact

​Further reading

The problem

Normalization passes

1. Negative literal recovery

2. Modulo operation recovery

3. Undefined type inference

4. Dead null-assignment removal

Impact

Further reading