The problem
Ghidra’s decompiler produces syntactically valid C, but it’s noisy. Negative numbers appear as unsigned hex, modulo operations are expanded into division chains, and undefined types litter the output. This noise wastes LLM tokens and can confuse the model.
Kong runs four normalization passes on every function’s decompilation before sending it to the LLM.
Normalization passes
1. Negative literal recovery
Ghidra often represents negative numbers as large unsigned values or with awkward + - syntax.
Before:
x = y + -5;
offset = base + 0xfffffffc; // This is actually -4
After:
x = y - 5;
offset = base - 4;
The transformation catches patterns like + -N and converts them to - N, making the code more readable.
2. Modulo operation recovery
Compilers optimize x % N into a division-multiply-subtract pattern. Ghidra decompiles the optimized form, which obscures the original modulo:
Before:
remainder = x - (x / 5) * 5;
After:
Kong also handles the cast-wrapped variant that appears with certain compiler optimizations:
Before:
r = (int)(x) + (int)((x) / 7) * -7;
After:
3. Undefined type inference
Ghidra uses placeholder types like undefined4 (4-byte unknown) and undefined8 (8-byte unknown) when it can’t determine the actual type. Kong infers types from usage context:
Before:
undefined4 i;
for (i = 0; i < count; i++) {
// loop body
}
After:
int i;
for (i = 0; i < count; i++) {
// loop body
}
The rules:
undefined4 → int when the variable is used as a loop counter or accumulator (initialized to 0, incremented)
undefined8 → long when the variable is compared to NULL, assigned from a DAT_ global, or cast-dereferenced as a pointer
4. Dead null-assignment removal
Inside null-check blocks, Ghidra sometimes emits redundant assignments that set the variable being checked to (type *)0x0. These are artifacts of decompilation, not real logic:
Before:
if (ptr == (char *)0x0) {
ptr = (char *)0x0; // dead assignment — ptr is already null here
return -1;
}
After:
if (ptr == (char *)0x0) {
return -1;
}
Impact
These transformations are small individually but compound across a binary. In a 500-function analysis, normalization typically reduces total token usage by 10-15% and improves LLM accuracy by removing confusing patterns that the model might misinterpret.
Further reading
Last modified on March 20, 2026