Posts Ground Truth for Binary Disassembly is Not Easy
Post
Cancel

Ground Truth for Binary Disassembly is Not Easy

Paper Info

  • Paper Name: Ground Truth for Binary Disassembly is Not Easy
  • Conference: USENIX ‘22
  • Author List: Chengbin Pang, Tiantai Zhang, Ruotong Yu, Bing Mao, Jun Xu
  • Link to Paper: here
  • Food: Starvation

Paper Summary

In this paper, the authors perform a systematic study of prior work regarding the creation of ground truth data for disassembling binaries. Ground truth data for disassembly is important because because it can be used to evaluate and rank various disassembly tools fairly, and is also the data desired to train machine learning models in various disassembly tasks. More importantly, the authors showed that using incomplete or imperfect ground truth disassembly has a demonstrable impact on model accuracy when used as training data, and can also produce misleading results (manually crafted ground truth can result in biased output). The authors identified and compared 5 previous approaches for generating ground truth disassembly, including: manually crafted, using existing disassembly (e.g. borrowing from IDA), utilizing intermediate compiler outputs, leveraging compilation metadata (symbols/debug info), and tracing the compilation process. To compare these, each approach was evaluated on precision (correctness of disassembly), recall (coverage of disassembly), generality (completeness of functionality, such as identifying jump tables), and extendibility (coverage of architectures). A table from the authors is shown below comparing the 5 existing methods for generating ground truth disassembly on these metrics.

Prior work is discussed for each of these methods and the observed pros and cons for each approach are quantified and the implications are demonstrated. Based on their evaluation of these existing solutions, the authors concluded that tracing the compilation process was the best way to generate ground truth data, and then identified the current gaps in that solution, known as OracleGT. OracleGT works by instrumenting the compiler to insert metadata into the outputs of various compilation stages. Specifically (utilizing GCC instead of Clang), the first instrumentation happens after the final RTL pass, where OracleGT inserts extra metadata in the form of labels to mark functions, basic blocks, and instructions. Then, when the outputted assembly is fed into the GAS assembler, OracleGT utilizes the previously inserted metadata to calculate and insert additional metadata to specifically look for and insert information about functions, basic blocks, and jump tables, and then inserts into a new section in the outputted object files. When linked, these are all merged into one section, and OracleGT can extract this information to aid in calculating the ground truth disassembly.

Then, the authors built upon the tool to attempt to improve its performance on the four evaluation metrics mentioned earlier. This starts with a post-compilation analysis phase to identify three sets of missing ground truth data. To identify instructions encoded as data, OracleGT reconstructs the CFGS of each function to find control flow transfers into regions that are considered data, and then will recursively disassemble at that location to gather missing instructions. The next fix comes to identify overlapped instructions and anything that was missed originally during the compilation process, as OracleGT cannot process things that it didn’t label. This is done by once again checking jump targets for locations that point into the middle of instructions, and look for indirect calls that don’t point to known functions. Finally, OracleGT had problems identifying manually written jump tables, the authors implemented a taint analysis-based solution that combines knowledge of contiguous data-to-code references (jump table targets) and then code-to-data references (hopefully to the base address of the jump table). The authors also extended the support of OracleGT on both GCC and Clang to support other architectures including ARM32, AArch64, MIPS32, and MIPS64. This involved some architecture specific tweaks that needed to be added such as ARM/Thumb mode switchpoints, implicit jump table support, and the ability to track taint jump tables that utilize the stack instead of just registers.

To evaulate the improvements, the authors evaluated the accuracy of six popular disassemblers (angr, IDA, Binary Ninja, Ghidra, DYNINST, Radare2) on recovering glibc jump tables, utilizing OracleGT for ground truth information to compare against. They did this for both the original version of OracleGT and the author’s modified version. From the graph, the improved version of OracleGT highlights that most tools are less accurate than originally thought, with the notable exceptions of IDA, which gained a large amount of accuracy, and Radare2, which remained consistent between the two versions. Given this, the improvements added have a significant impact on estimating the accuracy of tools. The authors also evalulated the same disassemblers on on a test suite across ARM32, AArch64, MIPS32, MIPS64, and x86/x64. From these results, the authors concluded with three key observations: The performance of modern disassemblers varies across architectures, ARM32 is harder than AArch64 to disassemble due to thumb mode, and the commercial disasseemblers struggle with MIPS binaries.

Discussion

The discussion on this paper was very positive. As has been the trend for papers from Pang, the authors did a great job a both motivation and explination of their paper. In addition, the tools developed to make this paper’s eval was open-sourced. With this in mind, we thought the paper was great; however, there were still places in this paper that could use improvement.

In section 4 of this paper, the authors attempt to motivate why having ground-truth disassembly is important. We felt the “Impacts on Tool Evaluation” felt week at times. The comparison of state-of-the-art tools that use CFG’s with and without OracleGT had some logical holes. In this comparison, it showed that tools without OracleGT preform worse when compared to the human verified CFG. It’s hard for us to tell from a glance that OracleGT always has better CFGs than human analysis, which makes the claim that tools performing worse is because of an innacurate CFG used in their eval. We felt this section could’ve been tighter with an analysis of how their CFG outpreforms human made CFGs.

In section 6 of this paper, the authors explain the approach and methods for making OracleGT. This section was concise, but we felt some important information about the implementation of this approach was missing. OracleGT is more accurate than a normal CFG because it instruments GCC on the intermediate language level, known as RTL. Some of our more technical reviewers felt that only instrumenting the RTL level of GCC was inssuficient and that IPA (another pass of GCC) should be implemented. For shortness, the explination was that IPA has a significant effect on the CFG and the authors should’ve either implemented or mentioned it in the drawbacks section.

Finally, throughout the paper a large amount of information was presented via specific exampled or case studeis (like section 5). We felt that the large-scale evaluation was a little unclear. We were unsure how the authors selected the targets they did or why in some cases OracleGT did worse. We felt a unifying story for these sections was missing.

In conclusion, we though this paper presented a promising approach that will certainly help the field of binary analysis. All our reviews accepted this paper. Though with it’s flaws, we still felt this paper pushed the field forward.

This post is licensed under CC BY 4.0 by the author.