Effective Representation Learning for Binary Code
Sponsored by WTD81 through APERITIF: Analysis Pipeline for Effective vulneRability IdenTIfication through Fuzzing and by the Royal Holloway Centre for Doctoral Training in Cyber Security.
Deep learning has revolutionized natural language processing and computer vision by stepping away from manual feature engineering and instead training deep neural networks on large quantities of data with minimal preprocessing. There have been a number of attempts to use deep learning on tasks related to program understanding; while results have been encouraging, the real-world performance of these systems is still lacking. All existing systems learn mostly the syntax of programs and derive very little semantic insight.
The goal of this project is to develop a methodology for effective representation learning for binary code, by performing preprocessing that aims to generalize code away from specific syntax. To this end, we leverage program analysis methods for abstracting semantics. With this methodology, we will learn a large-scale general purpose model for code that can be deployed for different applications.
Applications we will investigate include (a) the reconstruction of metadata in binary code to aid reverse engineering and to help target fuzzing campaigns (as part of the APERITIF project on a general-purpose fuzzing pipeline); (b) code similarity to discover derived code across compiler versions and settings; (c) detection of previously unseen malicious code.
Software
- XFL: Labeling unknown functions in binaries using Extreme Multilabel Learning [ GitHub ]
- Punstrip: Labeling unknown functions using Conditional Random Fields [ GitHub ]
Publications
Tristan Benoit, Yunru Wang, Moritz Dannehl, and Johannes Kinder. BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding. Tech. rep. arXiv:2409.07889, arXiv, 2024.
BibTeX URL
@techreport{blens-arxiv, author = {Tristan Benoit and Yunru Wang and Moritz Dannehl and Johannes Kinder}, title = {BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding}, institution = {arXiv}, number = {arXiv:2409.07889}, year = {2024}, url = {https://arxiv.org/abs/2409.07889}, }
James Patrick-Evans, Moritz Dannehl, and Johannes Kinder. XFL: Naming Functions in Binaries with Extreme Multi-label Learning. In Proc. IEEE Symp. Security and Privacy (S&P), pp. 1677–1692, IEEE, 2023.
BibTeX PDF
@inproceedings{oakland23-xfl, author = {James Patrick-Evans and Moritz Dannehl and Johannes Kinder}, title = {{XFL}: Naming Functions in Binaries with Extreme Multi-label Learning}, booktitle = {Proc. IEEE Symp. Security and Privacy (S\&P)}, pages = {1677--1692}, publisher = {IEEE}, year = {2023}, doi = {10.1109/SP46215.2023.00096}, }
James Patrick-Evans, Moritz Dannehl, and Johannes Kinder. XFL: eXtreme Function Labeling. Tech. rep. arXiv:2107.13404, arXiv, 2021.
BibTeX URL
@techreport{xfl-arxiv, author = {James Patrick-Evans and Moritz Dannehl and Johannes Kinder}, title = {{XFL}: eXtreme Function Labeling}, institution = {arXiv}, number = {arXiv:2107.13404}, year = {2021}, url = {https://arxiv.org/abs/2107.13404}, }
James Patrick-Evans, Lorenzo Cavallaro, and Johannes Kinder. Probabilistic Naming of Functions in Stripped Binaries. In Proc. 35th Annu. Computer Security Applications Conference (ACSAC), pp. 373–385, ACM, 2020.
BibTeX PDF
@inproceedings{acsac20-punstrip, author = {James Patrick-Evans and Lorenzo Cavallaro and Johannes Kinder}, title = {Probabilistic Naming of Functions in Stripped Binaries}, booktitle = {Proc. 35th Annu. Computer Security Applications Conference (ACSAC)}, pages = {373--385}, doi = {10.1145/3427228.3427265}, year = {2020}, publisher = {ACM}, }