An increasing number of modern antivirus solutions rely on machine learning (ML) techniques to protect users from malware. While ML-based approaches, like FireEye Endpoint Security’s MalwareGuard capability, have done a great job at detecting new threats, they also come with substantial development costs. Creating and curating a large set of useful features takes significant amounts of time and expertise from malware analysts and data scientists (note that in this context a feature refers to a property or characteristic of the executable that can be used to distinguish between goodware and malware). In recent years, however, deep learning approaches have shown impressive results in automatically learning feature representations for complex problem domains, like images, speech, and text. Can we take advantage of these advances in deep learning to automatically learn how to detect malware without costly feature engineering?
As it turns out, deep learning architectures, and in particular convolutional neural networks (CNNs), can do a good job of detecting malware simply by looking at the raw bytes of Windows Portable Executable (PE) files. Over the last two years, FireEye has been experimenting with deep learning architectures for malware classification, as well as methods to evade them. Our experiments have demonstrated surprising levels of accuracy that are competitive with traditional ML-based solutions, while avoiding the costs of manual feature engineering. Since the initial presentation of our findings, other researchers have published similarly impressive results, with accuracy upwards of 96%.