50 Thousand Needles in 5 Million Haystacks: Understanding Old Malware Tricks to Find New Malware Families

Abstract

The malware landscape is characterised by its rapid and constant evolution. Defenders often find themselves one step behind, resulting at best in monetary losses and in most extreme cases even endangering human lives. Corporations with the unique challenges they face, must assume that sooner or later malware infections will get through their security perimeter. Efforts should then be focused on early detection to contain and quickly mitigate the threats before they manage to cause any substantial damage. Even today's most stealth malware, if it's controlled remotely, needs an active network communication for reporting back to the attacker. This activity gives us a competitive visibility advantage. Nowadays we have the computational power and mechanisms to process huge amounts of data. Machine learning give us the algorithms to analyse network data in order to find specific types of behaviour. The challenge is how to use this technology to detect what matters most: malicious behaviours that pose a high risk to companies. In this talk we address four key challenges related to automatic malware detection in the network traffic: how to detect malware changing its network behaviour over time (e.g. changing different parts of the URL), how to mitigate potential mislabeling of the training data and how to perform large scale multi-class detection. We also introduce a training mechanism that allows to automate the learning process and improves the precision of the classifiers. We present unique algorithms that helps to solve different problems in each of the identified challenges. Results of our research constitute part of a working intrusion detection system that consumes real network traffic from more than 5 million users per day. We show how these methods can be used to learn from well known malware samples, generalise the behaviour and consequently find novel threats. We illustrate the detection performance of each algorithm presenting real examples of malware detected by algorithms described in this work. We also elaborate on how the found infections would have been otherwise missed using traditional detection tools.

Abstract

Slides