Beyond the Blacklists: Detecting Malicious URL Through Machine Learning

Abstract

Many types of modern malware utilize HTTP-based communications. Network-level behavioral signature/modeling in malware detection has some advantages, compared with traditional AV signature, or system-level behavioral models. Here we present a novel malware detection method based on URL behavioral modeling. The method has taken advantage of common practices of code re-use among many types of malware. Based on big data of known malware samples, we can distill concise feature models that represent common similarities in many different malware connection behaviors; the model can be used to detect unknown malware variants that share common network traits. We focused on HTTP connections because the protocol is the most used connection type for malicious software to phone home, get update, and receive command to start attack. Examining traits at http connection level have proved to be an efficient way to detect malicious connections. In our next generation firewall appliance, we had algorithms to examine connection domain name, URL path and user-agent using static blacklist and signatures to determine malicious user-agents, URL connection path. Combined with machine learning algorithm for DGA domain detection, we had achieved pretty good malicious URL detection rate. However, the most complex and challenging part is the dynamic content in the URL connection query string. Static signature rules become less effective because those strings are so diversified that they virtually can be anything. Variance and evolution of connection parameters can make signature generation time consuming. It also requires signature library performing frequent updates to emerging new connections features. The novel clustering algorithm we present in this talk is highly efficient - it could not only detect known malicious URL, but also new variants yet to be exposed (0-day). The model was machine learned from 800,000 URLs from malware samples with about 10k weekly update.