Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften
In the first post on edit distance, we took a look at searching for harmful executables with edit distance (i.e., how many character modifications it takes to make 2 text strings match). Now let’s take a look at how we can use edit distance to look for harmful domains, and how we can develop edit distance features that can be combined with other domain functions to pinpoint suspicious activity.
Here is the Background
Exactly what are bad actors doing with malicious domains? It may be just utilizing a similar spelling of a typical domain to trick reckless users into viewing advertisements or getting adware. Legitimate sites are gradually catching onto this technique, in some cases called typo squatting.
Other destructive domains are the product of domain generation algorithms, which can be utilized to do all types of nefarious things like evade countermeasures that obstruct known compromised sites, or overwhelm domain servers in a distributed DoS attack. Older variations use randomly-generated strings, while more advanced ones include techniques like injecting typical words, additionally puzzling protectors.
Edit distance can assist with both use cases: let’s see how. First, we’ll leave out common domain names, given that these are typically safe. And, a list of regular domain names offers a baseline for finding anomalies. One great source is Quantcast. For this conversation, we will adhere to domain names and avoid sub domains (e.g. ziften.com, not www.ziften.com).
After data cleaning, we compare each candidate domain (input data observed in the wild by Ziften) to its possible next-door neighbors in the very same top-level domain (the tail end of a domain name – classically.com,. org, and so on but now can be practically anything). The standard job is to find the nearby neighbor in terms of edit distance. By discovering domain names that are one step away from their nearby neighbor, we can quickly find typo-ed domain names. By discovering domain names far from their next-door neighbor (the stabilized edit distance we presented in the first post is useful here), we can also discover anomalous domain names in the edit distance area.
Exactly what were the Results?
Let’s take a look at how these results appear in reality. Be careful when navigating to these domain names considering that they might contain malicious content!
Here are a few prospective typos. Typo-squatters target popular domains given that there are more chances someone will go to them. Several of these are suspect according to our danger feed partners, but there are some false positives too with cute names like “wikipedal”.
Here are some weird looking domains far from their next-door neighbors.
So now we have actually created two useful edit distance metrics for hunting. Not just that, we have 3 features to possibly add to a machine learning model: rank of nearby neighbor, distance from neighbor, and edit distance 1 from next-door neighbor, indicating a risk of typo tricks. Other features that might be utilized well with these include other lexical functions like word and n-gram distributions, entropy, and string length – and network features like the total count of failed DNS requests.
Streamlined Code that you can Experiment with
Here is a simplified version of the code to have fun with! Created on HP Vertica, however this SQL should work on most innovative databases. Keep in mind the Vertica editDistance function may vary in other applications (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).