Deep Dive In K-Means Clustering

3 min readJul 19, 2021

As we all know it is one of the most commonly used unsupervised learning algorithms to find data insights using exploratory analysis

In Supervised Learning, we have many algorithms that will help us to get desired output from the data. But unlike this in unsupervised learning, we don’t have any way such that we can compare our output with the expected output. unsupervised learning helps us to find the data insights and using this we understand the structure of data.

K-Means Clustering

It is one of the most famous algorithms in unsupervised learning this will help us to separate the records or data points in k predefined numbers using an iterative approach. The most important fact is all the clusters are nonoverlapping and each data point will lie only in one cluster. It arranges all the data points in such they have the minimum centroid.

It works in three steps as follows:

we have to define k number of clusters (number of clusters)
Now we have to shuffle all the data points and select the number of points for the cluster
this algorithm will keep on iterating through the data point to find the minimum centroid from the cluster

After that, this algorithm will assign each data point to its closest cluster using the centroid in the iterative cluster

K-means the approach is also known as expectation maximization

Precautions:

when you use k means clustering algorithm then it uses distance-based approach on data point to find centroid and assign the point to the cluster so we need to standardize the data before fitting in the model
As we know we use a random initializer to pick data points and centroid so different kinds of initializers may lead us to different results so it is advised to use various initializers and check results.

Common Applications of the K-Means:

Gives deep insight into the structure of data using simple visualization techniques
It is also used in most cybersecurity tools to detect the kind of attack using k-Means.
Also, we can use it in image compression and Geyser eruptions segmentation.

SECBI Comprehensive Threat Detection Using Cluster Analysis:

SecBI can discover many advanced threats that can only be detected at the cluster level. In one recent example, SecBI was able to detect a fragmented exfiltration of several infected devices, in which the attacker used multiple servers under their control to send small chunks of data to each server without crossing predefined server thresholds. During the attack, 5GB of data was extracted to multiple destinations. However, once deployed, SecBI was able to easily detect the attack due to the multiple indicators of compromise: ● Large total upload in a single cluster

● Multiple servers accessed by only a few machines, at a time when other machines didn’t access these servers at all

● Beaconing behavior to multiple servers

● Machine-like behavior to most destinations: similar upload size, similar response size, mostly direct connections, etc. It’s only possible to detect these indicators by looking at the entire network using cluster analysis

Conclusions

cyber-attacks are becoming more and more sophisticated, making it harder to detect them. The cybersecurity industry must find new methods of detection to outsmart the attackers. By applying methods of data analysis such as machine learning and cluster-based analysis, it becomes possible to easily sift through huge volumes of data and identify where the new threats exist. SecBI has developed a solution that uses machine learning and cluster analysis to protect organizations from the next generation of cyberattacks.

Connect me on LinkedIn.