Should I click on a link? Machine Learning to Protect from Cyber Attacks on the Web

by František Střasák

The detection of unsafe websites poses a challenging task for our security community because their attacking techniques are varied, advanced and dangerous. There are many types of unsafe websites that can infect user’s devices or steal their sensitive data. The most prevalent representative type of unsafe websites are the evil twin websites that use phishing techniques to steal sensitive data and credentials from users. Evil twin websites are clone websites imitating other real websites to trick users into using them. Therefore, users judging the authenticity of a website by its look, can be defrauded by inputting sensitive information in the evil twin website. To detect these unsafe websites, previous studies have mainly used blacklists, but they constant updates when a new URL appears. This results in the approach not protecting from the new and current threats. Another common solution is to detect the website by analyzing the URL string, which may shows satisfying results under certain conditions. However, the complexity of domain names and URL parameters makes this approach to have errors also. Since websites offer much more information than only a URL, this thesis proposes novel methods to detect unsafe and evil twin websites based on the analysis of the behavior, content, and structure of websites. The structure refers to the HTML structure, the content and the behavior refer a large group of features extracted from the urlscan.io service that provides a complex description of websites. To fulfill its goal of better detecting unsafe websites, this thesis is mainly separated in two parts. The first part focuses on the detection of unsafe websites in general by using different set of features. The second part of this thesis specifically concentrates on the detection of evil twin websites. For both problems we created and publish our own datasets that can be useful for the whole community.

This thesis presents evidence that features from the content, behavior and structure of websites play an essential role for detecting cyber attacks on the websites. The results show that our models are able to separate between unsafe and legitimate websites with an accuracy of 92.69% and between evil twin websites and legitimate websites with an accuracy of 95.28%. Detecting unsafe websites is a hard topic because they keep evolving, but we believe that this thesis improves the research to detect this threat.

read this work in full