Phishing Site Detection

Paper Link: https://arxiv.org/abs/2311.12372

Background

Phishing detection is an increasingly critical area in the realm of cybersecurity, addressing the pervasive threat that phishing attacks pose to users' privacy, data security, and trust in digital communications. Phishing, a form of social engineering attack, typically involves deceiving individuals into revealing sensitive information, clicking malicious links, or performing actions that compromise their security. The evolving sophistication of these attacks underscores the urgent need for robust detection mechanisms that can adapt to the changing tactics of adversaries.

The importance of phishing detection extends beyond protecting individual users; it is vital for maintaining the integrity and security of entire digital ecosystems. Effective detection tools help safeguard personal and financial information, preserve the reputation of businesses, and ensure the trustworthiness of online platforms. As we transition into the era of Web3.0, characterized by decentralized networks, blockchain technologies, and a greater emphasis on user sovereignty and data privacy, the landscape of phishing attacks and the strategies for their detection must evolve correspondingly.

Machine learning-based phishing detection technologies, with their robust data processing and learning capabilities, are increasingly supplanting traditional rule-based and signature-based detection methods. Conventional web features, such as page behavior, content, and HTML code, can be harnessed to construct efficient phishing detection models. However, phishing links often have a short lifespan, rendering a vast archive of phishing web page records inaccessible. This scenario limits researchers' ability to retrieve and utilize information related to web content, behavior, or HTML code. Consequently, utilizing URLs to train machine learning models has become a predominant method for phishing detection. Given that URLs serve as gateways to web pages and contain a wealth of information, machine learning models can effectively identify phishing sites by analyzing and learning from these details, even in the absence of additional supportive data.

Solution

We introduce a pre-trained model-guided phishing webpage detection framework utilizing a multi-layer attention mechanism. This framework starts by extracting subword and character-level URL information using a pre-trained network. It then incorporates three pivotal modules: hierarchical feature extraction, layer-aware attention, and spatial pyramid pooling. Hierarchical feature extraction leverages pyramid feature learning to derive multi-level URL embeddings from CharBERT's various Transformer layers. The layer-aware attention module discerns and weights interconnections across hierarchical feature levels. Spatial pyramid pooling further processes the weighted feature pyramid through multiscale downsampling, capturing both local and global feature nuances. Our approach achieves near-perfect detection accuracy in real-world tests.

Backbone Network

We utilize the pretrained CharBERT model as our backbone network, primarily for its ability to focus on both subword and character-level features simultaneously. CharBERT is an enhancement of the BERT model, incorporating the Transformer architecture with a novel dual-channel framework. This framework is specifically designed to capture information at both the subword and character levels. The key advancements in CharBERT consist of two main components: (1) the Character Embedding Module, which encodes character sequences derived from input tokens, and (2) the Heterogeneous Interaction Module, which facilitates the integration and encoding of these character sequences.

Hierarchical Feature Extraction

In deep pre-trained models, even though the output features of one layer serve as the input for the next, the intricate computations within each layer could lead to the degradation of low-to-mid level features, impeding the comprehensive feature learning process. This understanding underscores the necessity of integrating output information across all layers. In this module, we leverage the pretrained model to amalgamate aspect features from every layer during the large-scale, self-supervised URL information learning process. Contrasting with methods that solely rely on the final layer's classification features, our approach significantly enhances detection performance by utilizing the distinct features learned at each layer.

Layer-Aware Attention

To effectively discern and highlight the importance of specific features across various layers, we develop a Layer-Aware Attention mechanism, drawing inspiration from channel attention principles. This mechanism empowers the model to independently discern and assign differentiated weights to feature maps at different layers, thus boosting both processing efficiency and precision. In particular, we consolidate spatial data from pyramid feature maps, extracted via the Hierar-chical Feature Extraction Module, using both average and max pooling. This yields two unique spatial context descriptors.

Spatial Pyramid Pooling

We apply Spatial Pyramid Pooling (SPP) to the weighted feature results. Originally utilized in computer vision tasks and convolutional neural networks, SPP segments feature maps into locally spatial partitions from fine to coarse levels, aggregating local features and thus becoming a key component in classification and detection systems. We innovatively combine SPP with Transformer technology, applying it to the weighted features extracted by our Layer-Aware Attention module. In the final stage of our network, we perform mean pooling along the concatenated feature map and fixed sequence length dimension. This is followed by processing through a standard dropout layer and a fully connected layer, transforming the URL features into a binary class representation for prediction. This methodology enhances the representational capability of features and improves the model’s adaptability to different scale features, thereby increasing overall predictive accuracy.

Competitive Advantage

Our approach outperforms the current best methods across a range of challenging real-world scenarios, including class imbalance, few-shot learning, multi-classification, non-independent and identically distributed (non-IIdD) settings, and adversarial attacks. It also achieves near-perfect detection accuracy in online tests.

Last updated