Supervised machine learning for classification mining and ranking of illegal web contents = Aprendizaje automático supervisado para la clasificación, extracción de conocimiento y ordenación de contenidos web ilegales
Área de conocimiento
In this thesis, we propose new algorithms, methods, and datasets that can be used to classify, to mine information, and rank web domains or similar resources containing text. Motivated by our joint work with INCIBE, we focus our efforts on detecting web resources which content could indicate illegal activities. Most of these textual web pages are hosted in a darknet, and, because of that, we centered our analysis in The Onion Router (Tor) Darknet, based on the common belief that this net hosts plenty of criminal activities. Additionally, we also addressed the same problem in Online Notepad Services (ONS), in particular, Pastebin service. Several of the contributions that we present here are already incorporated in tools developed by INCIBE that help Spanish Law Enforcement Agencies (LEAs) to monitor the contents of the Tor Darknet. Our work relies on the application of machine learning, both classical and deep, using most of the time supervised learning. This approach required the creation of different datasets, naming the first of them as Darknet Usage Text Addresses (DUTA), which contained 6,831 labeled samples distributed over 26 classes. Posteriorly, we extended this dataset up to 10,367 samples, naming it as DUTA-10K. Using DUTA, we evaluated the combination of two text representation techniques with three well-known classifiers to categorize the Tor domains. The combination of TF-IDF words representation with Logistic Regression achieved a 93.7% macro F1 score, in a subset of DUTA where eight categories of illegal activities were selected. To classify Pastebin contents, we use Active Learning to select and label only the most informative samples, reducing in this way, the cost of building a labeled dataset. Our design requires three cascade classifiers, saying the last one whether a sample belongs to one out of six categories related to criminal activities, obtaining an average class recall of 95.24% as binary, and 80.33% as multiclass. To enrich the information that we provide to LEAs, we develop first a semi-automatic algorithm to identify emerging products in Tor marketplaces. Using Graph Theory, we build a Products Correlations Graph (PCG), in which the nodes are the markets' products, and the edges reflect the simultaneous offering of two products in the same market. Our algorithm decomposes the PCG, using the k-shell algorithm, and analyzes the connectivity of the products in the core-shell. We apply this method to drug Hidden Services (HS) in DUTA, finding that MDMA and Ecstasy were the most emerging drug products during the analyzed period. Second, we used Named Entity Recognition (NER) to recognize rare and emerging named entities in noisy user-generated text. We overcome the use of gazetteers to incorporate external resources to neural network architectures, presenting a novel feature that we named Local Distance Neighbor (LDN), obtaining in this way the state-of-the-art F1 score on three categories of the W-NUT-2017 dataset: Group, Person, and Product. Furthermore, we present an application of NER to the domains of the Tor network by extending the samples of the W-NUT-2017 dataset with 851 manually annotated entries, naming this modified version of the dataset as Noisy User-generated Text on Tor (NUToT). Our model obtained an entity and a surface F1 scores of 52.96% and 50.57%, respectively, beating the current state-of-the-art. Our third area of interest is to detect which of the onion domains are the most influential, both inside its category and the whole Tor Darknet. Our first solution is based on the use of ToRank, a novel algorithm based on Graph Theory, which outperforms three well-known alternatives used in the Surface Web when ranking Tor hidden services: PageRank, HITS, and Katz. ToRank creates a higher disruption to the Tor network connectivity than any other alternative, which indicates its superiority for this problem. After evaluating ToRank, we analyze DUTA-10K to reveal that 20% of the samples are related to suspicious activities, while 48% are associated with normal ones, considering the 32% left are unknown because they were inaccessible. We also discover that domains related to suspicious activities usually present multiple clones under different addresses, which could be used as an additional feature for identifying them. Our last contribution is a proposal based on the analysis of the domain content, considering features from textual, HTML, NER, links related, and visual information. We compared three supervised ranking strategies using Learning to Rank (LtR) to detect the most influential onion domains. We represent an onion domain by 40 features, and we learn how to rank the domains using an LtR approach. We evaluate our proposal on a case-study of drug-related onion domains, founding that among the explored LtR schemes, the listwise approach outperforms the benchmarked methods with an NDCG@10 of 0.95. Our findings proved that the content-based ranking is superior to link-based algorithms and that features extracted from the user-visible textual content are the better choice when balancing the cost of its extraction and the precision reached.
- Tesis