首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Clustering digital forensic string search output
Institution:1. EPFL, Switzerland;2. eXascale Infolab, University of Fribourg, Switzerland;1. School of Computer Science College of Engineering, University of Seoul, 163 Siripdae-ro, Dongdaemun-gu, Seoul, 02504, Republic of Korea;2. Supreme Prosecutors'' Office 157 Banpo-daero, Seocho-gu, Seoul, 06590, Republic of Korea
Abstract:This research comparatively evaluates four competing clustering algorithms for thematically clustering digital forensic text string search output. It does so in a more realistic context, respecting data size and heterogeneity, than has been researched in the past. In this study, we used physical-level text string search output, consisting of over two million search hits found in nearly 50,000 allocated files and unallocated blocks. Holding the data set constant, we comparatively evaluated k-Means, Kohonen SOM, Latent Dirichlet Allocation (LDA) followed by k-Means, and LDA followed by SOM. This enables true cross-algorithm evaluation, whereas past studies evaluated singular algorithms using unique, non-reproducible datasets. Our research shows an LDA + k-Means using a linear, centroid-based user navigation procedure produces optimal results. The winning approach increased information retrieval effectiveness, from the baseline random walk absolute precision rate of 0.04, to an average precision rate of 0.67. We also explored a variety of algorithms for user navigation of search hit results, finding that the performance of k-means clustering can be greatly improved with a non-linear, non-centroid-based cluster and document navigation procedure, which has potential implications for digital forensic tools and use thereof, particularly given the popularity and speed of k-means clustering.
Keywords:Digital forensics  Text string search  Clustering  k-means  SOM  LDA
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号