In the face threat of the Internet attack, malware classification is one of the promising solutions in the
field of intrusion detection and digital forensics. In previous work, researchers performed dynamic analysis or static
analysis after reverse engineering. But malware developers even use anti-virtual machine (VM) and obfuscation
techniques to evade malware classifiers. By means of the deployment of honeypots, malware source code could
be collected and analyzed. Source code analysis provides a better classification for understanding the purpose of
attackers and forensics. In this paper, a novel classification approach is proposed, based on content similarity and
directory structure similarity. Such a classification avoids to re-analyze known malware and allocates resources
for new malware. Malware classification also let network administrators know the purpose of attackers. The
experimental results demonstrate that the proposed system can classify the malware efficiently with a small misclassification
ratio and the performance is better than virustotal.
CHEN Chia-mei1 (陈嘉玫), LAI Gu-hsin2* (赖谷鑫)
. Research on Classification of Malware Source Code[J]. Journal of Shanghai Jiaotong University(Science), 2014
, 19(4)
: 425
-430
.
DOI: 10.1007/s12204-014-1519-1
[1] Jain S, Meena Y K. Byte level n-gram analysis for malware detection [M]. Berlin: Springer Heidelberg,2011: 51-59.
[2] Kolter J Z, Maloof M A. Learning to detect and classify malicious executables in the wild [J]. Journal of Machine Learning Research, 2006, 7: 2721-2744.
[3] Tahan G, Rokach L, Shahar Y. Mal-ID: Automatic malware detection using common segment analysis and meta-features [J]. Journal of Machine Learning Research,2012, 13: 949-979.
[4] Zhang B, Yin J, Hao J, et al. Malicious codes detection based on ensemble learning [J]. Lecture Notes in Computer Science, 2007, 4610: 468-477.
[5] Ye Y, Wang D, Li T, et al. An intelligent pemalware detection system based on association mining [J]. Journal in Computer Virology, 2008, 4(4): 323-334.
[6] Ye Y, Chen L, Wang D, et al. Sbmds: an interpretable string based malware detection system using SVM ensemble with bagging [J]. Journal in Computer Virology, 2009, 5(4): 283-293.
[7] Ye Y, Li T, Wang D, et al. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list [J]. Journal of Intelligent Information Systems, 2010, 35(1): 1-20.
[8] Cesare S, Xiang Y. Classification of malware using structured control flow [C]//Proceedings of the 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010). Darlinghurst, Australia:Australian Computer Society, 2010: 61-70.
[9] Cesare S, Xiang Y, Zhou W. Malwise—An effective and efficient classification system for packed and polymorphic malware [J]. IEEE Transactions on Computers,2013, 62(6): 1193-1206.
[10] Gheorghescu M. An automated virus classification system [C]// Virus Bulletin Conference. Dublin, Ireland:Virus Bulletin, 2005: 294-300.
[11] Rieck K, Trinius P, Willems C, et al. Automatic analysis of malware behavior using machine learning [J]. Journal of Computer Security, 2011, 19(4): 639-668.
[12] Willems C, Holz T, Freiling F. Toward automated dynamic malware analysis using CWSandbox[J]. IEEE Security and Privacy, 2007, 2(5): 32-39.
[13] Zhang J, Porras P, Yegneswaran V. Host-rx: Automated malware diagnosis based on probabilistic behavior models [R]. California, USA: SRI International,2009.
[14] Zhao H, Xu M, Zheng N, et al. Malicious executables classification based on behavioral factor analysis[C]//Proceedings of International Conference on e-Education, e-Business, e-Management and e-Learning.Washington, USA: IEEE Computer Society, 2010:502-506.
[15] Lutz P, Guido M, Michael P. JPlag: Finding plagiarisms among a set of programs with JPlag [J]. Journal of Universal Computer Science, 2002, 8(11): 1016-1038.
[16] Cosma G, Joy M. An approach to source-code plagiarism detection and investigation using latent semantic analysis [J]. IEEE Transactions on Computers, 2012,61(3): 379-394.
[17] Rokach L, Romano R, Maimon O. Negation recognition in medical narrative reports [J]. Information Retrieval,2008, 11(6): 499-538.
[18] Bloom B H. Space/time trade-offs in hash coding with allowable errors [J]. Communications of the ACM,1970, 13(7): 422-426.
[19] Gitchell D, Tran N. Sim: A utility for detecting similarity in computer programs [C]//Proceedings of the 30th SIGCSE Technical Symposium. New York,USA: ACM, 1999: 266-270.