DREAM Lab

The DREAM Lab is a research laboratory in UMBC’s Computer Science and Electrical Engineering department. We study machine learning and cyber security problems to combine them together, so that we can tackle the ever growing threat of malware. The amount of new malware (and often its sophistication) has been growing exponentially over time, while the supply of human analysts with the time to study and remediate these new malware is continuously limited. Therefore, we want to develop new techniques to automate or augment as much of the malware analysis process as possible via machine learning. This way we can hopefully reduce the human capital required to defend computer systems.

This intersection is particularly fun and interesting due to the wide breadth and depth of computer science skills involved. Malware is often exploiting low level details and flaws in software, often requiring knowledge in computer architecture, assembly, networking, and software design to understand. The machine learning tools we wish to apply in turn have their own breadth of mathematical foundations in linear algebra, calculus, and statistics. Finding all these skills in one person is rare, and so we enjoy an interdisciplinary lab working together on these research topics. This is especially true as many of the fundamental assumptions underlying modern deep learning and other machine learning methods are routinely violated to extreme degrees, necessitating new advancements in machine learning to create new capabilities in malware analysis. The lab is also home to UMBC’s cyber defense team, Cyber Dawgs.

PI Charles Nicholas Contact: nicholas@umbc.edu

news

Apr 27, 2024	Five abstracts accepted to the Malware Technical Exchange Meeting (MTEM 2024)!
Apr 13, 2022	Three abstracts accepted to the Malware Technical Exchange Meeting (MTEM 2022)!
Sep 25, 2021	Our paper “Searching for Selfie in TLS 1.3 with the Cryptographic Protocol Shapes Analyze” has been accepted to GuttmanFest2021!
Sep 20, 2021	Our abstract “Incremental Malware Detection and Classification Using Hidden Markov Models” has been selected for poster presentation at ICCWS!
Sep 17, 2021	Two papers, “Adversarial Transfer Attacks With Unknown Data and Class Overlap” and “A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels” accepted to AISec!

selected publications

An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance

Raff, Edward, and Nicholas, Charles

In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17 2017

Abs PDF

The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.
Malware Detection by Eating a Whole EXE

Raff, Edward, Barker, Jon, Sylvester, Jared, Brandon, Robert, Catanzaro, Bryan, and Nicholas, Charles

In AAAI Workshop on Artificial Intelligence for Cyber Security 2018

Abs PDF

In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community. Building a neural network for such a problem presents a number of interesting challenges that have not occurred in tasks such as image processing or NLP. In particular, we note that detection from raw bytes presents a sequence problem with over two million time steps and a problem where batch normalization appear to hinder the learning process. We present our initial work in building a solution to tackle this problem, which has linear complexity dependence on the sequence length, and allows for interpretable sub-regions of the binary to be identified. In doing so we will discuss the many challenges in building a neural network to process data at this scale, and the methods we used to work around them.
Ranking Retrieval Systems without Relevance Judgments

Soboroff, Ian, Nicholas, Charles, and Cahan, Patrick

In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2001

Abs

The most prevalent experimental methodology for comparing the effectiveness of information retrieval systems requires a test collection, composed of a set of documents, a set of query topics, and a set of relevance judgments indicating which documents are relevant to which topics. It is well known that relevance judgments are not infallible, but recent retrospective investigation into results from the Text REtrieval Conference (TREC) has shown that differences in human judgments of relevance do not affect the relative measured performance of retrieval systems. Based on this result, we propose and describe the initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics which we refer to aspseudo-relevance judgments.Rankings of systems with our methodology correlate positively with official TREC rankings, although the performance of the top systems is not predicted well. The correlations are stable over a variety of pool depths and sampling techniques. With improvements, such a methodology could be useful in evaluating systems such as World-Wide Web search engines, where the set of documents changes too often to make traditional collection construction techniques practical.