DREAM Lab | publications

2025

2024

Ransomware Evolution: Unveiling Patterns Using HDBSCAN

Bhandary, Prajna, Joyce, Robert J, and Nicholas, Charles

In 2024

Abs

This research presents an innovative approach to enhancing ransomware detection by leveraging Windows API calls and PE header information to develop precise signatures capable of identifying ransomware families. Our methodology introduces a novel application of hierarchical clustering using the HDBSCAN algorithm, in conjunction with the Jaccard similarity metric, to cluster ransomware into discrete families and generate corresponding signatures. This technique, to our knowledge, marks a pioneering effort in applying hierarchical density-based clustering to over 1.1 million malicious samples, specifically focusing on ransomware and using the clusters to automatically generate signatures.

2023

MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers

Joyce, Robert J., Raff, Edward, Nicholas, Charles, and Holt, James

2023

Abs

Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family. However, malware can be categorized according to many other types of attributes, and the ability to identify these attributes in newly-emerging malware using machine learning could provide significant value to analysts. In particular, we have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with. To obtain labels for training and evaluating ML classifiers on these tasks, we created an antivirus (AV) tagging tool called ClarAVy. ClarAVy’s sophisticated AV label parser distinguishes itself from prior AV-based taggers, with the ability to accurately parse 882 different AV label formats used by 90 different AV products. We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total. Our malware behavior dataset includes 75 distinct tags - nearly 7x more than the only prior benchmark dataset with behavioral tags. To our knowledge, we are the first to release datasets with malware platform and packer tags.

2022

FedSPLIT: One-Shot Federated Recommendation System Based on Non-negative Joint Matrix Factorization and Knowledge Distillation

Eren, Maksim E, Richards, Luke E, Bhattarai, Manish, Yus, Roberto, Nicholas, Charles, and Alexandrov, Boian S

arXiv preprint arXiv:2205.02359 2022

Abs PDF

Non-negative matrix factorization (NMF) with missing-value completion is a well-known effective Collaborative Filtering (CF) method used to provide personalized user recommendations. However, traditional CF relies on the privacy-invasive collection of users’ explicit and implicit feedback to build a central recommender model. One-shot federated learning has recently emerged as a method to mitigate the privacy problem while addressing the traditional communication bottleneck of federated learning. In this paper, we present the first unsupervised one-shot federated CF implementation, named FedSPLIT, based on NMF joint factorization. In our solution, the clients first apply local CF in-parallel to build distinct client-specific recommenders. Then, the privacy-preserving local item patterns and biases from each client are shared with the processor to perform joint factorization in order to extract the global item patterns. Extracted patterns are then aggregated to each client to build the local models via knowledge distillation. In our experiments, we demonstrate the feasibility of our approach with standard recommendation datasets. FedSPLIT can obtain similar results than the state of the art (and even outperform it in certain situations) with a substantial decrease in the number of communications.
General-Purpose Unsupervised Cyber Anomaly Detection via Non-Negative Tensor Factorization

Eren, Maksim Ekin, Moore, Juston, Skau, Erik, Bhattarai, Manish, Moore, Elisabeth, Chennupati, Gopinath, and Alexandrov, Boian

Digital Threats: Research and Practice 2022

Abs PDF

Distinguishing malicious anomalous activities from unusual but benign activities is a fundamental challenge for cyber defenders. Prior studies have shown that statistical user behavior analysis yields accurate detections by learning behavior profiles from observed user activity. These unsupervised models are able to generalize to unseen types of attacks by detecting deviations from normal behavior, without knowledge of specific attack signatures. However, approaches proposed to date based on probabilistic matrix factorization are limited by the information conveyed in a two-dimensional space. Non-negative tensor factorization, on the other hand, is a powerful unsupervised machine learning method that naturally models multi-dimensional data, capturing complex and multi-faceted details of behavior profiles. Our new unsupervised statistical anomaly detection methodology matches or surpasses state-of-the-art supervised learning baselines across several challenging and diverse cyber application areas, including detection of compromised user credentials, botnets, spam e-mails, and fraudulent credit card transactions.

2021

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

Joyce, Robert J., Amlani, Dev, Nicholas, Charles, and Raff, Edward

2021

Abs PDF

Malware family classification is a significant issue with public safety and research implications that has been hindered by the high cost of expert labels. The vast majority of corpora use noisy labeling approaches that obstruct definitive quantification of results and study of deeper interactions. In order to provide the data needed to advance further, we have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset with ground truth family labels to date, nearly 3x larger than any prior expert-labeled corpus and 36x larger than the prior Windows malware corpus. MOTIF also comes with a mapping from malware samples to threat reports published by reputable industry sources, which both validates the labels and opens new research opportunities in connecting opaque malware samples to human-readable descriptions. This enables important evaluations that are normally infeasible due to non-standardized reporting in industry. For example, we provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools when names are obtained from differing sources. Evaluation results obtained using the MOTIF dataset indicate that existing tasks have significant room for improvement, with accuracy of antivirus majority voting measured at only 62.10% and the well-known AVClass tool having just 46.78% accuracy. Our findings indicate that malware family classification suffers a type of labeling noise unlike that studied in most ML literature, due to the large open set of classes that may not be known from the sample under consideration.
A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

Joyce, Robert J., Raff, Edward, and Nicholas, Charles

Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security 2021

Abs PDF

In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.
Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time

Joyce, Robert J., Raff, Edward, and Nicholas, Charles

2021

Abs PDF

Although groups of strongly correlated antivirus engines are known to exist, at present there is limited understanding of how or why these correlations came to be. Using a corpus of 25 million VirusTotal reports representing over a decade of antivirus scan data, we challenge prevailing wisdom that these correlations primarily originate from "first-order" interactions such as antivirus vendors copying the labels of leading vendors. We introduce the Temporal Rank-1 Similarity Matrix decomposition (R1SM-T) in order to investigate the origins of these correlations and to model how consensus amongst antivirus engines changes over time. We reveal that first-order interactions do not explain as much behavior in antivirus correlation as previously thought, and that the relationships between antivirus engines are highly volatile. We make recommendations on items in need of future study and consideration based on our findings.
COVID-19 Multidimensional Kaggle Literature Organization

Eren, Maksim Ekin, Solovyev, Nick, Hamer, Chris, McDonald, Renee, Alexandrov, Boian, and Nicholas, Charles

In Proceedings of the ACM Symposium on Document Engineering 2021 2021

Abs PDF

The unprecedented outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, continues to be a significant worldwide problem. As a result, a surge of new COVID-19 related research has followed suit. The growing number of publications requires document organization methods to identify relevant information. In this paper, we expand upon our previous work with clustering the CORD-19 dataset by applying multi-dimensional analysis methods. Tensor factorization is a powerful unsupervised learning method capable of discovering hidden patterns in a document corpus. We show that a higher-order representation of the corpus allows for the simultaneous grouping of similar articles, relevant journals, authors with similar research interests, and topic keywords. These groupings are identified within and among the latent components extracted via tensor decomposition. We further demonstrate the application of this method with a publicly available interactive visualization of the dataset.
Random Forest of Tensors (RFoT)

Eren, Maksim Ekin, Nicholas, Charles, McDonald, Renee, and Hamer, Chris

Presented at the 12th Annual Malware Technical Exchange Meeting, Online. 2021

Abs PDF

Machine learning has become an invaluable tool in the fight against malware. Traditional supervised and unsupervised methods are not designed to capture the multi-dimensional details that are often present in cyber data. In contrast, tensor factorization is a powerful unsupervised data analysis method for extracting the latent patterns that are hidden in a multi-dimensional corpus. In this poster we explore the application of tensors to classification, and we describe a hybrid model that leverages the strength of multi-dimensional analysis combined with clustering. We introduce a novel semi-supervised ensemble classifier named Random Forest of Tensors (RFoT) that is based on generating a forest of tensors in parallel, which share the same first dimension, and randomly selecting the remainder of the dimensions and entries of each tensor from the features set.
Evading Malware Classifiers via Monte Carlo Mutant Feature Discovery

Boutsikas, John, Eren, Maksim Ekin, Varga, Charles, Raff, Edward, Matuszek, Cynthia, and Nicholas, Charles

Presented at the 12th Annual Malware Technical Exchange Meeting, Online. 2021

Abs PDF

The use of Machine Learning has become a significant part of malware detection efforts due to the influx of new malware, an ever changing threat landscape, and the ability of Machine Learning methods to discover meaningful distinctions between malicious and benign software. Antivirus vendors have also begun to widely utilize malware classifiers based on dynamic and static malware analysis features. Therefore, a malware author might make evasive binary modifications against Machine Learning models as part of the malware development life cycle to execute an attack successfully. This makes the studying of possible classifier evasion strategies an essential part of cyber defense against malice. To this extent, we stage a grey box setup to analyze a scenario where the malware author does not know the target classifier algorithm, and does not have access to decisions made by the classifier, but knows the features used in training. In this experiment, a malicious actor trains a surrogate model using the EMBER-2018 dataset to discover binary mutations that cause an instance to be misclassified via a Monte Carlo tree search. Then, mutated malware is sent to the victim model that takes the place of an antivirus API to test whether it can evade detection.
Bringing UMAP Closer to the Speed of Light with GPU Acceleration

Nolet, Corey J., Lafargue, Victor, Raff, Edward, Nanditale, Thejaswi, Oates, Tim, Zedlewski, John, and Patterson, Joshua

In The Thirty-Fifth AAAI Conference on Artificial Intelligence 2021

Abs PDF

The Uniform Manifold Approximation and Projection (UMAP) algorithm has become widely popular for its ease of use, quality of results, and support for exploratory, unsupervised, supervised, and semi-supervised learning. While many algorithms can be ported to a GPU in a simple and direct fashion, such efforts have resulted in inefficent and inaccurate versions of UMAP. We show a number of techniques that can be used to make a faster and more faithful GPU version of UMAP, and obtain speedups of up to 100x in practice. Many of these design choices/lessons are general purpose and may inform the conversion of other graph and manifold learning algorithms to use GPUs. Our implementation has been made publicly available as part of the open source RAPIDS cuML library(https://github.com/rapidsai/cuml).
Research Reproducibility as a Survival Analysis

Raff, Edward

In The Thirty-Fifth AAAI Conference on Artificial Intelligence 2021

Abs PDF

There has been increasing concern within the machine learning community that we are in a reproducibility crisis. As many have begun to work on this problem, all work we are aware of treat the issue of reproducibility as an intrinsic binary property: a paper is or is not reproducible. Instead, we consider modeling the reproducibility of a paper as a survival analysis problem. We argue that this perspective represents a more accurate model of the underlying meta-science question of reproducible research, and we show how a survival analysis allows us to draw new insights that better explain prior longitudinal data. The data and code can be found at https://github.com/EdwardRaff/Research-Reproducibility-Survival-Analysis
Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

Raff, Edward, Fleshman, William, Zak, Richard, Anderson, Hyrum S., Filar, Bobby, and McLean, Mark

In The Thirty-Fifth AAAI Conference on Artificial Intelligence 2021

Abs PDF

Recent works within machine learning have been tackling inputs of ever-increasing size, with cybersecurity presenting sequence classification problems of particularly extreme lengths. In the case of Windows executable malware detection, inputs may exceed }100} MB, which corresponds to a time series with }T=100,000,000} steps. To date, the closest approach to handling such a task is MalConv, a convolutional neural network capable of processing up to }T=2,000,000} steps. The }\backslashmathcal{O}(T)} memory of CNNs has prevented further application of CNNs to malware. In this work, we develop a new approach to temporal max pooling that makes the required memory invariant to the sequence length }T}. This makes MalConv }116\backslashtimes} more memory efficient, and up to }25.8\backslashtimes} faster to train on its original dataset, while removing the input length restrictions to MalConv. We re-invest these gains into improving the MalConv architecture by developing a new Global Channel Gating design, giving us an attention mechanism capable of learning feature interactions across 100 million time steps in an efficient manner, a capability lacked by the original MalConv CNN. Our implementation can be found at https://github.com/NeuromorphicComputationResearchProgram/MalConv2
Accounting for Variance in Machine Learning Benchmarks

Bouthillier, Xavier, Delaunay, Pierre, Bronzi, Mirko, Trofimov, Assya, Nichyporuk, Brennan, Szeto, Justin, Sepah, Naz, Raff, Edward, Madan, Kanika, Voleti, Vikram, Kahou, Samira Ebrahimi, Michalski, Vincent, Serdyuk, Dmitriy, Arbel, Tal, Pal, Chris, Varoquaux, Gaël, and Vincent, Pascal

In Machine Learning and Systems (MLSys) 2021

Abs PDF

Strong empirical evidence that one machine-learning algorithm A outperforms another one B ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly the results. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a 51 times reduction in compute cost. Building on these results, we study the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for performance comparisons.
Exact Acceleration of K-Means ++ and K-Means

Raff, Edward

In 30th International Joint Conference on Artificial Intelligence (IJCAI-21) 2021

Abs PDF

K-Means++ and its distributed variant K-Means‖ have become de facto tools for selecting the initialseeds of K-means. While alternatives have been developed, the effectiveness, ease of implementation, and theoretical grounding of the K-means++ and ‖ methods have made them difficult to “best” from a holistic perspective. By considering the limited opportunities within seed selection to perform pruning, we develop specialized triangle inequalitypruning strategies and a dynamic priority queue to show the first acceleration of K-Means++ and K-Means‖ that is faster in run-time while being algorithmicly equivalent. For both algorithms we are able to reduce distance computations by over500×. For K-means++ this results in up to a 17×speedup in run-time and a 551× speedup for K-means‖. We achieve this with simple, but carefully chosen, modifications to known techniques which makes it easy to integrate our approach into existing implementations of these algorithms
Generating Thermal Human Faces for Physiological Assessment Using Thermal Sensor Auxiliary Labels

Ordun, Catherine, Raff, Edward, and Purushotham, Sanjay

In ICIP 2021

Abs

Thermal images reveal medically important physiological information about human stress, signs of inflammation, and emotional mood that cannot be seen on visible images. Providing a method to generate thermal faces from visible images would be highly valuable for the telemedicine community in order to show this medical information. To the best of our knowledge, there are limited works on visible-to-thermal (VT) face translation, and many current works go the opposite direction to generate visible faces from thermal surveillance images (TV) for law enforcement applications. As a result, we introduce favtGAN, a VT GAN which uses the pix2pix image translation model with an auxiliary sensor label prediction network for generating thermal faces from visible images. Since most TV methods are trained on only one data source drawn from one thermal sensor, we combine datasets from faces and cityscapes. These combined data are captured from similar sensors in order to bootstrap the training and transfer learning task, especially valuable because visible-thermal face datasets are limited. Experiments on these combined datasets show that favtGAN demonstrates an increase in SSIM and PSNR scores of generated thermal faces, compared to training on a single face dataset alone.
Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints

Nguyen, Andre T., Raff, Edward, Nicholas, Charles, and Holt, James

In IJCAI-21 1st International Workshop on Adaptive Cyber Defense 2021

Abs

The detection of malware is a critical task for the protection of computing environments. This task often requires extremely low false positive rates (FPR) of 0.01% or even lower, for which modern machine learning has no readily available tools. We introduce the first broad investigation of the use of uncertainty for malware detection across multiple datasets, models, and feature types. We show how ensembling and Bayesian treatments of machine learning methods for static malware detection allow for improved identification of model errors, uncovering of new malware families, and predictive performance under extreme false positive constraints. In particular, we improve the true positive rate (TPR) at an actual realized FPR of 1e-5 from an expected 0.69 for previous methods to 0.80 on the best performing model class on the Sophos industry scale dataset. We additionally demonstrate how previous works have used an evaluation protocol that can lead to misleading results.

2020

Multi-Dimensional Anomalous Entity Detection via Poisson Tensor Factorization

Eren, Maksim Ekin, Moore, Juston, and Alexandrov, Boian

In 2020 IEEE International Conference on Intelligence and Security Informatics (ISI) 2020

Abs PDF

As the attack surfaces of large enterprise networks grow, anomaly detection systems based on statistical user behavior analysis play a crucial role in identifying malicious activities. Previous work has shown that link prediction algorithms based on non-negative matrix factorization learn highly accurate predictive models of user actions. However, most statistical link prediction models have been constructed on bipartite graphs, and fail to capture the nuanced, multi-faceted details of a user’s activity profile. This paper establishes a new benchmark for red team event detection on the Los Alamos National Laboratory Unified Host and Network Dataset by applying a tensor factorization model that exploits the multi-dimensional and sparse structure of user authentication logs. We show that learning patterns of normal activity across multiple dimensions in one unified statistical framework yields improved detection of penetration testing events. We further show operational value by developing fusion methods that can identify anomalous users, source devices, and destination devices in the network.
Flexible and Adaptive Fairness-aware Learning in Non-stationary Data Streams

Zhang, Wenbin, Zhang, Mingli, Zhang, Ji, Liu, Zhen, Chen, Zhiyuan, Wang, Jianwu, Raff, Edward, and Messina, Enza

In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI) 2020

Abs PDF

Artificial intelligence (AI)-based decision-making systems are employed nowadays in an ever growing number of online as well as offline services–some of great importance. Depending on sophisticated learning algorithms and available data, these systems are increasingly becoming automated and data-driven. However, these systems can impact individuals and communities with ethical or legal consequences. Numerous approaches have therefore been proposed to develop decision making systems that are discrimination-conscious by-design. However, these methods assume the underlying data distribution is stationary without drift, which is counterfactual in many real world applications. In addition, their focus has been largely on minimizing discrimination while maximizing prediction performance without necessary flexibility in customizing the tradeoff according to different applications. To this end, we propose a learning algorithm for fair classification that also adapts to evolving data streams and further allows for a flexible control on the degree of accuracy and fairness. The positive results on a set of discriminated and non-stationary data streams demonstrate the effectiveness and flexibility of this approach.
Sampling Approach Matters: Active Learning for Robotic Language Acquisition

Pillai, Nisha, Raff, Edward, Ferraro, Francis, and Matuszek, Cynthia

In 2020 IEEE International Conference on Big Data (Big Data) 2020

Abs PDF

Ordering the selection of training data using active learning can lead to improvements in learning efficiently from smaller corpora. We present an exploration of active learning approaches applied to three grounded language problems of varying complexity in order to analyze what methods are suitable for improving data efficiency in learning. We present a method for analyzing the complexity of data in this joint problem space, and report on how characteristics of the underlying task, along with design decisions such as feature selection and classification model, drive the results. We observe that representativeness, along with diversity, is crucial in selecting data samples.
COVID-19 Kaggle Literature Organization

Eren, Maksim Ekin, Solovyev, Nick, Raff, Edward, Nicholas, Charles, and Johnson, Ben

In Proceedings of the ACM Symposium on Document Engineering 2020 2020

Abs PDF

The world has faced the devastating outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, in 2020. Research in the subject matter was fast-tracked to such a point that scientists were struggling to keep up with new findings. With this increase in the scientific literature, there arose a need for organizing those documents. We describe an approach to organize and visualize the scientific literature on or related to COVID-19 using machine learning techniques so that papers on similar topics are grouped together. By doing so, the navigation of topics and related papers is simplified. We implemented this approach using the widely recognized CORD-19 dataset to present a publicly available proof of concept.
A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

Raff, Edward, and Nicholas, Charles

In NeurIPS 2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA) 2020

Abs PDF

Malware classification is a difficult problem, to which machine learning methods have been applied for decades. Yet progress has often been slow, in part due to a number of unique difficulties with the task that occur through all stages of the developing a machine learning system: data collection, labeling, feature creation and selection, model selection, and evaluation. In this survey we will review a number of the current methods and challenges related to malware classification, including data collection, feature extraction, and model construction, and evaluation. Our discussion will include thoughts on the constraints that must be considered for machine learning based solutions in this domain, and yet to be tackled problems for which machine learning could also provide a solution. This survey aims to be useful both to cybersecurity practitioners who wish to learn more about how machine learning can be applied to the malware problem, and to give data scientists the necessary background into the challenges in this uniquely complicated space.
The Use of AI for Thermal Emotion Recognition: A Review of Problems and Limitations in Standard Design and Data

Ordun, Catherine, Raff, Edward, and Purushotham, Sanjay

In AAAI FSS-20: Artificial Intelligence in Government and Public Sector 2020

Abs PDF

With the increased attention on thermal imagery for Covid-19 screening, the public sector may believe there are new opportunities to exploit thermal as a modality for computer vision and AI. Thermal physiology research has been ongoing since the late nineties. This research lies at the intersections of medicine, psychology, machine learning, optics, and affective computing. We will review the known factors of thermal vs. RGB imaging for facial emotion recognition. But we also propose that thermal imagery may provide a semi-anonymous modality for computer vision, over RGB, which has been plagued by misuse in facial recognition. However, the transition to adopting thermal imagery as a source for any human-centered AI task is not easy and relies on the availability of high fidelity data sources across multiple demographics and thorough validation. This paper takes the reader on a short review of machine learning in thermal FER and the limitations of collecting and developing thermal FER data for AI training. Our motivation is to provide an introductory overview into recent advances for thermal FER and stimulate conversation about the limitations in current datasets.
Robust Design of Deep Neural Networks against Adversarial Attacks based on Lyapunov Theory

Rahnama, Arash, Nguyen, Andre T., and Raff, Edward

In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Abs PDF

Deep neural networks (DNNs) are vulnerable to subtle adversarial perturbations applied to the input. These adversarial perturbations, though imperceptible, can easily mislead the DNN. In this work, we take a control theoretic approach to the problem of robustness in DNNs. We treat each individual layer of the DNN as a nonlinear dynamical system and use Lyapunov theory to prove stability and robustness locally. We then proceed to prove stability and robustness globally for the entire DNN. We develop empirically tight bounds on the response of the output layer, or any hidden layer, to adversarial perturbations added to the input, or the input of hidden layers. Recent works have proposed spectral norm regularization as a solution for improving robustness against l2 adversarial attacks. Our results give new insights into how spectral norm regularization can mitigate the adversarial effects. Finally, we evaluate the power of our approach on a variety of data sets and network architectures and against some of the well-known adversarial attacks.
Automatic Yara Rule Generation Using Biclustering

Raff, Edward, Zak, Richard, Munoz, Gary Lopez, Fleming, William, Anderson, Hyrum S., Filar, Bobby, Nicholas, Charles, and Holt, James

In 13th ACM Workshop on Artificial Intelligence and Security (AISec’20) 2020

Abs PDF

Yara rules are a ubiquitous tool among cybersecurity practitioners and analysts. Developing high-quality Yara rules to detect a malware family of interest can be labor- and time-intensive, even for expert users. Few tools exist and relatively little work has been done on how to automate the generation of Yara rules for specific families. In this paper, we leverage large n-grams (}n \backslashgeq 8}) combined with a new biclustering algorithm to construct simple Yara rules more effectively than currently available software. Our method, AutoYara, is fast, allowing for deployment on low-resource equipment for teams that deploy to remote networks. Our results demonstrate that AutoYara can help reduce analyst workload by producing rules with useful true-positive rates while maintaining low false-positive rates, sometimes matching or even outperforming human analysts. In addition, real-world testing by malware analysts indicates AutoYara could reduce analyst time spent constructing Yara rules by 44-86%, allowing them to spend their time on the more advanced malware that current tools can’t handle. Code will be made available at https://github.com/NeuromorphicComputationResearchProgram .
A New Burrows Wheeler Transform Markov Distance

Raff, Edward, Nicholas, Charles, and McLean, Mark

In The Thirty-Fourth AAAI Conference on Artificial Intelligence 2020

Abs

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.
Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs

Ordun, Catherine, Purushotham, Sanjay, and Raff, Edward

In epiDAMIK 2020: 3rd epiDAMIK ACM SIGKDD International Workshop on Epidemiology meets Data Mining and Knowledge Discovery 2020

Abs PDF

This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.
Cluster Quality Analysis Using Silhouette Score

Shahapure, Ketan Rajshekhar, and Nicholas, Charles

In 7th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2020, Sydney, Australia, October 6-9, 2020 2020
A Quantum Algorithm To Locate Unknown Hashes For Known N-Grams Within A Large Malware Corpus

Allgood, Nicholas R., and Nicholas, Charles K.

CoRR 2020

2019

A Step Toward Quantifying Independently Reproducible Machine Learning Research

Raff, Edward

In Advances in Neural Information Processing Systems 2019

Abs PDF

What makes a paper independently reproducible? Debates on reproducibility center around intuition or assumptions but lack empirical results. Our field focuses on releasing code, which is important, but is not sufficient for determining reproducibility. We take the first step toward a quantifiable answer by manually attempting to implement 255 papers published from 1984 until 2017, recording features of each paper, and performing statistical analysis of the results. For each paper, we did not look at the authors code, if released, in order to prevent bias toward discrepancies between code and paper.
PyLZJD: An Easy to Use Tool for Machine Learning

Raff, Edward, Aurelio, Joe, and Nicholas, Charles

In Proceedings of the 18th Python in Science Conference 2019

HTML PDF
KiloGrams: Very Large N-Grams for Malware Classification

Raff, Edward, Fleming, William, Zak, Richard, Anderson, Hyrum, Finlayson, Bill, Nicholas, Charles K., and Mclean, Mark

In Proceedings of KDD 2019 Workshop on Learning and Mining for Cybersecurity (LEMINCS’19) 2019

Abs PDF

N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values ofn are tested, with n \textgreater 6 being exceed- ingly rare. Larger values ofn are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top-k most frequent n-grams that is 60× faster for small n, and can tackle large n ≥ 1024. Despite the unprecedented size ofn considered, we show how these features still have predictive ability for malware classification tasks. More important, large n- grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Fur- thermore, the counts of common n-grams in a file may be added as features to publicly available human-engineered features that rival efficacy of professionally-developed features when used to train gradient-boosted decision tree models on the EMBER dataset.
Barrage of random transforms for adversarially robust defense

Raff, E., Sylvester, J., Forsyth, S., and McLean, M.

In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019

Abs PDF

Defenses against adversarial examples, when using the ImageNet dataset, are historically easy to defeat. The common understanding is that a combination of simple image transformations and other various defenses are insufficient to provide the necessary protection when the obfuscated gradient is taken into account. In this paper, we explore the idea of stochastically combining a large number of individually weak defenses into a single barrage of randomized transformations to build a strong defense against adversarial attacks. We show that, even after accounting for obfuscated gradients, the Barrage of Random Transforms (BaRT) is a resilient defense against even the most difficult attacks, such as PGD. BaRT achieves up to a 24x improvement in accuracy compared to previous work, and has even extended effectiveness out to a previously untested maximum adversarial perturbation of ϵ=32.
Non-Negative Networks Against Adversarial Attacks

Fleshman, William, Raff, Edward, Sylvester, Jared, Forsyth, Steven, and McLean, Mark

AAAI-2019 Workshop on Artificial Intelligence for Cyber Security 2019

Abs PDF

Adversarial attacks against Neural Networks are a problem of considerable importance, for which effective defenses are not yet readily available. We make progress toward this problem by showing that non-negative weight constraints can be used to improve resistance in specific scenarios. In particular, we show that they can provide an effective defense for binary classification problems with asymmetric cost, such as malware or spam detection. We also show how non-negativity can be leveraged to reduce an attacker’s ability to perform targeted misclassification attacks in other domains such as image processing.
Would a File by Any Other Name Seem as Malicious?

Nguyen, Andre T, Raff, Edward, and Sant-Miller, Aaron

In 2019 IEEE International Conference on Big Data (Big Data) 2019

PDF

2018

Malware Detection by Eating a Whole EXE

Raff, Edward, Barker, Jon, Sylvester, Jared, Brandon, Robert, Catanzaro, Bryan, and Nicholas, Charles

In AAAI Workshop on Artificial Intelligence for Cyber Security 2018

Abs PDF

In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community. Building a neural network for such a problem presents a number of interesting challenges that have not occurred in tasks such as image processing or NLP. In particular, we note that detection from raw bytes presents a sequence problem with over two million time steps and a problem where batch normalization appear to hinder the learning process. We present our initial work in building a solution to tackle this problem, which has linear complexity dependence on the sequence length, and allows for interpretable sub-regions of the binary to be identified. In doing so we will discuss the many challenges in building a neural network to process data at this scale, and the methods we used to work around them.
Static Malware Detection & Subterfuge: Quantifying the Robustness of Machine Learning and Current Anti-Virus

Fleshman, William, Raff, Edward, Zak, Richard, McLean, Mark, and Nicholas, Charles

In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE) 2018

Abs PDF

As machine-learning (ML) based systems for malware detection become more prevalent, it becomes necessary to quantify the benefits compared to the more traditional anti-virus (AV) systems widely used today. It is not practical to build an agreed upon test set to benchmark malware detection systems on pure classification performance. Instead we tackle the problem by creating a new testing methodology, where we evaluate the change in performance on a set of known benign & malicious files as adversarial modifications are performed. The change in performance combined with the evasion techniques then quantifies a system’s robustness against that approach. Through these experiments we are able to show in a quantifiable way how purely ML based systems can be more robust than AV products at detecting malware that attempts evasion through modification, but may be slower to adapt in the face of significantly novel attacks.
Engineering a Simplified 0-Bit Consistent Weighted Sampling

Raff, Edward, Sylvester, Jared, and Nicholas, Charles

In Proceedings of the 27th ACM International Conference on Information and Knowledge Management 2018

PDF
Gradient Reversal Against Discrimination : A Fair Neural Network Learning Approach

Raff, Edward, and Sylvester, Jared

In The 5th IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2018

PDF
Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash

Raff, Edward, and Nicholas, Charles K.

Digital Investigation 2018

Abs PDF

Recent work has proposed the Lempel-Ziv Jaccard Distance (LZJD) as a method to measure the similarity between binary byte sequences for malware classification. We propose and test LZJD’s effectiveness as a similarity digest hash for digital forensics. To do so we develop a high performance Java implementation with the same command-line arguments as sdhash, making it easy to integrate into existing work-flows. Our testing shows that LZJD is effective for this task, and significantly outperforms sdhash and ssdeep in its ability to match related file fragments and is faster at comparison time.
Hash-Grams: Faster N-Gram Features for Classification and Malware Detection

Raff, Edward, and Nicholas, Charles

In Proceedings of the ACM Symposium on Document Engineering 2018 2018

Abs PDF

N-grams have long been used as features for classification problems, and their distribution often allows selection of the top-k occurring n-grams as a reliable first-pass to feature selection. However, this top-k selection can be a performance bottleneck, especially when dealing with massive item sets and corpora. In this work we introduce Hash-Grams, an approach to perform top-k feature mining for classification problems. We show that the Hash-Gram approach can be up to three orders of magnitude faster than exact top-k selection algorithms. Using a malware corpus of over 2 TB in size, we show how Hash-Grams retain comparable classification accuracy, while dramatically reducing computational requirements.

2017

Learning the PE Header, Malware Detection with Minimal Domain Knowledge

Raff, Edward, Sylvester, Jared, and Nicholas, Charles

In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security 2017

Abs PDF

Many efforts have been made to use various forms of domain knowledge in malware detection. Currently there exist two common approaches to malware detection without domain knowledge, namely byte n-grams and strings. In this work we explore the feasibility of applying neural networks to malware detection and feature learning. We do this by restricting ourselves to a minimal amount of domain knowledge in order to extract a portion of the Portable Executable (PE) header. By doing this we show that neural networks can learn from raw bytes without explicit feature construction, and perform even better than a domain knowledge approach that parses the PE header into explicit features.
What can N-grams learn for malware detection?

Zak, Richard, Raff, Edward, and Nicholas, Charles

In 2017 12th International Conference on Malicious and Unwanted Software (MALWARE) 2017

Abs PDF

Recent work has shown that byte n-grams learn mostly low entropy features, such as function imports and strings, which has brought into question whether byte n-grams can learn information corresponding to higher entropy levels, such as binary code. We investigate that hypothesis in this work by performing byte n-gram analysis on only specific sub-sections of the binary file, and compare to results ob- tained by n-gram analysis on assembly code generated from disassembled binaries. We do this by leveraging the change in model performance and ensembles to glean insights about the data. In doing so we discover that byte n-grams can learn from the code regions, but do not necessarily learn any new information. We also discover that assembly n-grams may not be as effective as previously thought and that disam- biguating instructions by their binary opcode, an approach not previously used for malware detection, is critical for model generalization.
An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance

Raff, Edward, and Nicholas, Charles

In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17 2017

Abs PDF

The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.
JSAT: Java Statistical Analysis Tool, a Library for Machine Learning

Raff, Edward

Journal of Machine Learning Research 2017

Abs HTML PDF

Java Statistical Analysis Tool (JSAT) is a Machine Learning library written in pure Java. It works to fill a void in the Java ecosystem for a general purpose library that is relatively high performance and flexible, which is not adequately fulfilled by Weka (Hall et al., 2009) and Java-ML (Abeel et al., 2009). Almost all of the algorithms are independently implemented using an Object- Oriented framework. JSAT is made available under the GNU GPL license here: github.com/EdwardRaff/JSAT.
Malware Classification and Class Imbalance via Stochastic Hashed LZJD

Raff, Edward, and Nicholas, Charles

In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security 2017

Abs PDF

There are currently few methods that can be applied to malware classification problems which don’t require domain knowledge to apply. In this work, we develop our new SHWeL feature vector representation, by extending the recently proposed Lempel-Ziv Jaccard Distance. These SHWeL vectors improve upon LZJD’s accuracy, outperform byte n-grams, and allow us to build efficient algorithms for both training (a weakness of byte n-grams) and inference (a weakness of LZJD). Furthermore, our new SHWeL method also allows us to directly tackle the class imbalance problem, which is common for malware-related tasks. Compared to existing methods like SMOTE, SHWeL provides significantly improved accuracy while reducing algorithmic complexity to O(N). Because our approach is developed without the use of domain knowledge, it can be easily re-applied to any new domain where there is a need to classify byte sequences.
Document Engineering Issues in Malware Analysis

Nicholas, Charles K.

In Proceedings of the 2017 ACM Symposium on Document Engineering, DocEng 2017, Valletta, Malta, September 4-7, 2017 2017

2016

An investigation of byte n-gram features for malware classification

Raff, Edward, Zak, Richard, Cox, Russell, Sylvester, Jared, Yacci, Paul, Ward, Rebecca, Tracy, Anna, McLean, Mark, and Nicholas, Charles

Journal of Computer Virology and Hacking Techniques 2016

Abs PDF

Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work using n-gram features, in this work we use orders of magnitude more data, and we perform feature selection during model building using Elastic-Net regularized Logistic Regression. We compute a regularization path and analyze novel {\backslashem multi-byte identifiers}. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy. Second, we discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways. Finally, we demonstrate that n-gram features promote overfitting, even with linear models and extreme regularization.
Document Engineering Issues in Malware Analysis

Nicholas, Charles K., and Brandon, Robert

In Proceedings of the 2016 ACM Symposium on Document Engineering, DocEng 2016, Vienna, Austria, September 13 - 16, 2016 2016

2015

Document Engineering Issues in Document Analysis

Nicholas, Charles K., and Brandon, Robert

In Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, September 8-11, 2015 2015

2013

Document engineering education: workshop report

Nicholas, Charles K., and Munson, Ethan V.

SIGWEB Newsl. 2013
Change-link 2.0: a digital forensic tool for visualizing changes to shadow volume data

Leschke, Timothy R., and Nicholas, Charles K.

In 10th Workshop on Visualization for Cyber Security, VizSec 2013, Atlanta, GA, USA, October 14, 2013 2013

2009

Translation Corpus Source and Size in Bilingual Retrieval

McNamee, Paul, Mayfield, James, and Nicholas, Charles K.

In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA, Short Papers 2009
Addressing morphological variation in alphabetic languages

McNamee, Paul, Nicholas, Charles K., and Mayfield, James

In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009 2009

2008

Topological analysis of an online social network for older adults

Wilson, Marcella, and Nicholas, Charles K.

In Proceeding of the 2008 ACM Workshop on Search in Social Media, SSM 2008, Napa Valley, California, USA, October 30, 2008 2008
Don’t have a stemmer?: be un+concern+ed

McNamee, Paul, Nicholas, Charles K., and Mayfield, James

In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008 2008

2007

Building initial partitions through sampling techniques

Volkovich, Vladimir, Kogan, Jacob, and Nicholas, Charles K.

Eur. J. Oper. Res. 2007

2006

Sampling Methods for Building Initial Partitions

Volkovich, Zeev, Kogan, Jacob, and Nicholas, Charles K.

2006
Grouping Multidimensional Data - Recent Advances in Clustering

2006

2005

Data Driven Similarity Measures for k-Means Like Clustering Algorithms

Kogan, Jacob, Teboulle, Marc, and Nicholas, Charles K.

Inf. Retr. 2005

2004

Finding aliases on the web using latent semantic analysis

Bhat, Vinay, Oates, Tim, Shanbhag, Vishal, and Nicholas, Charles K.

Data Knowl. Eng. 2004

2003

Text mining with information-theoretic clustering

Kogan, Jacob, Nicholas, Charles K., and Volkovich, Vladimir

Comput. Sci. Eng. 2003
UMBC at TREC 12

Kallurkar, Srikanth, Shi, Yongmei, Cost, R. Scott, Nicholas, Charles K., Java, Akshay, James, Christopher, Rajavaram, Sowjanya, Shanbhag, Vishal, Bhatkar, Sachin, and Ogle, Drew

In Proceedings of The Twelfth Text REtrieval Conference, TREC 2003, Gaithersburg, Maryland, USA, November 18-21, 2003 2003

2002

ITtalks: A Case Study in the Semantic Web and DAML+OIL

Cost, R. Scott, Finin, Timothy W., Joshi, Anupam, Peng, Yun, Nicholas, Charles K., Soboroff, Ian, Chen, Harry, Kagal, Lalana, Perich, Filip, Zou, Youyong, and Tolia, Sovrin

IEEE Intell. Syst. 2002
Related, but not Relevant: Content-Based Collaborative Filtering in TREC-8

Soboroff, Ian, and Nicholas, Charles K.

Inf. Retr. 2002
Integrating Distributed Information Sources with CARROT II

Cost, R. Scott, Kallurkar, Srikanth, Majithia, Hemali, Nicholas, Charles K., and Shi, Yongmei

In Cooperative Information Agents VI, 6th International Workshop, CIA 2002, Madrid, Spain, September 18-20, 2002, Proceedings 2002
CARROTT 11 and the TREC 11 Web Track

Cost, R. Scott, Kallurkar, Srikanth, Majithia, Hemali, Nicholas, Charles K., and Shi, Yongmei

In Proceedings of The Eleventh Text REtrieval Conference, TREC 2002, Gaithersburg, Maryland, USA, November 19-22, 2002 2002
Agents Making Sense of the Semantic Web

Kagal, Lalana, Perich, Filip, Chen, Harry, Tolia, Sovrin, Zou, Youyong, Finin, Timothy W., Joshi, Anupam, Peng, Yun, Cost, R. Scott, and Nicholas, Charles K.

In Innovative Concepts for Agent-Based Systems, First International Workshop on Radical Agent Concepts, WRAC 2002, McLean, VA, USA, January 16-18, 2002, Revised Papers 2002

2001

Ranking Retrieval Systems without Relevance Judgments

Soboroff, Ian, Nicholas, Charles, and Cahan, Patrick

In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2001

Abs

The most prevalent experimental methodology for comparing the effectiveness of information retrieval systems requires a test collection, composed of a set of documents, a set of query topics, and a set of relevance judgments indicating which documents are relevant to which topics. It is well known that relevance judgments are not infallible, but recent retrospective investigation into results from the Text REtrieval Conference (TREC) has shown that differences in human judgments of relevance do not affect the relative measured performance of retrieval systems. Based on this result, we propose and describe the initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics which we refer to aspseudo-relevance judgments.Rankings of systems with our methodology correlate positively with official TREC rankings, although the performance of the top systems is not predicted well. The correlations are stable over a variety of pool depths and sampling techniques. With improvements, such a methodology could be useful in evaluating systems such as World-Wide Web search engines, where the set of documents changes too often to make traditional collection construction techniques practical.
ITTALKS: An Application of Agents in the Semantic Web

Perich, Filip, Kagal, Lalana, Chen, Harry, Tolia, Sovrin, Zou, Youyong, Finin, Timothy W., Joshi, Anupam, Peng, Yun, Cost, R. Scott, and Nicholas, Charles K.

In Engineering Societies in the Agents World II, Second International Workshop, ESAW 2001, Prague, Czech Republic, July 7, 2001, Revised Papers 2001
ITTALKS: A Case Study in the Semantic Web and DAML

Cost, R. Scott, Finin, Timothy W., Joshi, Anupam, Peng, Yun, Nicholas, Charles K., Chen, Harry, Kagal, Lalana, Perich, Filip, Zou, Youyong, and Tolia, Sovrin

In Proceedings of SWWS’01, The first Semantic Web Working Symposium, Stanford University, California, USA, July 30 - August 1, 2001 2001
Ranking Retrieval Systems without Relevance Judgments

Soboroff, Ian, Nicholas, Charles K., and Cahan, Patrick

In SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA 2001
Case Study: Visualization and Information Retrieval Techniques for Network Intrusion Detection

Atkison, Travis, Pensy, Kathleen, Nicholas, Charles K., Ebert, David S., Atkison, Rebekah, and Morris, Chris

In 3rd Joint Eurographics - IEEE TCVG Symposium on Visualization, VisSym 2001, Ascona, Switzerland, May 28-30, 2001 2001

2000

Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System

Millar, Ethan, Shen, Dan, Liu, Junli, and Nicholas, Charles K.

J. Digit. Inf. 2000
Collaborative filtering and the generalized vector space model

Soboroff, Ian, and Nicholas, Charles K.

In SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 24-28, 2000, Athens, Greece 2000

1999

Interactive Volumetric Information Visualization for Document Corpus Management

Shaw, Christopher D., Kukla, James M., Soboroff, Ian, Ebert, David S., Nicholas, Charles K., Zwa, Amen, Miller, Ethan L., and Roberts, D. Aaron

Int. J. Digit. Libr. 1999
Workshop on Recommender Systems: Algorithms and Evaluation

Soboroff, Ian, Nicholas, Charles K., and Pazzani, Michael J.

SIGIR Forum 1999
Techniques for Gigabyte-Scale N-gram Based Information Retrieval on Personal Computers

Miller, Ethan L., Shen, Dan, Liu, Junli, Nicholas, Charles K., and Chen, Ting

In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 1999, June 28 - Junlly 1, 1999, Las Vegas, Nevada, USA 1999

1998

Spotting Topics with the Singular Value Decomposition

Nicholas, Charles K., and Dahlberg, Randall

In Principles of Digital Document Processing, 4th International Workshop, PODDP’98, Saint Malo, France, March 29-30, 1998, Proceedings 1998

1997

TKQML: A Scripting Tool for Building Agents

Cost, R. Scott, Soboroff, Ian, Lakhani, Jeegar, Finin, Timothy W., Miller, Ethan L., and Nicholas, Charles K.

In Intelligent Agents IV, Agent Theories, Architectures, and Languages, 4th International Workshop, ATAL ’97, Providence, Rhode Island, USA, July 24-26, 1997, Proceedings 1997
Visualizing Document Authorship Using n-grams and Latent Semantic Indexing

Soboroff, Ian, Nicholas, Charles K., Kukla, James M., and Ebert, David S.

In Proceedings of the Workshop on New Paradigms in Information Visualization and Manipulation (NIPV ’97), in conjuction with CIKM ’97, November 10-14, 1997, Las Vegas, NV, USA 1997
Agent Development Support for Tcl

Cost, R. Scott, Soboroff, Ian, Lakhani, Jeegar, Finin, Tim, Miller, Ethan L., and Nicholas, Charles K.

In Proceedings of the Fifth Annual Tcl/Tk Workshop 1997, Boston, Massachusetts, USA, July 14-17, 1997 1997

1996

TELLTALE: Experiments in a Dynamic Hypertext Environment for Degraded and Multilingual Data

Pearce, Claudia, and Nicholas, Charles K.

J. Am. Soc. Inf. Sci. 1996

1995

Reliability of WWW Name Servers

Rowe, Kenneth E., and Nicholas, Charles K.

Comput. Networks ISDN Syst. 1995

1993

Canto: a Hypertext Data Model

Nicholas, Charles K., and Rosenberg, Linda H.

Electron. Publ. 1993
Information and Knowledge Management: Guest Editors’ Introduction

Nicholas, Charles K., and Yesha, Yelena

Int. J. Cooperative Inf. Syst. 1993
Snitch: Augmenting Hypertext Documents with a Semantic Net

Mayfield, James, and Nicholas, Charles K.

Int. J. Cooperative Inf. Syst. 1993
Generating a Dynamic Hypertext Environment with n-gram Analysis

Pearce, Claudia, and Nicholas, Charles K.

In CIKM 93, Proceedings of the Second International Conference on Information and Knowledge Management, Washington, DC, USA, November 1-5, 1993 1993

1992

On the Interchangeability of SGML and ODA

Nicholas, Charles K., and Welsch, Lawrence A.

Electron. Publ. 1992