Erik M. Ferragut

Deep Learning

Sequence Classification using Deep Learning

A Domain Generation Algorithm (DGA) is a pseudorandom algorithm for producing domains that are then used by malware for command and control. Previous methods have demonstrated an ability to use examples of DGA domains to determine if a new domain was created by a DGA. While useful, the more relevant question is whether existing DGA examples can be used to generalize to {\em previously unseen} DGA domains.

To test the ability to detect DGA domains, we created a new data set comprised of both Alexa domains, which are assumed to be benign, and 21 classes of DGA domains reported in the community. Generalizability is tested by measuring accuracy on a class left out during training. We compare baseline methods derived from the literature to a number of new methods that augment the data set in different ways. It is expected that larger data sets lead to improved accuracy, but we have also shown that transfer learning is improved. Data sets are enlarged by: 1. using reverse engineered DGAs to construct actual, possible domains, 2. perturbing a vector space embedding of the domain and reversing it into a new, similar domain, and/or 3. generating new domains from a language model trained on given class.

We found that more data from the first source helps, but its value is difficult to ascertain since often when there is a known generative algorithm, there is very little data from collection. The perturbation of embedding (or of internal state) appears to often be deleterious, though it does often help. However, further analysis of the improvements seem to show no discernible pattern of performance enhancement. Data created using language modeling of domains showed no real improvement. This may be bacause it is simply a byproduct of the original data, but we hypothesize that an important factor is the inherently non-linguistic nature of domains generated by the known algorithms.

We intend to revisit this work in the context of a one-class classifier, where the data augmentation of benign domains may be more appropriate. (In progress)

Multiscale Architectures for Deep Learning

One standard Deep Learning approach to classifying sequences of discrete symbols is to train an embedding of the symbols into vectors together with a Long Short-Term Memory (LSTM) classification model. This approach was applied to data regarding system call patterns in both malicious and benign data (on both Linux and Windows). Unlike natural languages where related concepts are proximal in the text, computer traces may have large and irrelevant segments between the related portions. These gaps place a greater demand on the memory within an LSTM. Alternative methods for improving long term effects were explored in the context of sequence classification.

The approaches considered can be described as multiscale architectures. These are architectures where the lowest layer steps at a faster rate than the next layer up, which is faster than the layer above that. A number of multiscale architectures have been proposed in the literature. We focused on using an interpolated layer that uses a secondary input to produce an output that is interpolated between the input and output of an original layer. This can be thought of as taking partial steps. The control that determines the stepping can come from (a) the same input as the layer uses, (b) one layer further down, or (c) the original input sequence. Method (a) is the one that appears most in the literature, but we found that the other two methods were more effective on our cyber data, with (b) being the marginally best solution. (Paper in progress)

Manifold Learning

One challenge with understanding unlabeled data is that any correlations or constraints between the features can effectively force the probability mass of the data distribution to be concentrated on a lower-dimensional locus of points that may be complex and non-linear. The attempt to describe this locus is called manifold learning and adopts (as motivation) the idea of a manifold from topology, where a manifold is a collection of locally linear (i.e., Euclidean) patches that are connected by transformations in their overlapping areas.

Our focus has been on improved manifold learning for anomaly detection. In particular, what are the probabilistic assumptiong inherent in a given manifold learning approach? A recent development called Variational AutoEncoding (VAE) attempts to force the representation of the data to follow a normal distribution. This approach has both theoretical and application shortcomings. Other approaches attempt to maintain the local density from the input space; these may be more promising.

In the end, the best distribution on the representation space will depend on the learning task. Development of an empirically effective and operationally deployable manifold-learning-based anomaly detector is underway.

Learning from Simulation Data

ORNL has extensive expertise in developing and validating complex simulations. Typically, these simulations are computaitonally intensive and depend on a large number of parameters. As a result, exhaustive sampling of parameter settings is intractable. Inverse problems, in which parameters are sought to produce a particular simulation result, are especially difficult to address. We propose to use Deep Learning to attempt to learn a models for the relationships between parameters, simulation states, and qualitative results of the simulations. Such models could be used to accelerate simulations, reduce the amount of expert time required to qualitatively analyze results, yield better simulation data compression algorithms, and contribute to solving inverse problems. Ideally, experts could then use these tools to accelerate their scientific process. This work is still in a proposal stage.

Cyber-Physical Attack Detection (CPAD)

Power Grid Analysis

In a physical system, such as a power grid, there are often many sensors with views into the same physical state. Inevitably, these sensors must then have some dependencies. In the case of power grid measurements, current and voltage are related by Kirchhoff’s Laws. In this research we attempt to learn those relationships using machine learning (neural networks) and then use that model to detect data that has been tampered with, either because of component failure or because of malicious intent. A demonstration of the detection was developed. The trained algorithm was able to improve detection of replay attacks by a factor of 30 or more on both simulated and real data.

PowerPoint Slides: Erik M. Ferragut, Jason Laska, “Cyber Physical Attack Detection,” Presentation, May 2017.

Validation Across Thousands of Simulated Systems

Work is ongoing to apply CPAD to a range of thousands of simulated control systems. This analysis will allow us to (1) validate the learned physical laws, and (2) identify properties of systems that make CPAD especially well suited or poorly suited.

Applications to Industrial Plants, Vehicles, or Other Systems

CPAD should apply to any system with multiple views into the same physical space. We are interested in pursuing pilots to explore these applications on real, operationally relevant data.

Data Wrangling

Field Matching (Medical Data)

Working with medical provider and claims data, we were presented with the problem of having multiple similar databases where the field data matches but the field names do not. For example, one database might have a field named “doctor” that matches the field named “provider” in another database. Some methods exist that attempt to use the field names themselves. While these methods are promising, they require extensive domain knowledge. Furthermore, we found situations where the data stored in a field did not match the name of the field, so using the field names would have led to incorrect matches.

Our approach was to create a generate non-parametric probabilistic model for each field. We can then compute the probability of the data in each field separately and in the two fields as one. (To do this correctly, it is important that no data is scored according to a model whose parameters it influenced.) It is then possible to apply Bayes’s Rule to compare the hypothesis that one model describes both fields to the hypothesis that the fields are described by two different models.

To match two databases, a number of types of models are constructed for each field of each database, and then the rule is applied to produce match scores between all possible matches. In a proof-of-concept analysis, matches were found that seemed incorrect because of the given field names but that were in fact correct due to the actual usage of the fields.

Paper: Erik Ferragut, Jason Laska, “Nonparametric Bayesian Modeling for Automated Database Schema Matching,” International Conference on Machine Learning and Applications, December 2015.

Data Pre-Processing (ORDEAL)

As a side project, I am working on a more principled way to pre-process data for machine learning efforts. The basic idea is that pre-processing can be viewed as having three parts:

Make. Given some hierarchical data, including directory paths and JSON/YAML/HDF5 files, create an iterator that maps individual records into named tuples.
Map. Construct a map that creates or replaces fields in each named tuple based on the other fields in the tuple.
Reduce. Create a process that combines individual tuples into new tuples, such as taking the maximum in one field across those with the same values in other fields.
View. Given the resulting data, construct a view into that data that looks like either a dictionary (i.e., key-value pairs) or a matrix, possibly with named rows and columns.

The hope is to have a simple syntax for the various steps to enable a wide range of typical operations without the full complexity of general regular expressions. This work is ongoing.

Anomaly Detection

Defining Anomaly Detection

Anomaly detection is the attempt to find outliers. On the one hand, you can think of it as finding points that are far from the concentrations of probability. On the other hand, you can think of it as finding points that have low probability density. I have adopted and argued for the second approach. In particular, given a probability measure \(p\), we define a tail probability \(T(x)\) of a point \(x\) to be:

\[ T(x) = p( \{ y \mid dp(y) \leq dp(x) \} ), \]

the probability of the measure \(p\) generating an event that is no more likely than \(x\) (where \(dp(x)\) is the derivative of \(p\) with respect to some background measure, in the Radon-Nikodym sense). This is not the only way to define a tail probability. To be more specific, we can call it meta-rarity because it is a number indicating the rarity of the rarity. This definition has a number of advantages, such as allowing for the regulation of alerts that are generated according to \(p\). Another advantage is that since the tail probabilities is a sort of p-value (as in statistical significance), tail probabilities derived from different distributions are directly comparable, even if the underlying data are vastly different.

Paper: EM Ferragut, J Laska, RA Bridges, “A new, principled approach to anomaly detection,” International Conference on Machine Learning and Applications (ICMLA), 2012.

Meta-Rarity Is Optimal

The definition of tail probabilities given above has another advantage. If it is the case that your data are being generated according to \(p\), then it is the best way to distinguish it from data that is generated uniformly over the same space. The mathematical details of this statement require a lot of explanation, which is being written up in a paper. One way to think of it is as a variation of the Neyman-Pearson Lemma where one of the probability distributions is replaced with a (not necessarily finite) measure. (Paper in progress)

Scalable Models

To make this work deployable and useful, a way of building and updating models from streaming data was required. To do this, we focused on discretization of data (i.e., binned data). In this application, the discrete distribution is being estimated from the event counts, and the events are being scored for anomalousness with respect to the estimate of the distribution. The tail probability definition applies in this case since we are looking at a discrete measure. There are two main questions to address to make this scale:

How do we deal with previously unseen events? This is the question of priors and smoothing. The typical solution, Cauchy Smoothing, is to add a “virtual” observation (or a fraction of an observation) to each value before computing probabilities. However, this method breaks down if the space of possible values is not known in advance. In that case, a typical prior to use would be a Dirichlet process prior (aka stick breaking). However, this approach is most appropriate when the prior is a collection of countably many point masses drawn from a continuous probability distribution. It could probably be made to work, but it would require assuming a “background” distribution on all possible values, which kind of brings us back to the original issue. Instead, we adopted a “hack” which is simple and mathematically unjustified, but solves the problem. In particular, we simply augment the count of the new observation first and then compute its tail probability. The result is a robust anomaly score that tempers the anomalousness of a never before seen event with some notion of how much overall data has been seen.
How do we efficiently compute the tail probability? The simplest approach to computing the tail probability is to first sort the counts, and then sum up over all values not bigger than the one observed. However, then with each new observation, it may become necessary to move a value through the list to its new position in order to resort the list. By grouping events that have been observed the same number of times, this can be made much faster. Additional speed-ups were also explored and implemented (by other researchers).

This has been implemented into the ORNL system called Situ, and has been deployed at multiple sites.

Patent: Erik M. Ferragut, John R. Goodall, Michael D. Iannacone, Jason A. Laska, Lane T. Harrison, “Real-time detection and classification of anomalous events in streaming data,” US Patent No. 9319421, April 19, 2016.

Patent: Erik M Ferragut, Jason A Laska, Robert A Bridges, “Detection of anomalous events,” US Patent No. 9361463, June 7, 2016

Web Site: ORNL, “Project: Situ”, October 2016

Importance Sampling

For discrete distributions, computing tail probabilities is a matter of counting events that are as rare or more rare. For continuous distributions, we lose the computational benefits of discretization. In particular, no two events can be expected to have the same probability. For most realistic (i.e., multivariate) distributions, even simple ones, the tail probabilities are hard to compute. The main exceptions are the multivariate normal distribution, whose tail probabilities are described using a Chi-squared distribution, and a vector of independent and identically distribution exponential variables, whose tail probabilities are described using a Gamma distribution. (We have yet to find a reference for this latter result, but it was likely already known.)

In more general distributions, computing tail probabilities has no closed form solution. Operationally, this presents a huge difficulty. Given an observation, how can we estimate its tail probability. By looking at methods that combine Monte Carlo and importance sampling with regressions, we found we were able to improve the computational cost of estimating tail probabilities by many orders of magnitude. (Paper in progress)

Change Point Detection

One interesting observation about tail probabilities is that they are approximately uniformly distributed, differing from uniform only when there are sets of positive measure that have constant probability density (i.e., level sets of positive probability). This observation enables a method to measure the extent to which the underlying distribution \(p\) actually matches how the data are generated. If the tail probabilities are uniformly distributed, then the assumption is supported. If the probabilities deviate from a uniform distribution, then the assumption must be rejected. We intend to integrate this into our anomaly detection system, Situ. (Paper in progress)

Game Theory

Cyber Game Strategy Analysis

Cyber security lends itself to game-theoretic analysis because it is, foundationally, a contest between attacker and defender, or else between two agressors. Previous work that attempted to use game-theoretic analysis of cyber conflicts were overly simplistic so as to lend themselves to better analysis. One common simplification is that it is an iterated game where the players move simultaneously. In our work, we developed a game that is interesting because it is (1) potentially multiplayer, (2) based on turn-taking, (3) has incomplete information, and (4) involves probabilistic outcomes. These properties allow the game to be of much greater fidelity to real life conflicts. We then modeled each players knowledge of the game using Bayesian inference, and considered different strategies for making decisions based on the given information. The results confirmed some well known rules of thumb in cyber security, such as the importance of patching systems. We also discovered that with great uncertainty, the best strategy basically ignores what the opponents are doing. (This paper won the best paper award at its venue. Also of note, my co-authors were middle school and high school students.)

Paper: Erik M Ferragut, Andrew C Brady, Ethan J Brady, Jacob M Ferragut, Nathan M Ferragut, Max C Wildgruber, “HackAttack: Game-Theoretic Analysis of Realistic Cyber Conflicts,” Proceedings of the 11th Annual Cyber and Information Security Research Conference, April 2016.

Strategies from Reinforcement Learning

The greatest difficulty in the cyber game strategy analysis was that the combination of uncertainties was enormous, making the Bayesian inference burdensome. The next step in this work is to drop the explicit tracking of probabilities and develop a strategy optimization process based on reinforcement learning. (Ongoing)

Quantum Computation

Monte Carlo Simulation

Modeling and simulation are essential for predicting and verifying the behavior of fabricated quantum circuits, but existing simulation methods are either impractically costly or require an unrealistic simplification of error processes. We present a method of simulating noisy Clifford circuits that is both accurate and practical in experimentally relevant regimes. In particular, the cost is weakly exponential in the size and the degree of non-Cliffordness of the circuit. Our approach is based on the construction of exact representations of quantum channels as quasiprobability distributions over stabilizer operations, which are then sampled, simulated, and weighted to yield unbiased statistical estimates of circuit outputs and other observables. As a demonstration of these techniques, we simulate a Steane [[7,1,3]]-encoded logical operation with non-Clifford errors and compute its fault tolerance error threshold. We expect that the method presented here will enable studies of much larger and more realistic quantum circuits than was previously possible.

Paper: Ryan S Bennink, Erik M Ferragut, Travis S Humble, Jason A Laska, James J Nutaro, Mark G Pleszkoch, Raphael C Pooser, “Unbiased simulation of near-Clifford quantum circuits,” Physical Review A, 95:6, June 28, 2017.

Decomposition of Quantum Processes into Mixtures of Cliffords

The simulations above depend on having a description of a quantum process (or each step in a multi-step quantum process) as a mixture of Cliffords. Going from an arbitrary process matrix to a mixture of Cliffords is in general still intractable. However, for problems in one or two qubits, it can be done using a standard linear programming solution. For three qubits, I developed an algorithm to decompose sparse process matrices. Although it will not work in general, it was able to decompose a control-control-not. The decompositions can either be optimized to minimize the negativity of the distribution (best for Monte Carlo simulations) or the number of terms used (best for exhaustive summations). The former is very quick since it is an L1 optimization, but the latter, being a discrete optimization problem, tends to be much slower. (Paper in progress)

Special Case Simulations

Clifford simulations are a special case simulation. I am investigating other special cases, based on Cliffords, that are well suited to simulation.