Organizations confronting malicious, negligent and unintentional threats from their trusted insiders must make important policy, structural and procedural decisions as they stand up programs to mitigate these burgeoning threats. On top of that, they must choose from a bewildering array of insider threat detection and prevention solutions.
To help demystify the latter, the Intelligence and National Security Alliance (INSA) recently published a framework for organizing and evaluating a broad range of data analytics techniques currently deployed in insider threat programs. INSA’s An Assessment of Data Analytics Techniques for Insider Threat Programs is intended to be used by insider threat program managers as they consider the optimal combination of techniques for their programs’ specific requirements, implementation timing and resource constraints. INSA’s assessment is deliberately agnostic regarding specific solutions; the nonprofit forum does not seek to endorse any particular product or tool developer.
INSA’s framework methodology consists of two primary dimensions: 1) Whether a technique is descriptive or predictive; and 2) Whether a technique is traceable, or a “black box.” Descriptive techniques, such as rules-based engines, answer the question of “what has happened,” whereas predictive techniques, such as machine learning, try to determine “what could happen.” Some techniques are traceable — meaning that an auditor can map exactly how a risk has been calculated — and some techniques are more opaque (aka black box), meaning that the calculation is not easily traceable.
Once INSA organized the many data analytics techniques into these dimensions, it took the further step of ensuring that there were no overlapping techniques (e.g., subsets or variations) or major categories that were missing. This step was the most challenging, because techniques are continuously evolving and mutating.
Analytic Approaches
INSA determined that there are six major categories of analytics techniques: rules-based systems, correlation and regression statistics, Bayesian inference networks, machine learning (supervised), machine learning (unsupervised), and cognitive/neural networks/deep learning.
Rules-Based Systems: These are commonly used in insider threat programs to flag persons who may have exceeded a threshold, as determined by policy or risk indicator. Rules-based systems are generally binary: you either exceed a threshold or you don’t. Once a person is flagged, that flag is then assessed (often using other rules). For example, a person past due on debt by over 90 days will be flagged and evaluated in combination with other flags such as, say, a DUI. Rules-based systems are used extensively to help inform insider threat analysts of what they should focus on.
As the INSA report states: “Rules-based systems are relatively easy to understand and defend, and can be built to represent expert judgment on simple or complicated subjects. Cause and effect triggers are transparent — there is no ‘black box.’ Even though the if-then reasoning can become complex, a domain expert can verify the rule base and make adjustments when necessary.”
However, these systems have obvious challenges. Rules reflect policy information, such as what constitutes a negative event, and thus must be changed frequently to reflect policy changes. As more rules are added, the system becomes more complex and less scalable. Thus, the most effective systems use aggregations of rules, but these will still not necessarily assess “risk”; instead, they just provide alerts based on the information they are given.
Like machine learning, data is of primary importance to rules-based systems. Such a system does not handle incomplete information very well, and data that does not relate to a rule will often be disregarded — or perhaps not even detected in the first place. This means that rules-based systems don’t detect things that aren’t already coded into the system. These systems generate a lot of false positives and miss a lot of potential threats.
Correlation and Regression Statistics: These techniques can help address gaps in rules-based systems; they essentially look at how strongly variables relate to each other. For example, if a person is flagged for financial issues and then also has a DUI, can we find any correlation between these two events? Is there a strong or weak correlation between the two, and why? Simple statistical tools can help us find patterns and, with enough data, produce some interesting insights.
However, correlations often do not show “cause and effect.” There may be a strong relationship between a DUI and financial problems, but more analysis is needed to find out why that may be the case and whether the DUI is impacting financial issues, or the other way around.
Finally, correlations may leave out unknown variables that also have a strong impact. We can see that DUIs and financial problems are strongly correlated, but what if the most important driver of the DUI trend is something else? If we added that variable in, and saw a very close correlation between the two, then suddenly financial issues may not be that important after all.
Bayesian Inference: The INSA paper describes Bayesian inference networks quite thoroughly: “A Bayesian network is a statistical model for reasoning about complex problem domains. Bayesian networks provide a way to make inferences even when evidence is missing, incomplete or inconsistent. Note that the inference results will depend on the strength, completeness and consistency of the evidence; strong consistent evidence will yield a strong Bayesian network inference, giving a clear answer. If evidence is sparse, weak or inconsistent, the Bayesian network will reflect the uncertainty inherent in the knowledge base. For example, if a subject’s spouse’s income is unknown, a comparison of known household debt levels to total household income will yield uncertain results. A Bayesian model can identify and quantify the uncertainty present in the resulting analysis.”
The key phrase above is “if evidence is sparse, weak or inconsistent, the Bayesian network will reflect the uncertainty inherent in the knowledge base.” Unlike correlations, Bayesian inference allows us to show uncertainties around lack of knowledge about the problem domain. Financial issues and DUIs may have a strong correlation, but a Bayesian inference network (aka model) may show it as weak, given that we don’t have enough information about other aspects of the insider threat problem set — such as a person’s external behaviors that might contribute to either of these problems.
INSA found that building a Bayesian network around the insider threat problem domain is good way of providing a holistic, baseline view of what is truly important to know for detection of insiders. Such a model-first approach is not subject to limitations of a “data-first” approach — in which the data accessed is the data used, and there is a lack of context that could have come from viewing the problem holistically.
Furthermore, Bayesian networks are transparent; analysts and decision-makers can trace the logic to determine how a risk score was calculated.
The biggest challenge with Bayesian inference is it can be time-consuming to build a useful model. Bayesian networks also have limitations around detecting true anomalous behaviors, which machine learning is able to accomplish. Thus, a Bayesian/machine-learning combination approach is one of the most powerful solutions for a comprehensive insider threat program.
Machine Learning: Machine learning is essentially learning from data. There are two types of machine learning: “supervised” and “unsupervised.” With supervised machine learning, an analyst will “train” a model by providing “labeled” data so that the model knows what output to expect. For example, the analyst will “train” the system to know what is “spam” and what is “not spam.” The analyst will feed labeled examples of each into the model and test it on new or unseen data. In this way it “learns,” and is able to identify new examples of spam based on the patterns it has been given. The majority of machine learning-based models are trained via supervised learning.
Training a model to draw inference from data without any labeled input is called “unsupervised learning.” The most common kind of unsupervised machine learning is “clustering,” in which defined groups of similar things are clustered together to determine anomalies.
As the INSA paper concludes: “Machine learning’s ability to cope with massive quantities of data is offset by the fact that it is entirely dependent on data, and thus is unable to offer solutions in cases where data is scarce.”
Cognitive/Neural Networks/Deep Learning. Essentially, this technique is a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. This technique is generally what today we consider artificial intelligence (AI) — the ability to do some things very well but not able (yet) to do lots of things well in a generalized way, like most humans.
As INSA states in its paper: “Unlike traditional programmable systems that are deterministic and thrive in structured data, cognitive systems are probabilistic and thrive in unstructured data while also reasoning and offering hypothesis based on their behavioral models.”
This technique is commonly used in tasks like natural language processing, emotion detection, event detection and behavioral analysis. It is also heavily dependent upon data.
Conclusion
INSA’s conclusion is that the best programs often use combinations of the above-described techniques. As an example, probabilistic models can be usefully enhanced with rules-based triggers and machine learning algorithms that detect anomalies.
INSA provides the following recommendations to insider threat program managers:
- Program managers (PMs) should integrate data analytics into the risk management methodology they use to rationalize decision-making. Without question, the methodical analysis of available data can help organizations better identify, weigh and assess the factors that raise the likelihood a trusted insider will act maliciously.
- PMs should consider the specific analytic techniques explored in the INSA paper and assess which techniques are likely to be most effective given the available data, their organizational culture and their levels of risk tolerance. They should assess different combinations of techniques, as the availability of data may make some combinations more effective than others at identifying and calculating risk.
- Once PMs have decided on an analytic approach, they should evaluate the myriad software tools available that most effectively evaluate data using the preferred approach.
- PMs should assess the human and financial resources needed to launch a data analytics program, including the expense of software tools, the training and time needed to structure data and apply tools and a clear definition of the skills program staff need to develop, maintain and execute a data analytics initiative over time. As with any technology, changes to technical capability, data availability and legal constraints occur rapidly and significantly change the usefulness of individual analytic methodologies and software tools.
The views expressed here are the writer’s and are not necessarily endorsed by Homeland Security Today, which welcomes a broad range of viewpoints in support of securing our homeland. To submit a piece for consideration, email [email protected]. Our editorial guidelines can be found here.