Cybersecurity and AI/ML Biases

Cyberattackers and cyberdefenders appear to be utilizing AI (artificial intelligence) and ML (machine learning) to a rapidly increasing degree, if you are to believe the press, vendors’ claims and blogs. So, it makes sense for cybersecurity professionals and researchers to get a better understanding of the biases that affect the AI/ML pipeline. A recent article, “Biases in AI Systems,” by Ramya Srinivasan and Ajay Chander in the August 2021 issue of Communications of the ACM, does an excellent job of laying out various biases and makes some suggestions as how to mitigate their negative impact.

The article is too detailed to describe in a short column, but I will list the stated biases. The CACM article presents an AI/ML Pipeline, similar to the Software Development Lifecycle (SDLC) used in software development. The AI/ML Pipeline (AMP?) has the following sequential phases:

  • Data creation
  • Problem formulation
  • Data analysis
  • Validation and testing.

I question whether the problem shouldn’t be formulated first, as with the requirements phase of the SDLC, with the creation and analysis of data following the defining of the problem and specification of requirements. Otherwise, it seems to be a so-called “fishing expedition” where the nature of the problem is dependent on the data available rather than seeking the data after the problem has been defined. This is somewhat analogous to my claim, in the November 2008 ISACA Journal article “Accounting for Value and Uncertainty in Security Metrics,” where I note that the most useful metrics are often based on data that are more difficult and expensive to obtain.

The biases within the AMP phases are identified in the article as follows:

Data Creation Biases

  • Sampling bias—due to the selection of particular types of instances more than others, rendering the dataset under-representative of the real world
  • Measurement bias—introduced by errors in human measurement or because of intrinsic habits of those capturing data
  • Label bias—associated with inconsistencies in the data labelling process due to labelers’ different styles and preferences
  • Negative set bias—introduced as a consequence of not having enough samples representative of “the rest of the world”

Problem Formulation Bias

  • Framing effect bias—based on how the problem is formulated and how information is presented

Algorithm/Data Analysis Biases

  • Sample selection bias—introduced by the selection of individuals, groups, or data for analysis in such a way that the samples are not representative of the population intended to be analyzed
  • Confounding bias—arises if the algorithm learns the wrong relations by not considering all the information in the data
  • Design-related bias—solely introduced or added by the algorithm

Evaluation/Validation Biases

  • Human evaluation biases—due to such phenomena as confirmation bias, peak-end effect, prior beliefs (e.g., culture), and how much information can be recalled (“recall bias”)
  • Sample treatment bias—introduced in the process of selectively subjecting some sets of people to a type of treatment
  • Validation and test dataset biases—introduced from sample selection or label biases in the test and validation datasets or can result from the selection of inappropriate benchmarks/datasets for testing

The article asserts that “it may not be possible to eliminate all sources of bias,” but offers some guidelines, as follows:

  • Incorporate domain-specific knowledge (This is similar to what I advocate in my book “Engineering Safe and Secure Software Systems,” where I suggest including both infosecurity and safety experts throughout the SDLC)
  • Understand which features of the data are deemed sensitive based on the application
  • Ensure that datasets are representative of the true population, as far as possible
  • Lay out appropriate standards for annotating (labeling) the data
  • Include all variables that have dependencies with the target feature
  • Eliminate sources of confounding biases by appropriate data conditioning and randomization strategies in selecting input
  • Take care not to introduce sample selection bias in choosing subsets of data for analysis
  • Guard against the introduction of sample treatment bias

These guidelines make sense for AI/ML and should be considered when applying these technologies to cybersecurity systems and services that incorporate AI/ML to some degree. However, it should be recognized that many of these guidelines are “easier said than done.” Furthermore, consideration of these biases and guidelines is also appropriate for security metrics and the investment in cybersecurity measures generally.

Post a Comment

Your email is never published nor shared. Required fields are marked *