Thursday, December 3, 2009

Evolutionary Processes and Natural Selection Reloaded

There are four basic evolutionary processes: Natural Selection, Genetic Drift, Undirected Mutation, and Gene Flow; all of which operate on populations of entities. The interplay between these processes can enhance or suppress the fitness of the individuals within given a population.

The process most commonly discussed when addressing evolutionary biology is the process of Natural Selection. In the basic formulation of natural selection, it only requires four conditions to operate on a population (based upon those found within Evolution, 3rd Edition by Mark Ridley);
  1. Reproduction - Entities must reproduce to form a new generation.
  2. Heredity - Entities produced via reproduction must tend to possess the characteristics (e.g. traits) from the previous generation.
  3. Individual Variation - The population of entities is not identical.
  4. Characteristic Fitness - Individual characteristics have varying degrees of fitness which allows them to propagate their traits to subsequent generations.
There are a number of issues when attempting to directly apply Natural Section from evolutionary biology to information security using a strict interpretation of the required conditions.
  1. Reproduction - A majority of the entities within information systems are installed or are copied onto other information systems rather than true reproduction. This form of reproduction is more akin replication which is essentially cloning as opposed to asexual reproduction. In asexual reproduction, each subsequent generation consists of identical or nearly identical copies that are produced as offspring, while cloning produces identical copies.
  2. Heredity - The condition for heredity is easily satisfied. Computers are quite effective at producing exact copies of programs and data, and there are numerous methods for performing integrity checks to insure that the replication events did in deed produce an identical copy.
  3. Individual Variation - Natural Selection requires that there is variatability within a population. Within information security, as programs are installed or replicate in an environment, they do so without any variatability. Ability to create exact copies of itself, and any errors within the replication routines can often cause fatal errors when the copy of the application attempts to execute. Simply stated, programs are produced by installation or infection. There may be some variation within the population if the entity is polymorphic or metamorphic, but typically a program is created and then processed through a polymorphic encoder to produce the variations.
  4. Character Fitness - The fitness of an entity within a population varies based on the character traits which it has inherited. Some characters will have a higher fitness, while others have a lower fitness. Those with a higher fitness will tend to have their characters dominate in a population as they are successful in reproducing.  As the population consists of cloned entities, which implies that the individual variatability has been eliminated or minimized, which means that there will be minimal variatability for character fitness.
There are some issues with Natural Selection and its direct application to information security. Alternatively Natural Selection can be thought of as an evolutionary algorithm. The conditions can be relaxed, and reinterpreted in a more general form to produce a selection algorithm which operates within information security.
  1. Reproduction can be reinterpreted as replication, or simply as a process or an algorithm that replicates an entity.
  2. Heredity can be reinterpreted simply as a process for passing characters from a parent entity of to its offspring. Heredity allows for individual characters to be linked between offspring and the previous generation.
  3. Individual Variation can be reinterpreted as a process in which different characters are generated based on the previous generation's characters. This could be as simple as an algorithm that incrementally modifies parameters to a function as they are passed into an entities control loop which alters its interactions between itself and the environment. Like biological systems, these parameters are modified during reproduction, and are assumed to be relatively static during the lifespan of an entity.
  4. Character Fitness can be reinterpreted simply as a filtering function, in which the individual variation causes the fitness of the entity to vary such that selection can act on the individual entities within the population causing the higher fitness entities to survive while the lower fitness entities are pruned from the population.
Natural Selection is the process operating in the environment.  Artificial selection is used to describe selection processes as they are used to modify a population by an experimenter. They both operate in a similar method on a population, but Natural Selection is free from an experimenter's influence. Due to the filtering algorithm, natural selection, like artificial selection, is not a random process. Selection occurs in a nonrandom manner; only those entities who are able to survive the selection pressure are able to reproduce. There are three basic selection pressures: directional selection, disruptive selection and stabilizing selection pressure, which have already been discussed.

Selection is readily evident in information security, as cryptographic algorithms which are broken are slowly removed from general use and newer algorithms are designed. Selection usually does not cause changes to occur instantaneously unless it is a strong selection pressure. MD5 is still within wide spread use through out the computing base despite it being known to be a weak algorithm for some time.

When considering how selection is applied to information security, it is important to understand that in evolutionary biology when an entity is selected against, it has been removed from the reproductive (or effective) population.  In most cases, this means that the entity has died.  When an entity has been selected against in information security it does not necessarily mean that the system has died.  A more complete way of stating that an entity has been selected against in information technologies, it could be to state that when a system has been selected against it is no longer present or is no longer in its intended operating state.  In the case of malware, this would be that the malware has been removed from a system, or its command and control infrastructure has been eradicated.  In the case of an IT system, if a system has been compromised it has been selected against or if the system has been wiped (as in the case of a complete rebuild).  Unlike organisms in the environment, once an organism has been eliminated it cannot not be brought back to life but IT systems on the other hand can be wiped and rebuilt.

Another significant process within evolutionary biology is the Undirected Mutation. Although life is able to have a high fidelity when it is replicating, errors are introduced when replication occurs. The errors cause the cellular processes to vary in ways that can enhance the organisms fitness, reduce it's fitness or have little effect on its fitness. Natural Selection works with Undirected Mutations to select for entities which have a higher fitness, and prune out the entities with a lower fitness. Although there are evolutionary algorithms within the fields of artificial intelligence, most of the processes that modify the behavior or enhance the functionality of programs are guided by Directed Mutation. Directed Mutations are mutations which are deliberately made with the goal of producing an desired effect, and in the case of malware it can range from increasing its ability to infect remote hosts, or hinder the ability of a malware analyst in determining the true nature of the application or simply getting past the latest anti-virus scanner definitions.

If Directed Mutations as a process are reinterpreted to include modifications at a larger scale, then it is tempting to think of directed mutations as being applied by an intelligent entity, most commonly referred to as an intelligent designer.  Although the intelligence designer has no scientific basis in evolutionary biology, it can apply to information security in a more limited way. Evolutionary biology works without having an "intelligent designer" guiding the evolution and development of an entity. Information systems typically work with a designer and/or engineer who designs a system which is then implemented. The fitness of a system is then determined and it can be revised during the next design and subsequent implementation.  As within evolutionary biology, the application of a designer to information security does not require that a single overall designer exist.  Indeed the opposite is true.  There are large numbers of individual designers operating and competing by proxy through their fielded applications and programs for systems and resources.

Genetic Drift is one of two evolutionary processes which can directly work against natural selection. In Genetic Drift, random inheritance of weakly or neutrally selected characters during reproduction can cause characters to either eventually dominate or be removed from a population. Weakly selected characters are those characters which only have minor selection pressures working against them, while neutral characters effectively have no selection pressures operating on them. Genetic Drift is one of the processes which is able to counter the act of Natural Selection. It is able to work against Natural Selection in that during reproduction, despite being a character that provides for a strongly enhanced fitness, it may not be passed along to subsequent generations. If there are two characters {A,B} and only one will be passed along, it will be either A or B. The other character can be lost unless there are sufficient numbers to ensure that statistically it is passed along. Unlike most of the other evolutionary processes, Genetic Drift does not have an easily identifiable analogy to information security other than personal preferences in the choice of browsers, office automation applications, operating systems, etc.

The last of the four major evolutionary processes is Gene Flow. Gene Flow occurs when two populations having different allele frequencies interbreed (usually due to a period of isolation and then reintroduction). Typically in an isolated population, Natural Selection and Genetic Drift will alter the characters of the population from their original frequencies.  When the population encounters another population with which it can interbreed, the resulting interactions cause gene frequencies to change in the resulting population.  It is not required that the two populations be environmentally isolated.  Gene Flow can also occur if there is a strong selection pressure operating locally within the population (normally on a fringe population in which the environment is different from that of the main body).

When selection favors specific adaptations within a population, the adjusted gene frequency of initial population's genes may flow into another population with a different set allele frequency altering the resulting gene frequencies for both populations.  This situation can occur because a population has becomes isolated due to environmental conditions or because an adaptation favors a specific frequency in a sub population. Since the genes favored in the different populations can be different, and the intermixing of the genes results in an intermediate allele frequencies, this process can actually work against Natural Selection preventing optimal solutions from being established.  As an example of this in evolutionary biology, Stephen Stearns and Richard Sage (Mal-Adaptation in a Marginal Population of the Mosquito Fish, Evolution, 1980) found that specific adaptations which could have increased the overall the fitness of a border population of mosquito fish attempting to survive in fresh water was being hindered by gene flow resulting from interbreeding with the main population.

A close parallel with Gene Flow is found within the formal education of programmers for producing secure code. In order to create and distribute programs the developer does not need to be trained in how to create a secure program, only in that they need to be able to create a functional program. Some organizations have deliberately allocated resources to train their developers in methods for developing and implementing secure programs.  But if the organization is only able to attract new developers which have not received any training in a secure development lifecycle, they must expend resources to educate the developer. Depending on the turnover rate of the organization and project schedules, this reoccurring cost could be significant enough to cause the organization to loose their focus on developing a secure product.

Genetic Drift, Gene Flow, Natural Selection and Undirected Mutation form the four basic processes of evolutionary biology. With little modification or reinterpretation these processes can be applied to information security.  Natural Selection becomes Artificial Selection, Undirected Mutation becomes Directed Mutation, Gene Flow is still represented but Genetic Drift becomes less important of an evolutionary process.