ABSTRACT
Data mining involves sifting through very large databases in search of frequently occurring patterns to detect trends and produce generalizations about the data content. This paper describes an approach, using genetic algorithms, to search through, or mine, a large set of database transactions in order to determine a relationship between the transactions. Consider an example database consisting of transactions representing products purchased at a retail store. This genetic algorithm implementation focuses on determining, out of 100 possible items, the four items that are most often purchased together. Information of this sort can be used by businesses to aid in inventory, sales, and marketing strategy planning. For example, such information can allow a business to focus on marketing opportunities that encompass sets or groups of products and thus maximize their advertising dollars.
INTRODUCTION
With the widespread proliferation of powerful and affordable computing and information gathering devices, we have seen a dramatic increase in the amount of electronic data that is being obtained and stored. It has been estimated that the amount of electronic information in the world doubles every 20 months, and the size and number of databases that store this information are increasing even faster [1].
Given that we have vast amounts of data stored and available to us, and new mountains of information are being gathered each day with every credit card and point-of-sale bar code transaction, businesses are now faced with the task of making use of all this information. For instance, one of the countries largest retailers, Wal-Mart, as of 1995, had 65 weeks of point-of-sale transaction data on-line, taking up more than 3.5 terabytes of storage [2]. Considering that one terabyte of data is equivalent to approximately two million books, it is obvious that new and powerful techniques will be required to make effective use of this much information.
It is generally recognized within the business community that there is untapped value in large databases [3]. It is also recognized that this information is vital to business operations, and that decision-makers need to make use of the stored data in order to be competitive. To accomplish this, the large businesses that own these terabytes of data have turned to what is being called data mining. Data mining employs methodologies to sift through huge databases in search of frequently occurring patterns to detect trends and produce generalizations about the data content. As one might suspect, this type of analysis not only requires the use of very sophisticated and powerful computers, designed specifically for the purpose of managing large amounts of data, but it also requires data mining application software that can efficiently and effectively search through vast amounts of data in order to recognize the possible relationships that exist.
Data mining encompasses many technologies and techniques which are used to identify hidden pieces of information within a database. Current technologies and techniques employed in many data mining environments include parallel computer architectures, relational database management systems (RDMS), data warehouses (which organize data in such a way that retrieval and analysis is greatly facilitated), and often, artificial intelligence and neural networks. These technologies, although important in the overall implementation of a data mining environment, are beyond the scope of the application examined in this paper.
This paper focuses solely on a genetic algorithm-based search technique that can be used in data mining to aid in determining relationships that may exist within a data set. In order to determine what relationships exist, many combinations of transaction comparisons and database queries must take place. Depending on the nature of the data and the relationships that are being sought, millions of data attribute combinations may have to be investigated.
For example, consider a database that contains N transactions, where each transaction consists of a list of items purchased. Also, assume that there are 100 different items that can appear in a transaction. Now, consider an effort to determine the four items that are most often purchased together. This implies that there are 100 items chosen 4 at a time, or 3,921,225 combinations of items available for investigation. If any useful output from the data mining tool is expected within a reasonable amount of time, an exhaustive comparison of all possible combinations is probably not feasible. Instead, an efficient and effective search technique is required for relationship discovery. This is where a genetic algorithm comes into play, and it is this problem that is addressed in this paper.
The problem addressed in this paper consists of using a genetic algorithm to search a synthetically generated database in order to discover a relationship between data items. The synthetic database contains N transactions, where N varies based on the test being run. Each transaction, similar to transactions found in a retailers database, consists of a list of items purchased, where each item is one of 100 different items. Given a database with the characteristics just described, the goal is then to determine the four items that are most often purchased together.
The remainder of this section discusses an overview of the problem implementation, the format of a synthetic data set, and a supporting program, the synthetic data generator.
Implementation Overview
If we are to find four related items out of a possible 100, this implies that there are 100 choose 4, or 3,921,225, combinations of items that we can potentially investigate. As previously mentioned, it is unreasonable to investigate all 3,921,225 combinations of potential relationships since, for each combination, the entire database (or a sufficiently large subset) has to be scanned to determine how often the four items appear together. To address this problem, a genetic algorithm is used to investigate a subset of the possible combinations. The intent was that by using a genetic-based search, the combination of four items that are most often purchased together could be determined much more efficiently than through an enumerative investigation of every possible combination. Genetic algorithm specifics on coding, fitness, and operators is given in the following section, Genetic Algorithm Specifics. Thus, an overview of the genetic algorithm in the data mining problem domain is presented next.
Before performing any type of genetic search, a coding scheme was determined to represent the four potentially related items. To prepare for the search, X sets (the population size) of four items are picked at random and then coded into strings based on the coding scheme. These X strings represent generation 0 of the search. Then, for each of the X sets of four coded items, the entire database (or a sufficiently large subset) is scanned to generate a statistic (the fitness) as to how often the four items are purchased together. This fitness is based on the frequency with which the set of four items appear within the database under investigation. In addition, if a fractional number of the items in the item list appear within a transaction, the fitness reflects that fraction.
Once a fitness value is determined for each set of items in the population, the three genetic operators (reproduction, crossover, and mutation) are performed to generate a new population of strings. After performing reproduction, crossover, and mutation, if all goes as planned, each successive population should contain more highly fit strings, ultimately leading to a set of the four most frequently occurring items within the database.
Synthetic Data Sets and Data Set Generation
Since the focus of this chapter is on the development and verification of a genetic algorithm-based data mining search approach, an actual database with real data was not used nor was it required to demonstrate the usefulness of the genetic algorithm. Instead, synthetic data sets were generated to demonstrate the application of the genetic algorithm. As described earlier, a synthetically generated data set consists of X transactions. Each transaction consists of a transaction number (Trans_Num), the number of items purchased in the transaction (N), and a list of N items, where each item is specified by a number (i.e. 0=Pretzels, 1=Aspirin, 2=Beer, 3=Bread, etc.). Also, N can be any of 100 different items. The format of a transaction and an example of a transaction containing fifteen items are shown below:
Transaction Format: Trans_Num N Item0 Item1
ItemN
Example Transaction: 0 15 1 2 3 10 19 55 42 41 22 83 20 20 87 92 6
Synthetic data sets were generated using a C program that allowed the user, through command line input, to specify data characteristics. Allowing the user control over data set characteristics facilitated verification of the genetic algorithm search results, since known relationships were embedded within the data sets. Then, the relationships that were discovered by the genetic search were confirmed against the expected results.
The synthetic data generator program required the user to provide at least one option to specify the number of transactions to generate. Other options allowed the user to specify individual items and the probability with which each of the specified items appeared within the synthetically generated data set.
GENETIC ALGORITHM SPECIFICS
The coding scheme, fitness function, and each of the three genetic operators (reproduction, crossover, and mutation) can be implemented in a variety of ways depending on the problem to which a genetic algorithm is being applied. Various implementation alternatives of these for the data mining problem examined in this chapter are discussed below.
Parameter Coding
A main difference between genetic algorithms and more traditional optimization and search algorithms is that genetic algorithms work with a coding of the parameter set and not the parameters themselves. Thus, before any type of genetic search can be performed, a coding scheme must be determined to represent the parameters in the problem at hand. In the data mining problem addressed by this chapter, the parameters of interest are simply four, potentially related, item numbers. Therefore, a coding scheme for four item numbers was determined considering the following factors:
A multi-parameter coding, consisting of four sub-strings, is required
to code each of the four items into a single string.
Each sub-string needs to represent one of a hundred (0 through 99) possible
items.
There are 100 choose 4 (3,921,225) possible combinations of items that
need to be represented.
Given these factors, several coding schemes were considered. First, a scheme
using a base-10 coded sub-string was examined. This coding allows 100 items
to be represented, and no codings correspond to non-existent items (all possible
sub-string codings would represent valid item numbers, 0 through 99). This coding
results in an 8-digit string where each digit (dx) is between 0 and 9, as follows:
A coding of this sort, however, limits the number of schemata that are available for the genetic algorithm to exploit. Therefore, since we want to maximize the number of schemata, a binary coding, as shown below was considered.
Here, each digit (bx) is a 1 or 0. One problem with this coding is that, in order to represent 100 items, a sub-string must consist of seven bits (27 = 128). This means that 284 (128-100=28 values in each sub-string), or 614,656 strings could be manipulated by the genetic algorithm which dont correspond to actual solutions.
Another possibility, when considering the fact that there are 3,921,225 combinations of items that need to be represented, was to code the items as a 22-bit binary string. This corresponds to 222, or 4,194,304 possible solution representations, reducing the number of non-solution strings to 273,079 (4,194,304-3,921,225). Using this approach, however, requires a mapping from a combination number to the actual four items represented by that combination since the item numbers are not specifically embedded within the string. If this combination-mapping approach was used, it would be difficult for the genetic algorithm to exploit similarities between strings since two combination numbers would not necessarily have anything in common, even if the items represented by the combination numbers were similar. Therefore, this coding approach was deemed unacceptable.
Given these possibilities, and considering the implementation of the crossover operator which is discussed in the section, Reproduction, Crossover, and Mutation Operators, the decimal coding appeared to be the best alternative and was therefore implemented for the data mining problem.
Fitness function
If a string of four items is coded as just described, the fitness for such a string can be determined based on the frequency with which the set of four items appear within the transactions contained in the database under investigation. In addition, if a fractional number of the items in the item list appear within a transaction, the fitness value will also reflect that fraction. For example, if the item list for which a fitness value is to be determined is as follows:
and if the database, for the purpose of this discussion, contains only two transactions, as follows:
transaction 1: 34 22 55 01 68 99 02 07 42 24
transaction 2: 76 01 86 99 02 07 42 24
then transaction 1 contains all four items (01, 22, 68, and 07) and transaction 2 contains only two of the four items (01 and 07). If I(ti) represents the fraction of items that are contained within a single transaction ti, then I(t1) would equal 1.0 and I(t2) would equal 0.5 (2/4). If all the I(ti)s are summed over all transactions, and this sum is divided by the total number of transaction, N, then a relationship (referred to as the base fitness) between the item set and the frequency with which the items appear within the database can be obtained. Thus, the base fitness value for the item combination of 01, 22, 68, and 07 for the example database of two transactions would be:
This yields a base fitness function that can be represented as follows:
Assume:
N = total number of transactions in the database file (sdg.db)
ti = transaction number
I(ti) = fraction of items that are contained in ti
M = sum of all I(ti) over all transactions
then M is:
and the fitness, referred to as F1, for a given string of items is:
Furthermore, in an effort to avoid possible premature convergence, and to promote competition between strings throughout a simulation, a scaled fitness, F2, was implemented and tested. F2 is expressed as,
where F2 is the scaled fitness, F1 is the raw fitness, and a and b are determined based on the maximum and average raw fitness values. In the case of the data mining problem, the maximum possible fitness (if every transaction in the database contained all four items of interest) was 1.0 and the average raw fitness was found to be around 0.4.
In a third alternative, the fitness function was based on the following additional assumptions: Since the goal is to determine the four items that are most often purchased together, the fitness function should yield a higher value if more of the items in the item set are contained together within the database transactions. For example, if one item set yielded a base fitness value of 0.25 (implying that, on average, one of the four items in the item set were contained in each of the database transactions) and a second item set yielded a base fitness value of 0.75 (implying that, on average, three of the four items in the item set were contained in each of the database transactions), it may be desirable to give more than a simple linear emphasis to the second item set value since it is closer to the goal of four. Thus, a third fitness, referred to as F3, is expressed as
where x was chosen to be 3.
While this type of fitness expression gives more emphasis to item sets that are closer to the four item goal, it must be realized that a single occurrence of four items purchased together within a large database will not yield a higher fitness than three of the items purchased together many times. Since we are seeking trends within a large set of database transactions, it was felt to be more important to recognize major trends (such as three items purchased together many times) which might lead to a trend of four items purchased together.
The results of simulations run with each of the fitness expressions just described (F1, F2, and F3) are discussed in the section, Fitness Function Simulation Results. Those results show that fitness function F3 provided very good performance, better than both F1 and F2.
Reproduction, Crossover, and Mutation Operators
The reproduction, crossover, and mutation operators that were implemented to support the eight-digit decimal coding are discussed next.
Reproduction Operator
A basic roulette wheel reproduction operator was implemented for this data
mining application. In this type of reproduction, each string in the population
is given a roulette wheel slot sized in proportion to its fitness. Then, by
spinning the wheel N times (the population size), N new offspring
are created for the next generation. By having a weighted, or biased
roulette wheel, it is more probable that higher fit strings receive more copies
in subsequent generations.
Crossover Operator
Crossover is used to improve the population fitness by introducing new strings into the population. Thus, the crossover operator for this problem had to introduce new item combination strings into the population. Due to the fact that the sub-strings (item numbers) within a coded string must be preserved, possible cross sites exist at the item boundaries within a string, as shown by points A, B, and C below:
In determining a crossover operator for this application, it was important to consider the fact that duplicate item numbers cannot exist within a string since this would represent an invalid combination of items. For example, if the following two strings were crossed using simple crossover and a cross site as noted by A,
strings 1' and 2' would result, as shown:
Note that the resulting string 1' only contains three different item numbers (04, 21, and 24) instead of four.
Given this consideration, it was obvious that a simple single point crossover operator would not be adequate. Crossover operators such as partially matched crossover, or PMX [7], and ordered crossover [7] ensure that duplicates do not occur in children strings. However, these crossover operators require that each parent string contain the same characters. The situation in the data mining problem, however, is slightly different in that every item in the first string may, or may not, be in the second string. Considering this fact, two different crossover operators were developed for examination: aligned single-point crossover (ASPX) and unmatched crossover with single child offspring (UXSCO), each of which is described below. The results of simulations run with each of these crossover operators are discussed in the section, Crossover Operator Simulation Results. Those results show that the ASPX crossover operator provided very good performance, better than the UXSCO operator.
Aligned Single-Point Crossover (ASPX)
Recall that the main goal of our crossover operator is that it produces children strings which do not contain duplicate item numbers. This goal can be achieved with aligned single point crossover as follows: Two parent strings, such as those shown below, are chosen at random from the current population.
Next, since the order of the item numbers within a string is not important, a string can be rearranged without affecting its fitness value. Thus, the item numbers in string 2 are rearranged such that each item number that has a corresponding match in string 1 is aligned with that item number. In this example, the 10 and 20 in string 2 are matched in string 1 and would thus be aligned, as follows:
Now, since all matching item numbers are aligned between the two strings, basic single-point crossover can be performed and guarantees that no duplicates occur in either child string. For example, if a cross site is chosen at point A,
the following children strings will result and no duplicates will be present.
Unmatched Crossover with Single Child Offspring (UXSCO)
As with aligned single-point crossover, unmatched crossover with single child offspring operates with the main goal of producing children strings which do not contain duplicate item numbers. In unmatched crossover with single child offspring, this is achieved as follows: Again, two parent strings are chosen, at random, from the current population. Next, it is determined if there are any matching items between the two strings. As part of this determination, the item numbers that are matched between the two strings are saved. In addition, the item numbers in string 2 that do not match any items in string 1 are saved. Next, a random number, X, between one and the number of items that are unmatched is generated. If there are zero unmatched items (the parent strings are identical) then X will be zero and no crossover will take place. If there are greater than zero unmatched items (the parent strings are not identical), then X unmatched item numbers from parent 2 are copied to random locations in child 1. Following the copy of X items to child 1, child 2 will be equal to parent 2 (there is a single new offspring, child 1). The following example demonstrates the operation of unmatched crossover with single child offspring:
Assume the following two strings, which possess two matched items (10 and 20) and two unmatched items, were chosen:
Next, assume that choosing a random number between 1 and the number of unmatched items (two) results in X equal to two. Then, two unmatched items from string 2 (88 and 99) are copied to two random locations in string 1. Assuming the random locations are one and three, the following children strings will result:
By performing crossover in this manner, it is guaranteed that no duplicate item numbers will appear in a string.
Crossover Operator Support of Coding Scheme
Both the ASPX and UXSCO operators supported the choice of a decimal coding over a binary coding. Since the goal of crossover is to create strings with new item number combinations, there would be no advantage to allowing crossover to occur at sites other than between item number boundaries. Given this, there would be no advantage to having a string coded in binary form since no additional useful schemata would be processed. Using a binary coding, more schemata would be available for processing by the genetic algorithm, however, the additional available schemata would represent strings that are not possible solutions to the problem.
Mutation Operator
With the use of a base-10 coding as previously described, an alternate method of mutation was required. In a typical binary coding, mutation is simple in that if it is determined that mutation should take place, a 1 is mutated to a 0, and a 0 is mutated to a 1. In a base-10 coding, however, a method to mutate a decimal value between 0 and 99 must be determined. The following alternatives were considered:
1) generate a random number between 0 and 99 (random mutation)
2) generate a new number within some ±d of the number being mutated (window
mutation)