Artiﬁcial Intelligence 151 (2003) 155–176 www.elsevier.com/locate/artint Consistency-based search in feature selection Manoranjan Dash a,∗, Huan Liu b a Electrical and Computer Engineering, Northwestern University, 2145 Sheridan Rd, Evanston, IL 60208-3118, USA b Department of Computer Science and Engineering, Arizona State University, PO Box 875406, Tempe, AZ 85287-5406, USA Received 7 March 2002; received in revised form 27 March 2003Abstract Feature selection is an effective technique in dealing with dimensionality reduction. Forclassiﬁcation, it is used to ﬁnd an “optimal” subset of relevant features such that the overall accuracyof classiﬁcation is increased while the data size is reduced and the comprehensibility is improved.Feature selection methods contain two important aspects: evaluation of a candidate feature subsetand search through the feature space. Existing algorithms adopt various measures to evaluate thegoodness of feature subsets. This work focuses on inconsistency measure according to which afeature subset is inconsistent if there exist at least two instances with same feature values but withdifferent class labels. We compare inconsistency measure with other measures and study differentsearch strategies such as exhaustive, complete, heuristic and random search, that can be applied tothis measure. We conduct an empirical study to examine the pros and cons of these search methods,give some guidelines on choosing a search method, and compare the classiﬁer error rates before andafter feature selection. 2003 Elsevier B.V. All rights reserved.Keywords: Classiﬁcation; Feature selection; Evaluation measures; Search strategies; Random search; Branchand bound1. Introduction The basic problem of classiﬁcation is to classify a given instance (or example) to one ofm known classes. A set of features presumably contains enough information to distinguishamong the classes. When a classiﬁcation problem is deﬁned by features, the number of * Corresponding author. E-mail address: [email protected] (M. Dash).0004-3702/$ – see front matter 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0004-3702(03)00079-1

156 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176features can be quite large, many of which can be irrelevant or redundant. A relevant featureis deﬁned in [5] as one removal of which deteriorates the performance or accuracy of theclassiﬁer; an irrelevant or redundant feature is not relevant. Because irrelevant informationis cached inside the totality of the features, these irrelevant features could deterioratethe performance of a classiﬁer that uses all features [6]. The fundamental function of afeature selector is to extract the most useful information from the data, and reduce thedimensionality in such a way that the most signiﬁcant aspects of the data are representedby the selected features [14]. Its motivation is three-fold: simplifying the classiﬁer byretaining only the relevant features; improving or not signiﬁcantly reducing the accuracy ofthe classiﬁer; and reducing the dimensionality of the data thus reducing the size of the data.The last point is particularly relevant when a classiﬁer is unable to handle large volumesof data. Research on feature selection has been done for last several decades and is still infocus. Reviews and books on feature selection can be found in [11,26,27]. Recent paperssuch as [2,9,16,17,22,43] address some of the existing issues of feature selection. In order to select relevant features one needs to measure the goodness of selectedfeatures using a selection criterion. The class separability is often used as one of thebasic selection criteria, i.e., when a set of features maximizes the class separability,it is considered well suited for classiﬁcation [37]. From a statistics view point, ﬁvedifferent measurements for class separability are analyzed in [14]: error probability, inter-class distance, probabilistic distance, probabilistic dependence and entropy. Information-theoretic considerations [45] suggested something similar: using a good feature ofdiscrimination provides compact descriptions of each class, and these descriptions aremaximally distinct. Geometrically, this constraint can be interpreted to mean that (i) sucha feature takes on nearly identical values for all examples of the same class, and (ii) ittakes on some different values for all examples of the other class. In this work, we use aselection criterion called consistency measure that does not attempt to maximize the classseparability but tries to retain the discriminating power of the data deﬁned by originalfeatures. Using this measure, feature selection is formalized as ﬁnding the smallest set offeatures that can distinguish classes as if with the full set. In other words, if S is a consistentset of features, no two instances with the same values on S have different class labels [1]. Another aspect of feature selection is related to the study of search strategies to whichextensive research efforts have been devoted [5,11,41]. The search process starts with eitheran empty set or a full set. For the former, it expands the search space by adding one featureat a time (Forward Selection)—an example is Focus [1]; for the latter, it shrinks the searchspace by deleting one feature at a time (Backward Selection)—an example is ‘Branch &Bound’ [34]. Except these two starting points, the search process can start from a randomsubset and continue either probabilistically (LVF [30]) or deterministically (QBB [12])from there. The contributions of this paper include: • a detailed study of consistency measure vis-a-vis other evaluation measures; • pros and cons of various search strategies (exhaustive, complete, heuristic, and probabilistic) based on consistency measure; • experimental comparison of different search methods; and • guidelines about choosing a search method.

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 157 In the rest of the paper, we describe the feature selection process in Section 2. Section 3discusses consistency measure and compares with other measures. Section 4 describesdifferent search strategies for consistency measure and their pros and cons. Section 5 showssome experimental results and Section 6 concludes the paper.2. Feature selection process In the rest of the paper we use the following notations. P is the total number of instances,N denotes the total number of features, M stands for the number of relevant/selectedfeatures, S denotes a subset of features, f1, . . . , fM are the M features, m denotes thenumber of different class labels, and C stands for the class variable. Ideally feature selection methods search through the subsets of features and try to ﬁndthe best subset among the competing 2N candidate subsets according to some evaluationmeasure. But this procedure is exhaustive as it tries to ﬁnd only the best one, and maybe too costly and practically prohibitive even for a medium-sized N . Other methodsbased on heuristic or random search methods attempt to reduce computational complexityby compromising optimality. These methods need a stopping criterion to prevent anexhaustive search of subsets. There are four basic steps in a typical feature selection method(see Fig. 1):(1) a generation procedure to generate the next candidate subset for evaluation,(2) an evaluation function to evaluate the candidate subset,(3) a stopping criterion to decide when to stop, and(4) a validation procedure to check whether the subset is valid. The generation procedure uses a search strategy to generate subsets of features forevaluation. It starts (i) with no features, (ii) with all features, or (iii) with a random subsetof features. In the ﬁrst two cases features are iteratively added/removed, whereas in the lastFig. 1. Feature selection process with validation.

158 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176case, features are either iteratively added/removed or produced randomly thereafter duringsearch. An evaluation function measures the goodness of a subset produced by some generationprocedure, and this value is compared with the previous best. If it is found to be better,then it replaces the previous best subset. An optimal subset is always relative to a certainevaluation function (i.e., an optimal subset chosen using one evaluation function may notbe the same as that using another evaluation function). Without a suitable stopping criterion the feature selection process may run unneces-sarily long or possibly forever depending on search strategy. Generation procedures andevaluation functions can inﬂuence the choice for a stopping criterion. Examples of stop-ping criteria based on a generation procedure include: (i) whether a predeﬁned number offeatures are selected, and (ii) whether a predeﬁned number of iterations reached. Examplesof stopping criteria based on an evaluation function include: (i) whether further addition(or deletion) of any feature produces a better subset, and (ii) whether an optimal subset(according to some evaluation function) is obtained. The feature selection process halts byoutputting the selected subset of features which is then validated. There are many variations to this feature selection process but the basic steps ofgeneration, evaluation and stopping criterion are present in almost all methods. The validation procedure is not a part of the feature selection process itself. It triesto test the validity of the selected subset by testing and comparing the results withpreviously established results or with the results of competing feature selection methodsusing artiﬁcial datasets, and/or real-world datasets.3. Consistency measure3.1. The measure The suggested measure U is an inconsistency rate over the dataset for a given featureset. In the following description a pattern is a part of an instance without class label. Itis a set of values of feature subset. For a feature subset S with nf1 , nf2 , . . . , nf|S| numberof values for features f1, f2, . . . , f|S| respectively, there are at most nf1 ∗ nf2 ∗ · · · ∗ nf|S|patterns.Deﬁnition. Consistency measure is deﬁned by inconsistency rate which is calculated asfollows.(1) A pattern is considered inconsistent if there exists at least two instances such that they match all but their class labels; for example, an inconsistency is caused by instances (0 1, 1) and (0 1, 0) where the two features take the same values in the two instances while the class attribute varies which is the last value in the instance.(2) The inconsistency count for a pattern of a feature subset is the number of times it appears in the data minus the largest number among different class labels. For example, let us assume for a feature subset S a pattern p appears in np instances out of which c1 instances has class label 1, c2 has label2, and c3 has label3 where c1 + c2 + c3 = np.

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 159 If c3 is the largest among the three, the inconsistency count is (n − c3). Notice that the sum of all nps over different patterns p that appear in the data of the feature subset S is the total number of instances (P ) in the dataset, i.e., p np = P .(3) The inconsistency rate of a feature subset S (IR(S)) is the sum of all the inconsistency counts over all patterns of the feature subset that appears in the data divided by P . The consistency measure is applied to the feature selection task as follows. Given acandidate feature subset S we calculate its inconsistency rate IR(S). If IR(S) δ whereδ is a user given inconsistency rate threshold, the subset S is said to be consistent. Noticethat this deﬁnition is an extension of the earlier deﬁnition used in [1]: the new deﬁnitiontolerates a given threshold error rate. This deﬁnition suits the characteristic of consistencymeasure because real-world data is usually noisy and if δ is set to 0% then it may so happenthat no feature subset can satisfy the stringent condition. By employing a hashing mechanism, we can compute the inconsistency rate approxi-mately with a time complexity of O(P ) [30]. By deﬁnition, consistency measure can workwhen data has discrete valued features. Any continuous feature should be ﬁrst discretizedusing some discretization method available in the literature [25]. In the rest of the paper we use consistency measure and inconsistency rate interchange-ably.3.2. Other evaluation measures In the following we brieﬂy introduce different evaluation measures found in theliterature. Typically, an evaluation function tries to measure the discriminating ability of afeature or a subset to distinguish the different class labels. Blum and Langley [5] groupeddifferent feature selection methods into two broad groups (i.e., ﬁlter and wrapper) basedon their dependence on an inductive algorithm (classiﬁer) that will ﬁnally use the selectedsubset. By their deﬁnition, ﬁlter methods are independent of an inductive algorithm,whereas wrapper methods use an inductive algorithm as the evaluation function. Ben-Bassat [3] grouped the evaluation functions till 1982 into three categories: information oruncertainty measures, distance measures, and dependence measures. He did not considerthe classiﬁcation error rate as an evaluation function. Considering these divisions andlatest developments, we divide the evaluation functions into ﬁve categories: distance,information (or uncertainty), dependence, consistency, and classiﬁer error rate. In thefollowing, we brieﬂy discuss each of them (see [11] for more details).(1) Distance measures. It is also known as separability, divergence, or discrimination measure. For a two class problem, a feature fi is preferred to another feature fj if fi induces a greater difference between the two-class conditional probabilities than fj ; if the difference is zero then fi and fj are indistinguishable. Distance measure is employed in [20,24,34,39].(2) Information measures. These measures typically determine the information gain from a feature. The information gain from a feature fi is deﬁned as the difference between the prior uncertainty and expected posterior uncertainty using fi . Feature fi is preferred to feature fj if the information gain from feature fi is greater than that from

160 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 feature fj [3]. An example of this type is entropy. Information measure is employed in [2,8,23,40].(3) Dependence measures. Dependence measures or correlation measures quantify the ability to predict the value of one variable from the value of another variable. Correlation coefﬁcient is a classical dependence measure and can be used to ﬁnd the correlation between a feature and a class variable. If the correlation of feature fi with class variable C is higher than the correlation of feature fj with C, then feature fi is preferred to fj . A slight variation of this is to determine the dependence of a feature on other features; this value indicates the degree of redundancy of the feature. All evaluation functions based on dependence measures can be classiﬁed as distance and information measures. But, these are still kept as a separate category because, conceptually, they represent a different viewpoint [3]. Dependence measure is employed in [31,33].(4) Consistency measures. This type of evaluation measures are characteristically different from other measures because of their heavy reliance on the training dataset and use of Min-Features bias in selecting a subset of features [1]. Min-Features bias prefers consistent hypotheses deﬁnable over as few features as possible. These measures ﬁnd out the minimal size subset that satisﬁes the acceptable inconsistency rate, that is usually set by the user. Consistency measure is employed in [1,30,38]. The above types of evaluation measures are known as “ﬁlter” methods because of their independence from any particular classiﬁer that may use the selected features output by the feature selection method.(5) Classiﬁer error rate measures. In contrast to the above ﬁlter methods, classiﬁer error rate measures are called “wrapper methods”, i.e., a classiﬁer is used for evaluating feature subsets [22]. As the features are selected using the classiﬁer that later uses these selected features in predicting the class labels of unseen instances, the accuracy level is very high although computational cost is rather high compared to other measures. Classiﬁer error rate measure is employed in [14,15,18,29,32,42,44].3.3. Consistency measure vis-a-vis other measures Consistency measure has the following differences with other types of methods [13]. • Consistency measure is monotonic while most others are not. Assuming we have subsets {S0, S1, . . . , Sn} of features, we have a measure U that evaluates each subset Si . The monotonicity condition requires the following: S0 ⊃ S1 ⊃ · · · ⊃ Sn ⇒ U (S0) U (S1) · · · U (Sn).Theorem 1. Consistency measure is monotonic.Proof. A proof outline is given to show that the inconsistency rate measure is monotonic,i.e., if Si ⊂ Sj , then U (Si ) U (Sj ). Since Si ⊂ Sj , the discriminating power of Si canbe no greater than that of Sj . It is known that the discriminating power is inverselyproportional to the inconsistency rate. Hence, the inconsistency rate of Si is greater than or

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 161equal to that of Sj , or U (Si ) U (Sj ). The monotonicity of the measure can also be provedas follows. Consider three simplest cases of Sk(= Sj − Si ) without loss of generality: (i)features in Sk are irrelevant, (ii) features in Sk redundant, and (iii) features in Sk relevant.If features in Sk are irrelevant, based on the deﬁnition of irrelevancy, these extra featuresdo not change the inconsistency rate of Sj since Sj is Si ∪ Sk , so U (Sj ) = U (Si ). Likewisefor case (ii) based on the deﬁnition of redundancy. If features in Sk are relevant, that meansSi does not have as many relevant features as Sj . Obviously, U (Si ) U (Sj ) in the case ofSi ⊂ Sj . It is clear that the above results remain true for cases that Sk contains irrelevant,redundant as well as relevant features. ✷ • For the consistency measure, a feature subset can be evaluated in O(P ) time. It is usually costlier for other measures. For example, to construct a decision tree in order to have predictive accuracy, it requires at least O(P log P ). • Consistency measure can help remove both redundant and irrelevant features; other measures may not do so. For example, Relief [20] fails to detect redundant features. • Consistency measure is capable of handling some noise in the data reﬂected as a percentage of inconsistencies. • Unlike the commonly used univariate measures (e.g., distance, information, and dependence measures), this is a multivariate measure which checks a subset of features at a time. In summary, the consistency measure is monotonic, fast, multivariate, able to removeredundant and/or irrelevant features, and capable of handling some noise. As with otherﬁlter measures, it’s not clear that it also optimizes the accuracy of a classiﬁer trained onthe data after feature selection. In the next section we describe different search strategiesfor consistency measure.4. Different search strategies Search techniques are important as exhaustive search of the “optimal” subset isimpractical for even moderate N . In this section we will study and compare different searchstrategies for consistency measure. Five different algorithms represent standard search strategies: exhaustive—Focus [1],complete—ABB [28], heuristic—SetCover [10], probabilistic—LVF [30], and hybrid ofcomplete and probabilistic search methods—QBB [12]. In the rest of this section weexamine their advantages and disadvantages, and at the end we give some guidelines aboutwhich search strategy to be used under different situations.4.1. Focus: Exhaustive search Focus [1] is one of the earliest algorithms within machine learning. Focus starts with anempty set and carries out breadth-ﬁrst search until it ﬁnds a minimal subset that predictspure classes. If the full set has three features, the root is (0, 0, 0), its children are (0 0 1),(0 1 0), and (1 0 0) where a ‘0’ means absence of the respective feature and ‘1’ means

162 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176its presence in the feature subset. It is exhaustive search in nature and originally works onbinary, and noise-free data. With some simple modiﬁcation of Focus, we have FocusM thatcan work on non-binary data with noise by applying the inconsistency rate (deﬁned earlier)in place of the original consistency measure. Below we give the FocusM algorithm where inConCal( ) function calculates inconsis-tency rate of a given feature set S for a given dataset D.Algorithm. FocusMInput: Data D, feature set SOutput: Consistent Feature Subset1. δ = inConCal(S, D)2. For size = 0 to |S|3. for all subsets S with |S | = size4. if inConCal(S , D) δ5. return SAs FocusM is exhaustive search it guarantees an optimal solution. However, a briefanalysis can tell that FocusM’s time performance can deteriorate fast with increasing M.This issue is directly related to the size of the search space. The search space of FocusMis closely related to the number of relevant features. For instance, taking a simple exampleof four features, if one out of four features is relevant, the search space is 4 = 4; and if 1 4 4all 4 features together satisfy consistency, the search space is i=1 i = 41. In general,the smaller the number of relevant features M, the smaller the search space of FocusM andhigher its efﬁciency. But for data with small N − M FocusM is inefﬁcient, and one requiresmore efﬁcient techniques. The following is such a technique.4.2. ABB: Complete search Branch & Bound for feature selection was ﬁrst proposed in [34]. In contrast to Focus, itstarts with a full set of features, and removes one feature at a time. If the full set containsthree features, the root is (1 1 1) where ‘1’ means presence of the corresponding featureand ‘0’ its absence. Its child nodes are (1 1 0), (1 0 1), and (0 1 1), etc. When there isno restriction on expanding nodes in the search space, this could lead to an exhaustivesearch. However, if each node is evaluated by a measure U and an upper limit is setfor the acceptable values of U , then Branch & Bound backtracks whenever an infeasiblenode is discovered. If U is monotonic, no feasible node is omitted as a result of earlybacktracking and, therefore, savings in search time do not sacriﬁce the optimality of theselected subset, and hence it is a non-exhaustive yet complete search. As was pointed outin [41], the measures used in [34] such as accuracy of classiﬁcation have disadvantages(e.g., non-monotonicity). As a remedy, the authors proposed the concept of approximatemonotonicity which means that branch and bound can continue even after encounteringa node that does not satisfy the bound. But the node that ﬁnally is accepted must satisfythe set bound. In ABB (Automatic Branch and Bound) [28], we proposed an automatedBranch & Bound algorithm having its bound set to the inconsistency rate of the originalfeature set. The algorithm is given below.

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 163Algorithm. ABBInput: Data D, feature set SOutput: Consistent Feature Subset 1. δ = inConCal(S, D) 2. T = S /* subset generation */ 3. For all feature f in S 4. Sj = S − f /* remove one feature at a time */ 5. j + + 6. For all Sj 7. if (Sj is legitimate ∧ inConCal(Sj , D) δ) 8. if inConCal(Sj ,D) < inConCal(T ,D) 9. T = Sj /* recursion */10. ABB (Sj , D)11. return T It starts with the full set of features S0, removes one feature from Sjl−1 in turn to generatesubsets Sjl where l is the current level and j speciﬁes different subsets at the lth level. IfU (Sjl ) > U (Sjl−1), Sjl stops growing (its branch is pruned); otherwise, it grows to levell + 1, i.e., one more feature could be removed. The legitimacy test is based on whether a node (subset) is a child node of a pruned node.A node is illegitimate if it is a child node of a pruned one (which is already found to beillegitimate). Each node is represented by a binary vector where 1’s stand for presence ofa particular feature in that subset and 0’s for its absence. The test is done by checking theHamming distance between the child node under consideration and pruned nodes. If theHamming distance with any pruned node is 1 (i.e., the difference of the two representativebinary vectors is 1), the child node is the child of the pruned node. Notice that by this wayat every level we are able to determine all the illegitimate nodes.Example. Refer to Fig. 2 there are four features of which only the ﬁrst two (underlined) arerelevant. The root S0 = (1 1 1 1) of the search tree is a binary array with four ‘1’s. FollowingABB, we expand the root to four child nodes by turning one of the four ‘1’s into ‘0’ (L2).All four are legitimate: S1 = (1 1 1 0), S2 = (1 1 0 1), S3 = (1 0 1 1), and S4 = (0 1 1 1).Since one of the relevant features is missing, U (S3) and U (S4) will be greater than U (S0)where U is the inconsistency rate over the given data. Hence, the branches rooted by S3 andS4 are pruned and will not grow further. Only when a new node passes the legitimacy testwill its inconsistency rate be calculated. Doing so improves the efﬁciency of ABB becauseP (number of patterns) is normally much larger than N (number of features). The rest ofthe nodes are generated and tested in the same way. Since inconsistency is a monotonic measure, ABB guarantees an optimal solution.However, a brief analysis suggests that time performance of ABB can deteriorate as thedifference (N − M) increases. This issue is related to how many nodes (subsets) have beengenerated. The search space of ABB is closely related to the number of relevant features.

164 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 Fig. 2.For instance, taking a simple example of four features, if one out of four features is relevant, n=4the search space is 15 nodes which is almost exhaustive (the total is i=0 n = 16 nodes). iAnd when all four features are relevant, it is 5 nodes, i.e., the root plus 4 child nodes.In general, the more the number of relevant features, the smaller the search space dueto early pruning of the illegitimate nodes. See [28] for detailed results (the number ofsubsets evaluated, the number of subsets pruned, etc.) of ABB over a number of benchmarkdatasets.The analysis of Focus and ABB shows the following:• Focus is efﬁcient when M is small, and• ABB is efﬁcient when N − M is small.In other cases, the two algorithms can be inefﬁcient. An immediate thought is whether wecan use the inconsistency measure in heuristic search and how it fares.4.3. SetCover: Heuristic search SetCover [10] exploits the observation that the problem of ﬁnding a smallest set ofconsistent features is equivalent to ‘covering’ each pair of examples that have differentclass labels. Two instances with different class labels are said to be ‘covered’ when thereexists at least one feature which takes different values for the two instances [35]. Thisenables us to apply Johnson’s algorithm [19] for set-cover to this problem. Johnson’salgorithms shows that the size of the resulting consistent feature subset is O(M log P ).However, experimental results show that these estimations are in fact much larger than theobtained results, i.e., the size of the consistent subsets in the experiments is much less thanO(M log P ). The algorithm works as follows: The consistency criterion can be restated by sayingthat a feature set S is consistent if for any pair of instances with different class labels, thereis a feature in S that takes different values. Thus including a feature f in S ‘covers’ allthose example pairs with different class labels on which f takes different values. Once allpairs are ‘covered’ is the resulting set S consistent.

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 165Algorithm. SetCoverInput: Data D, feature set SOutput: Consistent Feature Subset 1. δ = inConCal(S, D) 2. S = φ 3. repeat 4. lowestInCon = C C is a large constant 5. for all features f ∈ S 6. S = append(S , f ) 7. tempInCon = inConCal(S , D) 8. if tempInCon < δ 9. return S10. else11. if tempInCon < lowestInCon12. lowestInCon = tempInCOn13. lowestFeature = f14. S = append(S , f )15. S = S − f16. until lowestInCon δ In [10] we report extensive experimental results which show that SetCover is fast,close to optimal, and deterministic.1 It works well for datasets where features are ratherindependent of each other. It may, however, have problems where features are correlated.This is because it selects the best feature in each iteration based on the number of instance-pairs covered. So, any feature that is most correlated to the class label is selected ﬁrst. Anexample is the CorrAL dataset [21] which has 64 instances with six binary features andtwo classes. Feature f5 is irrelevant to the target concept which is (f1 ∧ f2) ∨ (f3 ∧ f4).Feature f6 is correlated to the target concept 75% of the time. SetCover ﬁrst selects thefeature f6 due to the fact that among all six features f6 covers the maximum number ofinstances (75%). Then it selects features f1, f2, f3, and f4; so, it selects the wrong subset(f6, f1, f2, f3, f4) overall. So, we found that exhaustive methods have inherent drawback because they requirelarge computational time. Heuristic methods such as SetCover, although very fast andaccurate, can encounter problems if the data has highly correlated features. Hence, a newsolution is needed that avoids the problems of exhaustive and heuristic search. Probabilisticsearch is a natural choice. 1 In the literature for search techniques, there are many methods such as genetic algorithms, simulatedannealing, tabu search and estimation of distribution algorithm. Our choice of SetCover is mainly guided bythe requirement that a heuristic algorithm must be fast to complete and should return good results. So, althoughthe above mentioned search techniques are known to return good subsets, but they are not chosen due to theirknown higher computational time requirement.

166 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–1764.4. LVF: Probabilistic search Las Vegas algorithms [7] for feature subset selection can make probabilistic choicesof subsets in search of an optimal set. Another similar type of algorithm is the MonteCarlo algorithm in which it is often possible to reduce the error probability arbitrarily atthe cost of a little increase in computing time [7]. In [29] we proposed a probabilisticalgorithm called Las Vegas Filter (LVF) where probabilities of generating any subset areequal. LVF adopts the inconsistency rate as the evaluation measure. It generates featuresubsets randomly with equal probability, and once a consistent feature subset is obtainedthat satisﬁes the threshold inconsistency rate (δ which by default is set to the inconsistencyrate of the original feature set), the size of generated subsets is pegged to the size ofthat subset, i.e., subsets of higher size are not evaluated anymore. This is based on thefact that inconsistency rate is monotonic, i.e., a superset of a consistent feature set is alsoconsistent. This guarantees a continuously diminutive consistent feature subsets as outputof LVF. LVF is fast in reducing the number of features in the early stages and can produceoptimal solutions if computing resources permit. In the following the LVF algorithm isgiven.Algorithm. LVFInput: Data D, feature set S, MaxTriesOutput: Consistent Feature Subset1. δ = inConCal(S, D)2. T = S3. for j = 1 to MaxTries 4. randomly choose a subset of features, Sj 5. if |Sj | |T | 6. if inConCal(Sj , D) δ 7. if |Sj | < |T | 8. T = Sj 9. output Sj10. else11. append Sj to T12. output Sj as ‘yet another solution’13. return T In the algorithm ‘yet another solution’ means a solution of the same size as that of themost recently found solution. By appending solutions of equal size the algorithm producesa list of equal-sized feature subsets at the end. A more sophisticated version of LVF issampling without replacement. The number of features in each sampling is determined bythe size of the current smallest subset. This could speed up subset generation.LVF performance. We conducted experiments to observe how the consistent featuresubset size (M ) drops as the number of randomly generated feature sets increases. A totalof 10 benchmark datasets are taken from the UC Irvine machine learning repository [4].

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 167Table 1LVF performance: The size of consistent feature subsets decreases sharply after severalinitial runsDataset P NM M # Subsets evaluated # Max subsetsLED-24 1200 24 12 5 230 224Lung 32 56 19 4 155 256Lymphography 148 18 8 6 215 218Mushroom 22 8 4 22 222Par3+3 7125 12 5 3 25 212Promoters 64 57 15 4 187 257Soybean 106 35 12 2 42 235Splice 47 60 19 9 284 260Vote 16 13 8 215 216Zoo 3190 16 9 5 25 216 435 74Table 1 describes the datasets where P is the number of instances, N is the original numberof features, M is the size of the smallest consistent subset, and M is the intermediate sizeof some consistent subsets output by LVF. The Par3+3 dataset is a modiﬁed version of theoriginal Parity3 dataset. The target concept is the parity of the ﬁrst three bits. It containstwelve features among which the ﬁrst three are relevant, the next three are irrelevant, andthe other six are redundant (duplicate of the ﬁrst six). Table 1 also shows the possiblemaximum number of subsets (i.e., 2N ) and the results (M ) after several runs of LVF (eachrun stands for an evaluation of a feature subset).Analysis. The trend found in all the experiments is that M drops sharply in the beginningfrom N in a small number of runs (as shown in Table 1). Afterwards, it takes quite a longtime to further decrease M . Two typical graphs for Mushroom and Promoters datasetsare shown in Fig. 3. Some analysis can conﬁrm this ﬁnding. At the beginning, eachfeature subset has a probability of 1/2N to be generated, and many subsets can satisfythe inconsistency criterion. As M decreases from N to M, fewer and fewer subsets cansatisfy the criterion. However, the probability of a distinct set being generated is still 1/2N .That explains why the curves have a sharp drop in the beginning and then become ﬂat inFig. 3. The ﬁnding is that LVF reduces the number of features quickly during the initialstages; after that LVF still searches in the same way (i.e., blindly) while the computingresource is spent on generating many subsets that are obviously not good. This analysispaves the way for the next algorithm which is a hybrid of LVF and ABB.4.5. QBB: Hybrid search QBB is a hybrid of LVF and ABB. Analyses showed that ABB’s performance is goodwhen N − M is small whereas LVF is good in reducing the size of consistent featuresubset very quickly in the early stages. QBB (Quick Branch and Bound) is a two phasealgorithm that runs LVF in the ﬁrst phase and ABB in the second phase. In the followingQBB algorithm is given.

168 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176Fig. 3. LVF performance over Mushroom and Promoters datasets: The size of the consistent feature subsetsdecreases sharply but ﬂattens afterwards.

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 169Algorithm. QBBInput: Data D, feature set S, MaxTriesOutput: Consistent Feature Subset 1. δ = inConCal(S, D) 2. T = S 3. S = LVF(S, MaxTries, D) /* All consistent subsets which are output by LVF are in S */ 4. MinSize = |S| 5. for each subset Sj in S 6. T = ABB(Sj , D) 7. if MinSize > |T | 8. MinSize = |T | 9. T = Sj10. return T The rationale behind running LVF ﬁrst is that a small number of runs of LVF from thebeginning output consistent subsets of small size. Similarly, running ABB on these smallersubsets makes it very focused towards ﬁnding the smallest consistent subsets.A performance issue. A key issue about QBB performance is: what is the crossing pointat which ABB takes over from LVF. We carried out experiments over the earlier reporteddatasets by assigning a ﬁxed total number of runs (a run means evaluation of a featuresubset) and by varying the cross-over point. It showed that dividing the total runs equallybetween LVF and ABB is a robust solution and is more likely to yield the best results.Details of the experiments are reported in [12]. A simple analysis shows that if the crossingpoint is quite early, then consistent subsets output by LVF will be large in size for ABB toperform well under time constraint, but if the crossing point is close to the total numberof runs, then LVF return very small size subsets which ABB may not be able to reducefurther.4.6. Guidelines Above we described ﬁve different search strategies for consistency measure anddiscussed their pros and cons. In this section we provide guidelines for a user to selectthe best algorithm under particular circumstances. Theoretical analysis and experimental results suggest the following. If M—the size ofrelevant features – is small, FocusM should be chosen; however if M is even moderatelylarge, FocusM will take a long time. If there are a small number of irrelevant and redundantfeatures (i.e., N − M is small), ABB should be chosen; but ABB will take a long time foreven moderately large N − M. Consider the dataset Parity3+3 which has three relevantfeatures out of twelve (see Table 1). In such cases FocusM is preferred over ABB asM is very small. Similarly, consider the vote dataset (see Table 1) that has N = 16 andM = 8. ABB due to its early pruning of inconsistent subsets, evaluates only 301 subsets tooutput a subset of minimal size whereas FocusM has to evaluate 39,967 subsets to outputa subset of the minimal size. But for datasets with large numbers of features (large N )

170 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176and moderate to large number of relevant features (moderate to large M), FocusM andABB should not be expected to terminate in realistic time. An example is the Letter dataset(taken from UCI ML repository [4]) that has 20,000 instances (N = 16 and M = 11). BothFocusM and ABB take very long time to ﬁnd minimal size feature subsets. Hence, in suchcases one should resort to heuristic or probabilistic search for faster results. Although thesealgorithms may not guarantee minimal size subsets but they will be efﬁcient in generatingconsistent subsets of size close to minimal in much less time. SetCover is heuristic, fast,and deterministic but it may face problems for data having highly correlated features. Anexample is CorrAL dataset that has a feature correlated to the class variable 75% of the time(see Section 4.3 for discussion). Because of the correlated feature it outputs a wrong subset.LVF is probabilistic, not prone to the problem faced by SetCover, but slow to converge inlater stages. As we have shown in Table 1 and Fig. 3, it reduces the feature subset sizequickly in the beginning but then it slows down afterwards. An example is Promotersdataset that has N = 57 and M = 4. LVF requires to evaluate only 187 subsets to outputa subset of size 15, but for next 10’s of thousands of subsets its output subset size doesnot further decrease. QBB, by design, captures the best of LVF and ABB. As veriﬁed byexperiments, it is reasonably fast (slower than SetCover), robust, and can handle featureswith high correlation. In summary,(1) If time is not an issue, then FocusM and ABB are preferable because they ensure smallest consistent subset. In addition, if the user has knowledge of M then if M is small FocusM is preferable otherwise ABB is chosen. In the absence of any a priori knowledge about M both can be run simultaneously until any one terminates.(2) In the usual case of limited computing time a user is best off choosing from LVF, SetCover and QBB. LVF is suitable for quickly producing small (yet not small enough) consistent subsets. SetCover is suitable if it is known that features are not correlated. Otherwise, QBB is a robust choice.5. Further experiments We carry out further experiments to examine (1) whether features selected usingconsistency measure can achieve the objective of dimensionality reduction withoutsacriﬁcing predictive accuracy and (2) how the different algorithms fare in terms of timeand size of consistent subsets. The experimental procedure is to run ABB to get theminimal size, compare the performance (average time and size of selected subsets) ofdifferent algorithms; and compare the accuracy of two different classiﬁers (C4.5 [36] andBackpropagation neural network [46]) before and after feature selection by QBB (which isthe most robust among all methods). Ten benchmark datasets, as reported earlier in Table 1,are taken for the experiments.5.1. Performance comparison Fig. 4 shows the average time and average number of selected features over 10 datasetsof different algorithms. First, ABB is run over the 10 datasets to ﬁnd the M (minimum

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 171Fig. 4. Comparison of different algorithms: Results of FocusM and ABB are out of bounds due to their very largeprocessing time.number of relevant features) values for each dataset. Their average is 5 which is denotedas Mavg. This value is used as a reference line (in the bottom of the graph) in Fig. 4. Outof the 5 competing algorithms, FocusM, ABB and SetCover are deterministic, whereasLVF and QBB are non-deterministic due to their probabilistic nature. As per the ﬁndingsof [12], QBB spends half of the time running LVF and the other half running ABB. ForLVF and QBB we show results for 5 different processing time in terms of total numbers ofsubsets evaluated (1000 . . .5000). Each experiment was repeated 50 times for each dataset.Notice that Focus and ABB are not shown in the graph as their average times fall outsidethe range of the ‘processing time’ in the X-axis of the graph, although minimal sizedsubsets are guaranteed in each case. For datasets having large differences between N andM values such as Lung Cancer, Promoters, Soybean, and Splice, ABB takes very longtime (a number of hours) to terminate. For datasets having large N values and not verysmall M values such as Splice dataset (N = 60, M = 9) FocusM takes many hours toterminate. The comparison in Fig. 4 shows that QBB is more efﬁcient both in average timeand number of selected features compared to LVF, FocusM, and ABB. The average size

172 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176of the subsets produced by QBB is close to MAvg and it approaches to MAvg with time.SetCover produces near optimal subsets in much less time. Between QBB and SetCoverwe would say QBB is more robust while SetCover, although very fast and accurate, mayfail to deliver efﬁcient subsets if some features are highly correlated.5.2. Classiﬁcation error Predictive accuracy of a classiﬁer is often used as a validation criterion. Classiﬁcationerror holds a relationship with predictive accuracy as the sum of predictive accuracy anderror rate is 1. Among the different search strategies discussed in the paper, we chooseQBB due to its robustness in the following experiments. We choose C4.5 (a widely useddecision tree induction algorithm) [36] and Backpropagation neural networks [46] as twoclassiﬁers for validation. For C4.5, we use the default settings: apply it to datasets before and after featureselection, and obtain the results of 10-fold cross-validation in Table 2. In the table wereport average tree size and average error rate for each dataset after pruning. As QBB isnon-deterministic, QBB is run 10 times for each dataset, i.e., 10 feature subsets are outputby QBB for each dataset. The average for “before feature selection” is over the 10-foldcross-validation of C4.5, and the average for “after feature selection” is over the chosen10 subsets of selected features and for each feature subset the 10-fold cross validation isperformed, i.e., average is taken the total 10 10-fold cross validation for C4.5. We estimatethe P-value using two-tailed t-test for two-sample unequal variances (α = 0.05). For thet-test we test the null hypothesis that the means of the results “before feature selection”and “after feature selection” are equal. The experiment shows the improvement or nosigniﬁcant decrease in performance for most datasets (8 out of 10) in C4.5’s accuracyafter feature selection. For Led24 dataset the P-values show “divide by 0” because the treesize is 19.0 for all folds both before and after feature selection, and similarly the error rateis 0.0 for all folds. So we put “1” in the place for P-values. For the experiments with Backpropagation, each dataset is divided into a training set(two-third of the original size) and the rest one-third for testing. Running Backpropagationis very time consuming and involves the setting of some parameters, such as the networkTable 2C4.5 decision tree results of QBBDatasets Tree size Error rate Before After P-value Before After P-valueLED-24 19.0 19.0Lung 17.8 12.9 1.0 0.0 0.0 1.0Lymphography 26.9 19.5 0.013 66.7 54.5 0.253Mushroom 30.8 40.5 0.001 20.8 19.7 0.757Par3+3 15.6 15.0 0.0 0.0Promoters 22.6 13.8 0.638 31.4 0.0 0.057Soybean 7.1 8.0 0.0 16.9 0.0 0.0Splice 207.7 535.0 0.0 4.0 35.9 0.0Vote 16.0 16.9 0.0 5.6 0.2 0.57Zoo 18.0 18.4 0.48 4.4 31.2 0.0 0.777 10.6 4.4 1.0 7.6 0.539

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 173Table 3Backpropagation neural network results of QBBDatasets Cycles Hidden Error rate units Before AfterLED-24 1000 12 0.06 0.0Lung 1000 28 75.0 75.0Lymphography 7000 9 25.0 25.0Mushroom 5000 11 0.0 0.0Par3+3 1000 6 22.2 0.0Promoters 2000 29 46.8 25.0Soybean 1000 18 10.0 0.0Splice 6000 30 25.64 42.33Vote 4000 8 6.7 4.0Zoo 4000 8 10.3 3.4structure (number of layers and number of hidden units), learning rate, momentum, numberof CYCLES (epochs). In order to focus our attention on the effect of feature selection byQBB, we try to minimize the tuning of the parameters for each dataset. We ﬁx the learningrate at 0.1, the momentum at 0.5, one hidden layer, the number of hidden units as halfof the original input units for all datasets. The experiment is carried out in two steps: (1)a trial run to ﬁnd a proper number of CYCLES for each dataset which is determined bya sustained trend where error does not decrease; and (2) two runs on datasets with andwithout feature selection respectively using the number of CYCLES found in step 1. Otherparameters remain ﬁxed for the two runs in step 2. The results are shown in Table 3 withan emphasis on the difference before and after feature selection. In most cases, error ratesdecrease (6 out of 10) or do not change (3 out of 10) after feature selection.6. Conclusion and future directions In this paper, we carry out a study of consistency measure with different searchstrategies. The study of the consistency measure with other measures shows that it ismonotonic, fast, multivariate, capable of handling some noise and can be used to removeredundant and/or irrelevant features. As with other ﬁlter measures, it is not clear that italso optimizes the accuracy of a classiﬁer used after feature selection. We investigatedifferent search strategies for consistency measure, which are: exhaustive, complete,heuristic, probabilistic, and hybrid. Their advantages and shortcomings are discussedand an experimental comparison is done to determine their relative performance. Someguidelines are given to choose the most suitable one in a given situation. The fact that the consistency measure does not incorporate any search bias with regardsto a particular classiﬁer enables it to be used with a variety of different learning algorithms.As shown in the second set of experiments with the two different types of classiﬁers,selected features improve the performance in terms of lower error rates in most cases.Features selected without search bias bring efﬁciency in later stage as the evaluation ofa feature subset becomes simpler than that of a full set. On the other hand, since a setof features is deemed consistent if any function maps from the values of the features

174 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176to the class labels, any algorithm optimizing this criterion may choose a small set offeatures that has a complicated function, while overlooking larger sets of features admittingsimple rules. Although intuitively this should be relatively rare, it can happen in practice,as apparent was the case for the Splice dataset where both C4.5 and Backpropagation’sperformance deteriorate after feature selection. This work can be extended in various directions. We plan to explore a line of researchthat focuses on comparison of different heuristic methods over consistency measure. Wediscussed in Section 4 that there are other heuristic search techniques such as geneticalgorithms, simulated annealing, tabu search, and estimation of distribution algorithms.SetCover was the choice of heuristic search based on the known fact that these abovealgorithms are costly in computational time. However, the actual comparison results amongthese techniques can as well be interesting from research point of view.Acknowledgements We would like to thank Amit Mandvikar for help in performing experiments. Ourunderstanding of the subject also enhanced by numerous discussions with Prof. HiroshiMotoda.References [1] H. Almuallim, T.G. Dietterich, Learning boolean concepts in the presence of many irrelevant features, Artiﬁcial Intelligence 69 (1–2) (1994) 279–305. [2] D.A. Bell, H. Wang, A formalism for relevance and its application in feature subset selection, Machine Learning 41 (2000) 175–195. [3] M. Ben-Bassat, Pattern recognition and reduction of dimensionality, in: P.R. Krishnaiah, L.N. Kanal (Eds.), Handbook of Statistics, North-Holland, Amsterdam, 1982, pp. 773–791. [4] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, 1998, http://www.ics.uci. edu/~mlearn/MLRepository.html. [5] A.L. Blum, P. Langley, Selection of relevant features and examples in machine learning, Artiﬁcial Intelligence 97 (1997) 245–271. [6] A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Occam’s razor, in: J.W. Shavlik, T.G. Dietterich (Eds.), Readings in Machine Learning, Morgan Kaufmann, San Mateo, CA, 1990, pp. 201–204. [7] G. Brassard, P. Bratley, Fundamentals of Algorithms, Prentice Hall, Englewood Cliffs, NJ, 1996. [8] C. Cardie, Using decision trees to improve case-based learning, in: Proceedings of Tenth International Conference on Machine Learning, Amherst, MA, 1993, pp. 25–32. [9] S. Das, Filters, wrappers and a boosting-based hybrid for feature selection, in: Proceedings of the Eighteenth International Conference on Machine Learning (ICML), Williamstown, MA, 2001, pp. 74–81.[10] M. Dash, Feature selection via set cover, in: Proceedings of IEEE Knowledge and Data Engineering Exchange Workshop, Newport, CA, IEEE Computer Society, 1997, pp. 165–171.[11] M. Dash, H. Liu, Feature selection methods for classiﬁcation, Intelligent Data Analysis: An Internat. J. 1 (3) (1997).[12] M. Dash, H. Liu, Hybrid search of feature subsets, in: Proceedings of Paciﬁc Rim International Conference on Artiﬁcial Intelligence (PRICAI-98), Singapore, 1998, pp. 238–249.[13] M. Dash, H. Liu, H. Motoda, Consistency based feature selection, in: Proceedings of Fourth Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Kyoto, Japan, 2000, pp. 98–109.[14] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice Hall, Englewood Cliffs, NJ, 1982.

M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 175[15] J. Doak, An evaluation of feature selection methods and their application to computer security, Technical Report, University of California, Department of Computer Science, Davis, CA, 1992.[16] M.A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in: Proceedings of Seventeenth International Conference on Machine Learning (ICML), Stanford, CA, Morgan Kaufmann, San Mateo, CA, 2000, pp. 359–366.[17] I. Inza, P. Larranaga, R. Etxeberria, B. Sierra, Feature subset selection by Bayesian network-based optimization, Artiﬁcial Intelligence 123 (2000) 157–184.[18] G.H. John, R. Kohavi, K. Pﬂeger, Irrelevant features and the subset selection problem, in: Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, 1994, pp. 121–129.[19] D.S. Johnson, Approximation algorithms for combinatorial problems, J. Comput. System Sci. 9 (1974) 256– 278.[20] K. Kira, L.A. Rendell, The feature selection problem: Traditional methods and a new algorithm, in: Proceedings of AAAI-92, San Jose, CA, 1992, pp. 129–134.[21] R. Kohavi, Wrappers for performance enhancement and oblivious decision graphs, PhD Thesis, Department of Computer Science, Stanford University, CA, 1995.[22] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artiﬁcial Intelligence 97 (1–2) (1997) 273–324.[23] D. Koller, M. Sahami, Toward optimal feature selection, in: Proceedings of International Conference on Machine Learning, Bari, Italy, 1996, pp. 284–292.[24] I. Kononenko, Estimating attributes: Analysis and extension of RELIEF, in: Proceedings of European Conference on Machine Learning, Catania, Italy, 1994, pp. 171–182.[25] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: An enabling technique, J. Data Mining Knowledge Discovery 6 (4) (2002) 393–423.[26] H. Liu, H. Motoda (Eds.), Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic, Boston, MA, 1998.[27] H. Liu, H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic, Dordrecht, 1998.[28] H. Liu, H. Motoda, M. Dash, A monotonic measure for optimal feature selection, in: Proceedings of European Conference on Machine Learning, Chemnitz, Germany, 1998, pp. 101–106.[29] H. Liu, R. Setiono, Feature selection and classiﬁcation—A probabilistic wrapper approach, in: Proceedings of Ninth International Conference on Industrial and Engineering Applications of AI and ES, 1996, pp. 419– 424.[30] H. Liu, R. Setiono, A probabilistic approach to feature selection—A ﬁlter solution, in: Proceedings of International Conference on Machine Learning, Bari, Italy, 1996, pp. 319–327.[31] M. Modrzejewski, Feature selection using rough sets theory, in: P.B. Brazdil (Ed.), Proceedings of the European Conference on Machine Learning, Vienna, Austria, 1993, pp. 213–226.[32] A.W. Moore, M.S. Lee, Efﬁcient algorithms for minimizing cross validation error, in: Proceedings of Eleventh International Conference on Machine Learning, New Brunswick, NJ, Morgan Kaufmann, San Mateo, CA, 1994, pp. 190–198.[33] A.N. Mucciardi, E.E. Gose, A comparison of seven techniques for choosing subsets of pattern recognition, IEEE Trans. Comput. C-20 (1971) 1023–1031.[34] P.M. Narendra, K. Fukunaga, A branch and bound algorithm for feature selection, IEEE Trans. Comput. C-26 (9) (1977) 917–922.[35] A.L. Oliveira, A.S. Vincentelli, Constructive induction using a non-greedy strategy for feature selection, in: Proceedings of Ninth International Conference on Machine Learning, Aberdeen, Scotland, Morgan Kaufmann, San Mateo, CA, 1992, pp. 355–360.[36] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993.[37] T.W. Rauber, Inductive pattern classiﬁcation methods—Features—Sensors, PhD Thesis, Department of Electrical Engineering, Universidade Nova de Lisboa, 1994.[38] J.C. Schlimmer, Efﬁciently inducing determinations: A complete and systematic search algorithm that uses optimal pruning, in: Proceedings of Tenth International Conference on Machine Learning, Amherst, MA, 1993, pp. 284–290.[39] J. Segen, Feature selection and constructive inference, in: Proceedings of Seventh International Conference on Pattern Recognition, Montreal, QB, 1984, pp. 1344–1346.

176 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176[40] J. Sheinvald, B. Dom, W. Niblack, A modelling approach to feature selection, in: Proceedings of Tenth International Conference on Pattern Recognition, Atlantic City, NJ, Vol. 1, 1990, pp. 535–539.[41] W. Siedlecki, J. Sklansky, On automatic feature selection, Internat. J. Pattern Recognition Artiﬁcial Intelligence 2 (1988) 197–220.[42] D.B. Skalak, Prototype and feature selection by sampling and random mutation hill-climbing algorithms, in: Proceedings of Eleventh International Conference on Machine Learning, New Brunswick, NJ, Morgan Kaufmann, San Mateo, CA, 1994, pp. 293–301.[43] P. Soucy, G.W. Mineau, A simple feature selection method for text classiﬁcation, in: Proceedings of IJCAI- 01, Seattle, WA, 2001, pp. 897–903.[44] H. Vafaie, I.F. Imam, Feature selection methods: Genetic algorithms vs. greedy-like search, in: Proceedings of International Conference on Fuzzy and Intelligent Control Systems, 1994.[45] S. Watanabe, Pattern Recognition: Human and Mechanical, Wiley Interscience, New York, 1985.[46] A. Zell, et al., Stuttgart Neural Network Simulator (SNNS), User Manual, Version 4.1, Technical Report, 1995.

# Consistency-based search in feature selection

##
**Description: ** 162 M. Dash, H. Liu / Artiﬁcial Intelligence 151 (2003) 155–176 its presence in the feature subset. It is exhaustive search in nature and originally works on

### Read the Text Version

No Text Content!

- 1 - 22

Pages: