Thursday, April 4, 2019
K Means Clustering With Decision Tree Computer Science Essay
K Means Clustering With last Tree Computer Science EssayThe K-means thumping data dig algorithmic rule is commonly used to find the plunks due to its simplicity of implementation and quick execution. After applying the K-means clustering algorithm on a dataset, it is difficult for one and only(a) to interpret and to extract need results from these clusters, until an other(a) data mining algorithm is not used. The close tree (ID3) is used for the interpretation of the clusters of the K-means algorithm because the ID3 is faster to use, easier to generate understand sufficient rules and simpler to explain. In this enquiry paper we integrate the K-means clustering algorithm with the closing tree (ID3) algorithm into a one algorithm victimization happy promoter, called training smart as a whip component (LI means). This LIAgent capable of to do the assortment and interpretation of the pulln dataset. For the visualization of the clusters 2D s spittered represents are dra wn.Keywords compartmentalisation, LIAgent, Interpretation, visualization1. IntroductionThe data mining algorithms are applied to discover hidden, new patterns and relations from the complex datasets. The uses of intelligent mobile ingredients in the data mining algorithms further boost their study. The term intelligent mobile operator is a combination of two different disciplines, the agent is bring to passd from Artificial Intelligence and code mobility is defined from the distri saveed systems. An agent is an object which has independent thread of control and stand be initiated. The first step is the agent initialization. The agent will because start to operate and may stop and start again depending upon the environment and the tasks that it sequence-tested to accomplish. After the agent finished all the tasks that are required, it will end at its complete state. prorogue 1 elaborates the different states of an agent 1234. tabularize 1. States of an agentName of StepDes criptionInitializePerforms one-time frame-up activity.StartStart its job or task.StopStops its jobs or tasks after saving modal(a) results.CompletePerforms completion or termination activity.There is link in the midst of Artificial Intelligence (AI) and the well-grounded Agents (IA). The data mining is known as Machine Learning in Artificial Intelligence. Machine Learning deals with the development of techniques which allows the computer to learn. It is a method of creating computer programs by the analysis of the datasets. The agents must be able to learn to do classification, clustering and prediction using learning algorithms 5678.The remainder of this paper is organized as followos contribution 2 reviews the relevant data mining algoritms, namely the K-means clustering and the Decision tree (ID3). Section 3 is about the methodology a hybrid integration of the data mining algorithms. In instalment 4 we discuss the results and dicussion. Finally section 5 presents the conclu sion.2. Overview of Data Mining AlgorithmsThe K-means clustering data mining algorithm is used for the classification of a dataset by producing the clusters of that dataset. The K-means clustering algorithm is a kind of unattended learning of appliance learning. The closing tree (ID3) data mining algorithm is used to interpret these clusters by producing the finish rules in if- whereforece-else form. The finish tree (ID3) algorithm is a type of supervised learning of machine learning. Both of these algorithms are meltd in one algorithm through intelligent agents, called Learning Intelligent Agent (LIAgent). In this section we will discuss both of these algorithms.2.1. K-means clustering AlgorithmThe following step explain the K-means clustering algorithmStep 1 Enter the topic of clusters and number of iterations, which are the required and basic inputs of the K-means clustering algorithm.Step 2 Compute the initial centroids by using the Range manner shown in equations 1 an d 2.(1)(2)The initial centroid is C(ci, cj).Where max X, max Y, min X and min Y counterbalance maximum and minimum values of X and Y attributes respectively. k represents the number of clusters and i, j and n straggle from 1 to k where k is an integer. In this way, we can calculate the initial centroids this will be the beginning point of the algorithm. The value (maxX minX) will provide the range of X attribute, akinly the value (maxY minY) will give the range of Y attribute. The value of n varies from 1 to k. The number of iterations should be small otherwise the time and space complexity will be very high and the value of initial centroids will in like manner become very high and may be out of the range in the presumptuousness dataset. This is a major drawback of the K-means clustering algorithm.Step 3 Calculate the distance using Euclideans distance formula in equation 3. On the basis of the distances, generate the partition by assigning each sample to the juxtaposed cl uster.Euclidean Distance Formula (3)Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a assumption object, where i and j vary from 1 to N where N is total number of attributes of a given object. i,j and N are integers.Step 4 Compute new cluster centers as centroids of the clusters, again compute the distances and generate the partition. Repeat this until the cluster memberships stabilizes 910.The strengths and weaknesses of the K-means clustering algorithm are discussed in table 2. carry over 2. Strengths and flunk of the K-means clustering AlgorithmStrengthsWeaknessesTime complexity is O(nkl). Linear time complexity in the size of the dataset.It is swooning to implement, it has the drawback of depending on the initial centre provided.Space complexity is O(k + n).If a distance measure does not exist, specially in multidimensional spaces, first define the distance, which is not always swooning.It is an order-independent algorithm. It generates sa me partition of data no matter of order of samples.The Results obtained from this clustering algorithm can be interpreted in different ways. non relevantAll clustering techniques do not address all the requirements adequately and concurrently.The following are areas but not limited to where the K-means clustering algorithm can be appliedMarketing Finding groups of customers with similar behavior given large database of customer containing their profiles and past records.Biology family lineification of plants and animals given their features.Libraries Book ordering.policy Identifying groups of motor insurance policy holders with a high average claim cost identifying frauds.City-planning Identifying groups of tolerates accord to their house type, value and geographically location.Earthquake studies Clustering observed earthquake epicenters to identify dangerous zones.WWW instrument classification clustering web log data to discover groups of similar access patterns.Medical Scien ces human bodyification of medicines patient records according to their doses etc. 1112.2.2. Decision Tree (ID3) AlgorithmThe decision tree (ID3) produces the decision rules as an output. The decision rules obtained from ID3 are in the form of if- consequently-else, which can be use for the decision support systems, classification and prediction. The decision rules are right-hand to form an accurate, equilibrize picture of the risks and rewards that can result from a special quality. The function of the decision tree (ID3) is shown in the descriptor 1.Figure 1. The Function of Decision Tree (ID3) algorithmThe cluster is the input data for the decision tree (ID3) algorithm, which produces the decision rules for the cluster.The following steps explain the Decision Tree (ID3) algorithmStep 1 Let S is a training set. If all instances in S are positive, then fix YES node and halt. If all instances in S are negative, create a NO node and halt. Otherwise select a feature F with valu es v1,,vn and create a decision node.Step 2 Partition the training instances in S into subsets S1, S2, , Sn according to the values of V.Step 3 Apply the algorithm recursively to each of the sets Si 1314.mesa 3 shows the strengths and weaknesses of ID3 algorithm.Table 3. Strengths and Weaknesses of Decision Tree (ID3) AlgorithmStrengthsWeaknessesIt generates understandable rules.It is less appropriate for a continuous attribute.It performs classification without requiring such(prenominal) computation.It does not perform better in problems with many class and small number of training examples.It is desirable to handle both continuous and categorical variables.The growing of a decision tree is expensive in terms of computation because it sorts each node before finding the best split.It provides an indication for prediction or classification.It is suitable for a single field and does not treat well on non-rectangular regions.3. MethodologyWe combine two different data mining algorith ms namely the K-means clustering and Decision tree (ID3) into a one algorithm using intelligent agent called Learning Intelligent Agent (LIAgent). The Learning Intelligent Agent (LIAgent) is capable of clustering and interpretation of the given dataset. The clusters can also be visualized by using 2D staccato graphs. The architecture of this agent system is shown in figure 2.Figure 2. The Architecture of LIAgent frameThe LIAgent is a combination of two data mining algorithms, the one is the K-means clustering algorithm and the second is the Decision tree (ID3) algorithm. The K-means clustering algorithm produces the clusters of the given dataset which is the classification of that dataset and the Decision tree (ID3) will produce the decision rules for each cluster which are useful for the interpretation of these clusters. The user can access both the clusters and the decision rules from the LIAgent. This LIAgent is used for the classification and the interpretation of the given da taset. The clusters of the LIAgent are further used for visualization using 2D scattered graphs. Decision tree (ID3) is faster to use, easier to generate understandable rules and simpler to explain since any decision that is make can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction.A medical dataset Diabetes is used in this research paper. This is a dataset/testbed of 790 records. The data of Diabetes dataset is pre-processed, called the data standardization. The interval scaled data is properly cleansed. The attributes of the dataset/testbed Diabetes areNumber of times pregnant (NTP)(min. age = 21, max. age = 81)Plasma glucose concentration a 2 hours in an oral glucose allowance account test (PGC)Diastolic blood pressure (mm Hg ) (DBP)Triceps skin fold thickness (mm) (TSFT)2-Hour serum insulin (m U/ml) (2HSHI) consistency mass index (weight in kg/(height in m)2) (BMI)Diabetes pedigree function (DPF)Age mob (whether diabetes is cat 1 or cat 2) 15.We create the four vertical partitions of the dataset Diabetes, by selecting the proper number of attributes. This is illustrated in tables 4 to 7.Table 4. 1st Vertically partition of Diabetes DatasetNTPDPFClass40.627-ive20.351+ive22.288-iveTable 5. 2nd Vertically partition of Diabetes DatasetDBP ageClass7250-ive6631+ive6433-iveTable 6. 3rd Vertically partition of Diabetes DatasetTSFTBMIClass3533.6-ive2928.1+ive043.1-iveTable 7. 4th Vertically partition of Diabetes DatasetPGC2HISClass1480-ive8594+ive185168-iveEach partitioned table is a dataset of 790 records only 3 records are exemplary shown in each table. For the LIAgent, the number of clusters k is 4 and the number of iterations n in each case is 50 i.e. value of k =4 and value of n=50. The decision rules of ea ch clusters is obtained. For the visualization of the results of these clusters, 2D scattered graphs are also drawn.4. Results and DiscussionThe results of the LIAgent are discussed in this section. The LIAgent produces the two outputs, namely, the clusters and the decision rules for the given dataset. The total sixteen clusters are obtained for all four partitions, four clusters per partition. Not all the clusters are good for the classification, only the required and useful clusters are discussed for further information. The sixteen decision rules are also generated by LIAgent. We are presenting three decision rules of three different clusters. The number of decision rules varies from cluster to cluster it depends upon the number of records in the cluster.The Decision conventions of the 4th partition of the dataset Diabetes incur 1if PGC = clxv thenClass = Cat2else detect 2if PGC = 153 thenClass = Cat2elseRule 3if PGC = 157 thenClass = Cat2elseRule 4if PGC = 139 thenClass = Cat2e lseRule 5if HIS = 545 thenClass = Cat2elseRule 6if HIS = 744 thenClass = Cat2elseClass = Cat1Only six decision rules are for the 4th partition of the dataset. It is painless for any one to present the decision and interpret the results of this cluster.The Decision Rules of the 1st partition of the dataset DiabetesRule 1if DPF = 1.32 thenClass = Cat1elseRule 2if DPF = 2.29 thenClass = Cat1elseRule 3if NTP = 2 thenClass = Cat2elseRule 4if DPF = 2.42 thenClass = Cat1elseRule 5if DPF = 2.14 thenClass = Cat1elseRule 6if DPF = 1.39 thenClass = Cat1elseRule 7if DPF = 1.29 thenClass = Cat1elseRule 8if DPF = 1.26 thenClass = Cat1elseClass = Cat2The eight decision rules are for the 1st partition of the dataset. The interpretation of the cluster is easy through the decision rules and it also helps to take the decision.The Decision Rules of the 3rd partition of the dataset DiabetesRule 1if BMI = 29.9 thenClass = Cat1elseRule 2if BMI = 32.9 thenClass = Cat1elseRule 3if TSFK = 23 thenRule 4if B MI = 25.5 thenClass = Cat1elseRule 5if BMI = 30.1 thenClass = Cat1elseRule 6if BMI = 28.4 thenClass = Cat1elseClass = Cat2elseRule 7if BMI = 22.9 thenClass = Cat1elseRule 8if BMI = 27.6 thenClass = Cat1elseRule 9if BMI = 29.7 thenClass = Cat1elseRule 10if BMI = 27.1 thenClass = Cat1elseRule 11if BMI = 25.8 thenClass = Cat1elseRule 12if BMI = 28.9 thenClass = Cat1elseRule 13if BMI = 23.4 thenClass = Cat1elseRule 14if BMI = 30.5 thenRule 15if TSFK = 18 thenClass = Cat2elseClass = Cat1elseRule 16if BMI = 26.6 thenRule 17if TSFK = 18 thenClass = Cat2elseClass = Cat1elseRule 18if BMI = 32 thenRule 19if TSFK = 15 thenClass = Cat2elseClass = Cat1elseRule 20if BMI = 31.6 thenClass = Cat2 , Cat1elseClass = Cat2The twenty decision rules are for the 3rd partition of the dataset. The number of rules for this cluster is higher than the other two clusters discussed.The visualization is important tool which provides the better understanding of the data and illustrates the alliance among the attri butes of the data. For the visualization of the clusters 2D scattered graphs are drawn for all the clusters. We are presenting the four 2D scattered graphs of four different clusters of different partitions.Figure 3. 2D Scattered represent between NTP and DPF attributes of Diabetes datasetThe distance between NTP and DPF attributes of Diabetes dataset varies at the beginning of the graph but after some interval the distance becomes constant.Figure 4. 2D Scattered Graph between DBP and AGE attributes of Diabetes datasetThere is a variable distance between DBP and AGE attributes of the dataset. It remains variable throughout this graph.Figure 5. 2D Scattered Graph between TSFT and BMI attributes of Diabetes datasetThe graph shows well-nigh constant distance between TSFT and BMI attributes of the dataset. It remains constant throughout the graph.Figure 6. 2D Scattered Graph between PGC and 2HIS attributes of Diabetes datasetThere is a variable distance between PGC and 2HIS attributes of the dataset. But in the middle of this graph there is some constant distance between these attributes. The structure of this graph is similar to the graph of figure 5.5. ConclusionIt is not simple for all the users that they can interpret and extract the required results from these clusters, until some other data mining algorithms or other tools are not used. In this research paper we pay tried to address the issue by integrating the K-means clustering algorithm with the Decision tree (ID3) algorithm. The choice of the ID3 is due to the decision rules in the form of if-then-else as an output, which are easy to understand and help to take the decision. It is a hybrid combination of supervised and unsupervised machine learning, using intelligent agent, called a LIAgent. The LIAgent is helpful in the classification and prediction of the given dataset. Furthermore, 2D scattered graphs of the clusters are drawn for the visualization.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.