{"id":484,"date":"2018-10-14T11:00:28","date_gmt":"2018-10-14T17:00:28","guid":{"rendered":"http:\/\/www.jacobsoft.com.mx\/?p=484"},"modified":"2025-02-20T13:37:50","modified_gmt":"2025-02-20T19:37:50","slug":"clasificacion-con-arboles-de-decision","status":"publish","type":"post","link":"https:\/\/www.jacobsoft.com.mx\/en\/clasificacion-con-arboles-de-decision\/","title":{"rendered":"Classification with Decision Trees"},"content":{"rendered":"<h3 class=\"wp-block-heading\">Classification with Decision Trees<\/h3>\n\n\n\n<p>The CART (Classification and Regression Tree) trees are decision trees for classification or regression problems. In the previous article: <a href=\"https:\/\/www.jacobsoft.com.mx\/en\/arboles-de-regresion-usando-python\/\">Regression trees using Python<\/a>\u00a0I explain the types of trees and the algorithm for regression, in this new article we will talk about the use of trees for classification.<\/p>\n\n\n\n<p>CART trees were introduced in 1984 and the algorithm is characterized by the display of a series of questions and answers to determine what the next question will be. The main components of these trees are:\u00a0<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>The rules for determining the division of data for each node<\/li><li>The rules to determine where a branch ends<\/li><li>Prediction of the objective value in each terminal node<\/li><\/ol>\n\n\n\n<p>The main advantages of CART trees are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>It is a non-parametric model so it does not depend on the type of distribution to which the data belongs.<\/li><li>They are not directly impacted by out-of-range values \u200b\u200bor outliers.<\/li><li>It incorporates both training and test data and a cross validation to evaluate the goodness of fit<\/li><\/ul>\n\n\n\n<p>With all this, a decision tree is a graphic representation of all possible solutions to help make a decision. The basic intuition behind this type of tree is to divide a large data set into smaller subsets under certain rules until you get a set of data small enough to establish a simple label.<\/p>\n\n\n\n<p>Each characteristic that allows to make the division is denoted by a node of the tree while the branches represent the possible decisions. The result of the decision is indicated in a leaf node without branches.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"443\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/decision_tree_example.png\" alt=\"\" class=\"wp-image-519\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/decision_tree_example.png 700w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/decision_tree_example-300x190.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure><\/div>\n\n\n\n<p>In the previous image we see a decision tree model where we negotiate with a binary classification problem, although in many cases we can have decision trees for multiple classes. In the previous example we have only two decision options, accept or decline the job offer. The branch to select is the one that provides us with the most information to reduce the degree of randomness in our decision.\u00a0<\/p>\n\n\n\n<p>Finally, the division of data is done with the intention of minimizing entropy and maximizing data groups.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Decision trees with Python<\/h2>\n\n\n\n<p>For the example, we use a data file with information about customers who have purchased or not online. If I buy the value of the dependent variable is 1 and if I do not buy, the value is 0. The data in the independent variable are gender, age and estimated salary.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"698\" height=\"537\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/Compras_en_linea.png\" alt=\"\" class=\"wp-image-510\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/Compras_en_linea.png 698w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/Compras_en_linea-300x231.png 300w\" sizes=\"auto, (max-width: 698px) 100vw, 698px\" \/><\/figure>\n\n\n\n<p>We will use the age and the estimated salary to create a classification tree with the labels of bought (1) or not bought (0) and thus be able to classify new records. For this we create the tree with a set of data that we call the training set.<\/p>\n\n\n\n\n<pre><span class=\"coments\"># Clasification with Decision Trees<\/span>\n\n<span class=\"coments\"># Import of libraries<\/span>\n<span class=\"keyword\">import<\/span> numpy <span class=\"keyword\">ace<\/span> np\n<span class=\"keyword\">import<\/span> matplotlib.pyplot <span class=\"keyword\">ace<\/span> plt\n<span class=\"keyword\">import<\/span> pandas <span class=\"keyword\">ace<\/span> P.S\n\n<span class=\"coments\"># Importation of the dataset<\/span>\ndataset = pd.read_csv (&#039;<span class=\"texto\">Compras_en_Linea.csv<\/span>&#039;) X = dataset.iloc [:, [<span class=\"keyword\">2<\/span>, <span class=\"keyword\">3<\/span>]]. values \u200b\u200by = dataset.iloc [:, <span class=\"keyword\">4<\/span>] .values\n\n<span class=\"coments\"># Division of data set in training data # and test data<\/span>\n<span class=\"keyword\">desde<\/span> sklearn.cross_validation <span class=\"keyword\">import<\/span> train_test_split X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = <span class=\"keyword\">0.25<\/span>, random_state = <span class=\"keyword\">0<\/span>)\n\n<span class=\"coments\"># Scale adjustment<\/span>\n<span class=\"keyword\">desde<\/span> sklearn.preprocessing <span class=\"keyword\">import<\/span> StandardScaler sc = StandardScaler () X_train = sc.fit_transform (X_train) X_test = sc.transform (X_test)\n\n\n<span class=\"coments\">\n# We create the Decision Tree for Classification and train it<\/span>\n<span class=\"keyword\">desde<\/span> sklearn.tree <span class=\"keyword\">import<\/span> DecisionTreeClassifier classifier = DecisionTreeClassifier (criterion = &#039;<span class=\"texto\">entropy<\/span>&#039;, random_state = <span class=\"keyword\">0<\/span>) classifier.fit (X_train, y_train)\n<\/pre>\n\n\n\n<p>When executing this piece of code we are loading the data set with 400 customer records of which 25% will be used to test the model and 75% (300 records) for training or to create the tree under the criteria of the age and the estimated salary (columns 2 and 3).<\/p>\n\n\n\n\n<pre>\n<span class=\"keyword\">import<\/span> pydotplus\n<span class=\"keyword\">desde<\/span> sklearn.tree \n<span class=\"keyword\">import<\/span> export_graphviz dot_data = export_graphviz (classifier, out_file = None, filled = True, feature_names = [&#039;Age&#039;, &#039;Salary&#039;]) graph = pydotplus.graph_from_dot_data (dot_data) graph.write_pdf (&#039;tree.pdf&#039;)\n<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"1877\" height=\"1440\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_sin_escalas.png\" alt=\"\" class=\"wp-image-521\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_sin_escalas.png 1877w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_sin_escalas-300x230.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_sin_escalas-768x589.png 768w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_sin_escalas-1024x786.png 1024w\" sizes=\"auto, (max-width: 1877px) 100vw, 1877px\" \/><\/figure><\/div>\n\n\n\n<p>Given the size of the training data set, the tree is too large, but when viewing the image in detail, we observe that each leaf or end node has an entropy value equal to zero. In the same node shows us the number of records that meet the criteria.<\/p>\n\n\n\n<p>If we now adjust the tree for the 100 records in the test set, the resulting graph is smaller and can be seen better.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"765\" height=\"586\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_test.png\" alt=\"\" class=\"wp-image-522\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_test.png 765w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_test-300x230.png 300w\" sizes=\"auto, (max-width: 765px) 100vw, 765px\" \/><\/figure><\/div>\n\n\n\n<p>In the final node located on the right side at the third level, it says entropy = 0, samples = 13 and value = [0, 13] is a blue node that indicates that clients older than 42.5 and salary greater than 84.5 if they buy.<\/p>\n\n\n\n<p>Now if we make the prediction for the test set and check the confusion matrix, we have:<\/p>\n\n\n\n\n<pre><span class=\"coments\"># Predicting the Test set results<\/span>\ny_pred = classifier.predict (X_test)\n\n<span class=\"coments\"># Making the Confusion Matrix<\/span>\n<span class=\"keyword\">desde<\/span> sklearn.metrics <span class=\"keyword\">import<\/span> confusion_matrix cm = confusion_matrix (y_test, y_pred)\n<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"416\" height=\"338\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_decision_matriz_confusion.png\" alt=\"\" class=\"wp-image-523\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_decision_matriz_confusion.png 416w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/arbol_decision_matriz_confusion-300x244.png 300w\" sizes=\"auto, (max-width: 416px) 100vw, 416px\" \/><\/figure><\/div>\n\n\n\n<p>We observe that of 100 records of the set of tests, there are 9 errors, that is, 3 elements that should have been classified with 1 were classified with 0, these represent the false positives. 6 records were classified with 1 and should be classified with 0, the latter represent false negatives.\u00a0<\/p>\n\n\n\n<p>This implies that the accuracy of the model is 0.91, that is, 91% much better than that which resulted from the same set of tests, <a href=\"https:\/\/www.jacobsoft.com.mx\/en\/regresion-logistica\/\">Logistic regression<\/a> of the previous article.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"656\" height=\"584\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/decision_tree_graph.png\" alt=\"\" class=\"wp-image-524\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/decision_tree_graph.png 656w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/decision_tree_graph-300x267.png 300w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/figure><\/div>\n\n\n\n<p>The previous graph is created for the set of tests and in it we see the decision boundaries for the data, the green zone is for customers who do buy and the red zone for those who do not.<\/p>\n<style class=\"advgb-styles-renderer\">\n\t.coments{color:gray;}\n\t.keyword{color:blue;}\n\t.texto{color:green;}\n\n\t.coments{color:gray;}\n\t.keyword{color:blue;}\n\t.texto{color:green;}\n\n\t.coments{color:gray;}\n\t.keyword{color:blue;}\n\t.texto{color:green;}\n<\/style>","protected":false},"excerpt":{"rendered":"<p>Clasificaci\u00f3n con \u00c1rboles de Decisi\u00f3n Los \u00e1rboles CART (Classification and Regression Tree) constituyen \u00e1rboles de &hellip; <\/p>","protected":false},"author":2,"featured_media":485,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advgb_blocks_editor_width":"","advgb_blocks_columns_visual_guide":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[25,35,46],"tags":[66,67,68,57,55,58,56,54,82,50,59],"class_list":["post-484","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-algoritmos","category-inteligencia-artificial","category-machine-learning","tag-analisis-de-datos","tag-arboles-de-clasificacion","tag-arboles-de-decision","tag-ciencia-de-datos","tag-clasificacion","tag-data-mining","tag-data-science","tag-inferencia","tag-inteligencia-artificial","tag-machine-learning","tag-mineria-de-datos"],"aioseo_notices":[],"author_meta":{"display_name":"Jacob Avila Camacho","author_link":"https:\/\/www.jacobsoft.com.mx\/en\/author\/jacob-avila\/"},"featured_img":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/destacata_arboles_de_decision-300x165.png","featured_image_src":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/destacata_arboles_de_decision.png","featured_image_src_square":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/destacata_arboles_de_decision.png","author_info":{"display_name":"Jacob Avila Camacho","author_link":"https:\/\/www.jacobsoft.com.mx\/en\/author\/jacob-avila\/"},"coauthors":[],"tax_additional":{"categories":{"linked":["<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/algoritmos\/\" class=\"advgb-post-tax-term\">Algoritmos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/\" class=\"advgb-post-tax-term\">Inteligencia Artificial<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Machine Learning<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">Algoritmos<\/span>","<span class=\"advgb-post-tax-term\">Inteligencia Artificial<\/span>","<span class=\"advgb-post-tax-term\">Machine Learning<\/span>"]},"tags":{"linked":["<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">an\u00e1lisis de datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">\u00e1rboles de clasificaci\u00f3n<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">\u00e1rboles de decisi\u00f3n<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Ciencia de Datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">clasificaci\u00f3n<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Data Mining<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Data Science<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">inferencia<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Inteligencia Artificial<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">machine learning<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Miner\u00eda de Datos<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">an\u00e1lisis de datos<\/span>","<span class=\"advgb-post-tax-term\">\u00e1rboles de clasificaci\u00f3n<\/span>","<span class=\"advgb-post-tax-term\">\u00e1rboles de decisi\u00f3n<\/span>","<span class=\"advgb-post-tax-term\">Ciencia de Datos<\/span>","<span class=\"advgb-post-tax-term\">clasificaci\u00f3n<\/span>","<span class=\"advgb-post-tax-term\">Data Mining<\/span>","<span class=\"advgb-post-tax-term\">Data Science<\/span>","<span class=\"advgb-post-tax-term\">inferencia<\/span>","<span class=\"advgb-post-tax-term\">Inteligencia Artificial<\/span>","<span class=\"advgb-post-tax-term\">machine learning<\/span>","<span class=\"advgb-post-tax-term\">Miner\u00eda de Datos<\/span>"]}},"comment_count":"4","relative_dates":{"created":"Posted 8 years ago","modified":"Updated 1 year ago"},"absolute_dates":{"created":"Posted on October 14, 2018","modified":"Updated on February 20, 2025"},"absolute_dates_time":{"created":"Posted on October 14, 2018 11:00 am","modified":"Updated on February 20, 2025 1:37 pm"},"featured_img_caption":"","series_order":"","_links":{"self":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/484","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/comments?post=484"}],"version-history":[{"count":3,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/484\/revisions"}],"predecessor-version":[{"id":525,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/484\/revisions\/525"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/media\/485"}],"wp:attachment":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/media?parent=484"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/categories?post=484"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/tags?post=484"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}