{"id":539,"date":"2018-10-21T02:10:37","date_gmt":"2018-10-21T08:10:37","guid":{"rendered":"http:\/\/www.jacobsoft.com.mx\/?p=539"},"modified":"2025-02-20T13:37:50","modified_gmt":"2025-02-20T19:37:50","slug":"k-means-clustering-con-python","status":"publish","type":"post","link":"https:\/\/www.jacobsoft.com.mx\/en\/k-means-clustering-con-python\/","title":{"rendered":"k-Means Clustering with Python"},"content":{"rendered":"<h2 class=\"wp-block-heading\">k-Means Clustering with Python<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As described in the previous article: <a href=\"https:\/\/www.jacobsoft.com.mx\/en\/clustering-analysis\/\">Cluster Analysis<\/a>, the k-Medias method is a non-hierarchical method based on centroids, robust and easy to implement, where it is necessary to specify in advance the number of groups that will be generated and to which the data will be assigned.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Additionally, these types of methods are recommended for large amounts of data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Checa el tema en video<\/h2>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"K Means Clustering\" width=\"780\" height=\"585\" src=\"https:\/\/www.youtube.com\/embed\/SwVCfiJNfwg?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen><\/iframe>\n<\/div><figcaption>Suscr\u00edbete tambi\u00e9n a mi canal en <a href=\"https:\/\/www.youtube.com\/channel\/UCHQDZW3R0NqPyAE3kOccOAw\" target=\"_blank\" aria-label=\"youtube (opens in a new tab)\" rel=\"noreferrer noopener\" class=\"aioseop-link\">youtube<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>dataset <\/strong>utilizado en el ejemplo lo puedes descargar de este enlace: <strong><a href=\"https:\/\/drive.google.com\/file\/d\/1d0P1elh1B3lX9g3tE981ZWpZLRl9zAt_\/view?usp=sharing\" target=\"_blank\" rel=\"noreferrer noopener\">dataset<\/a><\/strong><\/p>\n\n\n\n<div style=\"height:21px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The algorithm works in the following way: suppose we have the next set of data.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"557\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/before_k_means.png\" alt=\"\" class=\"wp-image-558\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/before_k_means.png 722w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/before_k_means-300x231.png 300w\" sizes=\"auto, (max-width: 722px) 100vw, 722px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">When applying the algorithm, we must obtain the following result:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"962\" height=\"374\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/after_k_means.png\" alt=\"\" class=\"wp-image-559\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/after_k_means.png 962w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/after_k_means-300x117.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/after_k_means-768x299.png 768w\" sizes=\"auto, (max-width: 962px) 100vw, 962px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To get there, the procedure followed by the algorithm is as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>K-Means Algorithm 1. Select the number of k groups (clusters) 2. Randomly generate k points that we will call centroids 3. Assign each element of the data set to the nearest centroid to form k groups 4. Reassign the position of each centroid 5 Reassign the data elements to the nearest centroid again 5.1 If there were elements that were assigned to a centroid other than the original, return to step 4, otherwise the process is over<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To understand clearly the previous algorithm, let&#039;s go step by step describing it graphically:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">1. Select the number of k groups<\/h4>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"914\" height=\"521\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/1_seleccionar_k.png\" alt=\"\" class=\"wp-image-560\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/1_seleccionar_k.png 914w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/1_seleccionar_k-300x171.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/1_seleccionar_k-768x438.png 768w\" sizes=\"auto, (max-width: 914px) 100vw, 914px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">For this data set, let&#039;s say that k equals 2. (We&#039;ll see how to select k later).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2. Randomly select k points that we will call centroids (k = 2)<\/h4>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"548\" height=\"305\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/2_centroides.png\" alt=\"\" class=\"wp-image-561\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/2_centroides.png 548w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/2_centroides-300x167.png 300w\" sizes=\"auto, (max-width: 548px) 100vw, 548px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The blue and red dots represent the two centroids located randomly in the space of the data set.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">3. Assign each element of the data set to the nearest centroid to form k = 2 groups<\/h4>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"566\" height=\"316\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/3_distancias.png\" alt=\"\" class=\"wp-image-562\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/3_distancias.png 566w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/3_distancias-300x167.png 300w\" sizes=\"auto, (max-width: 566px) 100vw, 566px\" \/><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"556\" height=\"309\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/3_estan_asignados.png\" alt=\"\" class=\"wp-image-563\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/3_estan_asignados.png 556w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/3_estan_asignados-300x167.png 300w\" sizes=\"auto, (max-width: 556px) 100vw, 556px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Each element was assigned to the centroid closest to it and in this way the k = 2 groups or clusters are formed, now the next step:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4. Reassign the position of each centroid<\/h4>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"719\" height=\"400\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/4_reubicar.png\" alt=\"\" class=\"wp-image-564\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/4_reubicar.png 719w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/4_reubicar-300x167.png 300w\" sizes=\"auto, (max-width: 719px) 100vw, 719px\" \/><\/figure><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">5. Reassign the data elements to the nearest centroid again<\/h4>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"674\" height=\"373\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_reasignar.png\" alt=\"\" class=\"wp-image-565\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_reasignar.png 674w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_reasignar-300x166.png 300w\" sizes=\"auto, (max-width: 674px) 100vw, 674px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">As we can see, there are blue elements that are now closer to the red centroid and a red element on the side of the blue centroid border, so these elements will be reassigned.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">5.1 If there were elements that were assigned to a centroid other than the original, we return to step 4<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Since there were reassigned elements, we return to step 4 and change the position of the centroids<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"723\" height=\"404\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_1_regresamos_4.png\" alt=\"\" class=\"wp-image-566\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_1_regresamos_4.png 723w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_1_regresamos_4-300x168.png 300w\" sizes=\"auto, (max-width: 723px) 100vw, 723px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Step 5 again and we reassign<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"690\" height=\"384\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_nuevamente_cambiamos.png\" alt=\"\" class=\"wp-image-567\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_nuevamente_cambiamos.png 690w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/5_nuevamente_cambiamos-300x167.png 300w\" sizes=\"auto, (max-width: 690px) 100vw, 690px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We return to step 4 again<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"722\" height=\"405\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/6_siguiente_reasignacion.png\" alt=\"\" class=\"wp-image-568\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/6_siguiente_reasignacion.png 722w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/6_siguiente_reasignacion-300x168.png 300w\" sizes=\"auto, (max-width: 722px) 100vw, 722px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">and the algorithm continues between step 4 and 5 until there are no elements that have to be reassigned from the cluster<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"686\" height=\"390\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/7_ultima_asignacion.png\" alt=\"\" class=\"wp-image-569\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/7_ultima_asignacion.png 686w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/7_ultima_asignacion-300x171.png 300w\" sizes=\"auto, (max-width: 686px) 100vw, 686px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">When there are no elements that changed cluster, the model has finished and we have the two clusters with their respective elements of the data sample.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"896\" height=\"503\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/8_fin.png\" alt=\"\" class=\"wp-image-570\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/8_fin.png 896w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/8_fin-300x168.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/8_fin-768x431.png 768w\" sizes=\"auto, (max-width: 896px) 100vw, 896px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Since the centroids are not part of the data set, they are not taken into account.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"449\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/9_sin_centroides.png\" alt=\"\" class=\"wp-image-571\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/9_sin_centroides.png 800w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/9_sin_centroides-300x168.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/9_sin_centroides-768x431.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">As you can guess, the initial position of the centroids can influence the final grouping of all elements and this generates more than one solution for the same number of clusters<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, for the same data set, we could have two or more groups of data elements, depending on the initial position of the centroids. In the following comparison picture, we have k = 3 and two final options for the same data set:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/comparativa_k_means.png\" alt=\"\" class=\"wp-image-572\" width=\"523\" height=\"178\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/comparativa_k_means.png 902w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/comparativa_k_means-300x103.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/comparativa_k_means-768x263.png 768w\" sizes=\"auto, (max-width: 523px) 100vw, 523px\" \/><figcaption>3 different groups for the same data set<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">This ambiguity is solved with a small modification to the k-Means algorithm that makes it <strong>k-Means ++<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Selection of the correct number of clusters k<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To determine the optimal number of clusters that can be held in a data sample, there are several practical methods, both formal and graphical, that can be used, but one of the most common and robust techniques is the <strong>elbow method<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The elbow method is based on the sum of the squares of the distances of each data element with its corresponding centroid and is denoted as follows:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"301\" height=\"99\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/wcss.png\" alt=\"\" class=\"wp-image-573\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/wcss.png 301w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/wcss-300x99.png 300w\" sizes=\"auto, (max-width: 301px) 100vw, 301px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Where WCSS is the sum of the squares of the distances and refers to Within-Cluster-Sum-of-Squares, Y<sub>i<\/sub> is the centroid of the element or data X<sub>i<\/sub> and n the total data in the sample.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The process is carried out by starting with a single cluster for all the elements of the sample and the sum of all the distance of each element is obtained with the centroid, later two centroids are created and the closest elements to each one are added. the centroids to add the distances of each element with its corresponding centroid. The process is repeated for 3, 4, 5 ... n centroids. When the number of centroids is equal to the amount of data in the sample (n), the distances are zero, since each element is a centroid.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/un_centroide.png\" alt=\"\" class=\"wp-image-574\" width=\"521\" height=\"323\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/un_centroide.png 763w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/un_centroide-300x186.png 300w\" sizes=\"auto, (max-width: 521px) 100vw, 521px\" \/><figcaption>The sum of the distances for a centroid<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/dos_centroides.png\" alt=\"\" class=\"wp-image-575\" width=\"519\" height=\"320\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/dos_centroides.png 760w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/dos_centroides-300x186.png 300w\" sizes=\"auto, (max-width: 519px) 100vw, 519px\" \/><figcaption>The sum of the distances for two centroids<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"858\" height=\"373\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/tres_centroides.png\" alt=\"\" class=\"wp-image-576\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/tres_centroides.png 858w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/tres_centroides-300x130.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/tres_centroides-768x334.png 768w\" sizes=\"auto, (max-width: 858px) 100vw, 858px\" \/><figcaption>The sum of the distances for three centroids<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Each of these values \u200b\u200bof WCSS for each case, that is, 1 centroid, 2 centroids, etc. It is graphed and we obtain a graph similar to the following:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"647\" height=\"474\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo_1.png\" alt=\"\" class=\"wp-image-577\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo_1.png 647w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo_1-300x220.png 300w\" sizes=\"auto, (max-width: 647px) 100vw, 647px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In this example, the sum of the distances for 1 to 10 clusters and \/ or centroids was calculated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The graph shows that the sum of the distances decreases as the number of clusters increases and that decrease decreases as the number of clusters increases. The point where an elbow is made and the change in the value of the sum of the distances is significantly reduced, is the value that tells us the optimal number of clusters that the sample should have. In this case the optimal point is 3<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"814\" height=\"537\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo2.png\" alt=\"\" class=\"wp-image-578\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo2.png 814w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo2-300x198.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/metodo_codo2-768x507.png 768w\" sizes=\"auto, (max-width: 814px) 100vw, 814px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To generate the graph, it is not necessary to generate the centroids n, when n is the number of samples in the data set, an estimated value that allows to visualize the graph with the elbow will be sufficient to determine that optimum number for the clusters that will generate the k-Means method.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<div style=\"height:23px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation of k-Means with Python<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>dataset <\/strong>utilizado en el ejemplo lo puedes descargar de este enlace: <strong><a href=\"https:\/\/drive.google.com\/file\/d\/1d0P1elh1B3lX9g3tE981ZWpZLRl9zAt_\/view?usp=sharing\" target=\"_blank\" rel=\"noreferrer noopener\">dataset<\/a><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For this example with python, we will use a sample of 200 data from a store that has rated its customers with a score ranging from 1 to 100 according to their purchase frequency and other conditions that the store has used to qualify its customers. with that score. In the data set we have information about the gender, age and annual income in thousands of the client. However, to be able to graph the results we will only use the annual income and the score to generate the groups of clients that exist in this sample and analyze said result.<\/p>\n\n\n\n\n<pre><span class=\"coment\"># K-Means Clustering # Import of libraries<\/span>\n<span class=\"key\">import<\/span> numpy <span class=\"key\">ace<\/span> np\n<span class=\"key\">import<\/span> matplotlib.pyplot <span class=\"key\">ace<\/span> plt\n<span class=\"key\">import<\/span> pandas <span class=\"key\">ace<\/span> P.S\n\n<span class=\"coment\"># Loading the data set<\/span>\ndataset = pd.read_csv (&#039;<span class=\"text\">Customers_Shop.csv<\/span>&#039;) X = dataset.iloc [:, [3, 4]]. Values\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We import the libraries and load the data set, indicating that the variable to be analyzed is a matrix with columns 3 and 4 of the data set, which correspond to the annual income in thousands and the customer&#039;s score.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"548\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/clientes_tienda.png\" alt=\"\" class=\"wp-image-581\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/clientes_tienda.png 700w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/clientes_tienda-300x235.png 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The matrix of X is as follows:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"416\" height=\"538\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/k_means_X.png\" alt=\"\" class=\"wp-image-582\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/k_means_X.png 416w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/k_means_X-232x300.png 232w\" sizes=\"auto, (max-width: 416px) 100vw, 416px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Our next step will be to create the graph for the elbow method and determine the optimal number of clusters that exist in the sample according to the income and the score that the store assigned to each of the clients.<\/p>\n\n\n\n\n<pre><span class=\"coment\">\n# Eldo method to find the optimal number of clusters\n<\/span>\n<span class=\"key\">desde<\/span> sklearn.cluster <span class=\"key\">import<\/span> KMeans wcss = []\n<span class=\"key\">for<\/span> i <span class=\"key\">in<\/span> range (1, 11): kmeans = KMeans (n_clusters = i, init = &#039;<span class=\"text\">k-means ++<\/span>&#039;, random_state = 42) kmeans.fit (X) wcss.append (kmeans.inertia_)\n\n<span class=\"coment\"># Graph of the sum of the distances<\/span>\nplt.plot (range (1, 11), wcss) plt.title (&#039;<span class=\"text\">The Elbow Method<\/span>&#039;) plt.xlabel (&#039;<span class=\"text\">Number of clusters<\/span>&#039;) plt.ylabel (&#039;<span class=\"text\">WCSS<\/span>&#039;) plt.show ()\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In the previous block, we generate the clusters for values \u200b\u200bfrom 1 to 10 (in the range of 1 to 11) and obtain for each of them, the sum of the distances with the inertial tribute_ of the kmeans object. The graph obtained is the following:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"656\" height=\"584\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/codo_python.png\" alt=\"\" class=\"wp-image-584\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/codo_python.png 656w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/codo_python-300x267.png 300w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In the graph we observe that the decrease in the sum of the distances is attenuated when the number of clusters is equal to 5, so, for this practical case, the optimum number of clusters will be 5.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With this, we now generate the model for 5 clusters with the object kmeans<\/p>\n\n\n\n\n<pre><span class=\"coment\"># Creating the k-Means for the 5 groups found<\/span>\nkmeans = KMeans (n_clusters = 5, init = &#039;<span class=\"text\">k-means ++<\/span>&#039;, random_state = 42) and_kmeans = kmeans.fit_predict (X)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">the y_kmeans variable stores the groups corresponding to each row of the data sample, which means that each record corresponding to a client is assigned to one of five groups ranging from 0 to 4<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"416\" height=\"538\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/y_kmeans.png\" alt=\"\" class=\"wp-image-585\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/y_kmeans.png 416w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/y_kmeans-232x300.png 232w\" sizes=\"auto, (max-width: 416px) 100vw, 416px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In order to graphically observe the assignment of the 200 clients to 5 groups or clusters, we did the following, we assigned a color to each group and we marked the centroids in yellow:<\/p>\n\n\n\n\n<pre><span class=\"coment\"># Graphic visualization of the clusters<\/span>\nplt.scatter (X [y_kmeans == 0, 0], X [y_kmeans == 0, 1], s = 100, c = &#039;<span class=\"text\">net<\/span>&#039;, label =&#039;<span class=\"text\">Cluster 1<\/span>&#039;) plt.scatter (X [y_kmeans == 1, 0], X [y_kmeans == 1, 1], s = 100, c =&#039;<span class=\"text\">blue<\/span>&#039;, label =&#039;<span class=\"text\">Cluster 2<\/span>&#039;) plt.scatter (X [y_kmeans == 2, 0], X [y_kmeans == 2, 1], s = 100, c =&#039;<span class=\"text\">green<\/span>&#039;, label =&#039;<span class=\"text\">Cluster 3<\/span>&#039;) plt.scatter (X [y_kmeans == 3, 0], X [y_kmeans == 3, 1], s = 100, c =&#039;<span class=\"text\">cyan<\/span>&#039;, label =&#039;<span class=\"text\">Cluster 4<\/span>&#039;) plt.scatter (X [y_kmeans == 4, 0], X [y_kmeans == 4, 1], s = 100, c =&#039;<span class=\"text\">magenta<\/span>&#039;, label =&#039;<span class=\"text\">Cluster 5<\/span>&#039;) plt.scatter (kmeans.cluster_centers_ [:, 0], kmeans.cluster_centers_ [:, 1], s = 300, c =&#039;<span class=\"text\">yellow<\/span>&#039;, label =&#039;<span class=\"text\">Centroids<\/span>&#039;) plt.title (&#039;<span class=\"text\">Clusters of customers<\/span>&#039;) plt.xlabel (&#039;<span class=\"text\">Annual Income (k $)<\/span>&#039;) plt.ylabel (&#039;<span class=\"text\">Spending Score (1-100)<\/span>&#039;) plt.legend () plt.show ()\n<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"656\" height=\"584\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/grafica_k_means.png\" alt=\"\" class=\"wp-image-586\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/grafica_k_means.png 656w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/grafica_k_means-300x267.png 300w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In relation to the annual income in thousands and the score generated by the store, we observed a group of customers that could be of interest to the store. The group of customers in purple, which have high income and a high score, so they could be a target group for certain promotions. In green we have low-scoring and low-income clients, while in blue, we have low-income clients with high scores, which could indicate that these customers buy a lot despite low incomes. That is, cluster analysis allows making inferences and making decisions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the next article, we solve the same case but with the <a href=\"https:\/\/www.jacobsoft.com.mx\/en\/clustering-jerarquico-con-python\/\">hierarchical method<\/a><\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n<style class=\"advgb-styles-renderer\">\n.coment{color:gray;}\n.key{color:blue;}\n.text{color:green;}\n\n.coment{color:gray;}\n.key{color:blue;}\n.text{color:green;}\n\n.coment{color:gray;}\n.key{color:blue;}\n.text{color:green;}\n\n.coment{color:gray;}\n.key{color:blue;}\n.text{color:green;}\n<\/style>","protected":false},"excerpt":{"rendered":"<p>k-Means Clustering con Python Como se describi\u00f3 en el art\u00edculo anterior: Cluster Analysis, el m\u00e9todo &hellip; <\/p>","protected":false},"author":2,"featured_media":557,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advgb_blocks_editor_width":"","advgb_blocks_columns_visual_guide":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[25,35,46],"tags":[135,66,57,85,133,58,56,82,134,86,50,132],"class_list":["post-539","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-algoritmos","category-inteligencia-artificial","category-machine-learning","tag-algoritmo-de-clustering","tag-analisis-de-datos","tag-ciencia-de-datos","tag-clustering","tag-curso-de-analisis-de-datos","tag-data-mining","tag-data-science","tag-inteligencia-artificial","tag-k-means-clustering","tag-k-means","tag-machine-learning","tag-tutorial-clustering"],"aioseo_notices":[],"author_meta":{"display_name":"Jacob Avila Camacho","author_link":"https:\/\/www.jacobsoft.com.mx\/en\/author\/jacob-avila\/"},"featured_img":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/destacada_cluster_k_means-300x165.png","featured_image_src":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/destacada_cluster_k_means.png","featured_image_src_square":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/10\/destacada_cluster_k_means.png","author_info":{"display_name":"Jacob Avila Camacho","author_link":"https:\/\/www.jacobsoft.com.mx\/en\/author\/jacob-avila\/"},"coauthors":[],"tax_additional":{"categories":{"linked":["<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/algoritmos\/\" class=\"advgb-post-tax-term\">Algoritmos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/\" class=\"advgb-post-tax-term\">Inteligencia Artificial<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Machine Learning<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">Algoritmos<\/span>","<span class=\"advgb-post-tax-term\">Inteligencia Artificial<\/span>","<span class=\"advgb-post-tax-term\">Machine Learning<\/span>"]},"tags":{"linked":["<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">algoritmo de clustering<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">an\u00e1lisis de datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Ciencia de Datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Clustering<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">curso de an\u00e1lisis de datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Data Mining<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Data Science<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Inteligencia Artificial<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">k means clustering<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">k-means<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">machine learning<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">tutorial clustering<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">algoritmo de clustering<\/span>","<span class=\"advgb-post-tax-term\">an\u00e1lisis de datos<\/span>","<span class=\"advgb-post-tax-term\">Ciencia de Datos<\/span>","<span class=\"advgb-post-tax-term\">Clustering<\/span>","<span class=\"advgb-post-tax-term\">curso de an\u00e1lisis de datos<\/span>","<span class=\"advgb-post-tax-term\">Data Mining<\/span>","<span class=\"advgb-post-tax-term\">Data Science<\/span>","<span class=\"advgb-post-tax-term\">Inteligencia Artificial<\/span>","<span class=\"advgb-post-tax-term\">k means clustering<\/span>","<span class=\"advgb-post-tax-term\">k-means<\/span>","<span class=\"advgb-post-tax-term\">machine learning<\/span>","<span class=\"advgb-post-tax-term\">tutorial clustering<\/span>"]}},"comment_count":"19","relative_dates":{"created":"Posted 8 years ago","modified":"Updated 1 year ago"},"absolute_dates":{"created":"Posted on October 21, 2018","modified":"Updated on February 20, 2025"},"absolute_dates_time":{"created":"Posted on October 21, 2018 2:10 am","modified":"Updated on February 20, 2025 1:37 pm"},"featured_img_caption":"","series_order":"","_links":{"self":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/539","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/comments?post=539"}],"version-history":[{"count":23,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/539\/revisions"}],"predecessor-version":[{"id":1799,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/539\/revisions\/1799"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/media\/557"}],"wp:attachment":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/media?parent=539"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/categories?post=539"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/tags?post=539"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}