{"id":334,"date":"2018-09-07T21:28:31","date_gmt":"2018-09-08T03:28:31","guid":{"rendered":"https:\/\/jacobsoft.com.mx\/?p=334"},"modified":"2021-08-22T21:36:14","modified_gmt":"2021-08-23T03:36:14","slug":"pre-procesamiento-de-datos-con-python","status":"publish","type":"post","link":"https:\/\/www.jacobsoft.com.mx\/en\/pre-procesamiento-de-datos-con-python\/","title":{"rendered":"Pre-processing data with Python"},"content":{"rendered":"<h1 class=\"wp-block-heading\">Pre-processing of data<\/h1>\n\n\n\n<p>Nowadays we have a large amount of data generated by different sources and it becomes increasingly necessary to be able to analyze them in order to automatically extract all the information <strong><a href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000384&amp;type=3&amp;subid=0\" target=\"_blank\" rel=\"noopener noreferrer\">intelligence<\/a> <\/strong>contained in them.<\/p>\n\n\n\n<p>The specialized techniques focused on <strong>analysis of data<\/strong>, constitute both statistical methods and methods of artificial intelligence and <a href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=579862.373&amp;type=3&amp;subid=0&amp;LSNSUBSITE=LSNSUBSITE\"><strong>machine learning<\/strong><\/a>, among others.<\/p>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<div style=\"height:27px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>Ve este art\u00edculo en video<\/p><\/blockquote><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Pre-procesamiento de datos con python\" width=\"780\" height=\"439\" src=\"https:\/\/www.youtube.com\/embed\/RHMHKA0pf9U?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<div style=\"height:27px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>In this sense, the <strong>data mining<\/strong> in itself, it is a set of techniques aimed at discovering the information contained in large data sets.<\/p>\n\n\n\n<p>The word <strong>discovery<\/strong> is related to the fact that much of the valuable information is previously unknown, it is about analyzing <strong>behaviors<\/strong>, <strong>patterns<\/strong>, <strong>trends<\/strong>, <strong>partnerships<\/strong> and other characteristics of the knowledge immersed in the data.<\/p>\n\n\n\n<p>The process of data analysis or data science consists of several phases:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/fases-de-analisis.png\"><img loading=\"lazy\" decoding=\"async\" width=\"932\" height=\"540\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/fases-de-analisis.png\" alt=\"\" class=\"wp-image-338\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/fases-de-analisis.png 932w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/fases-de-analisis-300x174.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/fases-de-analisis-768x445.png 768w\" sizes=\"auto, (max-width: 932px) 100vw, 932px\" \/><\/a><\/figure>\n\n\n\n<p>Pre-processing is a standardization of data prior to <a rel=\"noreferrer noopener\" aria-label=\"El pre-procesamiento es una estandarizaci\u00f3n de los datos previo al modelo de an\u00e1lisis y \u00e9ste consta a su vez de varias fases como se observa en la figura anterior. (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=579862.462&amp;type=3&amp;subid=0&amp;LSNSUBSITE=LSNSUBSITE\" target=\"_blank\"><strong>analysis model<\/strong><\/a> and this in turn consists of several phases as seen in the previous figure.<\/p>\n\n\n\n<p>The Selection, Exploration, Cleaning and Transformation phases make up the <strong>pre-processing<\/strong> of the data necessary in many occasions, before applying any model of data analysis, either predictive or descriptive.<\/p>\n\n\n\n<p><strong>Sampling and Selection<\/strong><\/p>\n\n\n\n<p>In the selection phase, the relevant variables in the data are identified and selected, the variables that will provide us with the information for the subject in which we are working, as well as the sources that may be useful. Once the variables have been selected, appropriate sampling techniques are applied in order to obtain a sample of the <a rel=\"noreferrer noopener\" aria-label=\"En la fase de selecci\u00f3n se identifican y seleccionan las variables relevantes en los datos, la variables que nos van a aportar la informaci\u00f3n para el tema en el que estamos trabajando, as\u00ed como las fuentes que pueden ser \u00fatiles. Una vez seleccionadas las variables se aplican t\u00e9cnicas de muestreo adecuadas con el fin de obtener una muestra de los datos que sea lo suficientemente representativa de la poblaci\u00f3n. (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000671&amp;type=3&amp;subid=0\" target=\"_blank\"><strong>data <\/strong><\/a>that is sufficiently representative of the population.<\/p>\n\n\n\n<p>The sample allows inferring the properties or characteristics of the entire population with a <strong>error<\/strong> Measurable and measurable.<\/p>\n\n\n\n<p>From the sample, the population characteristics (mean, total, proportion, etc.) are estimated with a quantifiable and controllable error.<\/p>\n\n\n\n<p>The errors are quantified by means of the variance, the standard deviation or the calculation of the mean square error to obtain the accuracy of the errors.<\/p>\n\n\n\n<p>It is important to take into account that to obtain the degree of representativeness of the sample it is necessary to use probabilistic sampling.<\/p>\n\n\n\n<p><strong>Exploration<\/strong><\/p>\n\n\n\n<p>Since the data come from different <a rel=\"noreferrer noopener\" aria-label=\"Dado que los datos provienen de diferentes fuentes, es necesaria su exploraci\u00f3n mediante t\u00e9cnicas de an\u00e1lisis exploratorio para identificar valores inusuales, valores extremos, valores desaparecidos, discontinuidades u otras peculiaridades de los mismos. (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000671&amp;type=3&amp;subid=0\" target=\"_blank\"><strong>sources<\/strong><\/a>, it is necessary to explore them using techniques of <strong>exploratory analysis<\/strong> to identify unusual values, extreme values, missing values, discontinuities or other peculiarities of them.<\/p>\n\n\n\n<p>With this, the exploration phase helps determine if the techniques of <strong><a rel=\"noreferrer noopener\" aria-label=\"Con ello la fase de exploraci\u00f3n ayuda a determinar si son adecuadas las t\u00e9cnicas de an\u00e1lisis de datos que se tienen en consideraci\u00f3n. Por ello es necesario realizar un an\u00e1lisis precio de la informaci\u00f3n de que se dispone antes del uso de cualquier t\u00e9cnica. (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000384&amp;type=3&amp;subid=0\" target=\"_blank\">analysis of data<\/a><\/strong> that are taken into consideration. Therefore, it is necessary to carry out a preliminary analysis of the information available before the use of any technique.<\/p>\n\n\n\n<p>We must examine the individual variables and the relationships between them, as well as evaluate and solve problems in the design of research and data collection.<\/p>\n\n\n\n<p>The exploration may indicate the need to transform the data if the technique needs a normal distribution or if nonparametric tests are needed.<\/p>\n\n\n\n<p>Exploratory analysis includes formal techniques and graphic or visual techniques<\/p>\n\n\n\n<p><strong>Cleaning<\/strong><\/p>\n\n\n\n<p>Since in the data set there may be outliers, missing values \u200b\u200band \/ or erroneous values, cleaning them is important to solve some of these problems. This phase is a consequence of the exploratory analysis.<\/p>\n\n\n\n<p><strong>Transformation<\/strong><\/p>\n\n\n\n<p>The transformation of the data is necessary when there are different scales between the variables or there are too many or few variables, then normalization or standardization of the data is carried out by techniques of reduction or increase of the dimension, as well as simple or multidimensional scaling.<\/p>\n\n\n\n<p>If the exploratory analysis indicates the need to transform some variables, some of these four transformations may be applied:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Logical transformations<\/li><li>Linear transformations<\/li><li>Algebraic transformations<\/li><li>Non-linear transformations<\/li><\/ul>\n\n\n\n<p>These phases mentioned in the previous points, constitute the process of pre-processing of the data and to apply these concepts, the following example is presented:<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p><strong>Pre-processing example<\/strong><\/p>\n\n\n\n<p>Suppose the following <strong><a href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000671&amp;type=3&amp;subid=0\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Supongamos la siguiente tabla de datos, que representa informaci\u00f3n de una tienda que relacion\u00f3 datos de clientes que compraron y clientes que no compraron: (opens in a new tab)\">data table<\/a><\/strong>, which represents information from a store that linked data from customers who bought and customers who did not buy:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><b>Do not<\/b><\/td><td><b>country<\/b><\/td><td><b>Age<\/b><\/td><td><b>Salary<\/b><\/td><td><b>Purchase<\/b><\/td><\/tr><tr><td>1<\/td><td>France<\/td><td>44<\/td><td>72000<\/td><td>Do not<\/td><\/tr><tr><td>2<\/td><td>Spain<\/td><td>27<\/td><td>48000<\/td><td>Yes<\/td><\/tr><tr><td>3<\/td><td>Germany<\/td><td>30<\/td><td>54000<\/td><td>Do not<\/td><\/tr><tr><td>4<\/td><td>Spain<\/td><td>38<\/td><td>61000<\/td><td>Do not<\/td><\/tr><tr><td>5<\/td><td>Germany<\/td><td>40<\/td><td>&nbsp;<\/td><td>Yes<\/td><\/tr><tr><td>6<\/td><td>France<\/td><td>35<\/td><td>58000<\/td><td>Yes<\/td><\/tr><tr><td>7<\/td><td>Spain<\/td><td>&nbsp;<\/td><td>52000<\/td><td>Do not<\/td><\/tr><tr><td>8<\/td><td>France<\/td><td>48<\/td><td>79000<\/td><td>Yes<\/td><\/tr><tr><td>9<\/td><td>Germany<\/td><td>50<\/td><td>83000<\/td><td>Do not<\/td><\/tr><tr><td>10<\/td><td>France<\/td><td>37<\/td><td>67000<\/td><td>Yes<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>As we can see, there are records with missing data. The client of the record or line 5 does not have salary and the client of line 7 does not have age.<\/p>\n\n\n\n<p>When there are missing data, there are several strategies that can be followed, especially if the randomness of such missing data has been proven. One strategy is to use the method of <strong>approximation of complete cases<\/strong> which consists in including in the analysis only the cases with complete data, only rows whose values \u200b\u200bfor all the variables are valid.<\/p>\n\n\n\n<p>In this case we have to validate if the sample is affected or not by eliminating incomplete cases.<\/p>\n\n\n\n<p>The alternative to data removal methods is the <strong>imputation of missing information<\/strong>, where the objective is to estimate the absent values \u200b\u200bbased on valid values \u200b\u200bof other variables or cases.<\/p>\n\n\n\n<p>One method to carry out the imputation of the missing information is the <strong>case substitution method<\/strong>. This consists in replacing the missing data with non-sample observations data, that is, they do not belong to the sample.<\/p>\n\n\n\n<p>The <strong>substitution method by the mean or median<\/strong>, substitutes missing data for the value of the mean or median of all valid values \u200b\u200bof its corresponding variable. When there are extreme or atypical values, the missing data are replaced by the median, otherwise the average is used.<\/p>\n\n\n\n<p><strong>The imputation method by interpolation<\/strong> substitutes each missing value by the mean or median of a certain number of data or observations adjacent to it, especially when there is too much variability in the data. If there is no such variability, then they are replaced by the value resulting from an interpolation with the adjacent values.<\/p>\n\n\n\n<p>Additionally, there is also the <strong>replacement method with a constant value<\/strong> and as its name indicates, the missing data are replaced by a constant value, valid for the variable in question, derived from external sources or from previous research.<\/p>\n\n\n\n<p>Finally, we can also use the <strong>Imputation method by regression<\/strong> and this one uses the regression method to calculate or estimate the absent values \u200b\u200bbased on their relationship with other variables in the data set.<\/p>\n\n\n\n<p>In our example, we will use the mean as a value to impute the missing data of both age and salary. In this sense, the average age values \u200b\u200bare: 38.77 and for the salary, the average value of the values \u200b\u200bpresent in the sample is: 63,777.77, so the table is now as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><b>Do not<\/b><\/td><td><b>country<\/b><\/td><td><b>Age<\/b><\/td><td><b>Salary<\/b><\/td><td><b>Purchase<\/b><\/td><\/tr><tr><td>1<\/td><td>France<\/td><td>44<\/td><td>72000<\/td><td>Do not<\/td><\/tr><tr><td>2<\/td><td>Spain<\/td><td>27<\/td><td>48000<\/td><td>Yes<\/td><\/tr><tr><td>3<\/td><td>Germany<\/td><td>30<\/td><td>54000<\/td><td>Do not<\/td><\/tr><tr><td>4<\/td><td>Spain<\/td><td>38<\/td><td>61000<\/td><td>Do not<\/td><\/tr><tr><td>5<\/td><td>Germany<\/td><td>40<\/td><td><strong>63777.77<\/strong><\/td><td>Yes<\/td><\/tr><tr><td>6<\/td><td>France<\/td><td>35<\/td><td>58000<\/td><td>Yes<\/td><\/tr><tr><td>7<\/td><td>Spain<\/td><td><strong>38.77<\/strong><\/td><td>52000<\/td><td>Do not<\/td><\/tr><tr><td>8<\/td><td>France<\/td><td>48<\/td><td>79000<\/td><td>Yes<\/td><\/tr><tr><td>9<\/td><td>Germany<\/td><td>50<\/td><td>83000<\/td><td>Do not<\/td><\/tr><tr><td>10<\/td><td>France<\/td><td>37<\/td><td>67000<\/td><td>Yes<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>We already have the complete sample and now we continue with the <strong>categorical variables<\/strong> (Country and Purchase).<\/p>\n\n\n\n<p>For the Buy variable, which has categorical values \u200b\u200bYes or No, we can code it as 1 and 0 representing 1 for the value Si and 0 for the value No. For the case of the variable Country, the case is a little different since if we codify as 0, 1 and 2, the country with the value 2 would have more weight than the country with the value 0 so the strategy for this type of variables is different.<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Within the logical transformations, interval variables can be converted into ordinals as the variables Talla or nominal as Color, and create dummy or dummy variables.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><b>country<\/b><\/td><td>&nbsp;<\/td><td><b>France<\/b><\/td><td><b>Spain<\/b><\/td><td><b>Germany<\/b><\/td><\/tr><tr><td>France<\/td><td>&nbsp;<\/td><td>1<\/td><td>0<\/td><td>0<\/td><\/tr><tr><td>Spain<\/td><td>&nbsp;<\/td><td>0<\/td><td>1<\/td><td>0<\/td><\/tr><tr><td>Germany<\/td><td>&nbsp;<\/td><td>0<\/td><td>0<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The Country variable is changed by three variables (Dummy) that are the values \u200b\u200bof the Country variable. Instead of using the Country variable, we now use the 3 new variables and the table would look like this:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><b>Do not<\/b><\/td><td><b>France<\/b><\/td><td><b>Spain<\/b><\/td><td><b>Germany<\/b><\/td><td><b>Age<\/b><\/td><td><b>Salary<\/b><\/td><td><b>Purchase<\/b><\/td><\/tr><tr><td>1<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>44<\/td><td>72000<\/td><td>0<\/td><\/tr><tr><td>2<\/td><td>0<\/td><td>1<\/td><td>0<\/td><td>27<\/td><td>48000<\/td><td>1<\/td><\/tr><tr><td>3<\/td><td>0<\/td><td>0<\/td><td>1<\/td><td>30<\/td><td>54000<\/td><td>0<\/td><\/tr><tr><td>4<\/td><td>0<\/td><td>1<\/td><td>0<\/td><td>38<\/td><td>61000<\/td><td>0<\/td><\/tr><tr><td>5<\/td><td>0<\/td><td>0<\/td><td>1<\/td><td>40<\/td><td>63777.77<\/td><td>1<\/td><\/tr><tr><td>6<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>35<\/td><td>58000<\/td><td>1<\/td><\/tr><tr><td>7<\/td><td>0<\/td><td>1<\/td><td>0<\/td><td>38.77<\/td><td>52000<\/td><td>0<\/td><\/tr><tr><td>8<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>48<\/td><td>79000<\/td><td>1<\/td><\/tr><tr><td>9<\/td><td>0<\/td><td>0<\/td><td>1<\/td><td>50<\/td><td>83000<\/td><td>0<\/td><\/tr><tr><td>10<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td>37<\/td><td>67000<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Given that the data of age and salary maintain a different scale and in the equations of the regression or some other method of classification and \/ or prediction, given the Euclidean distance between two points, the value of the salary could cause the age to stop be representative or important for the analysis, the most convenient is to make a scale transformation, be it a standardization or a normalization of the data.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Normalizacion.png\"><img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"157\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Normalizacion.png\" alt=\"\" class=\"wp-image-350\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Normalizacion.png 630w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Normalizacion-300x75.png 300w\" sizes=\"auto, (max-width: 630px) 100vw, 630px\" \/><\/a><\/figure>\n\n\n\n<p>For this example we will use normalization to adjust the scales of all the variables that is given by the current value of the sample minus the minimum value of the entire data set of that variable between the difference of the maximum value and the minimum value.<\/p>\n\n\n\n<p>Once the normalization is applied, the resulting table is as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><b>France<\/b><\/td><td><b>Germany<\/b><\/td><td><b>Spain<\/b><\/td><td><b>Age<\/b><\/td><td><b>Salary<\/b><\/td><td><b>Purchased<\/b><\/td><\/tr><tr><td>1<\/td><td>0<\/td><td>0<\/td><td>0.72003861<\/td><td>0.71101283<\/td><td>0<\/td><\/tr><tr><td>0<\/td><td>0<\/td><td>1<\/td><td>-1.62356783<\/td><td>-1.36437577<\/td><td>1<\/td><\/tr><tr><td>0<\/td><td>1<\/td><td>0<\/td><td>-1.20999022<\/td><td>-0.84552862<\/td><td>0<\/td><\/tr><tr><td>0<\/td><td>0<\/td><td>1<\/td><td>-0.1071166<\/td><td>-0.24020695<\/td><td>0<\/td><\/tr><tr><td>0<\/td><td>1<\/td><td>0<\/td><td>0.1686018<\/td><td>-6.0532E-07<\/td><td>1<\/td><\/tr><tr><td>1<\/td><td>0<\/td><td>0<\/td><td>-0.52069421<\/td><td>-0.49963052<\/td><td>1<\/td><\/tr><tr><td>0<\/td><td>0<\/td><td>1<\/td><td>-0.00096501<\/td><td>-1.01847767<\/td><td>0<\/td><\/tr><tr><td>1<\/td><td>0<\/td><td>0<\/td><td>1.27147542<\/td><td>1.3163345<\/td><td>1<\/td><\/tr><tr><td>0<\/td><td>1<\/td><td>0<\/td><td>1.54719383<\/td><td>1.6622326<\/td><td>0<\/td><\/tr><tr><td>1<\/td><td>0<\/td><td>0<\/td><td>-0.2449758<\/td><td>0.2786402<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>When the categorical variables generate many dummy variables we can use the techniques of dimension reduction to make our data set more manageable.<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p><strong>Pre-processing with Python<\/strong><\/p>\n\n\n\n<p>Using <strong><a href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000502&amp;type=3&amp;subid=0\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Utilizando python para realizar el pre-procesamiento del ejemplo anterior se vuelve sencillo con las librer\u00edas de numpy y sklearn dado que el subpaquete preprocessing contiene las clases necesarias para realizar la imputaci\u00f3n de datos ausentes, la codificaci\u00f3n de las variables categ\u00f3ricas, la creaci\u00f3n de variables dummy y el ajuste de escalas. (opens in a new tab)\">python <\/a><\/strong>to perform the pre-processing of the previous example, it becomes easy with the numpy and sklearn libraries since the preprocessing subpackage contains the classes necessary to perform the imputation of missing data, the coding of the categorical variables, the creation of dummy variables and the adjustment of scales.<\/p>\n\n\n\n<p>The first step is to import the libraries that we are going to use in processing, then we load the data set and store it in the dataset variable, we divide the matrix in x (independent variables) and Y (dependent variable)<\/p>\n\n\n<div id=\"code\">\n\n\n\n<pre class=\"wp-block-preformatted brush: cpp; gutter: true; first-line: 1\"><span style=\"color: #0000ff;\"><span style=\"color: #339966;\"># Pre-processing of data<\/span>\nimport<\/span> numpy <span style=\"color: #0000ff;\">ace<\/span> np\n<span style=\"color: #0000ff;\">import<\/span> matplotlib.pyplot <span style=\"color: #0000ff;\">ace<\/span> plt\n<span style=\"color: #0000ff;\">import<\/span> pandas <span style=\"color: #0000ff;\">ace<\/span> P.S\n\n<span style=\"color: #339966;\"># We load the data set<\/span>\ndataset = pd.read_csv (&#039;<span style=\"color: #008000;\">Preproc_Datos_Compras.csv<\/span>')\n<\/pre>\n\n\n<\/div>\n\n\n\n<p>The loaded dataset is as follows and as we can see, the missing data is shown as nan (not a number):<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/dataset-preproc.png\"><img loading=\"lazy\" decoding=\"async\" width=\"508\" height=\"329\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/dataset-preproc.png\" alt=\"\" class=\"wp-image-354\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/dataset-preproc.png 508w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/dataset-preproc-300x194.png 300w\" sizes=\"auto, (max-width: 508px) 100vw, 508px\" \/><\/a><\/figure>\n\n\n\n<p>Now we divide the variables into dependent and independent<\/p>\n\n\n<div id=\"code\">\n\n\n\n<pre class=\"wp-block-preformatted brush: cpp; gutter: true; first-line: 1\"><span style=\"color: #0000ff;\"><span style=\"color: #339966;\"># Pre-processing of data<\/span>\nimport<\/span> numpy <span style=\"color: #0000ff;\">ace<\/span> np\n<span style=\"color: #0000ff;\">import<\/span> matplotlib.pyplot <span style=\"color: #0000ff;\">ace<\/span> plt\n<span style=\"color: #0000ff;\">import<\/span> pandas <span style=\"color: #0000ff;\">ace<\/span> P.S\n\n<span style=\"color: #339966;\"># We load the data set<\/span>\ndataset = pd.read_csv (&#039;<span style=\"color: #008000;\">Preproc_Datos_Compras.csv<\/span>')\n\n<span style=\"color: #339966;\"># We separate dependent and independent variables<\/span>\nx = dataset.iloc [:,:<span style=\"color: #0000ff;\">-1<\/span>] y = dataset.iloc [:, <span style=\"color: #0000ff;\">3<\/span>]<\/pre>\n\n\n<\/div>\n\n\n\n<p>When executing this last piece of code we have the variables x and y<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-x-y-y.png\"><img loading=\"lazy\" decoding=\"async\" width=\"875\" height=\"425\" src=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-x-y-y.png\" alt=\"\" class=\"wp-image-355\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-x-y-y.png 875w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-x-y-y-300x146.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-x-y-y-768x373.png 768w\" sizes=\"auto, (max-width: 875px) 100vw, 875px\" \/><\/a><\/figure>\n\n\n\n<p>We see that in the matrix x, the missing data is shown as nan (Not an Object). The next step now is the imputation of the missing data and for this, we can use the Imputer class of the preprocessing subpackage.<\/p>\n\n\n<div id=\"code\">\n\n\n\n<pre class=\"wp-block-preformatted brush: cpp; gutter: true; first-line: 1\"><span style=\"color: #0000ff;\"><span style=\"color: #339966;\"># Pre-processing of data<\/span>\nimport<\/span> numpy <span style=\"color: #0000ff;\">ace<\/span> np\n<span style=\"color: #0000ff;\">import<\/span> matplotlib.pyplot <span style=\"color: #0000ff;\">ace<\/span> plt\n<span style=\"color: #0000ff;\">import<\/span> pandas <span style=\"color: #0000ff;\">ace<\/span> P.S\n\n<span style=\"color: #339966;\"># We load the data set<\/span>\ndataset = pd.read_csv (&#039;<span style=\"color: #008000;\">Preproc_Datos_Compras.csv<\/span>')\n\n<span style=\"color: #339966;\"># We separate dependent and independent variables<\/span>\nx = dataset.iloc [:,:<span style=\"color: #0000ff;\">-1<\/span>] y = dataset.iloc [:, <span style=\"color: #0000ff;\">3<\/span>]\n\n<span style=\"color: #339966;\">#Imputation of missing data<\/span>\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> Imputer imputer = Imputer (missing_values \u200b\u200b= &#039;<span style=\"color: #008000;\">NaN<\/span>&#039;, strategy =&#039;<span style=\"color: #008000;\">mean<\/span>&#039;, axis =<span style=\"color: #0000ff;\">0<\/span>) imputer = imputer.fit (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>]) x.iloc [:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>] = imputer.transform (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>])\n<\/pre>\n\n\n<\/div>\n\n\n\n<p>Once the fragment of the code of the Imputer class was executed, we observed that the missing data were calculated with the mean, which is the strategy that was indicated in the arguments of the constructor of the Imputer class.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"http:\/\/jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Datos-Imputados.png\"><img loading=\"lazy\" decoding=\"async\" width=\"448\" height=\"419\" src=\"http:\/\/jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Datos-Imputados.png\" alt=\"\" class=\"wp-image-356\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Datos-Imputados.png 448w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Datos-Imputados-300x281.png 300w\" sizes=\"auto, (max-width: 448px) 100vw, 448px\" \/><\/a><\/figure><\/div>\n\n\n\n<p>The average value of the age was 38.77778 and the average value of the salary was 63.777.77778. The next step is the coding of the categorical variables. First we will do it for Y, with the values \u200b\u200bYes and No and later for X with the values \u200b\u200bof the countries.<\/p>\n\n\n<div id=\"code\">\n\n\n\n<pre class=\"wp-block-preformatted brush: cpp; gutter: true; first-line: 1\"><span style=\"color: #0000ff;\"><span style=\"color: #339966;\"># Pre-processing of data<\/span>\nimport<\/span> numpy <span style=\"color: #0000ff;\">ace<\/span> np\n<span style=\"color: #0000ff;\">import<\/span> matplotlib.pyplot <span style=\"color: #0000ff;\">ace<\/span> plt\n<span style=\"color: #0000ff;\">import<\/span> pandas <span style=\"color: #0000ff;\">ace<\/span> P.S\n\n<span style=\"color: #339966;\"># We load the data set<\/span>\ndataset = pd.read_csv (&#039;<span style=\"color: #008000;\">Preproc_Datos_Compras.csv<\/span>')\n\n<span style=\"color: #339966;\"># We separate dependent and independent variables<\/span>\nx = dataset.iloc [:,:<span style=\"color: #0000ff;\">-1<\/span>] y = dataset.iloc [:, <span style=\"color: #0000ff;\">3<\/span>]\n\n<span style=\"color: #339966;\">#Imputation of missing data<\/span>\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> Imputer imputer = Imputer (missing_values \u200b\u200b= &#039;<span style=\"color: #008000;\">NaN<\/span>&#039;, strategy =&#039;<span style=\"color: #008000;\">mean<\/span>&#039;, axis =<span style=\"color: #0000ff;\">0<\/span>) imputer = imputer.fit (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>]) x.iloc [:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>] = imputer.transform (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>])\n\n<span style=\"color: #339966;\"># Coding of categorical variables<\/span>\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> LabelEncoder label_encoder_y = LabelEncoder () y = label_encoder_y.fit_transform (y) label_encoder_x = LabelEncoder () x.iloc [:, <span style=\"color: #0000ff;\">0<\/span>] = label_encoder_x.fit_transform (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">0<\/span>])\n\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> OneHotEncoder onehotencoder = OneHotEncoder (categorical_features = [<span style=\"color: #0000ff;\">0<\/span>]) x = onehotencoder.fit_transform (x) .toarray ()\n<\/pre>\n\n\n<\/div>\n\n\n\n<p>The LabelEncoder class encodes the values \u200b\u200bof Y in 0 for No and 1 for Yes, for the countries it codes them in 0 France, 1 Spain and 2 Germany, afterwards the OneHotEncoder classes performs the transformation of the dummy variables being as follows:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"http:\/\/jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-codificadas.png\"><img loading=\"lazy\" decoding=\"async\" width=\"997\" height=\"419\" src=\"http:\/\/jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-codificadas.png\" alt=\"\" class=\"wp-image-357\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-codificadas.png 997w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-codificadas-300x126.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/Variables-codificadas-768x323.png 768w\" sizes=\"auto, (max-width: 997px) 100vw, 997px\" \/><\/a><\/figure>\n\n\n\n<p>Finally, we only need the transformation of scales for the variables of Age and Salary.<\/p>\n\n\n<div id=\"code\">\n\n\n\n<pre class=\"wp-block-preformatted brush: cpp; gutter: true; first-line: 1\"><span style=\"color: #0000ff;\"><span style=\"color: #339966;\"># Pre-processing of data<\/span>\nimport<\/span> numpy <span style=\"color: #0000ff;\">ace<\/span> np\n<span style=\"color: #0000ff;\">import<\/span> matplotlib.pyplot <span style=\"color: #0000ff;\">ace<\/span> plt\n<span style=\"color: #0000ff;\">import<\/span> pandas <span style=\"color: #0000ff;\">ace<\/span> P.S\n\n<span style=\"color: #339966;\"># We load the data set<\/span>\ndataset = pd.read_csv (&#039;<span style=\"color: #008000;\">Preproc_Datos_Compras.csv<\/span>')\n\n<span style=\"color: #339966;\"># We separate dependent and independent variables<\/span>\nx = dataset.iloc [:,:<span style=\"color: #0000ff;\">-1<\/span>] y = dataset.iloc [:, <span style=\"color: #0000ff;\">3<\/span>]\n\n<span style=\"color: #339966;\">#Imputation of missing data<\/span>\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> Imputer imputer = Imputer (missing_values \u200b\u200b= &#039;<span style=\"color: #008000;\">NaN<\/span>&#039;, strategy =&#039;<span style=\"color: #008000;\">mean<\/span>&#039;, axis =<span style=\"color: #0000ff;\">0<\/span>) imputer = imputer.fit (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>]) x.iloc [:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>] = imputer.transform (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">1<\/span>:<span style=\"color: #0000ff;\">3<\/span>])\n\n<span style=\"color: #339966;\"># Coding of categorical variables<\/span>\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> LabelEncoder label_encoder_y = LabelEncoder () y = label_encoder_y.fit_transform (y) label_encoder_x = LabelEncoder () x.iloc [:, <span style=\"color: #0000ff;\">0<\/span>] = label_encoder_x.fit_transform (x.values \u200b\u200b[:, <span style=\"color: #0000ff;\">0<\/span>])\n\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> OneHotEncoder onehotencoder = OneHotEncoder (categorical_features = [<span style=\"color: #0000ff;\">0<\/span>]) x = onehotencoder.fit_transform (x) .toarray () # Transforming scales\n<span style=\"color: #0000ff;\">desde<\/span> sklearn.preprocessing <span style=\"color: #0000ff;\">import<\/span> StandardScaler sc_x = StandardScaler () sc_y = StandardScaler () x = sc_x.fit_transform (x) y = sc_y.fit_transform (y.reshape (-<span style=\"color: #0000ff;\">1<\/span>, <span style=\"color: #0000ff;\">1<\/span>))<\/pre>\n\n\n<\/div>\n\n\n\n<p>We apply the transformation of scales for both the variables of the set X and for the dependent variable Y and we see that now the if is 1 and the is not -1 to standardize all the values \u200b\u200bof all the variables.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"http:\/\/jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/transformacion-de-escalas.png\"><img loading=\"lazy\" decoding=\"async\" width=\"996\" height=\"412\" src=\"http:\/\/jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/transformacion-de-escalas.png\" alt=\"\" class=\"wp-image-358\" srcset=\"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/transformacion-de-escalas.png 996w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/transformacion-de-escalas-300x124.png 300w, https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/transformacion-de-escalas-768x318.png 768w\" sizes=\"auto, (max-width: 996px) 100vw, 996px\" \/><\/a><\/figure>\n\n\n\n<p>Now, with the pre-processed data we can apply some predictive method such as the Logistic Regression to predict the purchases of a new client, that is, with their data in country, age and salary, estimate whether to buy or not.<\/p>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>Ve este art\u00edculo en video<\/p><\/blockquote><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Pre-procesamiento de datos con python\" width=\"780\" height=\"439\" src=\"https:\/\/www.youtube.com\/embed\/RHMHKA0pf9U?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><figcaption>Preprocesamiento de datos con python<\/figcaption><\/figure>\n\n\n\n<p>In the following articles we will present classification and prediction methods.<\/p>\n\n\n\n<p>If you are interested in going deeper into these topics, we have the course of <a href=\"https:\/\/www.jacobsoft.com.mx\/en\/cursos\/#python_para_ciencia_de_datos\"><strong>Python for Data Science<\/strong>, check here the details<\/a><\/p>\n\n\n\n<p>Another very interesting course to start with python you can see <a rel=\"noreferrer noopener\" aria-label=\"Otro curso muy interesante para iniciar con python lo puedes ver aqu\u00ed (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000502&amp;type=3&amp;subid=0\" target=\"_blank\"><strong>here<\/strong><\/a><\/p>\n\n\n\n<p>For the basics of programming, we recommend the following course: <strong><a href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=347188.10000500&amp;type=3&amp;subid=0\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Para los fundamentos de programaci\u00f3n, recomendamos el siguiente curso: Introducci\u00f3n a la programaci\u00f3n (opens in a new tab)\">Introduction to programming<\/a><\/strong><\/p>\n\n\n\n<p>On the other hand, to know the details to use cloud services with AWS, this course on <strong><a rel=\"noreferrer noopener\" aria-label=\"Por otro lado, para conocer los detalles para utilizar servicios en la nube con AWS, este curso sobre la certificaci\u00f3n de Asociado es bastante recomendable (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=579862.373&amp;type=3&amp;subid=0&amp;LSNSUBSITE=LSNSUBSITE\" target=\"_blank\">Associate certification<\/a><\/strong> It is highly recommended, as well as the course of <a rel=\"noreferrer noopener\" aria-label=\"Por otro lado, para conocer los detalles para utilizar servicios en la nube con AWS, este curso sobre la certificaci\u00f3n de Asociado es bastante recomendable, as\u00ed como el curso de la certificaci\u00f3n profesional. (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=579862.372&amp;type=3&amp;subid=0&amp;LSNSUBSITE=LSNSUBSITE\" target=\"_blank\"><strong>professional certification<\/strong><\/a>.<\/p>\n\n\n\n<p>Finally in this webinar you will see the details to implement machine learning techniques in <strong><a rel=\"noreferrer noopener\" aria-label=\"Finalmente en este webinar podr\u00e1s ver los detalles para implementar t\u00e9cnicas de machine learning en Azure (opens in a new tab)\" href=\"https:\/\/click.linksynergy.com\/fs-bin\/click?id=cTjR400Zjac&amp;offerid=579862.462&amp;type=3&amp;subid=0&amp;LSNSUBSITE=LSNSUBSITE\" target=\"_blank\">Azure<\/a><\/strong><\/p>\n\n\n\n<p>Subscribe to the blog to receive notifications when new articles are added<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-2380084220870127\"\n     crossorigin=\"anonymous\"><\/script>\n<ins class=\"adsbygoogle\"\n     style=\"display:block; text-align:center;\"\n     data-ad-layout=\"in-article\"\n     data-ad-format=\"fluid\"\n     data-ad-client=\"ca-pub-2380084220870127\"\n     data-ad-slot=\"2437322509\"><\/ins>\n<script>\n     (adsbygoogle = window.adsbygoogle || []).push({});\n<\/script>","protected":false},"excerpt":{"rendered":"<p>Pre-procesamiento de datos Hoy en d\u00eda disponemos de una gran cantidad de datos generados por &hellip; <\/p>","protected":false},"author":2,"featured_media":1373,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"advgb_blocks_editor_width":"","advgb_blocks_columns_visual_guide":"","_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[25,35,46],"tags":[65,66,57,58,56,63,50,59,60,61,62,64],"class_list":["post-334","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-algoritmos","category-inteligencia-artificial","category-machine-learning","tag-ajuste-de-escalas","tag-analisis-de-datos","tag-ciencia-de-datos","tag-data-mining","tag-data-science","tag-exploracion","tag-machine-learning","tag-mineria-de-datos","tag-pre-procesamiento","tag-python","tag-seleccion-y-muestreo","tag-transformacion-de-datos"],"aioseo_notices":[],"author_meta":{"display_name":"Jacob Avila Camacho","author_link":"https:\/\/www.jacobsoft.com.mx\/en\/author\/jacob-avila\/"},"featured_img":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/preprocesamiento-300x205.png","featured_image_src":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/preprocesamiento.png","featured_image_src_square":"https:\/\/www.jacobsoft.com.mx\/wp-content\/uploads\/2018\/09\/preprocesamiento.png","author_info":{"display_name":"Jacob Avila Camacho","author_link":"https:\/\/www.jacobsoft.com.mx\/en\/author\/jacob-avila\/"},"coauthors":[],"tax_additional":{"categories":{"linked":["<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/algoritmos\/\" class=\"advgb-post-tax-term\">Algoritmos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/\" class=\"advgb-post-tax-term\">Inteligencia Artificial<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Machine Learning<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">Algoritmos<\/span>","<span class=\"advgb-post-tax-term\">Inteligencia Artificial<\/span>","<span class=\"advgb-post-tax-term\">Machine Learning<\/span>"]},"tags":{"linked":["<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Ajuste de escalas<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">an\u00e1lisis de datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Ciencia de Datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Data Mining<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Data Science<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Exploraci\u00f3n<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">machine learning<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Miner\u00eda de Datos<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Pre-procesamiento<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Python<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Selecci\u00f3n y Muestreo<\/a>","<a href=\"https:\/\/www.jacobsoft.com.mx\/en\/category\/inteligencia-artificial\/machine-learning\/\" class=\"advgb-post-tax-term\">Transformaci\u00f3n de datos<\/a>"],"unlinked":["<span class=\"advgb-post-tax-term\">Ajuste de escalas<\/span>","<span class=\"advgb-post-tax-term\">an\u00e1lisis de datos<\/span>","<span class=\"advgb-post-tax-term\">Ciencia de Datos<\/span>","<span class=\"advgb-post-tax-term\">Data Mining<\/span>","<span class=\"advgb-post-tax-term\">Data Science<\/span>","<span class=\"advgb-post-tax-term\">Exploraci\u00f3n<\/span>","<span class=\"advgb-post-tax-term\">machine learning<\/span>","<span class=\"advgb-post-tax-term\">Miner\u00eda de Datos<\/span>","<span class=\"advgb-post-tax-term\">Pre-procesamiento<\/span>","<span class=\"advgb-post-tax-term\">Python<\/span>","<span class=\"advgb-post-tax-term\">Selecci\u00f3n y Muestreo<\/span>","<span class=\"advgb-post-tax-term\">Transformaci\u00f3n de datos<\/span>"]}},"comment_count":"10","relative_dates":{"created":"Posted 8 years ago","modified":"Updated 5 years ago"},"absolute_dates":{"created":"Posted on September 7, 2018","modified":"Updated on August 22, 2021"},"absolute_dates_time":{"created":"Posted on September 7, 2018 9:28 pm","modified":"Updated on August 22, 2021 9:36 pm"},"featured_img_caption":"","series_order":"","_links":{"self":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/334","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/comments?post=334"}],"version-history":[{"count":37,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/334\/revisions"}],"predecessor-version":[{"id":1806,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/posts\/334\/revisions\/1806"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/media\/1373"}],"wp:attachment":[{"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/media?parent=334"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/categories?post=334"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jacobsoft.com.mx\/en\/wp-json\/wp\/v2\/tags?post=334"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}