|Photo by Lucas Benjamin on Unsplash|
Data is a collection of facts and figures, observations, or descriptions of things in an unorganized or organized form. Data can exist as images, words, numbers, characters, videos, audios, and etcetera.
What is data preprocessing
To analyze our data and extract the insights out of it, it is necessary to process the data before we start building up our machine learning model i.e. we need to convert our data in the form which our model can understand. Since the machines cannot understand data in the form of images, audios, etc.
Data is processed in the form (an efficient format) that it can be easily interpreted by the algorithm and produce the required output accurately.
The data we use in the real world is not perfect and it is incomplete, inconsistent (with outliers and noisy values), and in an unstructured form. Preprocessing the raw data helps to organize, scaling, clean (remove outliers), standardize i.e. simplifying it to feed the data to the machine learning algorithm.
The process of data preprocessing involves a few steps:
- Data cleaning: the data we use may have some missing points (like rows or columns which does not contain any values) or have noisy data (irrelevant data that is difficult to interpret by the machine). To solve the above problems we can delete the empty rows and columns or fill them with some other values and we can use methods like regression and clustering for noisy data.
- Data transformation: this the process of transforming the raw data into the format that is ready to suitable for the model. It may include steps like- categorical encoding, scaling, normalization, standardization, etc.
- Data reduction: this helps to reduce the size of the data we are working on (for easy analysis) while maintaining the integrity of the original data.
Scikit-learn library for data preprocessing
Scikit-learn is a popular machine learning library available as an open-source. This library provides us various essential tools including algorithms for random forests, classification, regression, and of course for data preprocessing as well. This library is built on the top of NumPy and SciPy and it is easy to learn and understand.
We can use the following code to import the library in the workspace:
from sklearn import preprocessing
Standardization is a technique used to scale the data such that the mean of the data becomes zero and the standard deviation becomes one. Here the values are not restricted to a particular range. We can use standardization when features of input data set have large differences between their ranges.
|The formula for standardization of data|
Let us consider the following example:
from sklearn import preprocessingimport numpy as npx = np.array([[1, 2, 3],[ 4, 5, 6],[ 7, 8, 9]])y_scaled = preprocessing.scale(X_train)print(y_scaled)
Here we have an input array of dimension 3x3 with its values ranging from one to nine. Using the
scalefunction available in the
preprocessing we can quickly scale our data.
There is another function available in this library
StandardScaler, this helps us to compute mean and standard deviation to the training set of data and reapplying the same transformation to the training dataset by implementing the
Transformer API .
If we want to scale our features in a given range we can use the
MinAbsScaler (the difference is that the maximum absolute value of each feature is scaled to unit size in
from sklearn.preprocessing import MinMaxScalerimport numpy as npx = MinMaxScaler(feature_range=(0,8))y = np.array([[1, 2, 3],[ 4, -5, -6],[ 7, 8, 9]])scale = x.fit_transform(y)scale
Here the values of an array of dimension 3x3 are scaled in a given range of
(0,8)and we have used the
.fit_transform() function which will help us to apply the same transformation to another dataset later.
|Scaled data in a specified range|
from sklearn import preprocessingimport numpy as npX = [[1,2,3],[4,-5,-6],[7,8,9]]y = preprocessing.normalize(X)y
Transformer API, by using the
Normalizerfunction which implements the same operation.
Encoding categorical features
Many times the data we use may not have the features values in a continuous form, but instead the forms of categories with text labels. To get this data processed by the machine learning model, it is necessary for converting these categorical features into a machine-understandable form.
There are two functions available in this module through which we can encode our categorical features:
- OrdinalEncoder: this is to convert categorical features to integer values such that the function converts each categorical feature to one new feature of integers (0 to n_categories — 1).
import sklearn.preprocessingimport numpy as npenc = preprocessing.OrdinalEncoder()X = [['a','b','c','d'], ['e', 'f', 'g', 'h'],['i','j','k','l']]enc.fit(X)enc.transform([['a', 'f', 'g','l']])
0,1,2and the output result for the above input is:
- OneHotEncode: this encoder function transforms each categorical feature with
n_categoriespossible values into
n_categoriesbinary features, with one of them 1, and all others 0. Check the following example for a better understanding.
import sklearn.preprocessingimport numpy as npenc = preprocessing.OneHotEncoder()X = [['a','b','c','d'], ['e', 'f', 'g', 'h'],['i','j','k','l']]enc.fit(X)enc.transform([['a', 'f', 'g','l']]).toarray().reshape(4,3)
The process of discretization helps us to separate the continuous features of data into discrete values (also known as binning or quantization). This is similar to creating a histogram using continuous data (where discretization focuses on assigning feature values to these bins). Discretization can help us introduce non-linearity in linear models in some cases.
import sklearn.preprocessingimport numpy as npX = np.array([[ 1,2,3],[-4,-5,6],[7,8,9]])dis = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal')dis.fit_transform(X)
KBinsDiscretizer(), the function discretizes the features into
k bins. By default, the output is one-hot encoded, which we can change with the
Imputation of missing values
from sklearn.impute import SimpleImputerimport numpy as npimpute = SimpleImputer(missing_values=np.nan, strategy='mean')X = [[np.nan, 1,2], [3,4, np.nan], [5, np.nan, 6]]impute.fit_transform(X)
SimpleImputer()function for imputing the missing values. The parameters used in this function are
missing_valuesto specify the missing values to be imputed,
strategyto specify how we want to impute the value, like in the above example we have used
mean, this means that the missing values will be replaced by the mean of column values. We can use other parameters for
strategy, like median, mode,
most_frequent(based on the frequency of occurrence of particular value in a column), or
constant(a constant value).
|Imputing missing values|
Generating polynomial features
import numpy as npfrom sklearn.preprocessing import PolynomialFeaturesx = np.array([[1,2],[3,4]])nonl = PolynomialFeatures(2)nonl.fit_transform(x)
|Generating polynomial features|
PolynomialFeatures()function. The feature values of the input array are transformed from (X1, X2) to (1, X1, X2, X1², X1*X2, X2²).
FunctionTransformer()and passing the required function through it.
import sklearn.preprocessingimport numpy as nptransformer = preprocessing.FunctionTransformer(np.log1p, validate=True)X = np.array([[1,2,3],[4,5,6],[7,8,9]])transformer.transform(X)
|Implementing custom transformers|
For a better understanding of these concepts, I will recommend you try implementing these concepts on your once. Keep exploring, and I am sure you will discover new features along the way.
If you have any questions or comments, please post them in the comment section.