Python and Machine Learning

Machine Learning

One of the branches of computer science is Machine Learning which deals with computer algorithms. It can develop itself by manipulating data sets over and over again. Machine Learning refers to a subtype of Artificial Intelligence. Moreover, Artificial Intelligence helps in improving machine learning algorithms. Machine Learning algorithms possess features such as detecting trends, answering business questions, data acquisition, efficient data handling, detecting unusual transactions, best for search engines and online shopping, and many others.

Python for Machine Learning

Python is the most potent and fifth most important language for Machine Learning and Data Science. Python has the following crucial features which make it the most functional language for data science:

Packages: There are extensive packages in Python covering various domains. The packages like scipy, pandas, numpy, scikit-learn, and many others in Python are helpful for machine learning.
Prototyping: Python provides easy and quick prototyping. It helps in developing new and customized algorithms for tackling complex problems.
Collaboration: Python possesses numerous valuable tools which prove helpful in collaboration with data science.
Multi-purpose Language: There are various domains in data science projects like data manipulation, data extraction, data analysis, modeling, feature extraction, evaluation, deployment, and updating. Python is a multipurpose language, that allows addressing all these domains.

Python Libraries for Machine Learning

Python provides a wide range of libraries to use in machine learning. Library refers to a set of functions and routines in a programming language. These libraries prove helpful in performing complex tasks. Machine learning relies heavily on mathematical optimization, probability, and statistics. Python libraries help in performing tasks efficiently. Following are some of the Python libraries helpful for machine learning:

Pandas: It is a fast, flexible, and powerful open-source data analysis and manipulation tool. It helps in performing machine learning tasks using the Numpy package to support multidimensional arrays.
Keras: It is a high-level deep learning API to implement neural networks quickly. Moreover, it helps in supporting multiple backend neural network computations.
Matplotlib: It is an extensive library to create static, interactive, and animated visualization in Python. It is a cross-platform and graphical plotting library that uses NumPy.
StatsModels: It is a Python library built on NumPy, SciPy, and matplotlib, which helps in statistical algorithms and data exploration.

Example

Python can perform numerously supervised and unsupervised learning algorithms, including linear regression, logistic regression, k-nearest neighbors, decision tree, random forest, support vector machine, dimension reduction, density estimation, market basket analysis, generative adversarial networks, clustering, and many others. Following is an example of simple linear regression using Python:

Linear Regression

Simple linear regression predicts the output or dependent variable based only on input features. Following are the steps to perform simple linear regression using sklearn in Python:

Import Libraries

Following is the code to import important libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

Read File

The next step is to check the first five rows of the dataset. Following is the example of a vehicle model:

data = pd.read_csv(“Fuel.csv”)
data.head()

Feature Selection

In this example, the goal is to predict the value of co2 emissions from engine size in the data set.

data = data[[“ENGINESIZE”,”CO2EMISSIONS”]]

Plotting Data

Users can visualize the data on a scatter plot by the following code:

plt.scatter(data[“ENGINESIZE”] , data[“CO2EMISSIONS”] , color=”blue”)
plt.xlabel(“ENGINESIZE”)
plt.ylabel(“CO2EMISSIONS”)
plt.show()

Data Division

The next step is to divide data into training and testing datasets to check the accuracy of a model. Training data helps in model training, and testing data helps in checking the accuracy of the model.

train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]

Model Training

Following lines of code train the model and find coefficients for the best-fit regression line:

regr = linear_model.LinearRegression()
train_x = np.array(train[[“ENGINESIZE”]])
train_y = np.array(train[[“CO2EMISSIONS”]])
regr.fit(train_x,train_y)
print (“coefficients : “,regr.coef_) #Slope
print (“Intercept : “,regr.intercept_) #Intercept

Plot Best Fit Line

The next step is to plot the line:

plt.scatter(train[“ENGINESIZE”], train[“CO2EMISSIONS”], color=’blue’)
plt.plot(train_x, regr.coef_*train_x + regr.intercept_, ‘-r’)
plt.xlabel(“Engine size”)
plt.ylabel(“Emission”)

Prediction Function

The next step is to use the prediction function for a testing dataset:

def get_regression_predictions(input_features,intercept,slope):
predicted_values = input_features*slope + intercept
return predicted_values

Predicting co2 Emissions

Following is the code to predict values of co2 emissions based on the regression line:

my_engine_size = 3.5
estimatd_emission = get_regression_predictions(my_engine_size,regr.intercept_[0],regr.coef_[0][0])
print (“Estimated Emission :”,estimatd_emission)

Checking Test Data Accuracy

Users can compare actual values with predicted values to check the accuracy of the model.

from sklearn.metrics import r2_score
test_x = np.array(test[[‘ENGINESIZE’]])
test_y = np.array(test[[‘CO2EMISSIONS’]])
test_y_ = regr.predict(test_x)
print(“Mean absolute error: %.2f” % np.mean(np.absolute(test_y_ — test_y)))
print(“Mean sum of squares (MSE): %.2f” % np.mean((test_y_ — test_y) ** 2))
print(“R2-score: %.2f” % r2_score(test_y_ , test_y) )

Other useful articles: