scikit-learn-intro¶

Credits: Forked from PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas

Machine Learning Models Cheat Sheet
Estimators
Introduction: Iris Dataset
K-Nearest Neighbors Classifier

In [1]:

%matplotlib inline
importnumpyasnp
importmatplotlib.pyplotasplt
importseaborn; 
fromsklearn.linear_modelimport LinearRegression
fromscipyimport stats
importpylabaspl
seaborn.set()

Machine Learning Models Cheat Sheet¶

In [2]:

fromIPython.displayimport Image
Image("http://scikit-learn.org/dev/_static/ml_map.png", width=800)

Out[2]:

No description has been provided for this image

Estimators¶

Given a scikit-learn estimator object named model, the following methods are available:

Available in all Estimators
- model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).
Available in supervised estimators
- model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.
- model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
- model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
Available in unsupervised estimators
- model.predict() : predict labels in clustering algorithms.
- model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
- model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

Introduction: Iris Dataset¶

In [3]:

fromsklearn.datasetsimport load_iris
iris = load_iris()
n_samples, n_features = iris.data.shape
print(iris.keys())
print((n_samples, n_features))
print(iris.data.shape)
print(iris.target.shape)
print(iris.target_names)
print(iris.feature_names)

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
(150, 4)
(150, 4)
(150,)
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

In [4]:

importnumpyasnp
importmatplotlib.pyplotasplt
# 'sepal width (cm)'
x_index = 1
# 'petal length (cm)'
y_index = 2
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
 c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);

No description has been provided for this image

K-Nearest Neighbors Classifier¶

The K-Nearest Neighbors (KNN) algorithm is a method used for algorithm used for classification or for regression. In both cases, the input consists of the k closest training examples in the feature space. Given a new, unknown observation, look up which points have the closest features and assign the predominant class.

In [5]:

fromsklearnimport neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# create the model
knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
# fit the model
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
X_pred = [3, 5, 4, 2]
result = knn.predict([X_pred, ])
print(iris.target_names[result])
print(iris.target_names)
print(knn.predict_proba([X_pred, ]))
fromfig_codeimport plot_iris_knn
plot_iris_knn()

['versicolor']
['setosa' 'versicolor' 'virginica']
[[0. 0.8 0.2]]

/Users/tarrysingh/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.datasets.samples_generator module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.datasets. Anything that cannot be imported from sklearn.datasets is now part of the private API.
 warnings.warn(message, FutureWarning)

No description has been provided for this image

Note we see overfitting in the K-Nearest Neighbors model above. We'll be addressing overfitting and model validation in a later notebook.