SVM

Model: SVM Implementation

Trying out the support vector machine algorithm on the stellar-classification dataset, to successfully predict the target variable.

Preprocess the Data

Code

# import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

import tensorflow as tf

Code

# load in the dataset
# note: uncomment the following lines to download the dataset
# !kaggle datasets download -d fedesoriano/stellar-classification-dataset-sdss17
# !unzip stellar-classification-dataset-sdss17.zip
# !rm stellar-classification-dataset-sdss17.zip
# !echo 'Dataset Downloaded Successfully'

Code

# read in the dataset
stellar = pd.read_csv("star_classification.csv")
stellar.head()

	obj_ID	alpha	delta	u	g	r	i	z	run_ID	rerun_ID	cam_col	field_ID	spec_obj_ID	class	redshift	plate	MJD	fiber_ID
0	1.237661e+18	135.689107	32.494632	23.87882	22.27530	20.39501	19.16573	18.79371	3606	301	2	79	6.543777e+18	GALAXY	0.634794	5812	56354	171
1	1.237665e+18	144.826101	31.274185	24.77759	22.83188	22.58444	21.16812	21.61427	4518	301	5	119	1.176014e+19	GALAXY	0.779136	10445	58158	427
2	1.237661e+18	142.188790	35.582444	25.26307	22.66389	20.60976	19.34857	18.94827	3606	301	2	120	5.152200e+18	GALAXY	0.644195	4576	55592	299
3	1.237663e+18	338.741038	-0.402828	22.13682	23.77656	21.61162	20.50454	19.25010	4192	301	3	214	1.030107e+19	GALAXY	0.932346	9149	58039	775
4	1.237680e+18	345.282593	21.183866	19.43718	17.58028	16.49747	15.97711	15.54461	8102	301	3	137	6.891865e+18	GALAXY	0.116123	6121	56187	842

Code

# statistical overview of the dataset
stellar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 18 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   obj_ID       100000 non-null  float64
 1   alpha        100000 non-null  float64
 2   delta        100000 non-null  float64
 3   u            100000 non-null  float64
 4   g            100000 non-null  float64
 5   r            100000 non-null  float64
 6   i            100000 non-null  float64
 7   z            100000 non-null  float64
 8   run_ID       100000 non-null  int64  
 9   rerun_ID     100000 non-null  int64  
 10  cam_col      100000 non-null  int64  
 11  field_ID     100000 non-null  int64  
 12  spec_obj_ID  100000 non-null  float64
 13  class        100000 non-null  object 
 14  redshift     100000 non-null  float64
 15  plate        100000 non-null  int64  
 16  MJD          100000 non-null  int64  
 17  fiber_ID     100000 non-null  int64  
dtypes: float64(10), int64(7), object(1)
memory usage: 13.7+ MB

Great! Looks like we have no observations with missing values

Code

# closer look at the features

result = {}
features = stellar.columns

for feature in features:
    if stellar[feature].nunique() > 10:
        result[feature] = stellar[feature].nunique()
    else:
        result[feature] = stellar[feature].unique()
result

{'obj_ID': 78053,
 'alpha': 99999,
 'delta': 99999,
 'u': 93748,
 'g': 92651,
 'r': 91901,
 'i': 92019,
 'z': 92007,
 'run_ID': 430,
 'rerun_ID': array([301]),
 'cam_col': array([2, 5, 3, 4, 6, 1]),
 'field_ID': 856,
 'spec_obj_ID': 100000,
 'class': array(['GALAXY', 'QSO', 'STAR'], dtype=object),
 'redshift': 99295,
 'plate': 6284,
 'MJD': 2180,
 'fiber_ID': 1000}

We see that there are 3 values for our traget variable class, namely:

GALAXY
QSO
START

Code

# number of observations
len(stellar)

Code

# reorder the observations based on the class variable
stellar.sort_values('class', axis=0, ascending=True, inplace=True)

Code

# distribution of the classe's observations
# view the layout of the dataset

result = {}
target_labels = stellar['class'].tolist()

for target in target_labels:
    result[target] = -1

for index, obs in enumerate(stellar.values):
    if result[obs[13]] == -1:
        result[obs[13]] = index
result

{'GALAXY': 0, 'QSO': 59445, 'STAR': 78406}

We can now proceed to viewing the distribution of each class in a plot

Code

# visualize using the features: `redshift` and `alpha`
class_labels = stellar['class']
scaled_features = StandardScaler().fit_transform(stellar.drop('class', axis=1))
scaled_stellar = pd.DataFrame(scaled_features, index=stellar.index, columns=(stellar.drop('class', axis=1)).columns)

scaled_stellar['class'] = class_labels
scaled_stellar

	obj_ID	alpha	delta	u	g	r	i	z	run_ID	rerun_ID	cam_col	field_ID	spec_obj_ID	redshift	plate	MJD	fiber_ID	class
0	-0.445634	-0.434604	0.425529	0.059755	0.054926	0.403962	0.046007	0.003937	-0.445535	0.0	-0.952553	-0.718947	0.228609	0.079557	0.228633	0.423203	-1.021342	GALAXY
58230	-1.139298	-0.368710	1.379317	-0.000092	0.031505	0.903290	1.010382	0.053337	-1.139260	0.0	-0.322395	-0.121673	0.731651	0.485616	0.731633	0.802528	0.986019	GALAXY
58228	-1.139298	-0.381213	1.351881	0.045294	0.047447	0.809590	0.974606	0.053707	-1.139260	0.0	-0.322395	-0.161939	0.731256	-0.439343	0.731294	0.805846	-1.516760	GALAXY
58227	-1.070015	0.480361	-1.322346	0.006342	0.031496	0.476575	0.233830	0.007300	-1.070040	0.0	0.307763	-0.584728	-0.378349	-0.092486	-0.378354	-0.141914	0.108945	GALAXY
58226	-1.070015	0.405437	-1.327688	0.112584	0.011681	-0.286498	-0.398618	-0.022590	-1.070040	0.0	0.307763	-0.906853	-1.429054	-0.218501	-1.429064	-1.760957	0.090596	GALAXY
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
46329	-1.092601	0.766323	0.734251	0.045131	0.016127	0.035459	0.100766	0.011344	-1.092435	0.0	-1.582710	-0.953830	1.893737	-0.789715	1.893782	1.470494	-1.509421	STAR
46330	-0.296055	0.617464	0.094237	-0.139661	-0.129397	-2.011636	-1.917283	-0.096359	-0.295898	0.0	-1.582710	3.294198	-0.907107	-0.789845	-0.907096	-0.577640	-0.691063	STAR
46331	-2.124159	-0.701998	-1.226114	0.014384	-0.030837	-0.636966	-0.606623	-0.028993	-2.124116	0.0	-0.322395	-1.088049	0.783445	-0.788944	0.783457	0.626690	-0.335096	STAR
75082	-1.700567	-0.203089	-1.300428	-0.000616	0.010098	0.515621	0.775514	0.049766	-1.700653	0.0	0.937920	-1.020939	-0.457924	-0.787913	-0.457953	-0.174538	1.173177	STAR
68798	1.760822	-1.593640	0.337726	0.050271	0.103026	1.872407	1.077138	0.040500	1.760848	0.0	-0.322395	-0.960541	0.878618	-0.789890	0.878637	1.405246	-0.654366	STAR

100000 rows × 18 columns

Code


x_data = scaled_stellar['i']
y_data = scaled_stellar['redshift']

galaxy_x = x_data[:59445]
galaxy_y = y_data[:59445]

qso_x = x_data[59445:78406]
qso_y = y_data[59445:78406]

star_x = x_data[78406:]
star_y = y_data[78406:]

# create the figure
plt.figure(figsize=(10,7))
plt.scatter(galaxy_x,galaxy_y,marker='+',color='green')
plt.scatter(qso_x,qso_y,marker='_',color='red')
plt.scatter(star_x,star_y,marker='*',color='blue')

plt.title("Target Distribution")
plt.xlabel("Green filter")
plt.ylabel("Redshift")
plt.show()

Feature Engineer Phase

Code

# encode values for class column
stellar.replace({'class': {'GALAXY': 0, 'STAR': 1, 'QSO':2}}, inplace=True)

# remove all columns containing ID at the end
cleaned = stellar.drop(stellar.filter(regex='ID$').columns, axis=1)

# drop the date column
cleaned = stellar.drop('MJD', axis=1)

Code

# make the X and y varialbes (all features)
X_all = cleaned.drop('class', axis=1)
y = cleaned['class']

Code

# keep the 2 features
features = cleaned.columns.tolist()
features.remove('redshift')
features.remove('g')

# make the X and y varialbes
X_specifc = cleaned.drop(features, axis=1)

Code

sc = StandardScaler()
yj = PowerTransformer(method="yeo-johnson")

preprocessor = ColumnTransformer([("norm", yj, selector(dtype_include="number")),
                ("std_encode", sc, selector(dtype_include="number"))
                ])

Build Phase

Using all features

Code

# initialize an SVC model (all features)
X_train, X_test, y_train, y_test = train_test_split(X_all, y, train_size=0.7, random_state=123)

svc_clf = SVC(kernel='rbf')
svc_model_pipeline = Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("model", svc_clf),
])
# train the model on our dataset
svc_model_pipeline.fit(X_train,y_train)
# store the predictions of the model
y_pred = svc_model_pipeline.predict(X_test)
# evaluate the model on accuracy
print(accuracy_score(y_test,y_pred))

Code

# initialize an SVC model (2 features)
X_train, X_test, y_train, y_test = train_test_split(X_specifc, y, train_size=0.7, random_state=123)

svc_clf = SVC(kernel='rbf')
# train the model on our dataset
svc_model_pipeline = Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("model", svc_clf),
])
svc_model_pipeline.fit(X_train,y_train)
# store the predictions of the model
y_pred = svc_model_pipeline.predict(X_test)
# evaluate the model on accuracy
print(accuracy_score(y_test,y_pred))

0.9644666666666667

Code

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x14560f730>

Code

# {'class': {'GALAXY': 0, 'STAR': 1, 'QSO':2}}

array([0, 2, 1])

Using Deep Learning

Code

# re-read in the dataset
stellar = pd.read_csv("star_classification.csv")
stellar.head()

	obj_ID	alpha	delta	u	g	r	i	z	run_ID	rerun_ID	cam_col	field_ID	spec_obj_ID	class	redshift	plate	MJD	fiber_ID
0	1.237661e+18	135.689107	32.494632	23.87882	22.27530	20.39501	19.16573	18.79371	3606	301	2	79	6.543777e+18	GALAXY	0.634794	5812	56354	171
1	1.237665e+18	144.826101	31.274185	24.77759	22.83188	22.58444	21.16812	21.61427	4518	301	5	119	1.176014e+19	GALAXY	0.779136	10445	58158	427
2	1.237661e+18	142.188790	35.582444	25.26307	22.66389	20.60976	19.34857	18.94827	3606	301	2	120	5.152200e+18	GALAXY	0.644195	4576	55592	299
3	1.237663e+18	338.741038	-0.402828	22.13682	23.77656	21.61162	20.50454	19.25010	4192	301	3	214	1.030107e+19	GALAXY	0.932346	9149	58039	775
4	1.237680e+18	345.282593	21.183866	19.43718	17.58028	16.49747	15.97711	15.54461	8102	301	3	137	6.891865e+18	GALAXY	0.116123	6121	56187	842

Feature Engineering

Code

# encode values for class column
stellar.replace({'class': {'GALAXY': 0, 'STAR': 1, 'QSO':2}}, inplace=True)

# remove all columns containing ID at the end
cleaned = stellar.drop(stellar.filter(regex='ID$').columns, axis=1)

# drop the date column
cleaned = cleaned.drop('MJD', axis=1)

# make the X and y varialbes
X = cleaned.drop(['class'], axis=1)
y = cleaned['class']

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123)

Code

X_train

	alpha	delta	u	g	r	i	z	cam_col	redshift	plate
52308	48.573323	-0.609684	18.80211	16.75706	15.66640	15.20351	14.80488	2	0.115262	412
27380	260.094957	27.014951	20.26815	18.68293	18.09193	17.86272	17.77524	1	-0.000209	2193
94588	135.477289	37.187920	24.05482	21.67747	20.38921	19.56683	19.27142	2	0.538213	4608
7361	118.180524	9.267706	20.56594	19.53276	19.22361	18.98544	18.89350	4	0.000171	2945
52298	15.024146	4.027261	24.63314	20.48507	18.76055	18.10046	18.07149	1	0.322435	4309
...	...	...	...	...	...	...	...	...	...	...
63206	115.884742	20.139136	18.93230	17.66927	17.21911	16.99153	16.88261	4	0.000022	1263
61404	133.430333	-0.719735	19.83167	19.90659	19.58608	19.57696	19.64975	2	1.309977	12533
17730	255.223294	23.638965	18.99455	18.04598	17.63787	17.44783	17.38555	3	-0.000624	3290
28030	232.458207	36.143802	20.20453	18.46623	17.64739	17.25447	16.98376	3	0.065986	1401
15725	5.646502	3.562572	23.04084	22.26374	22.04945	21.24024	20.42934	6	1.211771	9444

70000 rows × 10 columns

Code

sc = StandardScaler()
yj = PowerTransformer(method="yeo-johnson")

preprocessor = ColumnTransformer([("normalization", yj, selector(dtype_include="number")),
                ("standardization", sc, selector(dtype_include="number")),
                ])

Building a model

Code

deep_model = tf.keras.Sequential([
    tf.keras.layers.Dense(512, activation='tanh'),
    tf.keras.layers.Dense(64, activation='tanh'),
    tf.keras.layers.Dense(3, activation='softmax')
])

Code

# compile the model
deep_model.compile(
    loss= 'sparse_categorical_crossentropy',
    optimizer= tf.keras.optimizers.SGD(),
    metrics= ['accuracy']
)

Code

# create the pipeline
deep_model_pipeline = Pipeline(steps=[
  ("preprocessor", preprocessor),
  ("model", deep_model),
])

Code

deep_model_pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('normalization',
                                                  PowerTransformer(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x148b62980>),
                                                 ('standardization',
                                                  StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x148b61b40>)])),
                ('model',
                 <keras.engine.sequential.Sequential object at 0x148b61ab0>)])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

deep_model.summary()

Model: "sequential_37"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_114 (Dense)           (None, 512)               10752     
                                                                 
 dense_115 (Dense)           (None, 64)                32832     
                                                                 
 dense_116 (Dense)           (None, 3)                 195       
                                                                 
=================================================================
Total params: 43,779
Trainable params: 43,779
Non-trainable params: 0
_________________________________________________________________

Code

# fit the model
deep_model_history = deep_model_pipeline.fit(X_train,y_train, model__epochs=10, model__validation_split=0.2)

Epoch 1/10
1750/1750 [==============================] - 13s 7ms/step - loss: 0.2560 - accuracy: 0.9258 - val_loss: 0.1757 - val_accuracy: 0.9546
Epoch 2/10
1750/1750 [==============================] - 15s 9ms/step - loss: 0.1625 - accuracy: 0.9552 - val_loss: 0.1501 - val_accuracy: 0.9616
Epoch 3/10
1750/1750 [==============================] - 15s 8ms/step - loss: 0.1418 - accuracy: 0.9596 - val_loss: 0.1336 - val_accuracy: 0.9631
Epoch 4/10
1750/1750 [==============================] - 16s 9ms/step - loss: 0.1318 - accuracy: 0.9618 - val_loss: 0.1279 - val_accuracy: 0.9635
Epoch 5/10
1750/1750 [==============================] - 20s 11ms/step - loss: 0.1261 - accuracy: 0.9629 - val_loss: 0.1235 - val_accuracy: 0.9656
Epoch 6/10
1750/1750 [==============================] - 21s 12ms/step - loss: 0.1225 - accuracy: 0.9641 - val_loss: 0.1209 - val_accuracy: 0.9651
Epoch 7/10
1750/1750 [==============================] - 21s 12ms/step - loss: 0.1199 - accuracy: 0.9644 - val_loss: 0.1174 - val_accuracy: 0.9665
Epoch 8/10
1750/1750 [==============================] - 23s 13ms/step - loss: 0.1177 - accuracy: 0.9649 - val_loss: 0.1145 - val_accuracy: 0.9672
Epoch 9/10
1750/1750 [==============================] - 25s 14ms/step - loss: 0.1161 - accuracy: 0.9654 - val_loss: 0.1140 - val_accuracy: 0.9674
Epoch 10/10
1750/1750 [==============================] - 26s 15ms/step - loss: 0.1142 - accuracy: 0.9662 - val_loss: 0.1124 - val_accuracy: 0.9677

Code

y_pred = deep_model_pipeline.predict(X_test)
accuracy_score(y_test, y_pred.argmax(axis=-1))

0.9713333333333334