data mining lab file

 Practical NO:1 

Aim: Demonstration of data pre-processing on datasets  

Data pre-processing is an essential step in any data analysis or machine learning project. It involves transforming and preparing raw data into a clean, organized, and suitable format for further analysis or model training. In this explanation, we’ll go through the various steps involved in data pre-processing, using a dataset as an example. 

Let’s assume we have a dataset containing information about customers and their purchases, including attributes such as customer ID, age, gender, purchase amount, and date of purchase. Our goal is to process this data to make it ready for analysis or training a machine learning model. 

  1. Data Cleaning: 

   The first step is to identify and handle any missing or erroneous data. Missing data can occur due to various reasons like human error during data entry or data corruption during collection. Some common techniques for handling missing data are: 

   - Removing the missing data: If the missing data is negligible or doesn’t affect the analysis, we can remove the corresponding rows or columns. 

   - Imputing missing data: If the missing data is important or the missingness is systematic, we can estimate and fill in the missing values using techniques like mean, median, mode, or machine learning algorithms. 

  1. Data Transformation: 

   Data transformation involves converting the raw data into a suitable format. It includes the following steps: 

   - Encoding Categorical Variables: Categorical variables, such as gender or product categories, need to be converted into numerical form for most machine learning algorithms. This process is called encoding. There are two common encoding techniques: 

     - One-Hot Encoding: Each category is transformed into a binary vector, where each vector represents a single category, and only one element in the vector is 1 (hot), while the rest are 0 (cold). 

     - Label Encoding: Each category is assigned a unique numeric label. 

   - Feature Scaling: Features with different scales can lead to biased results in some algorithms. Therefore, it’s common to scale numerical features to a standard range. The two most common techniques are: 

     - Standardization (Z-score normalization): It scales the features to have zero mean and unit variance. 

     - Min-Max Scaling: It scales the features to a specific range, usually between 0 and 1. 

  • Handling Outliers: Outliers are extreme values that deviate significantly from the rest of the data. They can affect the performance and accuracy of models. Some techniques for handling outliers include removing them, capping or flooring their values, or replacing them with statistical measures like the mean or median. 

  • Handling Skewness: Skewness refers to the asymmetry in the distribution of a variable. Skewed data can lead to biased models. Common techniques for handling skewness include log transformation, square root transformation, or Box-Cox transformation. 

  1. Data Integration: 

   Data integration involves merging or combining multiple datasets into a single dataset. This step is necessary when we have data spread across multiple sources or files, and we want to consolidate it for analysis. It often requires aligning common variables or identifiers between the datasets. 

  1. Data Reduction: 

   Data reduction techniques aim to reduce the dimensionality of the dataset while preserving important information. High-dimensional data can be computationally expensive and prone to overfitting. Techniques like Principal Component Analysis (PCA) or feature selection algorithms can be used to select the most relevant features or transform the data into a lower-dimensional space. 

  1. Data Discretization: 

   Discretization involves converting continuous variables into categorical ones. It can be useful in certain analyses or when specific algorithms require categorical inputs. Discretization methods include equal width binning, equal frequency binning, or clustering-based approaches. 

  1. Data Sampling: 

   In some cases, the dataset may be too large or imbalanced, leading to biased results or computational 

 

Practical No:2 

Aim: Aim: To list all the categorical (or nominal) attributes and the real valued attributes 

To list all the categorical (nominal) attributes and the real-valued attributes in a dataset, you need to examine the nature of each attribute and determine its data type. Here’s how you can identify categorical and real-valued attributes: 

  1. Categorical (Nominal) Attributes: 

   Categorical attributes represent discrete values that belong to a specific category or class. Here are some indicators that suggest an attribute is categorical: 

   - The attribute contains a limited number of distinct values or categories. 

   - The values are not numeric or continuous. 

   - The values represent labels or classes rather than quantities. 

   Examples of categorical attributes could be gender, color, product type, or country. 

  1. Real-Valued Attributes: 

   Real-valued attributes, also known as continuous or numeric attributes, represent values on a continuous scale. Here are some indicators that suggest an attribute is real-valued: 

   - The attribute contains numeric values. 

   - The values represent quantities or measurements. 

   - The values can take on a wide range of numeric values. 

   Examples of real-valued attributes could be age, temperature, salary, or purchase amount. 

When examining your dataset, consider each attribute and determine its data type based on the above indicators. Categorical attributes are typically represented as strings or discrete labels, while real-valued attributes are represented as numerical values. Keep in mind that the distinction between categorical and real-valued attributes may vary depending on the context and specific dataset. 

By identifying the categorical and real-valued attributes, you can tailor your data pre-processing steps accordingly. For categorical attributes, you may need to apply one-hot encoding or label encoding techniques, while real-valued attributes may require scaling or normalization before analysis or modeling. 


Practical No:3 

Aim: Create a data classification model using decision tree  

Step 1: Import Libraries 

Start by importing the necessary libraries for data manipulation and the decision tree algorithm. In this example, we’ll use scikit-learn, a popular machine learning library in Python. 

Step 2: Load and Prepare the Data 

Load your dataset into a pandas DataFrame and split it into features (X) and the target variable (y). Make sure your target variable contains the class labels you want to predict. 

Step 3: Split Data into Training and Testing Sets 

Divide your data into training and testing sets. The training set will be used to train the decision tree model, while the testing set will be used to evaluate its performance. 

Step 4: Create and Train the Decision Tree Model 

Instantiate a decision tree classifier object and fit it to the training data. 

Step 5: Make Predictions 

Use the trained decision tree model to make predictions on the testing data. 

Step 6: Evaluate the Model 

Source code: 

Import pandas as pd 

From sklearn.model_selection import train_test_split 

From sklearn.tree import DecisionTreeClassifier 

From sklearn.metrics import accuracy_score 

From sklearn.datasets import load_iris 

# Load the Iris dataset 

Iris = load_iris() 

# Create a DataFrame with the features and target variable 

X = pd.DataFrame(iris.data, columns=iris.feature_names) 

Y = pd.Series(iris.target) 

 

# Split the data into training and testing sets 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# Create a decision tree classifier object 

Clf = DecisionTreeClassifier() 

# Train the decision tree model 

Clf.fit(X_train, y_train) 

# Make predictions on the testing data 

Y_pred = clf.predict(X_test) 

# Calculate the accuracy of the model 

Accuracy = accuracy_score(y_test, y_pred) 

Print(“Accuracy:”, accuracy) 

This example uses the Iris dataset, which is readily available in scikit-learn. 

Output 

Accuracy: 1.0 


Practical NO: 4 

Aim: Create a data classification model using Naïve Bayes  

Step 1: Import the necessary libraries: 

Step 2: Load the Iris dataset: 

Step 3: Split the data into training and testing sets: 

Step 4: Create a Naïve Bayes classifier object: 

Step 5: Train the Naïve Bayes classifier: 

Step 6: Make predictions on the testing data: 

Step 7: Evaluate the model’s accuracy: 

Source code: from sklearn.datasets import load_iris 

from sklearn.model_selection import train_test_split 

from sklearn.naive_bayes import GaussianNB 

from sklearn.metrics import accuracy_score 

# Load the Iris dataset 

iris = load_iris() 

# Split the data into features (X) and target variable (y) 

X = iris.data 

y = iris.target 

# Split the data into training and testing sets 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# Create a Naive Bayes classifier object 

clf = GaussianNB() 

# Train the Naive Bayes classifier 

clf.fit(X_train, y_train) 

# Make predictions on the testing data 

y_pred = clf.predict(X_test) 

# Calculate the accuracy of the model 

accuracy = accuracy_score(y_test, y_pred) 

print("Accuracy:", accuracy) 

Output 

Accuracy: 1.0 

 

Practical No: 5 

Aim: Create a data classification model using rule based classifies  

To create a rule-based data classification model using the iris dataset from scikit-learn, we’ll consider a multi-class classification problem where we want to classify iris flowers into three different species: setosa, versicolor, and virginica. We’ll use the features of sepal length, sepal width, petal length, and petal width. 

Step 1: Data Preparation 

- Load the iris dataset from scikit-learn. 

- Split the dataset into features (X) and labels (y). 

Step 2: Rule Creation 

- Examine the features to determine logical conditions that can be used to classify the iris flowers. 

- Define rules based on these conditions to classify the flowers into their respective species. 

Example Rules: 

Rule 1: If petal length <= 2.5, then classify as “setosa.” 

Rule 2: If petal width <= 1.8 and petal length <= 4.9, then classify as “versicolor.” 

Rule 3: If petal width > 1.8 and petal length > 4.9, then classify as “virginica.” 

Step 3: Classification 

  • For each new flower instance, apply the rules sequentially and assign the appropriate class based on the first matching rule. 

In the example above, the `classify_iris_flower` function takes a flower instance as input and applies the defined rules to classify it into one of the three iris species. If none of the rules match, the function returns “Unknown” or you can choose an appropriate action. 

Source code: 

from sklearn.datasets import load_iris 

# Load the iris dataset 

iris = load_iris() 

X = iris.data  # Features 

y = iris.target  # Labels 

def classify_iris_flower(flower): 

    petal_length = flower[2] 

    petal_width = flower[3] 

    # Rule-based classification 

    if petal_length <= 2.5: 

        return "setosa" 

    elif petal_width <= 1.8 and petal_length <= 4.9: 

        return "versicolor" 

    elif petal_width > 1.8 and petal_length > 4.9: 

        return "virginica" 

    else: 

        return "Unknown" 

# Example usage 

new_flower = [5.1, 3.5, 1.4, 0.2]  # Example flower with features [sepal length, sepal width, petal length, petal width] 

classification = classify_iris_flower(new_flower) 

print("Classification:", classification) 

Output 

Classification: setosa 

 

Practical No: 6 

Aim: Create a data classification model using statistical classifiers.  

Statistical classifiers are machine learning algorithms that make predictions or classify data based on statistical principles and probability theory. These algorithms aim to learn patterns and relationships from labeled training data and use that knowledge to classify new, unseen instances. 

Step 1: Import the necessary libraries and modules. 

Step 2: Load the Iris dataset. 

Step 3: Perform feature scaling. 

Step 4: Split the dataset into training and testing sets. 

Step 5: Create an instance of the Logistic Regression classifier. 

Step 6: Train the classifier. 

Step 7: Make predictions on the test set. 

Step 8: Evaluate the classifier’s performance. 

Source code:  

from sklearn.datasets import load_iris 

from sklearn.model_selection import train_test_split 

from sklearn.preprocessing import StandardScaler 

from sklearn.linear_model import LogisticRegression 

from sklearn.metrics import classification_report 

# Load the Iris dataset 

iris = load_iris() 

X = iris.data 

y = iris.target 

# Perform feature scaling 

scaler = StandardScaler() 

X_scaled = scaler.fit_transform(X) 

# Split the dataset into training and testing sets 

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) 

# Create an instance of the Logistic Regression classifier 

lr_classifier = LogisticRegression() 

# Train the classifier 

lr_classifier.fit(X_train, y_train) 

# Make predictions on the test set 

lr_predictions = lr_classifier.predict(X_test) 

# Evaluate the classifier 

report = classification_report(y_test, lr_predictions) 

print(report) 

Output 

Precision    recall  f1-score   support 

 

           0       1.00      1.00      1.00        10 

           1       1.00      1.00      1.00         9 

           2       1.00      1.00      1.00        11 

 

    Accuracy                           1.00        30 

   Macro avg       1.00      1.00      1.00        30 

Weighted avg       1.00      1.00      1.00        30 

By following these steps, you will create a data classification model using Logistic Regression and the Iris dataset from scikit-learn. The classifier will be trained on the training set and evaluated on the test set using the classification report, providing performance metrics for each class. 

 

Practical No:7 

Aim: Create a data classification model using neural networks. 

To create a data classification model using neural networks, we’ll follow a step-by-step process: 

Step 1: Data Preparation 

- Load or prepare your labeled dataset, ensuring it is properly formatted with features (X) and corresponding labels (y). 

- If necessary, perform any preprocessing steps such as data normalization or feature scaling. 

Step 2: Splitting the Data 

- Split your dataset into training and testing sets to evaluate the performance of the neural network. 

- This step helps in assessing the model’s generalization capability. 

Step 3: Building the Neural Network 

- Import the necessary libraries and modules for building and training the neural network. 

- Choose the appropriate architecture, such as the number of layers and nodes, activation functions, and optimizer. 

Step 4: Training the Neural Network 

- Fit the neural network to the training data using the chosen optimizer and loss function. 

- Specify the number of epochs and batch size for training. 

- Monitor the training process to ensure the model is learning and improving. 

Step 5: Evaluating the Model 

- Make predictions on the test set using the trained neural network. 

- Evaluate the performance of the model using appropriate evaluation metrics, such as accuracy, precision, recall, or F1-score. 

using the Keras library in Python to create a simple neural network for data classification: 

Source code: 

import numpy as np 

from sklearn.datasets import load_iris 

from sklearn.model_selection import train_test_split 

from sklearn.preprocessing import StandardScaler 

from keras.models import Sequential 

from keras.layers import Dense 

from keras.utils import to_categorical 

# Load the Iris dataset 

iris = load_iris() 

X = iris.data 

y = iris.target 

# Perform feature scaling 

scaler = StandardScaler() 

X_scaled = scaler.fit_transform(X) 

# Convert the labels to categorical format 

y_categorical = to_categorical(y) 

# Split the dataset into training and testing sets 

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_categorical, test_size=0.2, random_state=42) 

# Define the neural network architecture 

model = Sequential() 

model.add(Dense(10, activation='relu', input_shape=(4,))) 

model.add(Dense(10, activation='relu')) 

model.add(Dense(3, activation='softmax')) 

# Compile the model 

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) 

# Train the model 

model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=1) 

# Evaluate the model on the test set 

loss, accuracy = model.evaluate(X_test, y_test) 

print("Accuracy: %.2f%%" % (accuracy * 100)) 

Output 

Epoch 46/50 

4/4 [==============================] – 0s 5ms/step – loss: 0.3315 – accuracy: 0.8667 

Epoch 47/50 

4/4 [==============================] – 0s 6ms/step – loss: 0.3276 – accuracy: 0.8667 

Epoch 48/50 

4/4 [==============================] – 0s 6ms/step – loss: 0.3235 – accuracy: 0.8667 

Epoch 49/50 

4/4 [==============================] – 0s 6ms/step – loss: 0.3200 – accuracy: 0.8667 

Epoch 50/50 

4/4 [==============================] – 0s 5ms/step – loss: 0.3162 – accuracy: 0.8667 

1/1 [==============================] – 0s 272ms/step – loss: 0.2334 – accuracy: 0.9333 

Accuracy: 93.33% 

 

 

Practical No: 8 

Aim: Create a data classification model 

Step 1: Import the necessary libraries and modules. 

Step 2: Load the Iris dataset. 

Step 3: Perform feature scaling. 

Step 4: Split the dataset into training and testing sets. 

Step 5: Create a classifier. 

Step 6: Train the classifier. 

Step 7: Make predictions on the test set. 

Step 8: Evaluate the classifier’s performance. 

Source code: from sklearn.datasets import load_iris 

from sklearn.model_selection import train_test_split 

from sklearn.preprocessing import StandardScaler 

from sklearn.svm import SVC 

from sklearn.metrics import classification_report 

# Load the Iris dataset 

iris = load_iris() 

X = iris.data 

y = iris.target 

# Perform feature scaling 

scaler = StandardScaler() 

X_scaled = scaler.fit_transform(X) 

# Split the dataset into training and testing sets 

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) 

# Create a Support Vector Machine (SVM) classifier 

svm_classifier = SVC() 

# Train the classifier 

svm_classifier.fit(X_train, y_train) 

# Make predictions on the test set 

svm_predictions = svm_classifier.predict(X_test) 

# Evaluate the classifier 

report = classification_report(y_test, svm_predictions) 

print(report)

Output 

Precision    recall  f1-score   support 

 

           0       1.00      1.00      1.00        10 

           1       1.00      1.00      1.00         9 

           2       1.00      1.00      1.00        11 

 

    Accuracy                           1.00        30 

   Macro avg       1.00      1.00      1.00        30 

Weighted avg       1.00      1.00      1.00        30 

print(report)By following these steps, you will create a data classification model using the Iris dataset from scikit-learn. The dataset is loaded, feature scaling is performed, and the dataset is split into training and testing sets. Then, a Support Vector Machine (SVM) classifier is created, trained on the training data, and used to make predictions on the test set. Finally, the performance of the classifier is evaluated using the classification report, which provides metrics such as precision, recall, and F1-score for each class. 

 

 

Practical No:9 

Aim:Demonstrate the working of k-means algorithm for clustering the data. 

Step 1: Choose the number of clusters (k). 

  • Decide on the number of clusters you want to create. This can be based on prior knowledge or by using techniques such as the elbow method to determine an optimal value for k. 

Step 2: Initialize cluster centroids. 

  • Randomly initialize the centroids for each cluster. The centroids are the representative points that define the center of each cluster. 

Step 3: Assign data points to clusters. 

  • Assign each data point to the cluster whose centroid is closest to it. This is done by calculating the distance between each data point and the centroids using a distance metric such as Euclidean distance. 

Step 4: Update cluster centroids. 

  • Recalculate the centroids of each cluster by taking the mean of all the data points assigned to that cluster. This will update the centroid positions. 

Step 5: Repeat steps 3 and 4. 

  • Iterate steps 3 and 4 until a stopping criterion is met. This criterion can be a maximum number of iterations or when the centroids no longer change significantly between iterations. 

Step 6: Obtain the final cluster assignments. 

  • Once the algorithm converges and the centroids stabilize, the final cluster assignments are obtained. Each data point will be assigned to the cluster whose centroid it is closest to. 

Step 7: Analyze the clusters. 

  • Evaluate and interpret the resulting clusters based on your domain knowledge or use additional metrics such as silhouette score or within-cluster sum of squares (WCSS) to assess the quality of the clustering. 

It’s important to note that k-means clustering is sensitive to the initial placement of centroids and can converge to different solutions. To mitigate this, it’s common practice to run the algorithm multiple times with different initializations and choose the clustering solution with the lowest WCSS or highest silhouette score. WCSS stands for Within-Cluster Sum of Squares. It is a metric used to evaluate the quality of clustering in k-means clustering. WCSS measures the compactness or tightness of the clusters by summing up the squared distances between each data point and its cluster centroid. 

By following these general steps, you can perform k-means clustering on your dataset and obtain clusters based on the chosen value of k. 

Source code: import numpy as np 

import matplotlib.pyplot as plt 

from sklearn.datasets import make_blobs 

from sklearn.cluster import KMeans 

# Generate some sample data 

X, y = make_blobs(n_samples=500, centers=4, random_state=42) 

# Apply k-means clustering 

kmeans = KMeans(n_clusters=4, random_state=42) 

kmeans.fit(X) 

# Get the cluster labels and cluster centers 

labels = kmeans.labels_ 

centers = kmeans.cluster_centers_ 

# Visualize the clusters 

plt.scatter(X[:, 0], X[:, 1], c=labels) 

plt.scatter(centers[:, 0], centers[:, 1], marker='X', color='red', s=200) 

plt.xlabel("Feature 1") 

plt.ylabel("Feature 2") 

plt.title("k-means Clustering") 

plt.show() 

Output 

 

 

Practical No: 10 

Aim: Create a clustering model using hierarchical clustering algorithm 

Hierarchical clustering is an unsupervised machine learning algorithm used to cluster data points into groups based on their similarity. It creates a hierarchy of clusters by iteratively merging or splitting clusters based on a similarity or dissimilarity measure. 

The basic idea behind hierarchical clustering is to start with each data point as its own cluster and then iteratively merge or split clusters until a stopping criterion is met. The result is a dendrogram, which is a tree-like structure that represents the hierarchy of clusters. 

There are two main approaches to hierarchical clustering: 

1. Agglomerative (Bottom-Up) Clustering: 

   - In agglomerative clustering, each data point starts as its own cluster. 

   - At each iteration, the two most similar clusters are merged together based on a similarity measure such as Euclidean distance, Manhattan distance, or correlation. 

   - This process continues until all data points belong to a single cluster or until a stopping criterion is met. 

2. Divisive (Top-Down) Clustering: 

   - In divisive clustering, all data points start in a single cluster. 

   - At each iteration, the cluster is split into two subclusters based on a dissimilarity measure. 

   - This process continues recursively, with clusters being split into smaller clusters until each data point is in its own cluster or until a stopping criterion is met. 

Hierarchical clustering offers several advantages: 

- It does not require specifying the number of clusters in advance. 

- It provides a hierarchical structure that can be visually represented as a dendrogram. 

- It can capture complex relationships and nested structures in the data. 

However, hierarchical clustering can be computationally expensive, especially for large datasets. The choice of the similarity or dissimilarity measure and the linkage criterion (used to determine the distance between clusters) can also affect the clustering results. 

To determine the final clusters from the dendrogram, you can apply a cutting threshold or use techniques such as the elbow method or silhouette score to decide on the appropriate number of clusters. 

Overall, hierarchical clustering is a powerful technique for exploring and analyzing data by organizing it into a hierarchical structure of clusters based on similarity or dissimilarity. 

Source code : 

Import matplotlib.pyplot as plt 

From sklearn.datasets import make_blobs 

From sklearn.cluster import AgglomerativeClustering 

# Generate some sample data 

X, y = make_blobs(n_samples=500, centers=4, random_state=42) 

# Perform hierarchical clustering 

Agglomerative = AgglomerativeClustering(n_clusters=4) 

Labels = agglomerative.fit_predict(X) 

# Visualize the clusters 

Plt.scatter(X[:, 0], X[:, 1], c=labels) 

Plt.xlabel(“Feature 1”) 

Plt.ylabel(“Feature 2”) 

Plt.title(“Hierarchical Clustering”) 

Plt.show() 

Output 

 

 

In the code above, we first generate some sample data using the `make_blobs` function from scikit-learn. This generates synthetic data points with four distinct clusters. 

Next, we apply hierarchical clustering using the `AgglomerativeClustering` class from scikit-learn. We specify the number of clusters (`n_clusters`) as four to match the number of true clusters in the data. 

After fitting the agglomerative clustering model to the data using the `fit_predict` method, we obtain the cluster labels for each data point. 

Finally, we visualize the clusters by plotting the data points with different colors based on their assigned cluster labels. 

Comments

Popular Posts