SKRUB : A Python library for cleaning, structuring, and visualizing tabular data

INTRODUCTION

How to deal with messy data Missing values, inconsistent formats, and unstructured information slow down data analysis? here skrub comes in picture! skrub is a powerful Python library designed to clean, structure, and prepare tabular data efficiently. In this guide, we’ll explore its key features

skrub makes cleaning, organizing, and visualizing messy tables easier and faster

Data Cleaning – Removes inconsistencies, trims spaces, and fixes column names.

Handling Missing Data – Easily fills missing values.

Dataset Merging– Intelligently links datasets, even with slight variations.

Quick Insights– Generates structured data for visualization & analysis.

WHY SKRUB?

Assembling Tables with Precision:Skrub excels at joining tables on keys of different types, including string, numerical, and datetime, with an impressive ability to handle imprecise correspondences.

Fuzzy Joining for Seamless Integration: selects the type of fuzzy matching based on column types, producing a similarity score for easy identification of less-than-perfect matches

Advanced Analysis Made Simple:Skrub takes table joining to the next level with features like Joiner, AggJoiner, and AggTarget.

Efficient Column Selection in Pipelines:Apart from joins, skrub also facilitates column selection within a pipeline, allowing data scientists to choose and discard columns dynamically.

INSTALLATION PROCESS

pip install skrub
Requirement already satisfied: skrub in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (0.5.1)
Requirement already satisfied: numpy>=1.23.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.1)
Requirement already satisfied: packaging>=23.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (24.2)
Requirement already satisfied: pandas>=1.5.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.3)
Requirement already satisfied: scikit-learn>=1.2.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.6.1)
Requirement already satisfied: scipy>=1.9.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.15.2)
Requirement already satisfied: jinja2>=3.1.2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.1.5)
Requirement already satisfied: matplotlib>=3.4.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.10.0)
Requirement already satisfied: requests>=2.25.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.32.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from jinja2>=3.1.2->skrub) (3.0.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.3.1)
Requirement already satisfied: cycler>=0.10 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.4.8)
Requirement already satisfied: pillow>=8 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2025.1.31)
Requirement already satisfied: joblib>=1.2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (3.5.0)
Requirement already satisfied: six>=1.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.4.3->skrub) (1.17.0)
Note: you may need to restart the kernel to use updated packages.
pip install --upgrade skrub
Requirement already satisfied: skrub in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (0.5.1)
Requirement already satisfied: numpy>=1.23.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.1)
Requirement already satisfied: packaging>=23.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (24.2)
Requirement already satisfied: pandas>=1.5.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.3)
Requirement already satisfied: scikit-learn>=1.2.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.6.1)
Requirement already satisfied: scipy>=1.9.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.15.2)
Requirement already satisfied: jinja2>=3.1.2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.1.5)
Requirement already satisfied: matplotlib>=3.4.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.10.0)
Requirement already satisfied: requests>=2.25.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.32.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from jinja2>=3.1.2->skrub) (3.0.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.3.1)
Requirement already satisfied: cycler>=0.10 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.4.8)
Requirement already satisfied: pillow>=8 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2025.1.31)
Requirement already satisfied: joblib>=1.2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (3.5.0)
Requirement already satisfied: six>=1.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.4.3->skrub) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

USING SKRUB

1.HANDLING MISSING DATAFILES:

We always find issues when cleaning data is dealing with missing or NaN values. With Skrub, you can fill or drop these missing values with just a few lines of code.

for example consider this code:

#filling missing values
import pandas as pd


data = {
    'Name': ['Alice', 'Bob', None, 'Eve'],
    'Age': [25, None, 22, 29],
    'City': ['New York', 'Paris', None, None]
}

df = pd.DataFrame(data)
df
Name Age City
0 Alice 25.0 New York
1 Bob NaN Paris
2 None 22.0 None
3 Eve 29.0 None

we can use skrub’s missing module to fill the missing values. For example, to fill numerical columns with the mean and categorical columns with the mode, we can use th following code:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from skrub import TableVectorizer  # Correct import

# Sample data with missing values
data = {
    'Name': ['Alice', 'Bob', None, 'Eve'],
    'Age': [25, None, 22, 29],
    'City': ['New York', 'Paris', None, None]
}

# Create a DataFrame
df = pd.DataFrame(data)

print("Original Data:")
print(df)

# Fill missing values correctly (avoid FutureWarning)
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Filling missing Age with the mean
df['Name'] = df['Name'].fillna('Unknown')
df['City'] = df['City'].fillna('Unknown')

print("\nData After Filling Missing Values:")
print(df)

# Standardize the 'Age' column
scaler = StandardScaler()
df['Age'] = scaler.fit_transform(df[['Age']])

print("\nData After Standardizing 'Age' Column:")
print(df)

# Vectorize the table (convert categorical data into numerical featur
Original Data:
    Name   Age      City
0  Alice  25.0  New York
1    Bob   NaN     Paris
2   None  22.0      None
3    Eve  29.0      None

Data After Filling Missing Values:
      Name        Age      City
0    Alice  25.000000  New York
1      Bob  25.333333     Paris
2  Unknown  22.000000   Unknown
3      Eve  29.000000   Unknown

Data After Standardizing 'Age' Column:
      Name       Age      City
0    Alice -0.134231  New York
1      Bob  0.000000     Paris
2  Unknown -1.342312   Unknown
3      Eve  1.476543   Unknown

The age column is replaced with its mean value.

the missing Name and City values are replaced with the string ‘Unknown’.

TableVectorizer handles the categorical features and converts them into binary features (one-hot encoding). Output The cleaned data is now numerical, and we can see how missing values were handled.

2.STANDARDIZING DATA

skrub can be useful standardize numerical data, by making sure that values are consistent across the dataset. For example, if we have a column of Age values with a large range, we can scale it between 0 and 1 using normalization or standardization techniques.

for example analyze the code below

import pandas as pd
from sklearn.preprocessing import StandardScaler
from skrub._table_vectorizer import TableVectorizer

# Sample data with missing values
data = {
    'Name': ['Alice', 'Bob', None, 'Eve'],
    'Age': [25, None, 22, 29],
    'City': ['New York', 'Paris', None, None]
}

# Create a DataFrame
df = pd.DataFrame(data)

print("Original Data:")
print(df)

# Fill missing values correctly (avoid FutureWarning)
df['Age'] = df['Age'].fillna(df['Age'].mean())  # Filling missing Age with the mean
df['Name'] = df['Name'].fillna('Unknown')
df['City'] = df['City'].fillna('Unknown')

print("\nData After Filling Missing Values:")
print(df)

# Standardize the 'Age' column
scaler = StandardScaler()
df['Age'] = scaler.fit_transform(df[['Age']])

print("\nData After Standardizing 'Age' Column:")
print(df)

# Vectorize the table (convert categorical data into numerical features)
vectorizer = TableVectorizer()
df_vectorized = vectorizer.fit_transform(df)

print("\nVectorized Data:")
print(df_vectorized)
Original Data:
    Name   Age      City
0  Alice  25.0  New York
1    Bob   NaN     Paris
2   None  22.0      None
3    Eve  29.0      None

Data After Filling Missing Values:
      Name        Age      City
0    Alice  25.000000  New York
1      Bob  25.333333     Paris
2  Unknown  22.000000   Unknown
3      Eve  29.000000   Unknown

Data After Standardizing 'Age' Column:
      Name       Age      City
0    Alice -0.134231  New York
1      Bob  0.000000     Paris
2  Unknown -1.342312   Unknown
3      Eve  1.476543   Unknown

Vectorized Data:
   Name_Alice  Name_Bob  Name_Eve  Name_Unknown       Age  City_New York  \
0         1.0       0.0       0.0           0.0 -0.134231            1.0   
1         0.0       1.0       0.0           0.0  0.000000            0.0   
2         0.0       0.0       0.0           1.0 -1.342312            0.0   
3         0.0       0.0       1.0           0.0  1.476543            0.0   

   City_Paris  City_Unknown  
0         0.0           0.0  
1         1.0           0.0  
2         0.0           1.0  
3         0.0           1.0  

The code fills missing values in the Age, Name, and City columns, then standardizes the Age column using StandardScaler. Afterward, it uses TableVectorizer to convert the entire dataframe into numerical features for machine learning.

3.TRANSFORMING CATEGEORICAL VARIABLES:

Categorical variables, such as names of cities or product categories, must be encoded into numerical values to be applied it machine learning algorithms.

Skrub makes this easy with the TableVectorizer, which handles one-hot encoding and other encoding methods automatically.

oberse the code below.

# Use TableVectorizer to transform categorical columns
vectorizer = TableVectorizer()
df_transformed = vectorizer.fit_transform(df)

print("\nTransformed Data (Categorical Encoded):")
print(df_transformed)

Transformed Data (Categorical Encoded):
   Name_Alice  Name_Bob  Name_Eve  Name_Unknown       Age  City_New York  \
0         1.0       0.0       0.0           0.0 -0.134231            1.0   
1         0.0       1.0       0.0           0.0  0.000000            0.0   
2         0.0       0.0       0.0           1.0 -1.342312            0.0   
3         0.0       0.0       1.0           0.0  1.476543            0.0   

   City_Paris  City_Unknown  
0         0.0           0.0  
1         1.0           0.0  
2         0.0           1.0  
3         0.0           1.0  

The code uses TableVectorizer to change text data (like ‘City’ and ‘Gender’) into numbers. It creates new columns for each category, with 1 or 0 to show if that category is present. This helps turn the data into a form that computers can understand for analysis.

4.MERGING SIMILAR TEXT DATA

If we have a messy text (for example “NewYork” vs “New Yor”), Skrub will match and merge them as shown below:

from skrub import fuzzy_join
import pandas as pd

# Sample data: Customer names with typos
df1 = pd.DataFrame({'Name': ['John Doe', 'Alice W.', 'Eve Smith']})
df2 = pd.DataFrame({'Name': ['Jon Doe', 'Eve s.','Alice '], 'Purchase': [100, 200, 150]})

# Use Skrub to match similar names
merged_df = fuzzy_join(df1, df2, on='Name')

print("\nFuzzy Matched Data:\n", merged_df)

Fuzzy Matched Data:
         Name Name__skrub_d4fa04a9__  Purchase
0   John Doe                Jon Doe       100
1   Alice W.                 Alice        150
2  Eve Smith                 Eve s.       200

The “fuzzy join” recipe is dedicated to joins between two datasets when join keys don’t match exactly.

It works by calculating a distance chosen by user and then comparing it to a threshold. DSS handles inner, left, right or outer joins.

5.DETECTION AND FILTERING OUTLIERS

There are 4 ways to detect outliers:

1.Sorting method

2.Data visualization method

3.Statistical tests (z scores)

4.Interquartile range method

To understand in detail we can use the Interquartile Range (IQR) method.

first of all lets analyse the code below:

import pandas as pd
import numpy as np
from skrub import TableVectorizer

# Sample data with an outlier
data = {'Salary': [30000, 50000, 70000, 999000]}  

# Create DataFrame
df = pd.DataFrame(data)

# Detect and remove outliers using IQR method
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_cleaned = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]

print("Original Data:")
print(df)

print("\nCleaned Data (Outliers Removed):")
print(df_cleaned)
Original Data:
   Salary
0   30000
1   50000
2   70000
3  999000

Cleaned Data (Outliers Removed):
   Salary
0   30000
1   50000
2   70000

The interquartile range (IQR) tells you the range of the middle half of your dataset. You can use the IQR to create “fences” around your data and then define outliers as any values that fall outside those fences.

This method is helpful if you have a few values on the extreme ends of your dataset.

Interquartile range method

Sort your data from low to high

Identify the first quartile (Q1), the median, and the third quartile (Q3).

Calculate your IQR = Q3 – Q1

Calculate your upper fence = Q3 + (1.5 * IQR)

Calculate your lower fence = Q1 – (1.5 * IQR)

Use your fences to highlight any outliers, all values that fall outside your fences.

Your outliers are any values greater than your upper fence or less than your lower fence.

CONCLUSION: In the end, skrub has many uses in handling data in python.it helps in simplifying messy data along with organizing it.it helps the users in cleaning the data efficiently with minimum effort. This saves a lot of time along with preserving of accuracy and consistancy.Without tools like Skrub, cleaning large datasets manually can be slow and has higher probality of getting errors. Skrub helps with tasks like filling missing values, fixing inconsistencies, and standardizing formats.

REFERENCES:

https://blog.stackademic.com/unleashing-the-power-of-skrub-revolutionizing-table-preparation-for-machine-learning-cdfe9dee8804

https://doc.dataiku.com/dss/latest/other_recipes/fuzzy-join.html

https://www.scribbr.com/statistics/outliers/