SKRUB : A Python library for cleaning, structuring, and visualizing tabular data
INTRODUCTION
How to deal with messy data Missing values, inconsistent formats, and unstructured information slow down data analysis? here skrub comes in picture! skrub is a powerful Python library designed to clean, structure, and prepare tabular data efficiently. In this guide, we’ll explore its key features
skrub makes cleaning, organizing, and visualizing messy tables easier and faster
Data Cleaning – Removes inconsistencies, trims spaces, and fixes column names.
Handling Missing Data – Easily fills missing values.
Dataset Merging– Intelligently links datasets, even with slight variations.
Quick Insights– Generates structured data for visualization & analysis.
WHY SKRUB?
Assembling Tables with Precision:Skrub excels at joining tables on keys of different types, including string, numerical, and datetime, with an impressive ability to handle imprecise correspondences.
Fuzzy Joining for Seamless Integration: selects the type of fuzzy matching based on column types, producing a similarity score for easy identification of less-than-perfect matches
Advanced Analysis Made Simple:Skrub takes table joining to the next level with features like Joiner, AggJoiner, and AggTarget.
Efficient Column Selection in Pipelines:Apart from joins, skrub also facilitates column selection within a pipeline, allowing data scientists to choose and discard columns dynamically.
INSTALLATION PROCESS
pip install skrub
Requirement already satisfied: skrub in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (0.5.1)
Requirement already satisfied: numpy>=1.23.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.1)
Requirement already satisfied: packaging>=23.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (24.2)
Requirement already satisfied: pandas>=1.5.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.3)
Requirement already satisfied: scikit-learn>=1.2.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.6.1)
Requirement already satisfied: scipy>=1.9.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.15.2)
Requirement already satisfied: jinja2>=3.1.2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.1.5)
Requirement already satisfied: matplotlib>=3.4.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.10.0)
Requirement already satisfied: requests>=2.25.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.32.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from jinja2>=3.1.2->skrub) (3.0.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.3.1)
Requirement already satisfied: cycler>=0.10 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.4.8)
Requirement already satisfied: pillow>=8 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2025.1.31)
Requirement already satisfied: joblib>=1.2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (3.5.0)
Requirement already satisfied: six>=1.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.4.3->skrub) (1.17.0)
Note: you may need to restart the kernel to use updated packages.
pip install --upgrade skrub
Requirement already satisfied: skrub in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (0.5.1)
Requirement already satisfied: numpy>=1.23.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.1)
Requirement already satisfied: packaging>=23.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (24.2)
Requirement already satisfied: pandas>=1.5.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.2.3)
Requirement already satisfied: scikit-learn>=1.2.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.6.1)
Requirement already satisfied: scipy>=1.9.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (1.15.2)
Requirement already satisfied: jinja2>=3.1.2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.1.5)
Requirement already satisfied: matplotlib>=3.4.3 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (3.10.0)
Requirement already satisfied: requests>=2.25.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from skrub) (2.32.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from jinja2>=3.1.2->skrub) (3.0.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.3.1)
Requirement already satisfied: cycler>=0.10 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (4.55.3)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (1.4.8)
Requirement already satisfied: pillow>=8 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from matplotlib>=3.4.3->skrub) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from pandas>=1.5.3->skrub) (2025.1)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.25.0->skrub) (2025.1.31)
Requirement already satisfied: joblib>=1.2.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from scikit-learn>=1.2.1->skrub) (3.5.0)
Requirement already satisfied: six>=1.5 in c:\users\nihar\appdata\local\programs\python\python312\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.4.3->skrub) (1.17.0)
Note: you may need to restart the kernel to use updated packages.
USING SKRUB
1.HANDLING MISSING DATAFILES:
We always find issues when cleaning data is dealing with missing or NaN values. With Skrub, you can fill or drop these missing values with just a few lines of code.
we can use skrub’s missing module to fill the missing values. For example, to fill numerical columns with the mean and categorical columns with the mode, we can use th following code:
import pandas as pdfrom sklearn.preprocessing import StandardScalerfrom skrub import TableVectorizer # Correct import# Sample data with missing valuesdata = {'Name': ['Alice', 'Bob', None, 'Eve'],'Age': [25, None, 22, 29],'City': ['New York', 'Paris', None, None]}# Create a DataFramedf = pd.DataFrame(data)print("Original Data:")print(df)# Fill missing values correctly (avoid FutureWarning)df['Age'] = df['Age'].fillna(df['Age'].mean()) # Filling missing Age with the meandf['Name'] = df['Name'].fillna('Unknown')df['City'] = df['City'].fillna('Unknown')print("\nData After Filling Missing Values:")print(df)# Standardize the 'Age' columnscaler = StandardScaler()df['Age'] = scaler.fit_transform(df[['Age']])print("\nData After Standardizing 'Age' Column:")print(df)# Vectorize the table (convert categorical data into numerical featur
Original Data:
Name Age City
0 Alice 25.0 New York
1 Bob NaN Paris
2 None 22.0 None
3 Eve 29.0 None
Data After Filling Missing Values:
Name Age City
0 Alice 25.000000 New York
1 Bob 25.333333 Paris
2 Unknown 22.000000 Unknown
3 Eve 29.000000 Unknown
Data After Standardizing 'Age' Column:
Name Age City
0 Alice -0.134231 New York
1 Bob 0.000000 Paris
2 Unknown -1.342312 Unknown
3 Eve 1.476543 Unknown
The age column is replaced with its mean value.
the missing Name and City values are replaced with the string ‘Unknown’.
TableVectorizer handles the categorical features and converts them into binary features (one-hot encoding). Output The cleaned data is now numerical, and we can see how missing values were handled.
2.STANDARDIZING DATA
skrub can be useful standardize numerical data, by making sure that values are consistent across the dataset. For example, if we have a column of Age values with a large range, we can scale it between 0 and 1 using normalization or standardization techniques.
for example analyze the code below
import pandas as pdfrom sklearn.preprocessing import StandardScalerfrom skrub._table_vectorizer import TableVectorizer# Sample data with missing valuesdata = {'Name': ['Alice', 'Bob', None, 'Eve'],'Age': [25, None, 22, 29],'City': ['New York', 'Paris', None, None]}# Create a DataFramedf = pd.DataFrame(data)print("Original Data:")print(df)# Fill missing values correctly (avoid FutureWarning)df['Age'] = df['Age'].fillna(df['Age'].mean()) # Filling missing Age with the meandf['Name'] = df['Name'].fillna('Unknown')df['City'] = df['City'].fillna('Unknown')print("\nData After Filling Missing Values:")print(df)# Standardize the 'Age' columnscaler = StandardScaler()df['Age'] = scaler.fit_transform(df[['Age']])print("\nData After Standardizing 'Age' Column:")print(df)# Vectorize the table (convert categorical data into numerical features)vectorizer = TableVectorizer()df_vectorized = vectorizer.fit_transform(df)print("\nVectorized Data:")print(df_vectorized)
Original Data:
Name Age City
0 Alice 25.0 New York
1 Bob NaN Paris
2 None 22.0 None
3 Eve 29.0 None
Data After Filling Missing Values:
Name Age City
0 Alice 25.000000 New York
1 Bob 25.333333 Paris
2 Unknown 22.000000 Unknown
3 Eve 29.000000 Unknown
Data After Standardizing 'Age' Column:
Name Age City
0 Alice -0.134231 New York
1 Bob 0.000000 Paris
2 Unknown -1.342312 Unknown
3 Eve 1.476543 Unknown
Vectorized Data:
Name_Alice Name_Bob Name_Eve Name_Unknown Age City_New York \
0 1.0 0.0 0.0 0.0 -0.134231 1.0
1 0.0 1.0 0.0 0.0 0.000000 0.0
2 0.0 0.0 0.0 1.0 -1.342312 0.0
3 0.0 0.0 1.0 0.0 1.476543 0.0
City_Paris City_Unknown
0 0.0 0.0
1 1.0 0.0
2 0.0 1.0
3 0.0 1.0
The code fills missing values in the Age, Name, and City columns, then standardizes the Age column using StandardScaler. Afterward, it uses TableVectorizer to convert the entire dataframe into numerical features for machine learning.
3.TRANSFORMING CATEGEORICAL VARIABLES:
Categorical variables, such as names of cities or product categories, must be encoded into numerical values to be applied it machine learning algorithms.
Skrub makes this easy with the TableVectorizer, which handles one-hot encoding and other encoding methods automatically.
oberse the code below.
# Use TableVectorizer to transform categorical columnsvectorizer = TableVectorizer()df_transformed = vectorizer.fit_transform(df)print("\nTransformed Data (Categorical Encoded):")print(df_transformed)
The code uses TableVectorizer to change text data (like ‘City’ and ‘Gender’) into numbers. It creates new columns for each category, with 1 or 0 to show if that category is present. This helps turn the data into a form that computers can understand for analysis.
4.MERGING SIMILAR TEXT DATA
If we have a messy text (for example “NewYork” vs “New Yor”), Skrub will match and merge them as shown below:
from skrub import fuzzy_joinimport pandas as pd# Sample data: Customer names with typosdf1 = pd.DataFrame({'Name': ['John Doe', 'Alice W.', 'Eve Smith']})df2 = pd.DataFrame({'Name': ['Jon Doe', 'Eve s.','Alice '], 'Purchase': [100, 200, 150]})# Use Skrub to match similar namesmerged_df = fuzzy_join(df1, df2, on='Name')print("\nFuzzy Matched Data:\n", merged_df)
Fuzzy Matched Data:
Name Name__skrub_d4fa04a9__ Purchase
0 John Doe Jon Doe 100
1 Alice W. Alice 150
2 Eve Smith Eve s. 200
The “fuzzy join” recipe is dedicated to joins between two datasets when join keys don’t match exactly.
It works by calculating a distance chosen by user and then comparing it to a threshold. DSS handles inner, left, right or outer joins.
5.DETECTION AND FILTERING OUTLIERS
There are 4 ways to detect outliers:
1.Sorting method
2.Data visualization method
3.Statistical tests (z scores)
4.Interquartile range method
To understand in detail we can use the Interquartile Range (IQR) method.
first of all lets analyse the code below:
import pandas as pdimport numpy as npfrom skrub import TableVectorizer# Sample data with an outlierdata = {'Salary': [30000, 50000, 70000, 999000]} # Create DataFramedf = pd.DataFrame(data)# Detect and remove outliers using IQR methodQ1 = df['Salary'].quantile(0.25)Q3 = df['Salary'].quantile(0.75)IQR = Q3 - Q1# Define the lower and upper boundslower_bound = Q1 -1.5* IQRupper_bound = Q3 +1.5* IQR# Remove outliersdf_cleaned = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]print("Original Data:")print(df)print("\nCleaned Data (Outliers Removed):")print(df_cleaned)
The interquartile range (IQR) tells you the range of the middle half of your dataset. You can use the IQR to create “fences” around your data and then define outliers as any values that fall outside those fences.
This method is helpful if you have a few values on the extreme ends of your dataset.
Interquartile range method
Sort your data from low to high
Identify the first quartile (Q1), the median, and the third quartile (Q3).
Calculate your IQR = Q3 – Q1
Calculate your upper fence = Q3 + (1.5 * IQR)
Calculate your lower fence = Q1 – (1.5 * IQR)
Use your fences to highlight any outliers, all values that fall outside your fences.
Your outliers are any values greater than your upper fence or less than your lower fence.
CONCLUSION: In the end, skrub has many uses in handling data in python.it helps in simplifying messy data along with organizing it.it helps the users in cleaning the data efficiently with minimum effort. This saves a lot of time along with preserving of accuracy and consistancy.Without tools like Skrub, cleaning large datasets manually can be slow and has higher probality of getting errors. Skrub helps with tasks like filling missing values, fixing inconsistencies, and standardizing formats.