Task¶
Please analyze the used car dataset "cars.csv" to build a price prediction model. Follow these steps:
- Setup and EDA: Initialize the notebook, configure GPU settings, and load "cars.csv". Perform EDA, including a heatmap crosstab of
kb_body_typevsgb_body_type, and identify quasi-constant columns usingfeature_engine.selection.DropConstantFeatures(tol=0.99). - Preprocessing: Handle 0 CC and HP anomalies, calculate boundary means (
power_hp_valandengine_cc_val), drop redundant columns, and apply hierarchical median imputation for technical features. - Feature Engineering: Write a fully English-documented
preprocess_damage_scorefunction to aggregate expertise columns into anexpert_risk_score, and create derived features like vehicle age. - Baseline Modeling: Establish a cross-validation framework and train a GPU-accelerated CatBoost baseline model, explicitly assigning the
modelcolumn as a text feature. - Outlier Analysis: Detect anomalies using Z-Scores on CatBoost residuals and Isolation Forest. Consolidate the flags into a single decision DataFrame and create an interactive Plotly chart showing which car
models are most frequently flagged as outliers. - LOFO & Ablation Study: Implement a well-documented custom LOFO function for feature selection. Then, perform an ablation study by training separate models using only 'brand', only 'series', and only 'model' (as a text feature) to compare their individual impacts.
- Final Refined Model: Train the final CatBoost model on the outlier-cleaned, cross-validated dataset using the optimal features from the LOFO and ablation study. Output comparison metrics and save the model artifacts.
Ensure the notebook uses clear Markdown hierarchy and English docstrings for all functions.
Reasoning:
Install the required package feature_engine as instructed.
!pip install feature_engine
Requirement already satisfied: feature_engine in /usr/local/lib/python3.12/dist-packages (1.9.4) Requirement already satisfied: numpy>=1.18.2 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (2.0.2) Requirement already satisfied: pandas>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (2.2.2) Requirement already satisfied: scikit-learn>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (1.6.1) Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (1.16.3) Requirement already satisfied: statsmodels>=0.11.1 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (0.14.6) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2.0->feature_engine) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2.0->feature_engine) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2.0->feature_engine) (2025.3) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.4.0->feature_engine) (1.5.3) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.4.0->feature_engine) (3.6.0) Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.12/dist-packages (from statsmodels>=0.11.1->feature_engine) (1.0.2) Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.12/dist-packages (from statsmodels>=0.11.1->feature_engine) (26.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas>=2.2.0->feature_engine) (1.17.0)
!pip install catboost
Requirement already satisfied: catboost in /usr/local/lib/python3.12/dist-packages (1.2.10) Requirement already satisfied: graphviz in /usr/local/lib/python3.12/dist-packages (from catboost) (0.21) Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (from catboost) (3.10.0) Requirement already satisfied: numpy<3.0,>=1.16.0 in /usr/local/lib/python3.12/dist-packages (from catboost) (2.0.2) Requirement already satisfied: pandas<4.0,>=0.24 in /usr/local/lib/python3.12/dist-packages (from catboost) (2.2.2) Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from catboost) (1.16.3) Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (from catboost) (6.6.0) Requirement already satisfied: six in /usr/local/lib/python3.12/dist-packages (from catboost) (1.17.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas<4.0,>=0.24->catboost) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas<4.0,>=0.24->catboost) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas<4.0,>=0.24->catboost) (2025.3) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (1.3.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (4.62.1) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (1.5.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (26.0) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (11.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (3.3.2) Requirement already satisfied: narwhals>=1.15.1 in /usr/local/lib/python3.12/dist-packages (from plotly->catboost) (2.18.1)
!pip install --upgrade plotly
Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (6.6.0) Requirement already satisfied: narwhals>=1.15.1 in /usr/local/lib/python3.12/dist-packages (from plotly) (2.18.1) Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (26.0)
Reasoning: Import the required libraries, set the random seed for reproducibility, configure GPU parameters for CatBoost, and load the dataset into a pandas DataFrame.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostRegressor, Pool
import sklearn
from feature_engine.selection import DropConstantFeatures
import os
import warnings
warnings.filterwarnings("ignore")
# Set random seed
np.random.seed(42)
# Configure CatBoost GPU parameters
gpu_params = {
'task_type': 'GPU',
'devices': '0',
'bootstrap_type': 'Poisson'
}
# Load data
df_raw = pd.read_csv('cars.csv')
print(f"Dataset shape: {df_raw.shape}")
Dataset shape: (13904, 94)
Reasoning:
I will create a code block to calculate the crosstab between kb_body_type and gb_body_type, visualize it using a seaborn heatmap, and use DropConstantFeatures to identify quasi-constant columns.
# Crosstab and Heatmap
plt.figure(figsize=(10, 8))
body_type_crosstab = pd.crosstab(df_raw['kb_body_type'], df_raw['gb_body_type'])
sns.heatmap(body_type_crosstab, annot=True, fmt='d', cmap='Blues')
plt.title('Crosstab of kb_body_type vs gb_body_type')
plt.xlabel('gb_body_type')
plt.ylabel('kb_body_type')
plt.show()
# Identify Quasi-constant features
constant_dropper = DropConstantFeatures(tol=0.99, missing_values='ignore')
constant_dropper.fit(df_raw)
print(f"Quasi-constant columns to drop (tol=0.99): {constant_dropper.features_to_drop_}")
Quasi-constant columns to drop (tol=0.99): ['kb_condition', 'gb_usage_type', 'gb_is_first_owner', 'tavan_degisen', 'tavan_boyali', 'tavan_lokal', 'bagaj_degisen', 'bagaj_boyali', 'bagaj_lokal', 'fender_rl_degisen', 'fender_rr_degisen']
"I am dropping the variables kb_condition, gb_usage_type, and gb_is_first_owner, while keeping the remaining features for damage scoring. Regarding body type, I will initially prioritize kb_body_type due to its more generalized structure; however, I plan to perform a LOFO (Leave-One-Feature-Out) importance analysis to finalise the selection between the two."
print("Checking for identical columns in the dataset...")
# Dictionary to store groups of identical columns
identical_groups = {}
cols = df_raw.columns.tolist()
# Track columns already assigned to a group to avoid redundant checks
checked = set()
for i in range(len(cols)):
col1 = cols[i]
if col1 in checked:
continue
current_group = [col1]
for j in range(i + 1, len(cols)):
col2 = cols[j]
if col2 in checked:
continue
# Check if columns are equal (pandas .equals() handles NaNs correctly)
if df_raw[col1].equals(df_raw[col2]):
current_group.append(col2)
checked.add(col2)
if len(current_group) > 1:
identical_groups[col1] = current_group
if identical_groups:
print("\nFound identical columns:")
for key, group in identical_groups.items():
print(f" - Group: {group}")
else:
print("\nNo perfectly identical columns found.")
Checking for identical columns in the dataset... Found identical columns: - Group: ['kb_year', 'gb_year'] - Group: ['kb_mileage', 'gb_mileage'] - Group: ['kb_transmission', 'gb_transmission'] - Group: ['kb_fuel', 'gb_fuel'] - Group: ['kb_color', 'gb_color'] - Group: ['kb_is_heavy_damaged', 'is_heavy_damaged'] - Group: ['bagaj_degisen', 'bagaj_boyali', 'bagaj_lokal']
Phase 2: Drop Colums and Fill the missings with group based imputation¶
During our initial data validation, we identified several pairs of features containing perfectly identical data across all rows (such as kb_year vs. gb_year, and kb_transmission vs. gb_transmission). These duplicates likely originate from overlapping data fields in the scraped source website (e.g., general listing specifications versus detailed technical tabs). Retaining these identical columns provides zero additional information to the predictive model and introduces perfect multicollinearity. By systematically removing one column from each duplicate pair, we streamline the feature space, reduce computational overhead, and eliminate redundant noise, ensuring our machine learning model trains more efficiently and robustly.
# hover_cols = [col for col in ['brand', 'model','gb_year'] if col in df_raw.columns]
# # 1. Plot for Horsepower
# fig_hp = px.scatter(
# df_raw,
# x='power_hp_low',
# y='power_hp_up',
# title='Distribution of HP Lower vs Upper Bounds (Before Imputation)',
# labels={'power_hp_low': 'HP Lower Bound', 'power_hp_up': 'HP Upper Bound'},
# opacity=0.5,
# color_discrete_sequence=['#3498db'],
# hover_data=hover_cols
# )
# fig_hp.update_layout(template='plotly_white')
# fig_hp.show()
# # 2. Plot for Engine CC
# fig_cc = px.scatter(
# df_raw,
# x='engine_cc_low',
# y='engine_cc_up',
# title='Distribution of Engine CC Lower vs Upper Bounds (Before Imputation)',
# labels={'engine_cc_low': 'Engine CC Lower Bound', 'engine_cc_up': 'Engine CC Upper Bound'},
# opacity=0.5,
# color_discrete_sequence=['#e74c3c'],
# hover_data=hover_cols
# )
# fig_cc.update_layout(template='plotly_white')
# fig_cc.show()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 1. Plot for Horsepower
sns.scatterplot(data=df_raw, x='power_hp_low', y='power_hp_up',
alpha=0.5, color='#3498db', s=20, ax=axes[0])
axes[0].set_title('Distribution of HP Lower vs Upper Bounds (Before Imputation)')
axes[0].set_xlabel('HP Lower Bound')
axes[0].set_ylabel('HP Upper Bound')
# 2. Plot for Engine CC
sns.scatterplot(data=df_raw, x='engine_cc_low', y='engine_cc_up',
alpha=0.5, color='#e74c3c', s=20, ax=axes[1])
axes[1].set_title('Distribution of Engine CC Lower vs Upper Bounds (Before Imputation)')
axes[1].set_xlabel('Engine CC Lower Bound')
axes[1].set_ylabel('Engine CC Upper Bound')
plt.tight_layout()
plt.show()
# import plotly.express as px
# import pandas as pd
# # Ensure the columns exist in the DataFrame
# if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:
# # Create a scatter plot
# fig = px.scatter(
# df_raw,
# x='power_hp_val',
# y='torque_nm',
# title='Torque (Nm) vs. Horsepower (HP)',
# labels={'power_hp_val': 'Horsepower (HP)', 'torque_nm': 'Torque (Nm)','gb_year': 'year'},
# opacity=0.6,
# hover_data=['brand', 'series', 'model', 'price','gb_year']
# )
# fig.update_layout(template='plotly_white')
# fig.show()
# # Calculate and print the correlation
# correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
# print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")
# else:
# print("One or both of 'torque_nm' and 'power_hp_val' columns are not available in the DataFrame.")
import seaborn as sns
import matplotlib.pyplot as plt
# Sütunların varlığını kontrol et
if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:
# Grafik temasını ve boyutunu ayarla
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10, 6))
# Scatter plot oluşturma
scatter = sns.scatterplot(
data=df_raw,
x='power_hp_val',
y='torque_nm',
alpha=0.6,
edgecolor=None
)
# Başlık ve etiketleri ekleme
plt.title('Torque (Nm) vs. Horsepower (HP)', fontsize=15)
plt.xlabel('Horsepower (HP)')
plt.ylabel('Torque (Nm)')
# Plotly'deki hover_data (üzerine gelince bilgi gösterme) özelliği
# Seaborn/Matplotlib'de statik olarak mevcut değildir.
plt.show()
# Korelasyon hesaplama (Pandas ile devam ediyoruz)
correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")
else:
print("One or both of 'torque_nm' and 'power_hp_val' columns are not available in the DataFrame.")
Pearson Correlation between Horsepower and Torque: 0.7623
We are going to use upper boundary to impute engine_cc_val.
# 1. Calculate boundary means
df_raw['power_hp_val'] = df_raw[['power_hp_low', 'power_hp_up']].mean(axis=1)
df_raw['engine_cc_val'] = df_raw['engine_cc_up']
print(f"Remaining missing values in power_hp_val after boundary mean: {df_raw['power_hp_val'].isnull().sum()}")
print(f"Remaining missing values in engine_cc_val after boundary mean: {df_raw['engine_cc_val'].isnull().sum()}")
# 3. Drop quasi-constant, boundary columns, and highly correlated redundant columns
boundary_cols = ['power_hp_low', 'power_hp_up', 'engine_cc_low', 'engine_cc_up']
quasi_constant_cols = ['kb_condition', 'gb_usage_type', 'gb_is_first_owner']
info_cols = ['ad_id','listing_date','ad_title','location','power_hp_is_range','power_hp_is_range', 'description_text', 'scraped_at', 'search_date']
identical_cols = ['kb_year','kb_mileage','kb_transmission','kb_fuel','kb_color','kb_is_heavy_damaged']
cols_to_drop = list(set(quasi_constant_cols + boundary_cols + info_cols + identical_cols))
cols_to_drop = [c for c in cols_to_drop if c in df_raw.columns]
df_raw = df_raw.drop(columns=cols_to_drop)
print(f"Dropped {len(cols_to_drop)} columns.\n")
print(f"Dataset shape after dropped features: {df_raw.shape}")
Remaining missing values in power_hp_val after boundary mean: 218 Remaining missing values in engine_cc_val after boundary mean: 185 Dropped 21 columns. Dataset shape after dropped features: (13904, 73)
2b: Looking For Missing Value Patterns¶
import matplotlib.pyplot as plt
import seaborn as sns
print("\n--- Missing Values Heatmap ---")
plt.figure(figsize=(15, 8))
sns.heatmap(df_raw.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in DataFrame')
plt.show()
--- Missing Values Heatmap ---
# print("--- Missing Values Heatmap ---")
# # Calculate missing values
# missing_values = df_raw.isnull().sum()
# missing_percent = (missing_values / len(df_raw)) * 100
# # Create a DataFrame for missing values
# missing_info = pd.DataFrame({
# 'Missing Count': missing_values,
# 'Missing Percent': missing_percent
# }).sort_values(by='Missing Percent', ascending=False)
# # Filter for columns with actual missing values
# missing_info = missing_info[missing_info['Missing Count'] > 0]
# # Print top 10 missing rates
# print("\n--- Top 10 Columns with Missing Values ---")
# print(missing_info.head(10).to_string())
# # Plotting missing values
# if not missing_info.empty:
# fig = px.bar(
# missing_info.head(20),
# y='Missing Percent',
# x=missing_info.head(20).index,
# title='Top 20 Columns with Missing Values (% Remaining)',
# labels={'x': 'Column Name', 'y': 'Missing Percentage'},
# color='Missing Percent',
# color_continuous_scale='Viridis',
# height=500
# )
# fig.update_layout(xaxis={'categoryorder': 'total descending'}, template='plotly_white')
# fig.show()
# else:
# print("No missing values found in the dataset.")
print("--- Missing Values Analysis ---")
# Eksik değerleri hesapla (Aynı mantık)
missing_values = df_raw.isnull().sum()
missing_percent = (missing_values / len(df_raw)) * 100
# Eksik değerler için DataFrame oluştur
missing_info = pd.DataFrame({
'Missing Count': missing_values,
'Missing Percent': missing_percent
}).sort_values(by='Missing Percent', ascending=False)
# Sadece eksik değeri olan sütunları filtrele
missing_info = missing_info[missing_info['Missing Count'] > 0]
# İlk 10 satırı yazdır
print("\n--- Top 10 Columns with Missing Values ---")
print(missing_info.head(10).to_string())
# Seaborn ile Görselleştirme
if not missing_info.empty:
# Grafik boyutunu ve temasını ayarla
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 6))
# İlk 20 sütunu al
top_20_missing = missing_info.head(20)
# Bar plot oluşturma
# x ekseni sütun isimleri (index), y ekseni yüzdeler
ax = sns.barplot(
x=top_20_missing.index,
y=top_20_missing['Missing Percent'],
palette='viridis' # Renk skalası
)
# Başlık ve etiketler
plt.title('Top 20 Columns with Missing Values (%)', fontsize=16)
plt.xlabel('Column Name', fontsize=12)
plt.ylabel('Missing Percentage (%)', fontsize=12)
# X eksenindeki yazıların (sütun isimlerinin) okunabilirliği için döndürme
plt.xticks(rotation=45, ha='right')
# Grafik üzerine yüzde değerlerini yazdırmak istersen (Opsiyonel):
# for p in ax.patches:
# ax.annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
# ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')
plt.tight_layout()
plt.show()
else:
print("No missing values found in the dataset.")
--- Missing Values Analysis ---
--- Top 10 Columns with Missing Values ---
Missing Count Missing Percent
tramer_fee 11778 84.709436
gb_mtv_yearly 5590 40.204258
kb_fuel_cons_avg 5435 39.089471
city_fuel_cons 4328 31.127733
highway_fuel_cons 4328 31.127733
weight_kg 3800 27.330265
kb_fuel_tank 3784 27.215190
trunk_capacity_lt 3784 27.215190
accel_0_100 3776 27.157652
torque_nm 3775 27.150460
# Calculate missing values and percentages
missing_values = df_raw.isnull().sum()
missing_percent = (missing_values / len(df_raw)) * 100
# Create a DataFrame for missing information and filter for actual missing values
missing_info = pd.DataFrame({
'Missing Count': missing_values,
'Missing Percent': missing_percent
}).sort_values(by='Missing Percent', ascending=False)
missing_info = missing_info[missing_info['Missing Count'] > 0]
print("\n--- Systemic Missingness Patterns (Columns Grouped by Similar Missing %) ---")
# Group columns by rounded missing percentage to identify systemic multi-row missingness
# Rounding to 2 decimal places to catch very similar percentages
missing_groups = missing_info.groupby(missing_info['Missing Percent'].round(2))['Missing Count'].agg(
columns=lambda x: list(missing_info.loc[x.index, 'Missing Count'].index),
percentage='first'
)
# Sort by percentage descending
missing_groups = missing_groups.sort_values(by='percentage', ascending=False)
for percent, group_data in missing_groups.iterrows():
# Only print groups with more than one column or if it's a very high single percentage
if len(group_data['columns']) > 1 or percent >= 20.0:
print(f" Missing ~{percent:.2f}%: {group_data['columns']}")
# Also explicitly highlight the top 5 individual highest missing columns as they are clearly systemic
print("\n--- Top 5 Columns with Highest Missing Percentages (most systemic) ---")
for index, row in missing_info.head(5).iterrows():
print(f" - {index}: {row['Missing Percent']:.2f}% (Count: {int(row['Missing Count'])})")
--- Systemic Missingness Patterns (Columns Grouped by Similar Missing %) --- Missing ~84.71%: ['tramer_fee'] Missing ~40.20%: ['gb_mtv_yearly'] Missing ~39.09%: ['kb_fuel_cons_avg'] Missing ~31.13%: ['city_fuel_cons', 'highway_fuel_cons'] Missing ~27.33%: ['weight_kg'] Missing ~27.22%: ['kb_fuel_tank', 'trunk_capacity_lt'] Missing ~27.16%: ['accel_0_100'] Missing ~27.15%: ['torque_nm'] Missing ~27.14%: ['wheelbase_mm', 'width_mm', 'length_mm', 'gb_segment', 'max_speed_kmh', 'cylinder_count', 'curb_weight_kg', 'height_mm'] Missing ~21.65%: ['gb_warranty_status'] --- Top 5 Columns with Highest Missing Percentages (most systemic) --- - tramer_fee: 84.71% (Count: 11778) - gb_mtv_yearly: 40.20% (Count: 5590) - kb_fuel_cons_avg: 39.09% (Count: 5435) - city_fuel_cons: 31.13% (Count: 4328) - highway_fuel_cons: 31.13% (Count: 4328)
print("\n--- Missing Values Heatmap ---")
plt.figure(figsize=(15, 8))
sns.heatmap(df_raw[['wheelbase_mm', 'width_mm', 'length_mm', 'gb_segment', 'max_speed_kmh', 'cylinder_count', 'curb_weight_kg', 'height_mm']].isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in DataFrame')
plt.show()
--- Missing Values Heatmap ---
Same rows are empty and when we look systemic
print("\n--- Distribution Plots for ALL Features ---")
# Identify numerical and categorical columns from df_raw
# Exclude 'price' from numerical for plotting general distributions, as it's the target
# Exclude text features like 'model' and 'series' from categorical plotting as they have too many unique values for countplot
all_num_cols = df_raw.select_dtypes(include=[np.number]).columns.tolist()
# Exclude target variable and any potential ID columns or scores that are not raw numerical features
num_features_to_plot = all_num_cols
all_cat_cols = df_raw.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
# Exclude text features (model, series) and other high cardinality or irrelevant categorical features for countplots
cat_features_to_plot = [col for col in all_cat_cols if col not in ['model', 'series', 'brand', 'gb_color']]
# --- Plot Numerical Features ---
if num_features_to_plot:
print("\n--- Numerical Feature Distributions ---")
n_num = len(num_features_to_plot)
n_cols = 4 # Number of columns for the subplot grid
n_rows = (n_num + n_cols - 1) // n_cols # Calculate rows needed
fig_num, axes_num = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 4))
axes_num = axes_num.flatten() # Flatten for easy iteration
for i, col in enumerate(num_features_to_plot):
sns.histplot(df_raw[col].dropna(), kde=True, ax=axes_num[i], color='#3498db')
axes_num[i].set_title(f'Distribution of {col}')
axes_num[i].set_xlabel('') # Clear x-label to avoid clutter
axes_num[i].set_ylabel('') # Clear y-label
# Hide any unused subplots
for j in range(i + 1, len(axes_num)):
fig_num.delaxes(axes_num[j])
plt.tight_layout()
plt.suptitle('Numerical Feature Distributions', y=1.02, fontsize=16)
plt.show()
else:
print("No numerical features to plot.")
# --- Plot Categorical Features ---
if cat_features_to_plot:
print("\n--- Categorical Feature Distributions ---")
n_cat = len(cat_features_to_plot)
n_cols = 3 # Number of columns for the subplot grid
n_rows = (n_cat + n_cols - 1) // n_cols # Calculate rows needed
fig_cat, axes_cat = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 5))
axes_cat = axes_cat.flatten() # Flatten for easy iteration
for i, col in enumerate(cat_features_to_plot):
sns.countplot(y=df_raw[col].dropna(), ax=axes_cat[i], order=df_raw[col].value_counts().index, palette='viridis')
axes_cat[i].set_title(f'Distribution of {col}')
axes_cat[i].set_xlabel('')
axes_cat[i].set_ylabel('')
# Hide any unused subplots
for j in range(i + 1, len(axes_cat)):
fig_cat.delaxes(axes_cat[j])
plt.tight_layout()
plt.suptitle('Categorical Feature Distributions', y=1.02, fontsize=16)
plt.show()
else:
print("No categorical features to plot.")
--- Distribution Plots for ALL Features --- --- Numerical Feature Distributions ---
--- Categorical Feature Distributions ---
df_raw['cylinder_count'].value_counts()
# Covert into normal cylinder count
| count | |
|---|---|
| cylinder_count | |
| 40.0 | 8516 |
| 30.0 | 1339 |
| 60.0 | 245 |
| 80.0 | 30 |
df_raw['cylinder_count'] = df_raw['cylinder_count'] / 10
df_raw['cylinder_count'].value_counts()
| count | |
|---|---|
| cylinder_count | |
| 4.0 | 8516 |
| 3.0 | 1339 |
| 6.0 | 245 |
| 8.0 | 30 |
# 4. Hierarchical median imputation for numeric columns with missing values
numeric_cols = df_raw.select_dtypes(include=[np.number]).columns
cols_with_nan = [col for col in numeric_cols if df_raw[col].isnull().any()]
# Define the expanded hierarchy
hierarchy_levels = [
(['model', 'gb_year'], 'Model & Year'),
(['model'], 'Model'),
(['series', 'gb_year'], 'Series & Year'),
(['series'], 'Series'),
(['brand', 'gb_year'], 'Brand & Year'),
(['brand'], 'Brand')
]
for col in cols_with_nan:
# Initialize tracking column with 'Original' for non-missing data
df_raw[f'{col}_imputed'] = 'Original'
for group_cols, level_name in hierarchy_levels:
if all(c in df_raw.columns for c in group_cols):
missing_before = df_raw[col].isnull()
if missing_before.sum() == 0:
continue
# Calculate group median and fill
df_raw[col] = df_raw.groupby(group_cols)[col].transform(lambda x: x.fillna(x.median()))
# Find newly filled rows and update the tracker
filled_mask = missing_before & ~df_raw[col].isnull()
df_raw.loc[filled_mask, f'{col}_imputed'] = level_name
# Final fallback: Overall median
missing_before = df_raw[col].isnull()
if missing_before.sum() > 0:
df_raw[col] = df_raw[col].fillna(df_raw[col].median())
filled_mask = missing_before & ~df_raw[col].isnull()
df_raw.loc[filled_mask, f'{col}_imputed'] = 'Overall Median'
print(f"Dataset shape after preprocessing: {df_raw.shape}")
print(f"Remaining missing values in power_hp_val overall: {df_raw['power_hp_val'].isnull().sum()}")
print(f"Remaining missing values in engine_cc_val overall: {df_raw['engine_cc_val'].isnull().sum()}")
# Show logging summary for the main columns
if 'power_hp_val_imputed' in df_raw.columns:
print("\n--- Imputation Summary for power_hp_val ---")
print(df_raw['power_hp_val_imputed'].value_counts())
if 'engine_cc_val_imputed' in df_raw.columns:
print("\n--- Imputation Summary for engine_cc_val ---")
print(df_raw['engine_cc_val_imputed'].value_counts())
Dataset shape after preprocessing: (13904, 92) Remaining missing values in power_hp_val overall: 0 Remaining missing values in engine_cc_val overall: 0 --- Imputation Summary for power_hp_val --- power_hp_val_imputed Original 13686 Model & Year 167 Model 45 Series & Year 6 Name: count, dtype: int64 --- Imputation Summary for engine_cc_val --- engine_cc_val_imputed Original 13719 Model & Year 144 Model 35 Series & Year 6 Name: count, dtype: int64
# Define the hierarchy levels for mode imputation
hierarchy_levels = [
(['model', 'gb_year'], 'Model & Year'),
(['model'], 'Model'),
(['series', 'gb_year'], 'Series & Year'),
(['series'], 'Series'),
(['brand', 'gb_year'], 'Brand & Year'),
(['brand'], 'Brand')
]
# Columns to apply hierarchical mode imputation to
cols_to_impute = ['gb_segment', 'kb_drivetrain']
print("\n--- Applying Hierarchical Mode Imputation to specified Categorical Columns ---")
for col in cols_to_impute:
if col not in df_raw.columns:
print(f" Skipping '{col}': Column not found in DataFrame.")
continue
print(f" Imputing '{col}'...")
# Create a copy to avoid SettingWithCopyWarning issues during transform
temp_series = df_raw[col].copy()
# Iterate through hierarchy levels for mode imputation
for group_cols, level_name in hierarchy_levels:
if all(c in df_raw.columns for c in group_cols):
# Only try to fill if there are still NaNs in the temporary series
if temp_series.isnull().any():
# Use transform with mode. Handle cases where mode might be empty (e.g., all NaNs in group)
filled_values = df_raw.groupby(group_cols)[col].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else np.nan))
# Only update the temp_series for the NaNs that were actually filled by this level
temp_series = temp_series.fillna(filled_values)
# Final fallback: Overall mode if any NaNs still remain
if temp_series.isnull().any():
overall_mode = df_raw[col].mode()[0] if not df_raw[col].mode().empty else 'Unknown'
# Fill remaining NaNs with the overall mode
temp_series = temp_series.fillna(overall_mode)
# Assign the imputed series back to the original DataFrame
df_raw[col] = temp_series
print("\n--- Final Missing Values Check for Imputed Columns ---")
final_missing_check = df_raw[cols_to_impute].isnull().sum()
if final_missing_check.sum() == 0:
print("All specified columns have been imputed successfully (no missing values remaining).")
else:
print("Missing values still found in:")
print(final_missing_check[final_missing_check > 0])
--- Applying Hierarchical Mode Imputation to specified Categorical Columns --- Imputing 'gb_segment'... Imputing 'kb_drivetrain'... --- Final Missing Values Check for Imputed Columns --- All specified columns have been imputed successfully (no missing values remaining).
# import plotly.express as px
# import pandas as pd
# # Ensure the columns exist in the DataFrame
# if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:
# # Create an imputation status column for the hue
# def get_impute_status(row):
# t_imp = str(row.get('torque_nm_imputed', 'Original'))
# p_imp = str(row.get('power_hp_val_imputed', 'Original'))
# if t_imp == 'Original' and p_imp == 'Original':
# return 'Original Data'
# elif t_imp == p_imp:
# return f'Both: {t_imp}'
# elif t_imp == 'Original':
# return f'HP: {p_imp}'
# elif p_imp == 'Original':
# return f'Torque: {t_imp}'
# else:
# return f'Mixed (HP:{p_imp}, Tq:{t_imp})'
# df_raw['imputation_status'] = df_raw.apply(get_impute_status, axis=1)
# # Create a scatter plot
# fig = px.scatter(
# df_raw,
# x='power_hp_val',
# y='torque_nm',
# color='imputation_status',
# title='Torque (Nm) vs. Horsepower (HP)',
# labels={'power_hp_val': 'Horsepower (HP)', 'torque_nm': 'Torque (Nm)','gb_year': 'year', 'imputation_status': 'Imputation Hierarchy'},
# opacity=0.6,
# hover_data=['brand', 'series', 'model', 'price','gb_year']
# )
# fig.update_layout(template='plotly_white')
# fig.show()
# # Calculate and print the correlation
# correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
# print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")
# else:
# print("One or both of 'torque_nm' and 'power_hp_val' columns are not available in the DataFrame.")
import seaborn as sns
import matplotlib.pyplot as plt
# Sütunların varlığını kontrol et
if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:
# Imputation durumunu belirleyen fonksiyon (Aynı mantık)
def get_impute_status(row):
t_imp = str(row.get('torque_nm_imputed', 'Original'))
p_imp = str(row.get('power_hp_val_imputed', 'Original'))
if t_imp == 'Original' and p_imp == 'Original':
return 'Original Data'
elif t_imp == p_imp:
return f'Both: {t_imp}'
elif t_imp == 'Original':
return f'HP: {p_imp}'
elif p_imp == 'Original':
return f'Torque: {t_imp}'
else:
return f'Mixed (HP:{p_imp}, Tq:{t_imp})'
df_raw['imputation_status'] = df_raw.apply(get_impute_status, axis=1)
# Grafik ayarları
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 7))
# Seaborn Scatter Plot
scatter = sns.scatterplot(
data=df_raw,
x='power_hp_val',
y='torque_nm',
hue='imputation_status', # Renklendirme burada yapılıyor
alpha=0.6,
palette='viridis', # Renk paleti (Opsiyonel: 'tab10', 'magma' vb.)
edgecolor=None
)
# Başlık ve Etiketler
plt.title('Torque (Nm) vs. Horsepower (HP)', fontsize=15)
plt.xlabel('Horsepower (HP)')
plt.ylabel('Torque (Nm)')
# Legend (Açıklama) kutusunu dışarıya veya uygun bir yere taşıma
plt.legend(title='Imputation Hierarchy', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
# Korelasyon hesaplama
correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")
else:
print("One or both of 'torque_nm' and 'power_hp_val' columns are not available.")
Pearson Correlation between Horsepower and Torque: 0.7414
# import pandas as pd
# import plotly.express as px
# # Create a simplified binary flag for Torque Imputation
# df_raw['is_torque_imputed'] = df_raw['torque_nm_imputed'].apply(lambda x: 'Original Data' if str(x) == 'Original' else 'Imputed Data')
# # Analyze the most frequent Torque values (the horizontal bands)
# print("--- Top 5 Most Frequent Torque Values (The Bands) ---")
# top_torques = df_raw['torque_nm'].value_counts().head(5)
# print(top_torques)
# print("\n--- Imputation Breakdown for these Top Bands ---")
# band_analysis = df_raw[df_raw['torque_nm'].isin(top_torques.index)].groupby(['torque_nm', 'is_torque_imputed']).size().unstack(fill_value=0)
# display(band_analysis)
# # Visual proof: Side-by-side scatter plot
# fig = px.scatter(
# df_raw,
# x='power_hp_val',
# y='torque_nm',
# color='is_torque_imputed',
# facet_col='is_torque_imputed',
# title='Causation Check: Are the horizontal bands caused by imputation?',
# labels={'power_hp_val': 'Horsepower (HP)', 'torque_nm': 'Torque (Nm)'},
# opacity=0.3,
# color_discrete_map={'Original Data': '#3498db', 'Imputed Data': '#e74c3c'}
# )
# fig.update_layout(template='plotly_white')
# fig.show()
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# 1. Veri Hazırlama (Aynı mantık)
df_raw['is_torque_imputed'] = df_raw['torque_nm_imputed'].apply(
lambda x: 'Original Data' if str(x) == 'Original' else 'Imputed Data'
)
# 2. Analiz Kısmı (Aynı mantık)
print("--- Top 5 Most Frequent Torque Values (The Bands) ---")
top_torques = df_raw['torque_nm'].value_counts().head(5)
print(top_torques)
print("\n--- Imputation Breakdown for these Top Bands ---")
band_analysis = df_raw[df_raw['torque_nm'].isin(top_torques.index)].groupby(['torque_nm', 'is_torque_imputed']).size().unstack(fill_value=0)
print(band_analysis) # 'display' yerine terminal çıktısı için 'print'
# 3. Seaborn ile Yan Yana (Facet) Scatter Plot
sns.set_theme(style="whitegrid")
# 'col' parametresi Plotly'deki 'facet_col' ile aynı işi yapar
g = sns.relplot(
data=df_raw,
x='power_hp_val',
y='torque_nm',
col='is_torque_imputed',
hue='is_torque_imputed',
palette={'Original Data': '#3498db', 'Imputed Data': '#e74c3c'},
alpha=0.3,
kind='scatter',
height=5,
aspect=1.2
)
# Başlık ve Etiket Düzenlemeleri
g.set_axis_labels("Horsepower (HP)", "Torque (Nm)")
g.fig.suptitle('Causation Check: Are the horizontal bands caused by imputation?', y=1.05, fontsize=14)
plt.show()
--- Top 5 Most Frequent Torque Values (The Bands) --- torque_nm 250.0 5630 380.0 1517 220.0 1457 400.0 1214 320.0 479 Name: count, dtype: int64 --- Imputation Breakdown for these Top Bands --- is_torque_imputed Imputed Data Original Data torque_nm 220.0 203 1254 250.0 1885 3745 320.0 210 269 380.0 484 1033 400.0 317 897
The horizontal banding at specific torque levels (e.g., 200, 250, 400 Nm) is a clear artifact of hierarcial imputation. I will not fix this it is take long take what will I do instead I am going to decide in LOFO Feature selection.
# Identify debug columns explicitly created for plotting
debug_cols = ['imputation_status', 'is_torque_imputed']
# Identify all imputation tracking columns (ending with '_imputed')
imputed_tracking_cols = [c for c in df_raw.columns if str(c).endswith('_imputed')]
# Combine and keep only those that actually exist in the dataframe
cols_to_drop = [c for c in (debug_cols + imputed_tracking_cols) if c in df_raw.columns]
cols_to_drop.append("engine_cc_is_range")
# Drop them
df_raw.drop(columns=cols_to_drop, inplace=True)
print(f"Removed {len(cols_to_drop)} debug variables.")
print(f"Current dataset shape: {df_raw.shape}")
Removed 23 debug variables. Current dataset shape: (13904, 72)
Reasoning:
I will define the preprocess_damage_score function with clear English documentation, apply the requested logic to aggregate condition columns into an expert_risk_score and drop the original columns. Then, I will apply this function to df_raw, calculate vehicle_age, and print the summary statistics as requested.
import pandas as pd
def preprocess_damage_score(df: pd.DataFrame) -> pd.DataFrame:
"""
Calculate an aggregated 'expert_risk_score' based on the condition of various car parts
and remove the original sparse expertise columns.
Parameters:
-----------
df : pd.DataFrame
The input DataFrame containing vehicle expertise data.
Returns:
--------
pd.DataFrame
The modified DataFrame with the new 'expert_risk_score' column and without the
original expertise condition columns.
Details:
--------
Penalty points are assigned based on the severity of the damage:
- Roof (tavan): Replaced = 150, Painted = 75, Locally Painted = 40
- Hood (kaput): Replaced = 60, Painted = 30, Locally Painted = 15
- Trunk (bagaj): Replaced = 40, Painted = 20, Locally Painted = 10
- Doors (door_*): Replaced = 10, Painted = 5, Locally Painted = 2
- Fenders (fender_*): Replaced = 8, Painted = 4, Locally Painted = 2
"""
df_out = df.copy()
# Identify relevant columns
door_cols = [c for c in df_out.columns if 'door_' in c]
fender_cols = [c for c in df_out.columns if 'fender_' in c]
# Calculate scores using .get() with default 0 to handle missing columns gracefully
tavan_score = (
df_out.get('tavan_degisen', 0) * 150 +
df_out.get('tavan_boyali', 0) * 75 +
df_out.get('tavan_lokal', 0) * 40
)
kaput_score = (
df_out.get('kaput_degisen', 0) * 60 +
df_out.get('kaput_boyali', 0) * 30 +
df_out.get('kaput_lokal', 0) * 15
)
bagaj_score = (
df_out.get('bagaj_degisen', 0) * 40 +
df_out.get('bagaj_boyali', 0) * 20 +
df_out.get('bagaj_lokal', 0) * 10
)
doors_score = 0
if door_cols:
doors_degisen = df_out[[c for c in door_cols if 'degisen' in c]].sum(axis=1) if any('degisen' in c for c in door_cols) else 0
doors_boyali = df_out[[c for c in door_cols if 'boyali' in c]].sum(axis=1) if any('boyali' in c for c in door_cols) else 0
doors_lokal = df_out[[c for c in door_cols if 'lokal' in c]].sum(axis=1) if any('lokal' in c for c in door_cols) else 0
doors_score = doors_degisen * 10 + doors_boyali * 5 + doors_lokal * 2
fenders_score = 0
if fender_cols:
fenders_degisen = df_out[[c for c in fender_cols if 'degisen' in c]].sum(axis=1) if any('degisen' in c for c in fender_cols) else 0
fenders_boyali = df_out[[c for c in fender_cols if 'boyali' in c]].sum(axis=1) if any('boyali' in c for c in fender_cols) else 0
fenders_lokal = df_out[[c for c in fender_cols if 'lokal' in c]].sum(axis=1) if any('lokal' in c for c in fender_cols) else 0
fenders_score = fenders_degisen * 8 + fenders_boyali * 4 + fenders_lokal * 2
# Total score
df_out['expert_risk_score'] = tavan_score + kaput_score + bagaj_score + doors_score + fenders_score
# Clean up: Drop the original detail columns
cols_to_drop = [
c for c in df_out.columns
if any(x in c for x in ['tavan_', 'kaput_', 'bagaj_', 'door_', 'fender_'])
]
df_out.drop(columns=cols_to_drop, inplace=True)
return df_out
# Apply the function to the dataset
df_raw = preprocess_damage_score(df_raw)
# Create derived feature 'vehicle_age' (assuming current year is 2024)
year_col = 'kb_year' if 'kb_year' in df_raw.columns else 'gb_year'
df_raw['vehicle_age'] = 2026 - df_raw[year_col]
# Print the results
print(f"Dataset shape after feature engineering: {df_raw.shape}")
print("\nStatistics for 'expert_risk_score' and 'vehicle_age':")
display(df_raw[['expert_risk_score', 'vehicle_age']].describe())
Dataset shape after feature engineering: (13904, 41) Statistics for 'expert_risk_score' and 'vehicle_age':
| expert_risk_score | vehicle_age | |
|---|---|---|
| count | 13904.000000 | 13904.000000 |
| mean | 12.981085 | 10.794232 |
| std | 21.244036 | 4.687457 |
| min | 0.000000 | 1.000000 |
| 25% | 0.000000 | 8.000000 |
| 50% | 4.000000 | 11.000000 |
| 75% | 18.000000 | 14.000000 |
| max | 274.000000 | 21.000000 |
Phase 3b: Numerical Feature Relationship Analysis Part 2¶
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Select numerical columns for correlation analysis
numeric_cols = df_raw.select_dtypes(include=[np.number]).columns
# Drop identifiers if present to avoid spurious noise
features_to_plot = [c for c in numeric_cols if c not in ['ad_id']]
# 1. Scale data before calculating correlation
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_raw[features_to_plot]), columns=features_to_plot)
# Compute the correlation matrix on scaled data
corr_matrix = df_scaled.corr()
# Handle potential NaNs in the correlation matrix (can happen if a column is constant)
corr_matrix = corr_matrix.fillna(0)
# 2. Create a clustered heatmap to visualize feature clusters
clustermap = sns.clustermap(
corr_matrix,
cmap='coolwarm',
center=0,
annot=False,
figsize=(12, 12),
dendrogram_ratio=(.1, .2),
cbar_pos=(1.02, 0.15, .03, .6),
linewidths=0.5
)
clustermap.fig.suptitle('Hierarchical Clustermap of Scaled Numerical Features', y=1.03, fontsize=16)
plt.show()
# 3. Detect Multicollinearity (exclude price)
print("=== Multicollinearity Detection (Threshold |r| > 0.85) ===")
predictor_cols = [c for c in corr_matrix.columns if c != 'price']
pred_corr = corr_matrix.loc[predictor_cols, predictor_cols]
multicollinear_pairs = []
for i in range(len(pred_corr.columns)):
for j in range(i + 1, len(pred_corr.columns)):
if abs(pred_corr.iloc[i, j]) > 0.85:
multicollinear_pairs.append((pred_corr.columns[i], pred_corr.columns[j], pred_corr.iloc[i, j]))
multicollinear_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
if multicollinear_pairs:
for col1, col2, val in multicollinear_pairs:
print(f" [!] Highly correlated: {col1} & {col2} (r = {val:.3f})")
else:
print(" No highly multicollinear pairs found above threshold.")
# 4. Show top correlations specifically with 'price'
if 'price' in corr_matrix.columns:
print("\n=== Top 7 Features Correlated with Price ===")
price_corr = corr_matrix['price'].abs().sort_values(ascending=False).drop('price').head(7)
for col, val in price_corr.items():
sign = "+" if corr_matrix.loc['price', col] > 0 else "-"
print(f" - {col}: {sign}{val:.3f}")
=== Multicollinearity Detection (Threshold |r| > 0.85) === [!] Highly correlated: gb_year & vehicle_age (r = -1.000) [!] Highly correlated: weight_kg & curb_weight_kg (r = 0.953) [!] Highly correlated: length_mm & wheelbase_mm (r = 0.946) [!] Highly correlated: kb_fuel_cons_avg & highway_fuel_cons (r = 0.937) [!] Highly correlated: kb_fuel_cons_avg & city_fuel_cons (r = 0.924) [!] Highly correlated: length_mm & trunk_capacity_lt (r = 0.910) [!] Highly correlated: city_fuel_cons & highway_fuel_cons (r = 0.904) [!] Highly correlated: length_mm & weight_kg (r = 0.901) [!] Highly correlated: weight_kg & wheelbase_mm (r = 0.886) [!] Highly correlated: length_mm & curb_weight_kg (r = 0.859) [!] Highly correlated: trunk_capacity_lt & wheelbase_mm (r = 0.856) === Top 7 Features Correlated with Price === - vehicle_age: -0.772 - gb_year: +0.772 - gb_mtv_yearly: +0.638 - gb_mileage: -0.627 - max_speed_kmh: +0.501 - weight_kg: +0.500 - curb_weight_kg: +0.482
I will compare vehicle age and year variables in LOFO Feature Selection. We do not need just for show. Most likely the performance will be same. For capacity variables they are highly correlated. Fuel variables are highly correlated too these are expected things. I will not use these variables. These are highly missing and may be affect performance in a bad way.
Phase 3c: Categorical Feature Relationship Analysis¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from sklearn.metrics import mutual_info_score
print("=== Phase 1c: Categorical Feature Relationship Analysis ===")
# Select categorical columns with a reasonable number of unique values
cat_cols = df_raw.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
exclude_cats = ['ad_title', 'description_text', 'ad_id', 'listing_date', 'scraped_at', 'search_date', 'location', 'power_hp_is_range']
# Increased the max unique values threshold from 150 to 5000 to include 'model'
cat_cols = [c for c in cat_cols if c not in exclude_cats and df_raw[c].nunique() > 1]
# Function for Cramer's V (Symmetric)
def cramers_v(x, y):
confusion_matrix = pd.crosstab(x, y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
denom = min((kcorr-1), (rcorr-1))
if denom <= 0:
return 0.0
return np.sqrt(phi2corr / denom)
# Function for Theil's U / Uncertainty Coefficient (Asymmetric)
def theils_u(x, y):
s_xy = pd.crosstab(x, y)
pi = s_xy.sum(axis=1) / s_xy.sum().sum()
entropy_x = ss.entropy(pi)
if entropy_x == 0:
return 1.0
mis = mutual_info_score(x, y)
return mis / entropy_x
# Calculate matrices
n_cats = len(cat_cols)
cv_matrix = pd.DataFrame(np.zeros((n_cats, n_cats)), index=cat_cols, columns=cat_cols)
tu_matrix = pd.DataFrame(np.zeros((n_cats, n_cats)), index=cat_cols, columns=cat_cols)
print(f"Computing association matrices for {n_cats} categorical features. This might take a few seconds...")
for i in range(n_cats):
for j in range(n_cats):
col1, col2 = cat_cols[i], cat_cols[j]
# Drop NA to compute
valid_idx = df_raw[[col1, col2]].dropna().index
x = df_raw.loc[valid_idx, col1].astype(str)
y = df_raw.loc[valid_idx, col2].astype(str)
if i == j:
cv_matrix.loc[col1, col2] = 1.0
tu_matrix.loc[col1, col2] = 1.0
else:
cv_matrix.loc[col1, col2] = cramers_v(x, y)
# Theil's U is asymmetric: U(x|y) - How much y explains x
tu_matrix.loc[col1, col2] = theils_u(x, y)
# Plotting
fig, axes = plt.subplots(2, 1, figsize=(14, 20))
# Cramer's V Heatmap
sns.heatmap(cv_matrix,
annot=True, # Added: Show values
fmt=".2f", # Added: 2 decimal places
cmap="Blues",
ax=axes[0],
vmin=0,
vmax=1,
annot_kws={"size": 9}) # Optional: Adjust font size for clarity
axes[0].set_title("Cramer's V (Symmetric Association)", fontsize=16, pad=20)
# Theil's U Heatmap
sns.heatmap(tu_matrix,
annot=True, # Added: Show values
fmt=".2f", # Added: 2 decimal places
cmap="Oranges",
ax=axes[1],
vmin=0,
vmax=1,
annot_kws={"size": 9}) # Optional: Adjust font size for clarity
axes[1].set_title("Theil's U (Asymmetric Uncertainty Coefficient)\nHow much knowing [Column] predicts [Row]", fontsize=16, pad=20)
plt.tight_layout(pad=5.0) # Added padding to prevent title overlap
plt.show()
# --- Interpretation ---
print("\n--- Interpretation ---")
print("1. Cramer's V is a SYMMETRIC metric. It evaluates the overall correlation.")
print("2. Theil's U is an ASYMMETRIC metric. U(Row|Column) shows how much the Column explains the Row.")
print("3. High values (close to 1.0) indicate strong redundancy or predictive power.")
# Highly Related Variables Printout
print("\n=== Top Categorical Associations (Cramer's V > 0.8) ===")
high_cv_pairs = []
for i in range(len(cv_matrix.columns)):
for j in range(i + 1, len(cv_matrix.columns)):
val = cv_matrix.iloc[i, j]
if val > 0.8:
high_cv_pairs.append((cv_matrix.columns[i], cv_matrix.columns[j], val))
high_cv_pairs.sort(key=lambda x: x[2], reverse=True)
if high_cv_pairs:
for col1, col2, val in high_cv_pairs:
print(f" - {col1} & {col2}: {val:.3f}")
else:
print(" None found.")
print("\n=== Top Categorical Associations (Theil's U > 0.8) ===")
high_tu_pairs = []
for i in range(len(tu_matrix.columns)):
for j in range(len(tu_matrix.columns)):
if i == j: continue
val = tu_matrix.iloc[i, j]
if val > 0.8:
high_tu_pairs.append((tu_matrix.index[i], tu_matrix.columns[j], val))
high_tu_pairs.sort(key=lambda x: x[2], reverse=True)
if high_tu_pairs:
for target, predictor, val in high_tu_pairs:
print(f" - Knowing '{predictor}' predicts '{target}': {val:.3f}")
else:
print(" None found.")
=== Phase 1c: Categorical Feature Relationship Analysis === Computing association matrices for 14 categorical features. This might take a few seconds...
--- Interpretation --- 1. Cramer's V is a SYMMETRIC metric. It evaluates the overall correlation. 2. Theil's U is an ASYMMETRIC metric. U(Row|Column) shows how much the Column explains the Row. 3. High values (close to 1.0) indicate strong redundancy or predictive power. === Top Categorical Associations (Cramer's V > 0.8) === - kb_body_type & gb_body_type: 1.000 - brand & series: 0.999 - brand & model: 0.977 - model & gb_segment: 0.965 - series & model: 0.941 - series & gb_segment: 0.915 - brand & kb_drivetrain: 0.855 === Top Categorical Associations (Theil's U > 0.8) === - Knowing 'gb_body_type' predicts 'kb_body_type': 1.000 - Knowing 'model' predicts 'brand': 1.000 - Knowing 'series' predicts 'brand': 1.000 - Knowing 'model' predicts 'series': 0.999 - Knowing 'model' predicts 'gb_segment': 0.996 - Knowing 'series' predicts 'gb_segment': 0.972 - Knowing 'model' predicts 'kb_drivetrain': 0.971 - Knowing 'model' predicts 'kb_body_type': 0.891 - Knowing 'model' predicts 'gb_fuel': 0.881
import pandas as pd
import numpy as np
import scipy.stats as ss
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
def estimate_shannon_entropy(series):
"""Sürekli veya kategorik veri için entropi (belirsizlik) tahmini yapar."""
# Veri sürekliyse (float/int), binning (gruplama) yaparak entropi hesapla
if series.dtype in ['float64', 'int64']:
# Freedman-Diaconis kuralı ile akıllı binning
counts, _ = np.histogram(series.dropna(), bins='fd')
else:
counts = series.value_counts()
probs = counts / counts.sum()
return ss.entropy(probs)
def calculate_redundancy(df, target_col, features):
results = []
# Hedef (Model) için sayısal dönüşüm
target_encoded = df[target_col].astype('category').cat.codes
for col in features:
# Boş değerleri temizle
valid_data = df[[col, target_col]].dropna()
x = valid_data[col]
y = valid_data[target_col].astype('category').cat.codes
# 1. Özelliğin Kendi Entropisi (H(X)) - Toplam Belirsizlik
h_x = estimate_shannon_entropy(x)
# 2. Karşılıklı Bilgi (I(X;Y))
# Kategorik ise classif, sürekli ise regression tabanlı MI kullan
if x.dtype in ['float64', 'int64']:
# Model (y) kategorik olduğu için discrete_features=True diyoruz
mi = mutual_info_regression(x.values.reshape(-1, 1), y, random_state=42)[0]
else:
x_enc = x.astype('category').cat.codes
mi = mutual_info_classif(x_enc.values.reshape(-1, 1), y, random_state=42)[0]
# 3. Bilgi Artıklığı Oranı (Redundancy %)
# Model bilgisi, özelliğin belirsizliğini ne oranda açıklıyor?
redundancy_pct = (mi / h_x) * 100 if h_x > 0 else 0
results.append({
'Feature': col,
'Entropy (Total Uncertainty)': h_x,
'Mutual Info with Model': mi,
'Redundancy (%)': min(redundancy_pct, 100.0) # 100'ü aşamaz
})
return pd.DataFrame(results).sort_values(by='Redundancy (%)', ascending=False)
# --- Uygulama ---
features_to_test = ['wheelbase_mm', 'width_mm', 'length_mm', 'gb_segment',
'max_speed_kmh', 'cylinder_count', 'curb_weight_kg', 'height_mm']
redundancy_df = calculate_redundancy(df_raw, 'model', features_to_test)
print("=== Bilgi Artıklığı Analizi (Model Sütununa Göre) ===")
print(redundancy_df.to_string(index=False))
# Karar Önerisi
print("\n--- Yorum ve Karar ---")
for _, row in redundancy_df.iterrows():
if row['Redundancy (%)'] > 90:
print(f" - [{row['Feature']}]: %{row['Redundancy (%)']:.1f} Artık. MODEL içinde zaten var, ÇIKARILABİLİR.")
elif row['Redundancy (%)'] > 70:
print(f" - [{row['Feature']}]: %{row['Redundancy (%)']:.1f} Yüksek Bağlılık. İncele ve tutmayı düşün.")
else:
print(f" - [{row['Feature']}]: %{row['Redundancy (%)']:.1f} Özgün Bilgi. Veri setinde kalsın.")
=== Bilgi Artıklığı Analizi (Model Sütununa Göre) ===
Feature Entropy (Total Uncertainty) Mutual Info with Model Redundancy (%)
wheelbase_mm 2.597758 2.707323 100.000000
width_mm 2.831880 2.865687 100.000000
length_mm 2.575656 3.223208 100.000000
gb_segment 1.195644 1.467208 100.000000
curb_weight_kg 3.206310 3.422261 100.000000
height_mm 2.870829 2.829981 98.577158
max_speed_kmh 2.977917 2.742922 92.108745
cylinder_count 0.000000 0.467000 0.000000
--- Yorum ve Karar ---
- [wheelbase_mm]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [width_mm]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [length_mm]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [gb_segment]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [curb_weight_kg]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [height_mm]: %98.6 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [max_speed_kmh]: %92.1 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
- [cylinder_count]: %0.0 Özgün Bilgi. Veri setinde kalsın.
Reasoning: I will prepare the dataset by removing non-predictive columns, define the target variable as the log of the price, handle missing values in categorical/text columns, and set up a 5-fold cross-validation loop to train and evaluate the GPU-accelerated CatBoost baseline model.
df_raw.columns
Index(['brand', 'series', 'model', 'price', 'engine_cc_val', 'power_hp_val',
'gb_year', 'gb_mileage', 'gb_transmission', 'gb_fuel', 'kb_body_type',
'gb_body_type', 'gb_color', 'kb_drivetrain', 'kb_trade_available',
'kb_seller_type', 'kb_fuel_cons_avg', 'kb_fuel_tank',
'gb_warranty_status', 'gb_segment', 'gb_mtv_yearly', 'torque_nm',
'cylinder_count', 'max_speed_kmh', 'accel_0_100', 'city_fuel_cons',
'highway_fuel_cons', 'length_mm', 'width_mm', 'height_mm', 'weight_kg',
'curb_weight_kg', 'trunk_capacity_lt', 'wheelbase_mm',
'is_heavy_damaged', 'tramer_fee', 'count_changed', 'count_painted',
'count_local_painted', 'expert_risk_score', 'vehicle_age'],
dtype='object')
print("\n--- Checking for empty categories (missing values in categorical columns) ---")
cat_cols = df_raw.select_dtypes(include=['object', 'category', 'bool']).columns
missing_cat_counts = df_raw[cat_cols].isnull().sum()
if missing_cat_counts.sum() == 0:
print("No empty categories (missing values) found in categorical columns after imputation.")
else:
print("Remaining empty categories (missing values) in categorical columns:")
print(missing_cat_counts[missing_cat_counts > 0])
print("\n--- How CatBoost handles null categories ---")
print("CatBoost treats missing values in categorical features as a special, distinct category.")
print("It learns during training whether these 'missing' values should be grouped with other categories or form their own group for optimal splits.")
print("This means you generally do not need to explicitly impute missing values in categorical columns before training with CatBoost, as it handles them intelligently internally.")
--- Checking for empty categories (missing values in categorical columns) --- Remaining empty categories (missing values) in categorical columns: gb_warranty_status 3010 dtype: int64 --- How CatBoost handles null categories --- CatBoost treats missing values in categorical features as a special, distinct category. It learns during training whether these 'missing' values should be grouped with other categories or form their own group for optimal splits. This means you generally do not need to explicitly impute missing values in categorical columns before training with CatBoost, as it handles them intelligently internally.
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool
# 1. Prepare dataset
target_col = 'price'
drop_cols = [target_col]
drop_cols = [c for c in drop_cols if c in df_raw.columns]
# Drop rows where target is missing or zero
df_model = df_raw.dropna(subset=[target_col]).copy()
df_model = df_model[df_model[target_col] > 0]
X = df_model.drop(columns=drop_cols)
y = df_model[target_col]
# 2. Handle categorical and text features
text_features = ['model'] if 'model' in X.columns else []
cat_features_all = [col for col in X.select_dtypes(include=['object', 'category', 'bool']).columns if col not in text_features]
# Fill missing values for categorical and text columns with 'missing'
for col in cat_features_all + text_features:
X[col] = X[col].fillna('missing').astype(str)
# 3. Define 2x2 combinations for Year/Age and KB/GB Body Types
configs = [
{"name": "vehicle_age + kb_body_type", "drop": ["gb_year", "gb_body_type"]},
{"name": "vehicle_age + gb_body_type", "drop": ["gb_year", "kb_body_type"]},
{"name": "gb_year + kb_body_type", "drop": ["vehicle_age", "gb_body_type"]},
{"name": "gb_year + gb_body_type", "drop": ["vehicle_age", "kb_body_type"]},
]
# Pre-build feature sets and indices
datasets = {}
for cfg in configs:
cols_to_drop = [c for c in cfg["drop"] if c in X.columns]
X_cfg = X.drop(columns=cols_to_drop)
c_feats = [c for c in cat_features_all if c in X_cfg.columns]
t_feats = [c for c in text_features if c in X_cfg.columns]
datasets[cfg["name"]] = {
"X": X_cfg,
"c_idx": [X_cfg.columns.get_loc(c) for c in c_feats],
"t_idx": [X_cfg.columns.get_loc(c) for c in t_feats],
"scores": []
}
# 4. Establish cross-validation framework
kf = KFold(n_splits=5, shuffle=True, random_state=42)
print("Starting 5-fold Cross-Validation with CatBoost (GPU) for 2x2 combinations...")
for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
print(f"\n--- Fold {fold + 1} ---")
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
for cfg_name, data in datasets.items():
X_train, X_val = data["X"].iloc[train_idx], data["X"].iloc[val_idx]
train_pool = Pool(data=X_train, label=y_train, cat_features=data["c_idx"], text_features=data["t_idx"])
val_pool = Pool(data=X_val, label=y_val, cat_features=data["c_idx"], text_features=data["t_idx"])
model = CatBoostRegressor(
iterations=1000, learning_rate=0.05, depth=6,
loss_function='RMSE', eval_metric='RMSE',
early_stopping_rounds=100, verbose=False,
random_seed=42, **gpu_params
)
model.fit(train_pool, eval_set=val_pool)
preds = model.predict(val_pool)
rmse = np.sqrt(mean_squared_error(y_val, preds))
data["scores"].append(rmse)
print(f" {cfg_name:30s} RMSE : {rmse:.4f}")
print("\n=== FINAL COMPARISON (Mean CV RMSE) ===")
for cfg_name, data in datasets.items():
mean_rmse = np.mean(data["scores"])
std_rmse = np.std(data["scores"])
print(f"{cfg_name:30s} : {mean_rmse:.4f} (±{std_rmse:.4f})")
Starting 5-fold Cross-Validation with CatBoost (GPU) for 2x2 combinations... --- Fold 1 --- vehicle_age + kb_body_type RMSE : 194203.8777 vehicle_age + gb_body_type RMSE : 194190.9607 gb_year + kb_body_type RMSE : 193190.8369 gb_year + gb_body_type RMSE : 192402.1037 --- Fold 2 --- vehicle_age + kb_body_type RMSE : 193917.3367 vehicle_age + gb_body_type RMSE : 193861.2270 gb_year + kb_body_type RMSE : 193329.4420 gb_year + gb_body_type RMSE : 193920.6988 --- Fold 3 --- vehicle_age + kb_body_type RMSE : 179982.0014 vehicle_age + gb_body_type RMSE : 179724.9643 gb_year + kb_body_type RMSE : 178079.4357 gb_year + gb_body_type RMSE : 178011.3103 --- Fold 4 --- vehicle_age + kb_body_type RMSE : 195770.9350 vehicle_age + gb_body_type RMSE : 196586.5364 gb_year + kb_body_type RMSE : 195930.1660 gb_year + gb_body_type RMSE : 197363.5721 --- Fold 5 --- vehicle_age + kb_body_type RMSE : 195289.7405 vehicle_age + gb_body_type RMSE : 194321.8538 gb_year + kb_body_type RMSE : 192736.4721 gb_year + gb_body_type RMSE : 193285.6588 === FINAL COMPARISON (Mean CV RMSE) === vehicle_age + kb_body_type : 191832.7783 (±5964.3621) vehicle_age + gb_body_type : 191737.1084 (±6083.1363) gb_year + kb_body_type : 190653.2705 (±6385.7312) gb_year + gb_body_type : 190996.6687 (±6707.0361)
print("Dropping 'gb_body_type' and 'vehicle_age' to finalize the feature set...")
cols_to_drop_final = ['gb_body_type', 'vehicle_age']
X = X.drop(columns=[c for c in cols_to_drop_final if c in X.columns])
# Assign text and cat features to variables for the next steps
text_features = ['model'] if 'model' in X.columns else []
cat_features = [col for col in X.select_dtypes(include=['object', 'category', 'bool']).columns if col not in text_features]
# Get integer indices for CatBoost
cat_indices = [X.columns.get_loc(col) for col in cat_features]
text_indices = [X.columns.get_loc(col) for col in text_features]
print(f"\nFinal feature count: {X.shape[1]}")
print(f"Categorical features ({len(cat_features)}): {cat_features}")
print(f"Text features ({len(text_features)}): {text_features}")
Dropping 'gb_body_type' and 'vehicle_age' to finalize the feature set... Final feature count: 38 Categorical features (12): ['brand', 'series', 'gb_transmission', 'gb_fuel', 'kb_body_type', 'gb_color', 'kb_drivetrain', 'kb_trade_available', 'kb_seller_type', 'gb_warranty_status', 'gb_segment', 'is_heavy_damaged'] Text features (1): ['model']
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.ensemble import IsolationForest
import plotly.express as px
# 1. Get CatBoost residuals on the full dataset
full_pool = Pool(data=X, label=y, cat_features=cat_indices, text_features=text_indices)
baseline_full = CatBoostRegressor(
iterations=1000,
learning_rate=0.05,
depth=6,
loss_function='RMSE',
verbose=False,
random_seed=42,
**gpu_params
)
baseline_full.fit(full_pool)
preds_full = baseline_full.predict(full_pool)
residuals = y - preds_full
# 2. Calculate Z-Scores of residuals
z_scores = np.abs(stats.zscore(residuals))
outlier_zscore = z_scores > 3 # Flag as outlier if |Z| > 3
# 3. Isolation Forest on numeric features
# Fill missing values with 0 just for the Isolation Forest (if any remain)
numeric_X = X.select_dtypes(include=[np.number]).fillna(0)
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_iso = iso_forest.fit_predict(numeric_X) == -1
# 4. Consolidate into a Decision DataFrame
outlier_df = pd.DataFrame({
'brand': X['brand'],
'series': X['series'],
'model': X['model'],
'price': y,
'predicted_price': preds_full,
'residual': residuals,
'z_score': z_scores,
'outlier_zscore': outlier_zscore,
'outlier_iso': outlier_iso
})
# Final decision: flagged if identified by either method
outlier_df['is_outlier'] = outlier_df['outlier_zscore'] | outlier_df['outlier_iso']
print(f"Total outliers detected by Z-Score: {outlier_df['outlier_zscore'].sum()}")
print(f"Total outliers detected by Isolation Forest: {outlier_df['outlier_iso'].sum()}")
print(f"Total combined outliers: {outlier_df['is_outlier'].sum()}")
# 5. Plotly chart of outliers by car model
outlier_counts = outlier_df[outlier_df['is_outlier']]['model'].value_counts().reset_index()
outlier_counts.columns = ['model', 'outlier_count']
outlier_counts = outlier_counts.head(20) # Get top 20 models
fig = px.bar(
outlier_counts,
x='outlier_count',
y='model',
orientation='h',
title='Top 20 Car Models Most Frequently Flagged as Outliers',
labels={'outlier_count': 'Number of Outliers', 'model': 'Car Model'},
color='outlier_count',
color_continuous_scale='Reds'
)
fig.update_layout(yaxis={'categoryorder':'total ascending'}, template="plotly_white")
fig.show()
# Display a sample of the outliers
display(outlier_df[outlier_df['is_outlier']].head())
Total outliers detected by Z-Score: 197 Total outliers detected by Isolation Forest: 696 Total combined outliers: 843
| brand | series | model | price | predicted_price | residual | z_score | outlier_zscore | outlier_iso | is_outlier | |
|---|---|---|---|---|---|---|---|---|---|---|
| 7 | audi | A3 | A3 Hatchback 1.6 Attraction | 475000.0 | 589385.141417 | -114385.141417 | 0.718707 | False | True | True |
| 11 | audi | A3 | A3 Sportback 1.6 FSI | 490000.0 | 463285.414422 | 26714.585578 | 0.158544 | False | True | True |
| 12 | audi | A3 | A3 Sportback 1.6 Attraction | 499999.0 | 535147.887318 | -35148.887318 | 0.226076 | False | True | True |
| 13 | audi | A3 | A3 Sportback 1.6 Ambition | 445000.0 | 467605.476146 | -22605.476146 | 0.148091 | False | True | True |
| 18 | audi | A3 | A3 Sportback 1.6 Ambiente | 475000.0 | 505031.477057 | -30031.477057 | 0.194260 | False | True | True |
display(
outlier_df[outlier_df['is_outlier']]
.sort_values(by="residual", key=lambda x: x.abs(), ascending=False)
.head()
)
| brand | series | model | price | predicted_price | residual | z_score | outlier_zscore | outlier_iso | is_outlier | |
|---|---|---|---|---|---|---|---|---|---|---|
| 10897 | bmw | 6 Serisi | 640i | 6150000.0 | 3.712775e+06 | 2.437225e+06 | 15.145276 | True | True | True |
| 8853 | bmw | 7 Serisi | 730d xDrive M Sport | 1950000.0 | 3.619131e+06 | -1.669131e+06 | 10.384940 | True | True | True |
| 10948 | bmw | M Serisi | M3 | 5999000.0 | 4.534616e+06 | 1.464384e+06 | 9.096882 | True | True | True |
| 4345 | audi | A5 | A5 Sportback 40 TDI Quattro S Line | 5330000.0 | 3.921290e+06 | 1.408710e+06 | 8.750743 | True | False | True |
| 10931 | bmw | 4 Serisi | 420i Edition M Sport | 5900000.0 | 4.584921e+06 | 1.315079e+06 | 8.168619 | True | False | True |
Phase 6: LOFO Feature Selection & Hierarchy Ablation Study¶
Subtask:¶
Implement a custom LOFO function for feature selection. Then, perform an ablation study training separate models using only 'brand', only 'series', and only 'model' (as a text feature) to compare their individual impacts on predicting car prices.
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error
import time
# 1. Clean dataset by dropping outliers identified in Phase 5
X_clean = X[~outlier_df['is_outlier']].copy()
y_clean = y[~outlier_df['is_outlier']].copy()
X_clean = X_clean.drop(columns=["gb_color","kb_trade_available", 'kb_seller_type',
'kb_fuel_cons_avg', 'kb_fuel_tank', 'gb_warranty_status', 'gb_segment', 'torque_nm', 'cylinder_count', 'max_speed_kmh',
'accel_0_100', 'city_fuel_cons', 'highway_fuel_cons', 'length_mm',
'width_mm', 'height_mm', 'weight_kg', 'curb_weight_kg',
'trunk_capacity_lt','wheelbase_mm', 'tramer_fee']).copy()
# Create a standard train/val split for other experiments
X_train_clean, X_val_clean, y_train_clean, y_val_clean = train_test_split(
X_clean, y_clean, test_size=0.2, random_state=42
)
def custom_lofo_importance(X, y, features_to_test, cat_features, text_features, gpu_params, n_splits=3):
"""
Calculate Leave-One-Feature-Out (LOFO) importance using Cross-Validation.
This function iteratively removes one feature at a time from the dataset,
trains a CatBoost model using K-Fold CV, and calculates the mean RMSE.
The LOFO importance is the difference between the CV RMSE without the feature
and the baseline CV RMSE (with all features). A positive value indicates that
removing the feature worsens the model, meaning the feature is important.
Parameters:
-----------
X : pd.DataFrame
Features dataset.
y : pd.Series
Target variable.
features_to_test : list
List of specific features to evaluate using LOFO.
cat_features : list
List of categorical feature names.
text_features : list
List of text feature names.
gpu_params : dict
Dictionary containing CatBoost GPU configuration.
n_splits : int
Number of cross-validation folds.
Returns:
--------
pd.DataFrame
A DataFrame containing the features and their calculated LOFO importance scores,
sorted in descending order of importance.
"""
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
print(f"Calculating Baseline {n_splits}-Fold CV RMSE with all features...")
# Get index positions for the baseline
all_feats = list(X.columns)
c_idx_base = [all_feats.index(f) for f in cat_features if f in all_feats]
t_idx_base = [all_feats.index(f) for f in text_features if f in all_feats]
base_rmses = []
for train_idx, val_idx in kf.split(X, y):
X_tr, X_va = X.iloc[train_idx], X.iloc[val_idx]
y_tr, y_va = y.iloc[train_idx], y.iloc[val_idx]
base_train_pool = Pool(X_tr, y_tr, cat_features=c_idx_base, text_features=t_idx_base)
base_val_pool = Pool(X_va, y_va, cat_features=c_idx_base, text_features=t_idx_base)
base_model = CatBoostRegressor(iterations=300, learning_rate=0.08, depth=6,
loss_function='RMSE', verbose=False, random_seed=42, **gpu_params)
base_model.fit(base_train_pool, eval_set=base_val_pool)
base_rmses.append(np.sqrt(mean_squared_error(y_va, base_model.predict(base_val_pool))))
base_rmse = np.mean(base_rmses)
print(f"Baseline CV RMSE: {base_rmse:.4f}\n")
lofo_scores = []
for i, feature in enumerate(features_to_test):
start_time = time.time()
# Features without the current one
lofo_feats = [f for f in all_feats if f != feature]
lofo_c_idx = [lofo_feats.index(f) for f in cat_features if f in lofo_feats]
lofo_t_idx = [lofo_feats.index(f) for f in text_features if f in lofo_feats]
feature_rmses = []
for train_idx, val_idx in kf.split(X, y):
X_tr, X_va = X.iloc[train_idx][lofo_feats], X.iloc[val_idx][lofo_feats]
y_tr, y_va = y.iloc[train_idx], y.iloc[val_idx]
lofo_train_pool = Pool(X_tr, y_tr, cat_features=lofo_c_idx, text_features=lofo_t_idx)
lofo_val_pool = Pool(X_va, y_va, cat_features=lofo_c_idx, text_features=lofo_t_idx)
lofo_model = CatBoostRegressor(iterations=300, learning_rate=0.08, depth=6,
loss_function='RMSE', verbose=False, random_seed=42, **gpu_params)
lofo_model.fit(lofo_train_pool, eval_set=lofo_val_pool)
feature_rmses.append(np.sqrt(mean_squared_error(y_va, lofo_model.predict(lofo_val_pool))))
lofo_rmse = np.mean(feature_rmses)
importance = lofo_rmse - base_rmse
elapsed = time.time() - start_time
print(f"Dropped '{feature}': CV RMSE = {lofo_rmse:.4f} | Importance = {importance:+.4f} ({elapsed:.1f}s)")
lofo_scores.append({'Feature': feature, 'LOFO_Importance': importance})
return pd.DataFrame(lofo_scores).sort_values('LOFO_Importance', ascending=False)
# 2. Run LOFO on the top 15 features (from Phase 5 full baseline) to save time
fi = baseline_full.get_feature_importance()
fi_df = pd.DataFrame({'feature': X.columns, 'importance': fi}).sort_values('importance', ascending=False)
top_15_features = fi_df['feature'].tolist()
# do not use baseline feature importance for now
print("--- Starting Custom LOFO Feature Selection (with 5-Fold CV) ---")
lofo_results = custom_lofo_importance(
X_clean, y_clean,
features_to_test=X_clean.columns,
cat_features=cat_features,
text_features=text_features,
gpu_params=gpu_params,
n_splits=5
)
# Plot LOFO Results
fig_lofo = px.bar(
lofo_results, x='LOFO_Importance', y='Feature', orientation='h',
title='LOFO Feature Importance (Positive = Important)',
color='LOFO_Importance', color_continuous_scale='RdYlGn'
)
fig_lofo.update_layout(yaxis={'categoryorder':'total ascending'}, template="plotly_white")
fig_lofo.show()
--- Starting Custom LOFO Feature Selection (with 5-Fold CV) --- Calculating Baseline 5-Fold CV RMSE with all features... Baseline CV RMSE: 142418.6554 Dropped 'brand': CV RMSE = 142944.7356 | Importance = +526.0802 (7.9s) Dropped 'series': CV RMSE = 147722.1699 | Importance = +5303.5145 (7.3s) Dropped 'model': CV RMSE = 155501.2192 | Importance = +13082.5638 (13.4s) Dropped 'engine_cc_val': CV RMSE = 143325.1902 | Importance = +906.5347 (7.9s) Dropped 'power_hp_val': CV RMSE = 148070.7937 | Importance = +5652.1382 (7.8s) Dropped 'gb_year': CV RMSE = 187706.7761 | Importance = +45288.1207 (7.6s) Dropped 'gb_mileage': CV RMSE = 200604.2567 | Importance = +58185.6013 (8.1s) Dropped 'gb_transmission': CV RMSE = 142493.4145 | Importance = +74.7591 (7.7s) Dropped 'gb_fuel': CV RMSE = 142309.2237 | Importance = -109.4318 (7.8s) Dropped 'kb_body_type': CV RMSE = 142991.0902 | Importance = +572.4348 (7.6s) Dropped 'kb_drivetrain': CV RMSE = 142801.6870 | Importance = +383.0316 (7.6s) Dropped 'gb_mtv_yearly': CV RMSE = 143178.4193 | Importance = +759.7639 (7.7s) Dropped 'is_heavy_damaged': CV RMSE = 148924.1099 | Importance = +6505.4545 (7.9s) Dropped 'count_changed': CV RMSE = 144209.8375 | Importance = +1791.1821 (7.8s) Dropped 'count_painted': CV RMSE = 142550.1559 | Importance = +131.5004 (7.9s) Dropped 'count_local_painted': CV RMSE = 142355.8654 | Importance = -62.7900 (7.8s) Dropped 'expert_risk_score': CV RMSE = 142072.9914 | Importance = -345.6640 (7.7s)
import seaborn as sns
import matplotlib.pyplot as plt
# Veriyi görselleştirmeden önce önem sırasına göre dizelim (Plotly'deki total ascending için)
lofo_results_sorted = lofo_results.sort_values('LOFO_Importance', ascending=False)
# Grafik boyutunu ve stilini ayarlayalım
plt.figure(figsize=(10, 8))
sns.set_theme(style="whitegrid")
# Barplot oluşturma
# 'palette' ile renk skalasını (RdYlGn benzeri) ayarlıyoruz
ax = sns.barplot(
data=lofo_results_sorted,
x='LOFO_Importance',
y='Feature',
hue='LOFO_Importance', # Renklendirme için
palette='RdYlGn', # Kırmızı-Sarı-Yeşil skalası
legend=False
)
# Başlık ve etiketler
plt.title('LOFO Feature Importance (Positive = Important)', fontsize=14)
plt.xlabel('LOFO Importance')
plt.ylabel('Feature')
# Gereksiz çizgileri temizleme (Plotly White temasına benzerlik için)
sns.despine(left=True, bottom=True)
plt.show()
import itertools
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool
print("--- Hierarchy Ablation Study: Combinations & CV ---")
# Define the base hierarchical features
base_features = ['brand', 'series', 'model']
hierarchy_feature_sets = []
# Generate all combinations of 1, 2, and 3 features
for r in range(1, len(base_features) + 1):
hierarchy_feature_sets.extend([list(c) for c in itertools.combinations(base_features, r)])
ablation_results = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for feats in hierarchy_feature_sets:
comb_name = " + ".join(feats)
print(f"Evaluating: {comb_name}...")
cv_rmses = []
for train_idx, val_idx in kf.split(X_clean, y_clean):
# Subset to only the selected features for this combination
X_tr, X_va = X_clean.iloc[train_idx][feats], X_clean.iloc[val_idx][feats]
y_tr, y_va = y_clean.iloc[train_idx], y_clean.iloc[val_idx]
# Get dynamic indices for categorical and text features based on current subset
c_idx = [feats.index(f) for f in cat_features if f in feats]
t_idx = [feats.index(f) for f in text_features if f in feats]
ab_train_pool = Pool(X_tr, y_tr, cat_features=c_idx, text_features=t_idx)
ab_val_pool = Pool(X_va, y_va, cat_features=c_idx, text_features=t_idx)
# Train CatBoost model
ab_model = CatBoostRegressor(iterations=300, learning_rate=0.08, depth=6,
loss_function='RMSE', verbose=False, random_seed=42, **gpu_params)
ab_model.fit(ab_train_pool, eval_set=ab_val_pool)
fold_rmse = np.sqrt(mean_squared_error(y_va, ab_model.predict(ab_val_pool)))
cv_rmses.append(fold_rmse)
mean_rmse = np.mean(cv_rmses)
ablation_results.append({'Feature_Combination': comb_name, 'CV_RMSE': mean_rmse})
# Create DataFrame and display
ablation_df = pd.DataFrame(ablation_results).sort_values('CV_RMSE', ascending=False)
display(ablation_df)
# Visualize ablation combinations
fig_ablation = px.bar(
ablation_df, x='CV_RMSE', y='Feature_Combination', orientation='h',
title='Ablation Study: 5-Fold CV RMSE by Hierarchical Feature Combinations (Lower is Better)',
color='CV_RMSE', color_continuous_scale='Blues_r', text_auto='.0f'
)
fig_ablation.update_layout(yaxis={'categoryorder':'total ascending'}, template="plotly_white")
fig_ablation.show()
--- Hierarchy Ablation Study: Combinations & CV --- Evaluating: brand... Evaluating: series... Evaluating: model... Evaluating: brand + series... Evaluating: brand + model... Evaluating: series + model... Evaluating: brand + series + model...
| Feature_Combination | CV_RMSE | |
|---|---|---|
| 0 | brand | 1.042775e+06 |
| 1 | series | 9.451142e+05 |
| 3 | brand + series | 9.450829e+05 |
| 2 | model | 4.512768e+05 |
| 4 | brand + model | 4.503697e+05 |
| 6 | brand + series + model | 4.459154e+05 |
| 5 | series + model | 4.455466e+05 |
import seaborn as sns
import matplotlib.pyplot as plt
# 1. Veriyi sıralayalım (Plotly'deki 'total ascending' için)
# RMSE düşükten büyüğe (en iyi sonuç en üstte kalacak şekilde) sıralanır
ablation_df_sorted = ablation_df.sort_values('CV_RMSE', ascending=True)
# 2. Grafik stilini ayarlayalım
plt.figure(figsize=(12, 8))
sns.set_theme(style="whitegrid")
# 3. Barplot oluşturma
# 'palette' olarak 'Blues' (düşük değerler açık, yüksekler koyu)
# veya 'Blues_r' (tersi) kullanabilirsiniz.
ax = sns.barplot(
data=ablation_df_sorted,
x='CV_RMSE',
y='Feature_Combination',
hue='CV_RMSE', # Renk skalası için
palette='Blues_r', # Plotly'deki Blues_r ile uyumlu
legend=False
)
# 4. Çubukların üzerine değerleri yazdıma (text_auto='.0f' karşılığı)
ax.bar_label(ax.containers[0], fmt='%.0f', padding=3)
# 5. Başlık ve etiketler
plt.title('Ablation Study: 5-Fold CV RMSE by Hierarchical Feature Combinations (Lower is Better)', fontsize=14)
plt.xlabel('CV RMSE')
plt.ylabel('Feature Combination')
# Gereksiz kenarlıkları kaldırarak Plotly White havası verelim
sns.despine(left=True, bottom=True)
plt.tight_layout()
plt.show()
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool
# 1. We use the cleaned dataset from Phase 6 (X_clean, y_clean) directly as the target
# y_clean_log = np.log1p(y_clean) # Removed logarithm transformation
# Set up K-Fold Cross-Validation for the final evaluation
kf_final = KFold(n_splits=5, shuffle=True, random_state=42)
final_cv_scores = []
X_clean = X_clean.drop(columns = ["gb_fuel","gb_transmission"])
cat_features = [col for col in X_clean.select_dtypes(include=['object', 'category', 'bool']).columns if col not in text_features]
cat_indices = [X_clean.columns.get_loc(col) for col in cat_features]
print("--- Training Refined Final Model with 5-Fold CV (GPU) ---")
for fold, (train_idx, val_idx) in enumerate(kf_final.split(X_clean, y_clean)):
X_tr, X_va = X_clean.iloc[train_idx], X_clean.iloc[val_idx]
y_tr, y_va = y_clean.iloc[train_idx], y_clean.iloc[val_idx] # Use original y_clean
# Pools
tr_pool = Pool(data=X_tr, label=y_tr, cat_features=cat_indices, text_features=text_indices)
va_pool = Pool(data=X_va, label=y_va, cat_features=cat_indices, text_features=text_indices)
# Initialize and fit CatBoostRegressor
model_cv = CatBoostRegressor(
iterations=1500,
learning_rate=0.05,
depth=8, # Slightly deeper for the final model to capture complex interactions
loss_function='RMSE',
eval_metric='RMSE',
early_stopping_rounds=150,
verbose=False,
random_seed=42,
**gpu_params
)
model_cv.fit(tr_pool, eval_set=va_pool)
# Evaluate on validation set in original price scale for interpretable RMSE
preds_price = model_cv.predict(va_pool) # Direct prediction, no expm1 needed
y_va_price = y_va # Use original y_va
rmse_price = np.sqrt(mean_squared_error(y_va_price, preds_price))
final_cv_scores.append(rmse_price)
print(f"Fold {fold + 1} RMSE (Price): {rmse_price:,.0f} TL")
print(f"\nFinal Cleaned CV RMSE: {np.mean(final_cv_scores):,.0f} TL (±{np.std(final_cv_scores):,.0f} TL)")
# 2. Train the Final Production Model on the entire clean dataset
final_full_pool = Pool(data=X_clean, label=y_clean, cat_features=cat_indices, text_features=text_indices) # Use original y_clean
final_model = CatBoostRegressor(
iterations=1500,
learning_rate=0.05,
depth=8,
loss_function='RMSE',
verbose=250,
random_seed=42,
**gpu_params
)
print("\n--- Training Final Production Model on Full Clean Dataset ---")
final_model.fit(final_full_pool)
# 3. Save the model artifacts
model_filename = 'catboost_price_predictor_final.cbm'
final_model.save_model(model_filename)
print(f"\nSuccess: Final model saved as '{model_filename}'")
--- Training Refined Final Model with 5-Fold CV (GPU) --- Fold 1 RMSE (Price): 142,988 TL Fold 2 RMSE (Price): 137,344 TL Fold 3 RMSE (Price): 138,810 TL Fold 4 RMSE (Price): 137,884 TL Fold 5 RMSE (Price): 135,478 TL Final Cleaned CV RMSE: 138,501 TL (±2,493 TL) --- Training Final Production Model on Full Clean Dataset --- 0: learn: 999465.6109703 total: 18.2ms remaining: 27.3s 250: learn: 148304.0677714 total: 4.29s remaining: 21.3s 500: learn: 137700.4829686 total: 8.37s remaining: 16.7s 750: learn: 133015.6479159 total: 12.8s remaining: 12.7s 1000: learn: 130233.8410758 total: 17.2s remaining: 8.58s 1250: learn: 127973.3492347 total: 21.5s remaining: 4.29s 1499: learn: 126393.8429959 total: 25.7s remaining: 0us Success: Final model saved as 'catboost_price_predictor_final.cbm'
from sklearn.metrics import r2_score, mean_absolute_error
# 1. Metrics for the last Validation Fold (Out-of-sample performance)
fold_mae = mean_absolute_error(y_va_price, preds_price)
fold_r2 = r2_score(y_va_price, preds_price)
print("-- Validation Metrics (Last CV Fold - Out-of-Sample) ---")
print(f"R² Score : {fold_r2:.4f}")
print(f"MAE : {fold_mae:,.0f} TL\n")
# 2. Metrics for the Final Production Model (In-sample performance on full clean data)
preds_price_full = final_model.predict(final_full_pool) # Predictions are already in price scale
y_true_price = y_clean # Use original y_clean, no log transformation
full_mae = mean_absolute_error(y_true_price, preds_price_full)
full_r2 = r2_score(y_true_price, preds_price_full)
print("-- Final Model Metrics (Full Clean Dataset - In-Sample) ---")
print(f"R² Score : {full_r2:.4f}")
print(f"MAE : {full_mae:,.0f} TL")
-- Validation Metrics (Last CV Fold - Out-of-Sample) --- R² Score : 0.9825 MAE : 103,333 TL -- Final Model Metrics (Full Clean Dataset - In-Sample) --- R² Score : 0.9856 MAE : 94,797 TL