!pip install feature_engine

Requirement already satisfied: feature_engine in /usr/local/lib/python3.12/dist-packages (1.9.4)
Requirement already satisfied: numpy>=1.18.2 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (2.0.2)
Requirement already satisfied: pandas>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (2.2.2)
Requirement already satisfied: scikit-learn>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (1.6.1)
Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (1.16.3)
Requirement already satisfied: statsmodels>=0.11.1 in /usr/local/lib/python3.12/dist-packages (from feature_engine) (0.14.6)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2.0->feature_engine) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2.0->feature_engine) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=2.2.0->feature_engine) (2025.3)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.4.0->feature_engine) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.4.0->feature_engine) (3.6.0)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.12/dist-packages (from statsmodels>=0.11.1->feature_engine) (1.0.2)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.12/dist-packages (from statsmodels>=0.11.1->feature_engine) (26.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas>=2.2.0->feature_engine) (1.17.0)

!pip install catboost

Requirement already satisfied: catboost in /usr/local/lib/python3.12/dist-packages (1.2.10)
Requirement already satisfied: graphviz in /usr/local/lib/python3.12/dist-packages (from catboost) (0.21)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (from catboost) (3.10.0)
Requirement already satisfied: numpy<3.0,>=1.16.0 in /usr/local/lib/python3.12/dist-packages (from catboost) (2.0.2)
Requirement already satisfied: pandas<4.0,>=0.24 in /usr/local/lib/python3.12/dist-packages (from catboost) (2.2.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from catboost) (1.16.3)
Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (from catboost) (6.6.0)
Requirement already satisfied: six in /usr/local/lib/python3.12/dist-packages (from catboost) (1.17.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas<4.0,>=0.24->catboost) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas<4.0,>=0.24->catboost) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas<4.0,>=0.24->catboost) (2025.3)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (4.62.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (1.5.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (26.0)
Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (3.3.2)
Requirement already satisfied: narwhals>=1.15.1 in /usr/local/lib/python3.12/dist-packages (from plotly->catboost) (2.18.1)

!pip install --upgrade plotly

Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (6.6.0)
Requirement already satisfied: narwhals>=1.15.1 in /usr/local/lib/python3.12/dist-packages (from plotly) (2.18.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (26.0)

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostRegressor, Pool
import sklearn
from feature_engine.selection import DropConstantFeatures
import os
import warnings

warnings.filterwarnings("ignore")

# Set random seed
np.random.seed(42)

# Configure CatBoost GPU parameters
gpu_params = {
    'task_type': 'GPU',
    'devices': '0',
    'bootstrap_type': 'Poisson'
}

# Load data
df_raw = pd.read_csv('cars.csv')
print(f"Dataset shape: {df_raw.shape}")

Dataset shape: (13904, 94)

# Crosstab and Heatmap
plt.figure(figsize=(10, 8))
body_type_crosstab = pd.crosstab(df_raw['kb_body_type'], df_raw['gb_body_type'])
sns.heatmap(body_type_crosstab, annot=True, fmt='d', cmap='Blues')
plt.title('Crosstab of kb_body_type vs gb_body_type')
plt.xlabel('gb_body_type')
plt.ylabel('kb_body_type')
plt.show()

# Identify Quasi-constant features
constant_dropper = DropConstantFeatures(tol=0.99, missing_values='ignore')
constant_dropper.fit(df_raw)

print(f"Quasi-constant columns to drop (tol=0.99): {constant_dropper.features_to_drop_}")

Quasi-constant columns to drop (tol=0.99): ['kb_condition', 'gb_usage_type', 'gb_is_first_owner', 'tavan_degisen', 'tavan_boyali', 'tavan_lokal', 'bagaj_degisen', 'bagaj_boyali', 'bagaj_lokal', 'fender_rl_degisen', 'fender_rr_degisen']

print("Checking for identical columns in the dataset...")

# Dictionary to store groups of identical columns
identical_groups = {}
cols = df_raw.columns.tolist()

# Track columns already assigned to a group to avoid redundant checks
checked = set()

for i in range(len(cols)):
    col1 = cols[i]
    if col1 in checked:
        continue

    current_group = [col1]
    for j in range(i + 1, len(cols)):
        col2 = cols[j]
        if col2 in checked:
            continue

        # Check if columns are equal (pandas .equals() handles NaNs correctly)
        if df_raw[col1].equals(df_raw[col2]):
            current_group.append(col2)
            checked.add(col2)

    if len(current_group) > 1:
        identical_groups[col1] = current_group

if identical_groups:
    print("\nFound identical columns:")
    for key, group in identical_groups.items():
        print(f"  - Group: {group}")
else:
    print("\nNo perfectly identical columns found.")

Checking for identical columns in the dataset...

Found identical columns:
  - Group: ['kb_year', 'gb_year']
  - Group: ['kb_mileage', 'gb_mileage']
  - Group: ['kb_transmission', 'gb_transmission']
  - Group: ['kb_fuel', 'gb_fuel']
  - Group: ['kb_color', 'gb_color']
  - Group: ['kb_is_heavy_damaged', 'is_heavy_damaged']
  - Group: ['bagaj_degisen', 'bagaj_boyali', 'bagaj_lokal']

# hover_cols = [col for col in ['brand', 'model','gb_year'] if col in df_raw.columns]

# # 1. Plot for Horsepower
# fig_hp = px.scatter(
#     df_raw,
#     x='power_hp_low',
#     y='power_hp_up',
#     title='Distribution of HP Lower vs Upper Bounds (Before Imputation)',
#     labels={'power_hp_low': 'HP Lower Bound', 'power_hp_up': 'HP Upper Bound'},
#     opacity=0.5,
#     color_discrete_sequence=['#3498db'],
#     hover_data=hover_cols
# )
# fig_hp.update_layout(template='plotly_white')
# fig_hp.show()

# # 2. Plot for Engine CC
# fig_cc = px.scatter(
#     df_raw,
#     x='engine_cc_low',
#     y='engine_cc_up',
#     title='Distribution of Engine CC Lower vs Upper Bounds (Before Imputation)',
#     labels={'engine_cc_low': 'Engine CC Lower Bound', 'engine_cc_up': 'Engine CC Upper Bound'},
#     opacity=0.5,
#     color_discrete_sequence=['#e74c3c'],
#     hover_data=hover_cols
# )
# fig_cc.update_layout(template='plotly_white')
# fig_cc.show()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Plot for Horsepower
sns.scatterplot(data=df_raw, x='power_hp_low', y='power_hp_up',
                alpha=0.5, color='#3498db', s=20, ax=axes[0])
axes[0].set_title('Distribution of HP Lower vs Upper Bounds (Before Imputation)')
axes[0].set_xlabel('HP Lower Bound')
axes[0].set_ylabel('HP Upper Bound')

# 2. Plot for Engine CC
sns.scatterplot(data=df_raw, x='engine_cc_low', y='engine_cc_up',
                alpha=0.5, color='#e74c3c', s=20, ax=axes[1])
axes[1].set_title('Distribution of Engine CC Lower vs Upper Bounds (Before Imputation)')
axes[1].set_xlabel('Engine CC Lower Bound')
axes[1].set_ylabel('Engine CC Upper Bound')

plt.tight_layout()
plt.show()

# import plotly.express as px
# import pandas as pd

# # Ensure the columns exist in the DataFrame
# if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:
#     # Create a scatter plot
#     fig = px.scatter(
#         df_raw,
#         x='power_hp_val',
#         y='torque_nm',
#         title='Torque (Nm) vs. Horsepower (HP)',
#         labels={'power_hp_val': 'Horsepower (HP)', 'torque_nm': 'Torque (Nm)','gb_year': 'year'},
#         opacity=0.6,
#         hover_data=['brand', 'series', 'model', 'price','gb_year']
#     )
#     fig.update_layout(template='plotly_white')
#     fig.show()

#     # Calculate and print the correlation
#     correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
#     print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")
# else:
#     print("One or both of 'torque_nm' and 'power_hp_val' columns are not available in the DataFrame.")

import seaborn as sns
import matplotlib.pyplot as plt

# Sütunların varlığını kontrol et
if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:
    # Grafik temasını ve boyutunu ayarla
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(10, 6))

    # Scatter plot oluşturma
    scatter = sns.scatterplot(
        data=df_raw,
        x='power_hp_val',
        y='torque_nm',
        alpha=0.6,
        edgecolor=None
    )

    # Başlık ve etiketleri ekleme
    plt.title('Torque (Nm) vs. Horsepower (HP)', fontsize=15)
    plt.xlabel('Horsepower (HP)')
    plt.ylabel('Torque (Nm)')

    # Plotly'deki hover_data (üzerine gelince bilgi gösterme) özelliği
    # Seaborn/Matplotlib'de statik olarak mevcut değildir.

    plt.show()

    # Korelasyon hesaplama (Pandas ile devam ediyoruz)
    correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
    print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")

else:
    print("One or both of 'torque_nm' and 'power_hp_val' columns are not available in the DataFrame.")

Pearson Correlation between Horsepower and Torque: 0.7623

# 1. Calculate boundary means
df_raw['power_hp_val'] = df_raw[['power_hp_low', 'power_hp_up']].mean(axis=1)
df_raw['engine_cc_val'] = df_raw['engine_cc_up']

print(f"Remaining missing values in power_hp_val after boundary mean: {df_raw['power_hp_val'].isnull().sum()}")
print(f"Remaining missing values in engine_cc_val after boundary mean: {df_raw['engine_cc_val'].isnull().sum()}")

# 3. Drop quasi-constant, boundary columns, and highly correlated redundant columns
boundary_cols = ['power_hp_low', 'power_hp_up', 'engine_cc_low', 'engine_cc_up']
quasi_constant_cols = ['kb_condition', 'gb_usage_type', 'gb_is_first_owner']
info_cols = ['ad_id','listing_date','ad_title','location','power_hp_is_range','power_hp_is_range', 'description_text', 'scraped_at', 'search_date']
identical_cols = ['kb_year','kb_mileage','kb_transmission','kb_fuel','kb_color','kb_is_heavy_damaged']

cols_to_drop = list(set(quasi_constant_cols + boundary_cols + info_cols + identical_cols))
cols_to_drop = [c for c in cols_to_drop if c in df_raw.columns]

df_raw = df_raw.drop(columns=cols_to_drop)
print(f"Dropped {len(cols_to_drop)} columns.\n")
print(f"Dataset shape after dropped features: {df_raw.shape}")

Remaining missing values in power_hp_val after boundary mean: 218
Remaining missing values in engine_cc_val after boundary mean: 185
Dropped 21 columns.

Dataset shape after dropped features: (13904, 73)

import matplotlib.pyplot as plt
import seaborn as sns

print("\n--- Missing Values Heatmap ---")
plt.figure(figsize=(15, 8))
sns.heatmap(df_raw.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in DataFrame')
plt.show()

--- Missing Values Heatmap ---

# print("--- Missing Values Heatmap ---")
# # Calculate missing values
# missing_values = df_raw.isnull().sum()
# missing_percent = (missing_values / len(df_raw)) * 100

# # Create a DataFrame for missing values
# missing_info = pd.DataFrame({
#     'Missing Count': missing_values,
#     'Missing Percent': missing_percent
# }).sort_values(by='Missing Percent', ascending=False)

# # Filter for columns with actual missing values
# missing_info = missing_info[missing_info['Missing Count'] > 0]

# # Print top 10 missing rates
# print("\n--- Top 10 Columns with Missing Values ---")
# print(missing_info.head(10).to_string())

# # Plotting missing values
# if not missing_info.empty:
#     fig = px.bar(
#         missing_info.head(20),
#         y='Missing Percent',
#         x=missing_info.head(20).index,
#         title='Top 20 Columns with Missing Values (% Remaining)',
#         labels={'x': 'Column Name', 'y': 'Missing Percentage'},
#         color='Missing Percent',
#         color_continuous_scale='Viridis',
#         height=500
#     )
#     fig.update_layout(xaxis={'categoryorder': 'total descending'}, template='plotly_white')
#     fig.show()
# else:
#     print("No missing values found in the dataset.")

print("--- Missing Values Analysis ---")

# Eksik değerleri hesapla (Aynı mantık)
missing_values = df_raw.isnull().sum()
missing_percent = (missing_values / len(df_raw)) * 100

# Eksik değerler için DataFrame oluştur
missing_info = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percent': missing_percent
}).sort_values(by='Missing Percent', ascending=False)

# Sadece eksik değeri olan sütunları filtrele
missing_info = missing_info[missing_info['Missing Count'] > 0]

# İlk 10 satırı yazdır
print("\n--- Top 10 Columns with Missing Values ---")
print(missing_info.head(10).to_string())

# Seaborn ile Görselleştirme
if not missing_info.empty:
    # Grafik boyutunu ve temasını ayarla
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(12, 6))

    # İlk 20 sütunu al
    top_20_missing = missing_info.head(20)

    # Bar plot oluşturma
    # x ekseni sütun isimleri (index), y ekseni yüzdeler
    ax = sns.barplot(
        x=top_20_missing.index,
        y=top_20_missing['Missing Percent'],
        palette='viridis' # Renk skalası
    )

    # Başlık ve etiketler
    plt.title('Top 20 Columns with Missing Values (%)', fontsize=16)
    plt.xlabel('Column Name', fontsize=12)
    plt.ylabel('Missing Percentage (%)', fontsize=12)

    # X eksenindeki yazıların (sütun isimlerinin) okunabilirliği için döndürme
    plt.xticks(rotation=45, ha='right')

    # Grafik üzerine yüzde değerlerini yazdırmak istersen (Opsiyonel):
    # for p in ax.patches:
    #    ax.annotate(f'{p.get_height():.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
    #                ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')

    plt.tight_layout()
    plt.show()
else:
    print("No missing values found in the dataset.")

--- Missing Values Analysis ---

--- Top 10 Columns with Missing Values ---
                   Missing Count  Missing Percent
tramer_fee                 11778        84.709436
gb_mtv_yearly               5590        40.204258
kb_fuel_cons_avg            5435        39.089471
city_fuel_cons              4328        31.127733
highway_fuel_cons           4328        31.127733
weight_kg                   3800        27.330265
kb_fuel_tank                3784        27.215190
trunk_capacity_lt           3784        27.215190
accel_0_100                 3776        27.157652
torque_nm                   3775        27.150460

# Calculate missing values and percentages
missing_values = df_raw.isnull().sum()
missing_percent = (missing_values / len(df_raw)) * 100

# Create a DataFrame for missing information and filter for actual missing values
missing_info = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percent': missing_percent
}).sort_values(by='Missing Percent', ascending=False)
missing_info = missing_info[missing_info['Missing Count'] > 0]

print("\n--- Systemic Missingness Patterns (Columns Grouped by Similar Missing %) ---")
# Group columns by rounded missing percentage to identify systemic multi-row missingness
# Rounding to 2 decimal places to catch very similar percentages
missing_groups = missing_info.groupby(missing_info['Missing Percent'].round(2))['Missing Count'].agg(
    columns=lambda x: list(missing_info.loc[x.index, 'Missing Count'].index),
    percentage='first'
)

# Sort by percentage descending
missing_groups = missing_groups.sort_values(by='percentage', ascending=False)

for percent, group_data in missing_groups.iterrows():
    # Only print groups with more than one column or if it's a very high single percentage
    if len(group_data['columns']) > 1 or percent >= 20.0:
        print(f"  Missing ~{percent:.2f}%: {group_data['columns']}")

# Also explicitly highlight the top 5 individual highest missing columns as they are clearly systemic
print("\n--- Top 5 Columns with Highest Missing Percentages (most systemic) ---")
for index, row in missing_info.head(5).iterrows():
    print(f"  - {index}: {row['Missing Percent']:.2f}% (Count: {int(row['Missing Count'])})")

--- Systemic Missingness Patterns (Columns Grouped by Similar Missing %) ---
  Missing ~84.71%: ['tramer_fee']
  Missing ~40.20%: ['gb_mtv_yearly']
  Missing ~39.09%: ['kb_fuel_cons_avg']
  Missing ~31.13%: ['city_fuel_cons', 'highway_fuel_cons']
  Missing ~27.33%: ['weight_kg']
  Missing ~27.22%: ['kb_fuel_tank', 'trunk_capacity_lt']
  Missing ~27.16%: ['accel_0_100']
  Missing ~27.15%: ['torque_nm']
  Missing ~27.14%: ['wheelbase_mm', 'width_mm', 'length_mm', 'gb_segment', 'max_speed_kmh', 'cylinder_count', 'curb_weight_kg', 'height_mm']
  Missing ~21.65%: ['gb_warranty_status']

--- Top 5 Columns with Highest Missing Percentages (most systemic) ---
  - tramer_fee: 84.71% (Count: 11778)
  - gb_mtv_yearly: 40.20% (Count: 5590)
  - kb_fuel_cons_avg: 39.09% (Count: 5435)
  - city_fuel_cons: 31.13% (Count: 4328)
  - highway_fuel_cons: 31.13% (Count: 4328)

print("\n--- Missing Values Heatmap ---")
plt.figure(figsize=(15, 8))
sns.heatmap(df_raw[['wheelbase_mm', 'width_mm', 'length_mm', 'gb_segment', 'max_speed_kmh', 'cylinder_count', 'curb_weight_kg', 'height_mm']].isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in DataFrame')
plt.show()

--- Missing Values Heatmap ---

print("\n--- Distribution Plots for ALL Features ---")

# Identify numerical and categorical columns from df_raw
# Exclude 'price' from numerical for plotting general distributions, as it's the target
# Exclude text features like 'model' and 'series' from categorical plotting as they have too many unique values for countplot

all_num_cols = df_raw.select_dtypes(include=[np.number]).columns.tolist()
# Exclude target variable and any potential ID columns or scores that are not raw numerical features
num_features_to_plot = all_num_cols
all_cat_cols = df_raw.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
# Exclude text features (model, series) and other high cardinality or irrelevant categorical features for countplots
cat_features_to_plot = [col for col in all_cat_cols if col not in ['model', 'series', 'brand', 'gb_color']]


# --- Plot Numerical Features ---
if num_features_to_plot:
    print("\n--- Numerical Feature Distributions ---")
    n_num = len(num_features_to_plot)
    n_cols = 4  # Number of columns for the subplot grid
    n_rows = (n_num + n_cols - 1) // n_cols # Calculate rows needed

    fig_num, axes_num = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 4))
    axes_num = axes_num.flatten() # Flatten for easy iteration

    for i, col in enumerate(num_features_to_plot):
        sns.histplot(df_raw[col].dropna(), kde=True, ax=axes_num[i], color='#3498db')
        axes_num[i].set_title(f'Distribution of {col}')
        axes_num[i].set_xlabel('') # Clear x-label to avoid clutter
        axes_num[i].set_ylabel('') # Clear y-label

    # Hide any unused subplots
    for j in range(i + 1, len(axes_num)):
        fig_num.delaxes(axes_num[j])

    plt.tight_layout()
    plt.suptitle('Numerical Feature Distributions', y=1.02, fontsize=16)
    plt.show()
else:
    print("No numerical features to plot.")


# --- Plot Categorical Features ---
if cat_features_to_plot:
    print("\n--- Categorical Feature Distributions ---")
    n_cat = len(cat_features_to_plot)
    n_cols = 3 # Number of columns for the subplot grid
    n_rows = (n_cat + n_cols - 1) // n_cols # Calculate rows needed

    fig_cat, axes_cat = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 5))
    axes_cat = axes_cat.flatten() # Flatten for easy iteration

    for i, col in enumerate(cat_features_to_plot):
        sns.countplot(y=df_raw[col].dropna(), ax=axes_cat[i], order=df_raw[col].value_counts().index, palette='viridis')
        axes_cat[i].set_title(f'Distribution of {col}')
        axes_cat[i].set_xlabel('')
        axes_cat[i].set_ylabel('')

    # Hide any unused subplots
    for j in range(i + 1, len(axes_cat)):
        fig_cat.delaxes(axes_cat[j])

    plt.tight_layout()
    plt.suptitle('Categorical Feature Distributions', y=1.02, fontsize=16)
    plt.show()
else:
    print("No categorical features to plot.")

--- Distribution Plots for ALL Features ---

--- Numerical Feature Distributions ---

--- Categorical Feature Distributions ---

df_raw['cylinder_count'].value_counts()
# Covert into normal cylinder count

df_raw['cylinder_count'] = df_raw['cylinder_count'] / 10

df_raw['cylinder_count'].value_counts()

# 4. Hierarchical median imputation for numeric columns with missing values
numeric_cols = df_raw.select_dtypes(include=[np.number]).columns
cols_with_nan = [col for col in numeric_cols if df_raw[col].isnull().any()]

# Define the expanded hierarchy
hierarchy_levels = [
    (['model', 'gb_year'], 'Model & Year'),
    (['model'], 'Model'),
    (['series', 'gb_year'], 'Series & Year'),
    (['series'], 'Series'),
    (['brand', 'gb_year'], 'Brand & Year'),
    (['brand'], 'Brand')
]

for col in cols_with_nan:
    # Initialize tracking column with 'Original' for non-missing data
    df_raw[f'{col}_imputed'] = 'Original'

    for group_cols, level_name in hierarchy_levels:
        if all(c in df_raw.columns for c in group_cols):
            missing_before = df_raw[col].isnull()
            if missing_before.sum() == 0:
                continue

            # Calculate group median and fill
            df_raw[col] = df_raw.groupby(group_cols)[col].transform(lambda x: x.fillna(x.median()))

            # Find newly filled rows and update the tracker
            filled_mask = missing_before & ~df_raw[col].isnull()
            df_raw.loc[filled_mask, f'{col}_imputed'] = level_name

    # Final fallback: Overall median
    missing_before = df_raw[col].isnull()
    if missing_before.sum() > 0:
        df_raw[col] = df_raw[col].fillna(df_raw[col].median())
        filled_mask = missing_before & ~df_raw[col].isnull()
        df_raw.loc[filled_mask, f'{col}_imputed'] = 'Overall Median'

print(f"Dataset shape after preprocessing: {df_raw.shape}")
print(f"Remaining missing values in power_hp_val overall: {df_raw['power_hp_val'].isnull().sum()}")
print(f"Remaining missing values in engine_cc_val overall: {df_raw['engine_cc_val'].isnull().sum()}")

# Show logging summary for the main columns
if 'power_hp_val_imputed' in df_raw.columns:
    print("\n--- Imputation Summary for power_hp_val ---")
    print(df_raw['power_hp_val_imputed'].value_counts())

if 'engine_cc_val_imputed' in df_raw.columns:
    print("\n--- Imputation Summary for engine_cc_val ---")
    print(df_raw['engine_cc_val_imputed'].value_counts())

Dataset shape after preprocessing: (13904, 92)
Remaining missing values in power_hp_val overall: 0
Remaining missing values in engine_cc_val overall: 0

--- Imputation Summary for power_hp_val ---
power_hp_val_imputed
Original         13686
Model & Year       167
Model               45
Series & Year        6
Name: count, dtype: int64

--- Imputation Summary for engine_cc_val ---
engine_cc_val_imputed
Original         13719
Model & Year       144
Model               35
Series & Year        6
Name: count, dtype: int64

# Define the hierarchy levels for mode imputation
hierarchy_levels = [
    (['model', 'gb_year'], 'Model & Year'),
    (['model'], 'Model'),
    (['series', 'gb_year'], 'Series & Year'),
    (['series'], 'Series'),
    (['brand', 'gb_year'], 'Brand & Year'),
    (['brand'], 'Brand')
]

# Columns to apply hierarchical mode imputation to
cols_to_impute = ['gb_segment', 'kb_drivetrain']

print("\n--- Applying Hierarchical Mode Imputation to specified Categorical Columns ---")
for col in cols_to_impute:
    if col not in df_raw.columns:
        print(f"  Skipping '{col}': Column not found in DataFrame.")
        continue

    print(f"  Imputing '{col}'...")
    # Create a copy to avoid SettingWithCopyWarning issues during transform
    temp_series = df_raw[col].copy()

    # Iterate through hierarchy levels for mode imputation
    for group_cols, level_name in hierarchy_levels:
        if all(c in df_raw.columns for c in group_cols):
            # Only try to fill if there are still NaNs in the temporary series
            if temp_series.isnull().any():
                # Use transform with mode. Handle cases where mode might be empty (e.g., all NaNs in group)
                filled_values = df_raw.groupby(group_cols)[col].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else np.nan))
                # Only update the temp_series for the NaNs that were actually filled by this level
                temp_series = temp_series.fillna(filled_values)

    # Final fallback: Overall mode if any NaNs still remain
    if temp_series.isnull().any():
        overall_mode = df_raw[col].mode()[0] if not df_raw[col].mode().empty else 'Unknown'
        # Fill remaining NaNs with the overall mode
        temp_series = temp_series.fillna(overall_mode)

    # Assign the imputed series back to the original DataFrame
    df_raw[col] = temp_series

print("\n--- Final Missing Values Check for Imputed Columns ---")
final_missing_check = df_raw[cols_to_impute].isnull().sum()
if final_missing_check.sum() == 0:
    print("All specified columns have been imputed successfully (no missing values remaining).")
else:
    print("Missing values still found in:")
    print(final_missing_check[final_missing_check > 0])

--- Applying Hierarchical Mode Imputation to specified Categorical Columns ---
  Imputing 'gb_segment'...
  Imputing 'kb_drivetrain'...

--- Final Missing Values Check for Imputed Columns ---
All specified columns have been imputed successfully (no missing values remaining).

# import plotly.express as px
# import pandas as pd

# # Ensure the columns exist in the DataFrame
# if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:

#     # Create an imputation status column for the hue
#     def get_impute_status(row):
#         t_imp = str(row.get('torque_nm_imputed', 'Original'))
#         p_imp = str(row.get('power_hp_val_imputed', 'Original'))

#         if t_imp == 'Original' and p_imp == 'Original':
#             return 'Original Data'
#         elif t_imp == p_imp:
#             return f'Both: {t_imp}'
#         elif t_imp == 'Original':
#             return f'HP: {p_imp}'
#         elif p_imp == 'Original':
#             return f'Torque: {t_imp}'
#         else:
#             return f'Mixed (HP:{p_imp}, Tq:{t_imp})'

#     df_raw['imputation_status'] = df_raw.apply(get_impute_status, axis=1)

#     # Create a scatter plot
#     fig = px.scatter(
#         df_raw,
#         x='power_hp_val',
#         y='torque_nm',
#         color='imputation_status',
#         title='Torque (Nm) vs. Horsepower (HP)',
#         labels={'power_hp_val': 'Horsepower (HP)', 'torque_nm': 'Torque (Nm)','gb_year': 'year', 'imputation_status': 'Imputation Hierarchy'},
#         opacity=0.6,
#         hover_data=['brand', 'series', 'model', 'price','gb_year']
#     )
#     fig.update_layout(template='plotly_white')
#     fig.show()

#     # Calculate and print the correlation
#     correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
#     print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")
# else:
#     print("One or both of 'torque_nm' and 'power_hp_val' columns are not available in the DataFrame.")

import seaborn as sns
import matplotlib.pyplot as plt

# Sütunların varlığını kontrol et
if 'torque_nm' in df_raw.columns and 'power_hp_val' in df_raw.columns:

    # Imputation durumunu belirleyen fonksiyon (Aynı mantık)
    def get_impute_status(row):
        t_imp = str(row.get('torque_nm_imputed', 'Original'))
        p_imp = str(row.get('power_hp_val_imputed', 'Original'))

        if t_imp == 'Original' and p_imp == 'Original':
            return 'Original Data'
        elif t_imp == p_imp:
            return f'Both: {t_imp}'
        elif t_imp == 'Original':
            return f'HP: {p_imp}'
        elif p_imp == 'Original':
            return f'Torque: {t_imp}'
        else:
            return f'Mixed (HP:{p_imp}, Tq:{t_imp})'

    df_raw['imputation_status'] = df_raw.apply(get_impute_status, axis=1)

    # Grafik ayarları
    sns.set_theme(style="whitegrid")
    plt.figure(figsize=(12, 7))

    # Seaborn Scatter Plot
    scatter = sns.scatterplot(
        data=df_raw,
        x='power_hp_val',
        y='torque_nm',
        hue='imputation_status', # Renklendirme burada yapılıyor
        alpha=0.6,
        palette='viridis',      # Renk paleti (Opsiyonel: 'tab10', 'magma' vb.)
        edgecolor=None
    )

    # Başlık ve Etiketler
    plt.title('Torque (Nm) vs. Horsepower (HP)', fontsize=15)
    plt.xlabel('Horsepower (HP)')
    plt.ylabel('Torque (Nm)')

    # Legend (Açıklama) kutusunu dışarıya veya uygun bir yere taşıma
    plt.legend(title='Imputation Hierarchy', bbox_to_anchor=(1.05, 1), loc='upper left')

    plt.tight_layout()
    plt.show()

    # Korelasyon hesaplama
    correlation = df_raw['power_hp_val'].corr(df_raw['torque_nm'])
    print(f"\nPearson Correlation between Horsepower and Torque: {correlation:.4f}")

else:
    print("One or both of 'torque_nm' and 'power_hp_val' columns are not available.")

Pearson Correlation between Horsepower and Torque: 0.7414

# import pandas as pd
# import plotly.express as px

# # Create a simplified binary flag for Torque Imputation
# df_raw['is_torque_imputed'] = df_raw['torque_nm_imputed'].apply(lambda x: 'Original Data' if str(x) == 'Original' else 'Imputed Data')

# # Analyze the most frequent Torque values (the horizontal bands)
# print("--- Top 5 Most Frequent Torque Values (The Bands) ---")
# top_torques = df_raw['torque_nm'].value_counts().head(5)
# print(top_torques)

# print("\n--- Imputation Breakdown for these Top Bands ---")
# band_analysis = df_raw[df_raw['torque_nm'].isin(top_torques.index)].groupby(['torque_nm', 'is_torque_imputed']).size().unstack(fill_value=0)
# display(band_analysis)

# # Visual proof: Side-by-side scatter plot
# fig = px.scatter(
#     df_raw,
#     x='power_hp_val',
#     y='torque_nm',
#     color='is_torque_imputed',
#     facet_col='is_torque_imputed',
#     title='Causation Check: Are the horizontal bands caused by imputation?',
#     labels={'power_hp_val': 'Horsepower (HP)', 'torque_nm': 'Torque (Nm)'},
#     opacity=0.3,
#     color_discrete_map={'Original Data': '#3498db', 'Imputed Data': '#e74c3c'}
# )
# fig.update_layout(template='plotly_white')
# fig.show()

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# 1. Veri Hazırlama (Aynı mantık)
df_raw['is_torque_imputed'] = df_raw['torque_nm_imputed'].apply(
    lambda x: 'Original Data' if str(x) == 'Original' else 'Imputed Data'
)

# 2. Analiz Kısmı (Aynı mantık)
print("--- Top 5 Most Frequent Torque Values (The Bands) ---")
top_torques = df_raw['torque_nm'].value_counts().head(5)
print(top_torques)

print("\n--- Imputation Breakdown for these Top Bands ---")
band_analysis = df_raw[df_raw['torque_nm'].isin(top_torques.index)].groupby(['torque_nm', 'is_torque_imputed']).size().unstack(fill_value=0)
print(band_analysis) # 'display' yerine terminal çıktısı için 'print'

# 3. Seaborn ile Yan Yana (Facet) Scatter Plot
sns.set_theme(style="whitegrid")

# 'col' parametresi Plotly'deki 'facet_col' ile aynı işi yapar
g = sns.relplot(
    data=df_raw,
    x='power_hp_val',
    y='torque_nm',
    col='is_torque_imputed',
    hue='is_torque_imputed',
    palette={'Original Data': '#3498db', 'Imputed Data': '#e74c3c'},
    alpha=0.3,
    kind='scatter',
    height=5,
    aspect=1.2
)

# Başlık ve Etiket Düzenlemeleri
g.set_axis_labels("Horsepower (HP)", "Torque (Nm)")
g.fig.suptitle('Causation Check: Are the horizontal bands caused by imputation?', y=1.05, fontsize=14)

plt.show()

--- Top 5 Most Frequent Torque Values (The Bands) ---
torque_nm
250.0    5630
380.0    1517
220.0    1457
400.0    1214
320.0     479
Name: count, dtype: int64

--- Imputation Breakdown for these Top Bands ---
is_torque_imputed  Imputed Data  Original Data
torque_nm                                     
220.0                       203           1254
250.0                      1885           3745
320.0                       210            269
380.0                       484           1033
400.0                       317            897

# Identify debug columns explicitly created for plotting
debug_cols = ['imputation_status', 'is_torque_imputed']

# Identify all imputation tracking columns (ending with '_imputed')
imputed_tracking_cols = [c for c in df_raw.columns if str(c).endswith('_imputed')]

# Combine and keep only those that actually exist in the dataframe
cols_to_drop = [c for c in (debug_cols + imputed_tracking_cols) if c in df_raw.columns]

cols_to_drop.append("engine_cc_is_range")

# Drop them
df_raw.drop(columns=cols_to_drop, inplace=True)

print(f"Removed {len(cols_to_drop)} debug variables.")
print(f"Current dataset shape: {df_raw.shape}")

Removed 23 debug variables.
Current dataset shape: (13904, 72)

import pandas as pd

def preprocess_damage_score(df: pd.DataFrame) -> pd.DataFrame:
    """
    Calculate an aggregated 'expert_risk_score' based on the condition of various car parts
    and remove the original sparse expertise columns.

    Parameters:
    -----------
    df : pd.DataFrame
        The input DataFrame containing vehicle expertise data.

    Returns:
    --------
    pd.DataFrame
        The modified DataFrame with the new 'expert_risk_score' column and without the
        original expertise condition columns.

    Details:
    --------
    Penalty points are assigned based on the severity of the damage:
    - Roof (tavan): Replaced = 150, Painted = 75, Locally Painted = 40
    - Hood (kaput): Replaced = 60, Painted = 30, Locally Painted = 15
    - Trunk (bagaj): Replaced = 40, Painted = 20, Locally Painted = 10
    - Doors (door_*): Replaced = 10, Painted = 5, Locally Painted = 2
    - Fenders (fender_*): Replaced = 8, Painted = 4, Locally Painted = 2
    """
    df_out = df.copy()

    # Identify relevant columns
    door_cols = [c for c in df_out.columns if 'door_' in c]
    fender_cols = [c for c in df_out.columns if 'fender_' in c]

    # Calculate scores using .get() with default 0 to handle missing columns gracefully
    tavan_score = (
        df_out.get('tavan_degisen', 0) * 150 +
        df_out.get('tavan_boyali', 0) * 75 +
        df_out.get('tavan_lokal', 0) * 40
    )

    kaput_score = (
        df_out.get('kaput_degisen', 0) * 60 +
        df_out.get('kaput_boyali', 0) * 30 +
        df_out.get('kaput_lokal', 0) * 15
    )

    bagaj_score = (
        df_out.get('bagaj_degisen', 0) * 40 +
        df_out.get('bagaj_boyali', 0) * 20 +
        df_out.get('bagaj_lokal', 0) * 10
    )

    doors_score = 0
    if door_cols:
        doors_degisen = df_out[[c for c in door_cols if 'degisen' in c]].sum(axis=1) if any('degisen' in c for c in door_cols) else 0
        doors_boyali = df_out[[c for c in door_cols if 'boyali' in c]].sum(axis=1) if any('boyali' in c for c in door_cols) else 0
        doors_lokal = df_out[[c for c in door_cols if 'lokal' in c]].sum(axis=1) if any('lokal' in c for c in door_cols) else 0
        doors_score = doors_degisen * 10 + doors_boyali * 5 + doors_lokal * 2

    fenders_score = 0
    if fender_cols:
        fenders_degisen = df_out[[c for c in fender_cols if 'degisen' in c]].sum(axis=1) if any('degisen' in c for c in fender_cols) else 0
        fenders_boyali = df_out[[c for c in fender_cols if 'boyali' in c]].sum(axis=1) if any('boyali' in c for c in fender_cols) else 0
        fenders_lokal = df_out[[c for c in fender_cols if 'lokal' in c]].sum(axis=1) if any('lokal' in c for c in fender_cols) else 0
        fenders_score = fenders_degisen * 8 + fenders_boyali * 4 + fenders_lokal * 2

    # Total score
    df_out['expert_risk_score'] = tavan_score + kaput_score + bagaj_score + doors_score + fenders_score

    # Clean up: Drop the original detail columns
    cols_to_drop = [
        c for c in df_out.columns
        if any(x in c for x in ['tavan_', 'kaput_', 'bagaj_', 'door_', 'fender_'])
    ]
    df_out.drop(columns=cols_to_drop, inplace=True)

    return df_out

# Apply the function to the dataset
df_raw = preprocess_damage_score(df_raw)

# Create derived feature 'vehicle_age' (assuming current year is 2024)
year_col = 'kb_year' if 'kb_year' in df_raw.columns else 'gb_year'
df_raw['vehicle_age'] = 2026 - df_raw[year_col]

# Print the results
print(f"Dataset shape after feature engineering: {df_raw.shape}")
print("\nStatistics for 'expert_risk_score' and 'vehicle_age':")
display(df_raw[['expert_risk_score', 'vehicle_age']].describe())

Dataset shape after feature engineering: (13904, 41)

Statistics for 'expert_risk_score' and 'vehicle_age':

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Select numerical columns for correlation analysis
numeric_cols = df_raw.select_dtypes(include=[np.number]).columns
# Drop identifiers if present to avoid spurious noise
features_to_plot = [c for c in numeric_cols if c not in ['ad_id']]

# 1. Scale data before calculating correlation
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_raw[features_to_plot]), columns=features_to_plot)

# Compute the correlation matrix on scaled data
corr_matrix = df_scaled.corr()

# Handle potential NaNs in the correlation matrix (can happen if a column is constant)
corr_matrix = corr_matrix.fillna(0)

# 2. Create a clustered heatmap to visualize feature clusters
clustermap = sns.clustermap(
    corr_matrix,
    cmap='coolwarm',
    center=0,
    annot=False,
    figsize=(12, 12),
    dendrogram_ratio=(.1, .2),
    cbar_pos=(1.02, 0.15, .03, .6),
    linewidths=0.5
)
clustermap.fig.suptitle('Hierarchical Clustermap of Scaled Numerical Features', y=1.03, fontsize=16)
plt.show()

# 3. Detect Multicollinearity (exclude price)
print("=== Multicollinearity Detection (Threshold |r| > 0.85) ===")
predictor_cols = [c for c in corr_matrix.columns if c != 'price']
pred_corr = corr_matrix.loc[predictor_cols, predictor_cols]

multicollinear_pairs = []
for i in range(len(pred_corr.columns)):
    for j in range(i + 1, len(pred_corr.columns)):
        if abs(pred_corr.iloc[i, j]) > 0.85:
            multicollinear_pairs.append((pred_corr.columns[i], pred_corr.columns[j], pred_corr.iloc[i, j]))

multicollinear_pairs.sort(key=lambda x: abs(x[2]), reverse=True)

if multicollinear_pairs:
    for col1, col2, val in multicollinear_pairs:
        print(f"  [!] Highly correlated: {col1} & {col2} (r = {val:.3f})")
else:
    print("  No highly multicollinear pairs found above threshold.")

# 4. Show top correlations specifically with 'price'
if 'price' in corr_matrix.columns:
    print("\n=== Top 7 Features Correlated with Price ===")
    price_corr = corr_matrix['price'].abs().sort_values(ascending=False).drop('price').head(7)
    for col, val in price_corr.items():
        sign = "+" if corr_matrix.loc['price', col] > 0 else "-"
        print(f"  - {col}: {sign}{val:.3f}")

=== Multicollinearity Detection (Threshold |r| > 0.85) ===
  [!] Highly correlated: gb_year & vehicle_age (r = -1.000)
  [!] Highly correlated: weight_kg & curb_weight_kg (r = 0.953)
  [!] Highly correlated: length_mm & wheelbase_mm (r = 0.946)
  [!] Highly correlated: kb_fuel_cons_avg & highway_fuel_cons (r = 0.937)
  [!] Highly correlated: kb_fuel_cons_avg & city_fuel_cons (r = 0.924)
  [!] Highly correlated: length_mm & trunk_capacity_lt (r = 0.910)
  [!] Highly correlated: city_fuel_cons & highway_fuel_cons (r = 0.904)
  [!] Highly correlated: length_mm & weight_kg (r = 0.901)
  [!] Highly correlated: weight_kg & wheelbase_mm (r = 0.886)
  [!] Highly correlated: length_mm & curb_weight_kg (r = 0.859)
  [!] Highly correlated: trunk_capacity_lt & wheelbase_mm (r = 0.856)

=== Top 7 Features Correlated with Price ===
  - vehicle_age: -0.772
  - gb_year: +0.772
  - gb_mtv_yearly: +0.638
  - gb_mileage: -0.627
  - max_speed_kmh: +0.501
  - weight_kg: +0.500
  - curb_weight_kg: +0.482

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from sklearn.metrics import mutual_info_score

print("=== Phase 1c: Categorical Feature Relationship Analysis ===")

# Select categorical columns with a reasonable number of unique values
cat_cols = df_raw.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
exclude_cats = ['ad_title', 'description_text', 'ad_id', 'listing_date', 'scraped_at', 'search_date', 'location', 'power_hp_is_range']
# Increased the max unique values threshold from 150 to 5000 to include 'model'
cat_cols = [c for c in cat_cols if c not in exclude_cats and df_raw[c].nunique() > 1]

# Function for Cramer's V (Symmetric)
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    denom = min((kcorr-1), (rcorr-1))
    if denom <= 0:
        return 0.0
    return np.sqrt(phi2corr / denom)

# Function for Theil's U / Uncertainty Coefficient (Asymmetric)
def theils_u(x, y):
    s_xy = pd.crosstab(x, y)
    pi = s_xy.sum(axis=1) / s_xy.sum().sum()
    entropy_x = ss.entropy(pi)
    if entropy_x == 0:
        return 1.0
    mis = mutual_info_score(x, y)
    return mis / entropy_x

# Calculate matrices
n_cats = len(cat_cols)
cv_matrix = pd.DataFrame(np.zeros((n_cats, n_cats)), index=cat_cols, columns=cat_cols)
tu_matrix = pd.DataFrame(np.zeros((n_cats, n_cats)), index=cat_cols, columns=cat_cols)

print(f"Computing association matrices for {n_cats} categorical features. This might take a few seconds...")
for i in range(n_cats):
    for j in range(n_cats):
        col1, col2 = cat_cols[i], cat_cols[j]
        # Drop NA to compute
        valid_idx = df_raw[[col1, col2]].dropna().index
        x = df_raw.loc[valid_idx, col1].astype(str)
        y = df_raw.loc[valid_idx, col2].astype(str)

        if i == j:
            cv_matrix.loc[col1, col2] = 1.0
            tu_matrix.loc[col1, col2] = 1.0
        else:
            cv_matrix.loc[col1, col2] = cramers_v(x, y)
            # Theil's U is asymmetric: U(x|y) - How much y explains x
            tu_matrix.loc[col1, col2] = theils_u(x, y)

# Plotting
fig, axes = plt.subplots(2, 1, figsize=(14, 20))

# Cramer's V Heatmap
sns.heatmap(cv_matrix,
            annot=True,      # Added: Show values
            fmt=".2f",       # Added: 2 decimal places
            cmap="Blues",
            ax=axes[0],
            vmin=0,
            vmax=1,
            annot_kws={"size": 9}) # Optional: Adjust font size for clarity
axes[0].set_title("Cramer's V (Symmetric Association)", fontsize=16, pad=20)

# Theil's U Heatmap
sns.heatmap(tu_matrix,
            annot=True,      # Added: Show values
            fmt=".2f",       # Added: 2 decimal places
            cmap="Oranges",
            ax=axes[1],
            vmin=0,
            vmax=1,
            annot_kws={"size": 9}) # Optional: Adjust font size for clarity
axes[1].set_title("Theil's U (Asymmetric Uncertainty Coefficient)\nHow much knowing [Column] predicts [Row]", fontsize=16, pad=20)

plt.tight_layout(pad=5.0) # Added padding to prevent title overlap
plt.show()

# --- Interpretation ---
print("\n--- Interpretation ---")
print("1. Cramer's V is a SYMMETRIC metric. It evaluates the overall correlation.")
print("2. Theil's U is an ASYMMETRIC metric. U(Row|Column) shows how much the Column explains the Row.")
print("3. High values (close to 1.0) indicate strong redundancy or predictive power.")

# Highly Related Variables Printout
print("\n=== Top Categorical Associations (Cramer's V > 0.8) ===")
high_cv_pairs = []
for i in range(len(cv_matrix.columns)):
    for j in range(i + 1, len(cv_matrix.columns)):
        val = cv_matrix.iloc[i, j]
        if val > 0.8:
            high_cv_pairs.append((cv_matrix.columns[i], cv_matrix.columns[j], val))

high_cv_pairs.sort(key=lambda x: x[2], reverse=True)
if high_cv_pairs:
    for col1, col2, val in high_cv_pairs:
        print(f"  - {col1} & {col2}: {val:.3f}")
else:
    print("  None found.")

print("\n=== Top Categorical Associations (Theil's U > 0.8) ===")
high_tu_pairs = []
for i in range(len(tu_matrix.columns)):
    for j in range(len(tu_matrix.columns)):
        if i == j: continue
        val = tu_matrix.iloc[i, j]
        if val > 0.8:
            high_tu_pairs.append((tu_matrix.index[i], tu_matrix.columns[j], val))

high_tu_pairs.sort(key=lambda x: x[2], reverse=True)
if high_tu_pairs:
    for target, predictor, val in high_tu_pairs:
        print(f"  - Knowing '{predictor}' predicts '{target}': {val:.3f}")
else:
    print("  None found.")

=== Phase 1c: Categorical Feature Relationship Analysis ===
Computing association matrices for 14 categorical features. This might take a few seconds...

--- Interpretation ---
1. Cramer's V is a SYMMETRIC metric. It evaluates the overall correlation.
2. Theil's U is an ASYMMETRIC metric. U(Row|Column) shows how much the Column explains the Row.
3. High values (close to 1.0) indicate strong redundancy or predictive power.

=== Top Categorical Associations (Cramer's V > 0.8) ===
  - kb_body_type & gb_body_type: 1.000
  - brand & series: 0.999
  - brand & model: 0.977
  - model & gb_segment: 0.965
  - series & model: 0.941
  - series & gb_segment: 0.915
  - brand & kb_drivetrain: 0.855

=== Top Categorical Associations (Theil's U > 0.8) ===
  - Knowing 'gb_body_type' predicts 'kb_body_type': 1.000
  - Knowing 'model' predicts 'brand': 1.000
  - Knowing 'series' predicts 'brand': 1.000
  - Knowing 'model' predicts 'series': 0.999
  - Knowing 'model' predicts 'gb_segment': 0.996
  - Knowing 'series' predicts 'gb_segment': 0.972
  - Knowing 'model' predicts 'kb_drivetrain': 0.971
  - Knowing 'model' predicts 'kb_body_type': 0.891
  - Knowing 'model' predicts 'gb_fuel': 0.881

import pandas as pd
import numpy as np
import scipy.stats as ss
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif

def estimate_shannon_entropy(series):
    """Sürekli veya kategorik veri için entropi (belirsizlik) tahmini yapar."""
    # Veri sürekliyse (float/int), binning (gruplama) yaparak entropi hesapla
    if series.dtype in ['float64', 'int64']:
        # Freedman-Diaconis kuralı ile akıllı binning
        counts, _ = np.histogram(series.dropna(), bins='fd')
    else:
        counts = series.value_counts()

    probs = counts / counts.sum()
    return ss.entropy(probs)

def calculate_redundancy(df, target_col, features):
    results = []

    # Hedef (Model) için sayısal dönüşüm
    target_encoded = df[target_col].astype('category').cat.codes

    for col in features:
        # Boş değerleri temizle
        valid_data = df[[col, target_col]].dropna()
        x = valid_data[col]
        y = valid_data[target_col].astype('category').cat.codes

        # 1. Özelliğin Kendi Entropisi (H(X)) - Toplam Belirsizlik
        h_x = estimate_shannon_entropy(x)

        # 2. Karşılıklı Bilgi (I(X;Y))
        # Kategorik ise classif, sürekli ise regression tabanlı MI kullan
        if x.dtype in ['float64', 'int64']:
            # Model (y) kategorik olduğu için discrete_features=True diyoruz
            mi = mutual_info_regression(x.values.reshape(-1, 1), y, random_state=42)[0]
        else:
            x_enc = x.astype('category').cat.codes
            mi = mutual_info_classif(x_enc.values.reshape(-1, 1), y, random_state=42)[0]

        # 3. Bilgi Artıklığı Oranı (Redundancy %)
        # Model bilgisi, özelliğin belirsizliğini ne oranda açıklıyor?
        redundancy_pct = (mi / h_x) * 100 if h_x > 0 else 0

        results.append({
            'Feature': col,
            'Entropy (Total Uncertainty)': h_x,
            'Mutual Info with Model': mi,
            'Redundancy (%)': min(redundancy_pct, 100.0) # 100'ü aşamaz
        })

    return pd.DataFrame(results).sort_values(by='Redundancy (%)', ascending=False)

# --- Uygulama ---
features_to_test = ['wheelbase_mm', 'width_mm', 'length_mm', 'gb_segment',
                   'max_speed_kmh', 'cylinder_count', 'curb_weight_kg', 'height_mm']

redundancy_df = calculate_redundancy(df_raw, 'model', features_to_test)

print("=== Bilgi Artıklığı Analizi (Model Sütununa Göre) ===")
print(redundancy_df.to_string(index=False))

# Karar Önerisi
print("\n--- Yorum ve Karar ---")
for _, row in redundancy_df.iterrows():
    if row['Redundancy (%)'] > 90:
        print(f" - [{row['Feature']}]: %{row['Redundancy (%)']:.1f} Artık. MODEL içinde zaten var, ÇIKARILABİLİR.")
    elif row['Redundancy (%)'] > 70:
        print(f" - [{row['Feature']}]: %{row['Redundancy (%)']:.1f} Yüksek Bağlılık. İncele ve tutmayı düşün.")
    else:
        print(f" - [{row['Feature']}]: %{row['Redundancy (%)']:.1f} Özgün Bilgi. Veri setinde kalsın.")

=== Bilgi Artıklığı Analizi (Model Sütununa Göre) ===
       Feature  Entropy (Total Uncertainty)  Mutual Info with Model  Redundancy (%)
  wheelbase_mm                     2.597758                2.707323      100.000000
      width_mm                     2.831880                2.865687      100.000000
     length_mm                     2.575656                3.223208      100.000000
    gb_segment                     1.195644                1.467208      100.000000
curb_weight_kg                     3.206310                3.422261      100.000000
     height_mm                     2.870829                2.829981       98.577158
 max_speed_kmh                     2.977917                2.742922       92.108745
cylinder_count                     0.000000                0.467000        0.000000

--- Yorum ve Karar ---
 - [wheelbase_mm]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [width_mm]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [length_mm]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [gb_segment]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [curb_weight_kg]: %100.0 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [height_mm]: %98.6 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [max_speed_kmh]: %92.1 Artık. MODEL içinde zaten var, ÇIKARILABİLİR.
 - [cylinder_count]: %0.0 Özgün Bilgi. Veri setinde kalsın.

df_raw.columns

Index(['brand', 'series', 'model', 'price', 'engine_cc_val', 'power_hp_val',
       'gb_year', 'gb_mileage', 'gb_transmission', 'gb_fuel', 'kb_body_type',
       'gb_body_type', 'gb_color', 'kb_drivetrain', 'kb_trade_available',
       'kb_seller_type', 'kb_fuel_cons_avg', 'kb_fuel_tank',
       'gb_warranty_status', 'gb_segment', 'gb_mtv_yearly', 'torque_nm',
       'cylinder_count', 'max_speed_kmh', 'accel_0_100', 'city_fuel_cons',
       'highway_fuel_cons', 'length_mm', 'width_mm', 'height_mm', 'weight_kg',
       'curb_weight_kg', 'trunk_capacity_lt', 'wheelbase_mm',
       'is_heavy_damaged', 'tramer_fee', 'count_changed', 'count_painted',
       'count_local_painted', 'expert_risk_score', 'vehicle_age'],
      dtype='object')

print("\n--- Checking for empty categories (missing values in categorical columns) ---")
cat_cols = df_raw.select_dtypes(include=['object', 'category', 'bool']).columns
missing_cat_counts = df_raw[cat_cols].isnull().sum()

if missing_cat_counts.sum() == 0:
    print("No empty categories (missing values) found in categorical columns after imputation.")
else:
    print("Remaining empty categories (missing values) in categorical columns:")
    print(missing_cat_counts[missing_cat_counts > 0])

print("\n--- How CatBoost handles null categories ---")
print("CatBoost treats missing values in categorical features as a special, distinct category.")
print("It learns during training whether these 'missing' values should be grouped with other categories or form their own group for optimal splits.")
print("This means you generally do not need to explicitly impute missing values in categorical columns before training with CatBoost, as it handles them intelligently internally.")

--- Checking for empty categories (missing values in categorical columns) ---
Remaining empty categories (missing values) in categorical columns:
gb_warranty_status    3010
dtype: int64

--- How CatBoost handles null categories ---
CatBoost treats missing values in categorical features as a special, distinct category.
It learns during training whether these 'missing' values should be grouped with other categories or form their own group for optimal splits.
This means you generally do not need to explicitly impute missing values in categorical columns before training with CatBoost, as it handles them intelligently internally.

import numpy as np
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool

# 1. Prepare dataset
target_col = 'price'
drop_cols = [target_col]
drop_cols = [c for c in drop_cols if c in df_raw.columns]

# Drop rows where target is missing or zero
df_model = df_raw.dropna(subset=[target_col]).copy()
df_model = df_model[df_model[target_col] > 0]

X = df_model.drop(columns=drop_cols)
y = df_model[target_col]

# 2. Handle categorical and text features
text_features = ['model'] if 'model' in X.columns else []
cat_features_all = [col for col in X.select_dtypes(include=['object', 'category', 'bool']).columns if col not in text_features]

# Fill missing values for categorical and text columns with 'missing'
for col in cat_features_all + text_features:
    X[col] = X[col].fillna('missing').astype(str)

# 3. Define 2x2 combinations for Year/Age and KB/GB Body Types
configs = [
    {"name": "vehicle_age + kb_body_type", "drop": ["gb_year", "gb_body_type"]},
    {"name": "vehicle_age + gb_body_type", "drop": ["gb_year", "kb_body_type"]},
    {"name": "gb_year + kb_body_type",     "drop": ["vehicle_age", "gb_body_type"]},
    {"name": "gb_year + gb_body_type",     "drop": ["vehicle_age", "kb_body_type"]},
]

# Pre-build feature sets and indices
datasets = {}
for cfg in configs:
    cols_to_drop = [c for c in cfg["drop"] if c in X.columns]
    X_cfg = X.drop(columns=cols_to_drop)

    c_feats = [c for c in cat_features_all if c in X_cfg.columns]
    t_feats = [c for c in text_features if c in X_cfg.columns]

    datasets[cfg["name"]] = {
        "X": X_cfg,
        "c_idx": [X_cfg.columns.get_loc(c) for c in c_feats],
        "t_idx": [X_cfg.columns.get_loc(c) for c in t_feats],
        "scores": []
    }

# 4. Establish cross-validation framework
kf = KFold(n_splits=5, shuffle=True, random_state=42)

print("Starting 5-fold Cross-Validation with CatBoost (GPU) for 2x2 combinations...")
for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"\n--- Fold {fold + 1} ---")

    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    for cfg_name, data in datasets.items():
        X_train, X_val = data["X"].iloc[train_idx], data["X"].iloc[val_idx]

        train_pool = Pool(data=X_train, label=y_train, cat_features=data["c_idx"], text_features=data["t_idx"])
        val_pool = Pool(data=X_val, label=y_val, cat_features=data["c_idx"], text_features=data["t_idx"])

        model = CatBoostRegressor(
            iterations=1000, learning_rate=0.05, depth=6,
            loss_function='RMSE', eval_metric='RMSE',
            early_stopping_rounds=100, verbose=False,
            random_seed=42, **gpu_params
        )
        model.fit(train_pool, eval_set=val_pool)

        preds = model.predict(val_pool)
        rmse = np.sqrt(mean_squared_error(y_val, preds))
        data["scores"].append(rmse)

        print(f"  {cfg_name:30s} RMSE : {rmse:.4f}")

print("\n=== FINAL COMPARISON (Mean CV RMSE) ===")
for cfg_name, data in datasets.items():
    mean_rmse = np.mean(data["scores"])
    std_rmse = np.std(data["scores"])
    print(f"{cfg_name:30s} : {mean_rmse:.4f} (±{std_rmse:.4f})")

Starting 5-fold Cross-Validation with CatBoost (GPU) for 2x2 combinations...

--- Fold 1 ---
  vehicle_age + kb_body_type     RMSE : 194203.8777
  vehicle_age + gb_body_type     RMSE : 194190.9607
  gb_year + kb_body_type         RMSE : 193190.8369
  gb_year + gb_body_type         RMSE : 192402.1037

--- Fold 2 ---
  vehicle_age + kb_body_type     RMSE : 193917.3367
  vehicle_age + gb_body_type     RMSE : 193861.2270
  gb_year + kb_body_type         RMSE : 193329.4420
  gb_year + gb_body_type         RMSE : 193920.6988

--- Fold 3 ---
  vehicle_age + kb_body_type     RMSE : 179982.0014
  vehicle_age + gb_body_type     RMSE : 179724.9643
  gb_year + kb_body_type         RMSE : 178079.4357
  gb_year + gb_body_type         RMSE : 178011.3103

--- Fold 4 ---
  vehicle_age + kb_body_type     RMSE : 195770.9350
  vehicle_age + gb_body_type     RMSE : 196586.5364
  gb_year + kb_body_type         RMSE : 195930.1660
  gb_year + gb_body_type         RMSE : 197363.5721

--- Fold 5 ---
  vehicle_age + kb_body_type     RMSE : 195289.7405
  vehicle_age + gb_body_type     RMSE : 194321.8538
  gb_year + kb_body_type         RMSE : 192736.4721
  gb_year + gb_body_type         RMSE : 193285.6588

=== FINAL COMPARISON (Mean CV RMSE) ===
vehicle_age + kb_body_type     : 191832.7783 (±5964.3621)
vehicle_age + gb_body_type     : 191737.1084 (±6083.1363)
gb_year + kb_body_type         : 190653.2705 (±6385.7312)
gb_year + gb_body_type         : 190996.6687 (±6707.0361)

print("Dropping 'gb_body_type' and 'vehicle_age' to finalize the feature set...")
cols_to_drop_final = ['gb_body_type', 'vehicle_age']
X = X.drop(columns=[c for c in cols_to_drop_final if c in X.columns])

# Assign text and cat features to variables for the next steps
text_features = ['model'] if 'model' in X.columns else []
cat_features = [col for col in X.select_dtypes(include=['object', 'category', 'bool']).columns if col not in text_features]

# Get integer indices for CatBoost
cat_indices = [X.columns.get_loc(col) for col in cat_features]
text_indices = [X.columns.get_loc(col) for col in text_features]

print(f"\nFinal feature count: {X.shape[1]}")
print(f"Categorical features ({len(cat_features)}): {cat_features}")
print(f"Text features ({len(text_features)}): {text_features}")

Dropping 'gb_body_type' and 'vehicle_age' to finalize the feature set...

Final feature count: 38
Categorical features (12): ['brand', 'series', 'gb_transmission', 'gb_fuel', 'kb_body_type', 'gb_color', 'kb_drivetrain', 'kb_trade_available', 'kb_seller_type', 'gb_warranty_status', 'gb_segment', 'is_heavy_damaged']
Text features (1): ['model']

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.ensemble import IsolationForest
import plotly.express as px

# 1. Get CatBoost residuals on the full dataset
full_pool = Pool(data=X, label=y, cat_features=cat_indices, text_features=text_indices)
baseline_full = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    loss_function='RMSE',
    verbose=False,
    random_seed=42,
    **gpu_params
)
baseline_full.fit(full_pool)
preds_full = baseline_full.predict(full_pool)
residuals = y - preds_full

# 2. Calculate Z-Scores of residuals
z_scores = np.abs(stats.zscore(residuals))
outlier_zscore = z_scores > 3 # Flag as outlier if |Z| > 3

# 3. Isolation Forest on numeric features
# Fill missing values with 0 just for the Isolation Forest (if any remain)
numeric_X = X.select_dtypes(include=[np.number]).fillna(0)
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_iso = iso_forest.fit_predict(numeric_X) == -1

# 4. Consolidate into a Decision DataFrame
outlier_df = pd.DataFrame({
    'brand': X['brand'],
    'series': X['series'],
    'model': X['model'],
    'price': y,
    'predicted_price': preds_full,
    'residual': residuals,
    'z_score': z_scores,
    'outlier_zscore': outlier_zscore,
    'outlier_iso': outlier_iso
})

# Final decision: flagged if identified by either method
outlier_df['is_outlier'] = outlier_df['outlier_zscore'] | outlier_df['outlier_iso']

print(f"Total outliers detected by Z-Score: {outlier_df['outlier_zscore'].sum()}")
print(f"Total outliers detected by Isolation Forest: {outlier_df['outlier_iso'].sum()}")
print(f"Total combined outliers: {outlier_df['is_outlier'].sum()}")

# 5. Plotly chart of outliers by car model
outlier_counts = outlier_df[outlier_df['is_outlier']]['model'].value_counts().reset_index()
outlier_counts.columns = ['model', 'outlier_count']
outlier_counts = outlier_counts.head(20) # Get top 20 models

fig = px.bar(
    outlier_counts,
    x='outlier_count',
    y='model',
    orientation='h',
    title='Top 20 Car Models Most Frequently Flagged as Outliers',
    labels={'outlier_count': 'Number of Outliers', 'model': 'Car Model'},
    color='outlier_count',
    color_continuous_scale='Reds'
)
fig.update_layout(yaxis={'categoryorder':'total ascending'}, template="plotly_white")
fig.show()

# Display a sample of the outliers
display(outlier_df[outlier_df['is_outlier']].head())

Total outliers detected by Z-Score: 197
Total outliers detected by Isolation Forest: 696
Total combined outliers: 843

display(
    outlier_df[outlier_df['is_outlier']]
    .sort_values(by="residual", key=lambda x: x.abs(), ascending=False)
    .head()
)

from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error
import time

# 1. Clean dataset by dropping outliers identified in Phase 5
X_clean = X[~outlier_df['is_outlier']].copy()
y_clean = y[~outlier_df['is_outlier']].copy()
X_clean = X_clean.drop(columns=["gb_color","kb_trade_available", 'kb_seller_type',
       'kb_fuel_cons_avg', 'kb_fuel_tank', 'gb_warranty_status', 'gb_segment', 'torque_nm', 'cylinder_count', 'max_speed_kmh',
       'accel_0_100', 'city_fuel_cons', 'highway_fuel_cons', 'length_mm',
       'width_mm', 'height_mm', 'weight_kg', 'curb_weight_kg',
       'trunk_capacity_lt','wheelbase_mm', 'tramer_fee']).copy()
# Create a standard train/val split for other experiments
X_train_clean, X_val_clean, y_train_clean, y_val_clean = train_test_split(
    X_clean, y_clean, test_size=0.2, random_state=42
)

def custom_lofo_importance(X, y, features_to_test, cat_features, text_features, gpu_params, n_splits=3):
    """
    Calculate Leave-One-Feature-Out (LOFO) importance using Cross-Validation.

    This function iteratively removes one feature at a time from the dataset,
    trains a CatBoost model using K-Fold CV, and calculates the mean RMSE.
    The LOFO importance is the difference between the CV RMSE without the feature
    and the baseline CV RMSE (with all features). A positive value indicates that
    removing the feature worsens the model, meaning the feature is important.

    Parameters:
    -----------
    X : pd.DataFrame
        Features dataset.
    y : pd.Series
        Target variable.
    features_to_test : list
        List of specific features to evaluate using LOFO.
    cat_features : list
        List of categorical feature names.
    text_features : list
        List of text feature names.
    gpu_params : dict
        Dictionary containing CatBoost GPU configuration.
    n_splits : int
        Number of cross-validation folds.

    Returns:
    --------
    pd.DataFrame
        A DataFrame containing the features and their calculated LOFO importance scores,
        sorted in descending order of importance.
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    print(f"Calculating Baseline {n_splits}-Fold CV RMSE with all features...")

    # Get index positions for the baseline
    all_feats = list(X.columns)
    c_idx_base = [all_feats.index(f) for f in cat_features if f in all_feats]
    t_idx_base = [all_feats.index(f) for f in text_features if f in all_feats]

    base_rmses = []
    for train_idx, val_idx in kf.split(X, y):
        X_tr, X_va = X.iloc[train_idx], X.iloc[val_idx]
        y_tr, y_va = y.iloc[train_idx], y.iloc[val_idx]

        base_train_pool = Pool(X_tr, y_tr, cat_features=c_idx_base, text_features=t_idx_base)
        base_val_pool = Pool(X_va, y_va, cat_features=c_idx_base, text_features=t_idx_base)

        base_model = CatBoostRegressor(iterations=300, learning_rate=0.08, depth=6,
                                       loss_function='RMSE', verbose=False, random_seed=42, **gpu_params)
        base_model.fit(base_train_pool, eval_set=base_val_pool)
        base_rmses.append(np.sqrt(mean_squared_error(y_va, base_model.predict(base_val_pool))))

    base_rmse = np.mean(base_rmses)
    print(f"Baseline CV RMSE: {base_rmse:.4f}\n")

    lofo_scores = []

    for i, feature in enumerate(features_to_test):
        start_time = time.time()

        # Features without the current one
        lofo_feats = [f for f in all_feats if f != feature]
        lofo_c_idx = [lofo_feats.index(f) for f in cat_features if f in lofo_feats]
        lofo_t_idx = [lofo_feats.index(f) for f in text_features if f in lofo_feats]

        feature_rmses = []
        for train_idx, val_idx in kf.split(X, y):
            X_tr, X_va = X.iloc[train_idx][lofo_feats], X.iloc[val_idx][lofo_feats]
            y_tr, y_va = y.iloc[train_idx], y.iloc[val_idx]

            lofo_train_pool = Pool(X_tr, y_tr, cat_features=lofo_c_idx, text_features=lofo_t_idx)
            lofo_val_pool = Pool(X_va, y_va, cat_features=lofo_c_idx, text_features=lofo_t_idx)

            lofo_model = CatBoostRegressor(iterations=300, learning_rate=0.08, depth=6,
                                           loss_function='RMSE', verbose=False, random_seed=42, **gpu_params)
            lofo_model.fit(lofo_train_pool, eval_set=lofo_val_pool)
            feature_rmses.append(np.sqrt(mean_squared_error(y_va, lofo_model.predict(lofo_val_pool))))

        lofo_rmse = np.mean(feature_rmses)
        importance = lofo_rmse - base_rmse

        elapsed = time.time() - start_time
        print(f"Dropped '{feature}': CV RMSE = {lofo_rmse:.4f} | Importance = {importance:+.4f} ({elapsed:.1f}s)")
        lofo_scores.append({'Feature': feature, 'LOFO_Importance': importance})

    return pd.DataFrame(lofo_scores).sort_values('LOFO_Importance', ascending=False)

# 2. Run LOFO on the top 15 features (from Phase 5 full baseline) to save time
fi = baseline_full.get_feature_importance()
fi_df = pd.DataFrame({'feature': X.columns, 'importance': fi}).sort_values('importance', ascending=False)
top_15_features = fi_df['feature'].tolist()
# do not use baseline feature importance for now

print("--- Starting Custom LOFO Feature Selection (with 5-Fold CV) ---")
lofo_results = custom_lofo_importance(
    X_clean, y_clean,
    features_to_test=X_clean.columns,
    cat_features=cat_features,
    text_features=text_features,
    gpu_params=gpu_params,
    n_splits=5
)

# Plot LOFO Results
fig_lofo = px.bar(
    lofo_results, x='LOFO_Importance', y='Feature', orientation='h',
    title='LOFO Feature Importance (Positive = Important)',
    color='LOFO_Importance', color_continuous_scale='RdYlGn'
)
fig_lofo.update_layout(yaxis={'categoryorder':'total ascending'}, template="plotly_white")
fig_lofo.show()

--- Starting Custom LOFO Feature Selection (with 5-Fold CV) ---
Calculating Baseline 5-Fold CV RMSE with all features...
Baseline CV RMSE: 142418.6554

Dropped 'brand': CV RMSE = 142944.7356 | Importance = +526.0802 (7.9s)
Dropped 'series': CV RMSE = 147722.1699 | Importance = +5303.5145 (7.3s)
Dropped 'model': CV RMSE = 155501.2192 | Importance = +13082.5638 (13.4s)
Dropped 'engine_cc_val': CV RMSE = 143325.1902 | Importance = +906.5347 (7.9s)
Dropped 'power_hp_val': CV RMSE = 148070.7937 | Importance = +5652.1382 (7.8s)
Dropped 'gb_year': CV RMSE = 187706.7761 | Importance = +45288.1207 (7.6s)
Dropped 'gb_mileage': CV RMSE = 200604.2567 | Importance = +58185.6013 (8.1s)
Dropped 'gb_transmission': CV RMSE = 142493.4145 | Importance = +74.7591 (7.7s)
Dropped 'gb_fuel': CV RMSE = 142309.2237 | Importance = -109.4318 (7.8s)
Dropped 'kb_body_type': CV RMSE = 142991.0902 | Importance = +572.4348 (7.6s)
Dropped 'kb_drivetrain': CV RMSE = 142801.6870 | Importance = +383.0316 (7.6s)
Dropped 'gb_mtv_yearly': CV RMSE = 143178.4193 | Importance = +759.7639 (7.7s)
Dropped 'is_heavy_damaged': CV RMSE = 148924.1099 | Importance = +6505.4545 (7.9s)
Dropped 'count_changed': CV RMSE = 144209.8375 | Importance = +1791.1821 (7.8s)
Dropped 'count_painted': CV RMSE = 142550.1559 | Importance = +131.5004 (7.9s)
Dropped 'count_local_painted': CV RMSE = 142355.8654 | Importance = -62.7900 (7.8s)
Dropped 'expert_risk_score': CV RMSE = 142072.9914 | Importance = -345.6640 (7.7s)

import seaborn as sns
import matplotlib.pyplot as plt

# Veriyi görselleştirmeden önce önem sırasına göre dizelim (Plotly'deki total ascending için)
lofo_results_sorted = lofo_results.sort_values('LOFO_Importance', ascending=False)

# Grafik boyutunu ve stilini ayarlayalım
plt.figure(figsize=(10, 8))
sns.set_theme(style="whitegrid")

# Barplot oluşturma
# 'palette' ile renk skalasını (RdYlGn benzeri) ayarlıyoruz
ax = sns.barplot(
    data=lofo_results_sorted,
    x='LOFO_Importance',
    y='Feature',
    hue='LOFO_Importance',      # Renklendirme için
    palette='RdYlGn',           # Kırmızı-Sarı-Yeşil skalası
    legend=False
)

# Başlık ve etiketler
plt.title('LOFO Feature Importance (Positive = Important)', fontsize=14)
plt.xlabel('LOFO Importance')
plt.ylabel('Feature')

# Gereksiz çizgileri temizleme (Plotly White temasına benzerlik için)
sns.despine(left=True, bottom=True)

plt.show()

import itertools
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool

print("--- Hierarchy Ablation Study: Combinations & CV ---")

# Define the base hierarchical features
base_features = ['brand', 'series', 'model']
hierarchy_feature_sets = []

# Generate all combinations of 1, 2, and 3 features
for r in range(1, len(base_features) + 1):
    hierarchy_feature_sets.extend([list(c) for c in itertools.combinations(base_features, r)])

ablation_results = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)

for feats in hierarchy_feature_sets:
    comb_name = " + ".join(feats)
    print(f"Evaluating: {comb_name}...")

    cv_rmses = []
    for train_idx, val_idx in kf.split(X_clean, y_clean):
        # Subset to only the selected features for this combination
        X_tr, X_va = X_clean.iloc[train_idx][feats], X_clean.iloc[val_idx][feats]
        y_tr, y_va = y_clean.iloc[train_idx], y_clean.iloc[val_idx]

        # Get dynamic indices for categorical and text features based on current subset
        c_idx = [feats.index(f) for f in cat_features if f in feats]
        t_idx = [feats.index(f) for f in text_features if f in feats]

        ab_train_pool = Pool(X_tr, y_tr, cat_features=c_idx, text_features=t_idx)
        ab_val_pool = Pool(X_va, y_va, cat_features=c_idx, text_features=t_idx)

        # Train CatBoost model
        ab_model = CatBoostRegressor(iterations=300, learning_rate=0.08, depth=6,
                                     loss_function='RMSE', verbose=False, random_seed=42, **gpu_params)
        ab_model.fit(ab_train_pool, eval_set=ab_val_pool)

        fold_rmse = np.sqrt(mean_squared_error(y_va, ab_model.predict(ab_val_pool)))
        cv_rmses.append(fold_rmse)

    mean_rmse = np.mean(cv_rmses)
    ablation_results.append({'Feature_Combination': comb_name, 'CV_RMSE': mean_rmse})

# Create DataFrame and display
ablation_df = pd.DataFrame(ablation_results).sort_values('CV_RMSE', ascending=False)
display(ablation_df)

# Visualize ablation combinations
fig_ablation = px.bar(
    ablation_df, x='CV_RMSE', y='Feature_Combination', orientation='h',
    title='Ablation Study: 5-Fold CV RMSE by Hierarchical Feature Combinations (Lower is Better)',
    color='CV_RMSE', color_continuous_scale='Blues_r', text_auto='.0f'
)
fig_ablation.update_layout(yaxis={'categoryorder':'total ascending'}, template="plotly_white")
fig_ablation.show()

--- Hierarchy Ablation Study: Combinations & CV ---
Evaluating: brand...
Evaluating: series...
Evaluating: model...
Evaluating: brand + series...
Evaluating: brand + model...
Evaluating: series + model...
Evaluating: brand + series + model...

import seaborn as sns
import matplotlib.pyplot as plt

# 1. Veriyi sıralayalım (Plotly'deki 'total ascending' için)
# RMSE düşükten büyüğe (en iyi sonuç en üstte kalacak şekilde) sıralanır
ablation_df_sorted = ablation_df.sort_values('CV_RMSE', ascending=True)

# 2. Grafik stilini ayarlayalım
plt.figure(figsize=(12, 8))
sns.set_theme(style="whitegrid")

# 3. Barplot oluşturma
# 'palette' olarak 'Blues' (düşük değerler açık, yüksekler koyu)
# veya 'Blues_r' (tersi) kullanabilirsiniz.
ax = sns.barplot(
    data=ablation_df_sorted,
    x='CV_RMSE',
    y='Feature_Combination',
    hue='CV_RMSE',           # Renk skalası için
    palette='Blues_r',       # Plotly'deki Blues_r ile uyumlu
    legend=False
)

# 4. Çubukların üzerine değerleri yazdıma (text_auto='.0f' karşılığı)
ax.bar_label(ax.containers[0], fmt='%.0f', padding=3)

# 5. Başlık ve etiketler
plt.title('Ablation Study: 5-Fold CV RMSE by Hierarchical Feature Combinations (Lower is Better)', fontsize=14)
plt.xlabel('CV RMSE')
plt.ylabel('Feature Combination')

# Gereksiz kenarlıkları kaldırarak Plotly White havası verelim
sns.despine(left=True, bottom=True)

plt.tight_layout()
plt.show()

import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor, Pool

# 1. We use the cleaned dataset from Phase 6 (X_clean, y_clean) directly as the target
# y_clean_log = np.log1p(y_clean) # Removed logarithm transformation

# Set up K-Fold Cross-Validation for the final evaluation
kf_final = KFold(n_splits=5, shuffle=True, random_state=42)
final_cv_scores = []
X_clean = X_clean.drop(columns = ["gb_fuel","gb_transmission"])
cat_features = [col for col in X_clean.select_dtypes(include=['object', 'category', 'bool']).columns if col not in text_features]
cat_indices = [X_clean.columns.get_loc(col) for col in cat_features]
print("--- Training Refined Final Model with 5-Fold CV (GPU) ---")
for fold, (train_idx, val_idx) in enumerate(kf_final.split(X_clean, y_clean)):
    X_tr, X_va = X_clean.iloc[train_idx], X_clean.iloc[val_idx]
    y_tr, y_va = y_clean.iloc[train_idx], y_clean.iloc[val_idx] # Use original y_clean

    # Pools
    tr_pool = Pool(data=X_tr, label=y_tr, cat_features=cat_indices, text_features=text_indices)
    va_pool = Pool(data=X_va, label=y_va, cat_features=cat_indices, text_features=text_indices)

    # Initialize and fit CatBoostRegressor
    model_cv = CatBoostRegressor(
        iterations=1500,
        learning_rate=0.05,
        depth=8,  # Slightly deeper for the final model to capture complex interactions
        loss_function='RMSE',
        eval_metric='RMSE',
        early_stopping_rounds=150,
        verbose=False,
        random_seed=42,
        **gpu_params
    )

    model_cv.fit(tr_pool, eval_set=va_pool)

    # Evaluate on validation set in original price scale for interpretable RMSE
    preds_price = model_cv.predict(va_pool) # Direct prediction, no expm1 needed
    y_va_price = y_va # Use original y_va

    rmse_price = np.sqrt(mean_squared_error(y_va_price, preds_price))
    final_cv_scores.append(rmse_price)
    print(f"Fold {fold + 1} RMSE (Price): {rmse_price:,.0f} TL")

print(f"\nFinal Cleaned CV RMSE: {np.mean(final_cv_scores):,.0f} TL (±{np.std(final_cv_scores):,.0f} TL)")

# 2. Train the Final Production Model on the entire clean dataset
final_full_pool = Pool(data=X_clean, label=y_clean, cat_features=cat_indices, text_features=text_indices) # Use original y_clean

final_model = CatBoostRegressor(
    iterations=1500,
    learning_rate=0.05,
    depth=8,
    loss_function='RMSE',
    verbose=250,
    random_seed=42,
    **gpu_params
)

print("\n--- Training Final Production Model on Full Clean Dataset ---")
final_model.fit(final_full_pool)

# 3. Save the model artifacts
model_filename = 'catboost_price_predictor_final.cbm'
final_model.save_model(model_filename)
print(f"\nSuccess: Final model saved as '{model_filename}'")

--- Training Refined Final Model with 5-Fold CV (GPU) ---
Fold 1 RMSE (Price): 142,988 TL
Fold 2 RMSE (Price): 137,344 TL
Fold 3 RMSE (Price): 138,810 TL
Fold 4 RMSE (Price): 137,884 TL
Fold 5 RMSE (Price): 135,478 TL

Final Cleaned CV RMSE: 138,501 TL (±2,493 TL)

--- Training Final Production Model on Full Clean Dataset ---
0:	learn: 999465.6109703	total: 18.2ms	remaining: 27.3s
250:	learn: 148304.0677714	total: 4.29s	remaining: 21.3s
500:	learn: 137700.4829686	total: 8.37s	remaining: 16.7s
750:	learn: 133015.6479159	total: 12.8s	remaining: 12.7s
1000:	learn: 130233.8410758	total: 17.2s	remaining: 8.58s
1250:	learn: 127973.3492347	total: 21.5s	remaining: 4.29s
1499:	learn: 126393.8429959	total: 25.7s	remaining: 0us

Success: Final model saved as 'catboost_price_predictor_final.cbm'

from sklearn.metrics import r2_score, mean_absolute_error

# 1. Metrics for the last Validation Fold (Out-of-sample performance)
fold_mae = mean_absolute_error(y_va_price, preds_price)
fold_r2 = r2_score(y_va_price, preds_price)

print("-- Validation Metrics (Last CV Fold - Out-of-Sample) ---")
print(f"R² Score : {fold_r2:.4f}")
print(f"MAE      : {fold_mae:,.0f} TL\n")

# 2. Metrics for the Final Production Model (In-sample performance on full clean data)
preds_price_full = final_model.predict(final_full_pool) # Predictions are already in price scale
y_true_price = y_clean # Use original y_clean, no log transformation

full_mae = mean_absolute_error(y_true_price, preds_price_full)
full_r2 = r2_score(y_true_price, preds_price_full)

print("-- Final Model Metrics (Full Clean Dataset - In-Sample) ---")
print(f"R² Score : {full_r2:.4f}")
print(f"MAE      : {full_mae:,.0f} TL")

-- Validation Metrics (Last CV Fold - Out-of-Sample) ---
R² Score : 0.9825
MAE      : 103,333 TL

-- Final Model Metrics (Full Clean Dataset - In-Sample) ---
R² Score : 0.9856
MAE      : 94,797 TL

	count
cylinder_count
40.0	8516
30.0	1339
60.0	245
80.0	30

	expert_risk_score	vehicle_age
count	13904.000000	13904.000000
mean	12.981085	10.794232
std	21.244036	4.687457
min	0.000000	1.000000
25%	0.000000	8.000000
50%	4.000000	11.000000
75%	18.000000	14.000000
max	274.000000	21.000000

	brand	series	model	price	predicted_price	residual	z_score	outlier_zscore	outlier_iso	is_outlier
7	audi	A3	A3 Hatchback 1.6 Attraction	475000.0	589385.141417	-114385.141417	0.718707	False	True	True
11	audi	A3	A3 Sportback 1.6 FSI	490000.0	463285.414422	26714.585578	0.158544	False	True	True
12	audi	A3	A3 Sportback 1.6 Attraction	499999.0	535147.887318	-35148.887318	0.226076	False	True	True
13	audi	A3	A3 Sportback 1.6 Ambition	445000.0	467605.476146	-22605.476146	0.148091	False	True	True
18	audi	A3	A3 Sportback 1.6 Ambiente	475000.0	505031.477057	-30031.477057	0.194260	False	True	True

	brand	series	model	price	predicted_price	residual	z_score	outlier_zscore	outlier_iso	is_outlier
10897	bmw	6 Serisi	640i	6150000.0	3.712775e+06	2.437225e+06	15.145276	True	True	True
8853	bmw	7 Serisi	730d xDrive M Sport	1950000.0	3.619131e+06	-1.669131e+06	10.384940	True	True	True
10948	bmw	M Serisi	M3	5999000.0	4.534616e+06	1.464384e+06	9.096882	True	True	True
4345	audi	A5	A5 Sportback 40 TDI Quattro S Line	5330000.0	3.921290e+06	1.408710e+06	8.750743	True	False	True
10931	bmw	4 Serisi	420i Edition M Sport	5900000.0	4.584921e+06	1.315079e+06	8.168619	True	False	True

Task¶

Phase 0: Setup and Data Loading¶

Subtask:¶

Phase 1: EDA & Variance Analysis¶

Subtask:¶

Phase 1b: Numerical Feature Relationship Analysis¶

Subtask:¶

Phase 2: Drop Colums and Fill the missings with group based imputation¶

2b: Looking For Missing Value Patterns¶

Phase 3: Domain Feature Engineering¶

Subtask:¶

Phase 3b: Numerical Feature Relationship Analysis Part 2¶

Phase 3c: Categorical Feature Relationship Analysis¶

Phase 4: Cross-Validation & Baseline Modeling¶

Subtask:¶

Phase 5: Outlier Analysis¶

Subtask:¶

Phase 6: LOFO Feature Selection & Hierarchy Ablation Study¶

Subtask:¶

Phase 7: Final Refined Model¶

Subtask:¶

	Feature_Combination	CV_RMSE
0	brand	1.042775e+06
1	series	9.451142e+05
3	brand + series	9.450829e+05
2	model	4.512768e+05
4	brand + model	4.503697e+05
6	brand + series + model	4.459154e+05
5	series + model	4.455466e+05