End-to-End Car Price Prediction System and MLOps Architecture

The second-hand vehicle market is one of the places where the "asymmetric information" problem is experienced most intensely. While the seller knows the history of the vehicle, the buyer usually has to rely solely on the declaration in the advertisement. As a Data Scientist, to solve this problem, I designed an architecture that offers a decision support system, not just a model that "predicts prices."

In this article, I will explain the journey of transforming raw data into a live MLOps system.

1. Model Selection: Why CatBoost?

Vehicle datasets are inherently dense with Categorical variables: Brand, Model, Gear Type, Fuel, Color, City...

When dealing with high-cardinality categorical features, traditional approaches face a bottleneck. One-Hot Encoding causes the dataset matrix to swell (Curse of Dimensionality).

Standard tree-based models (XGBoost, Random Forest) offer alternatives, but with significant tradeoffs:

Label Encoding: Requires pruning low-frequency categories to avoid noise, leading to data loss.
Target Encoding: Replaces categories with their average price. However, for rare categories, these averages are unreliable and require manual smoothing with the global mean to prevent overfitting.

I chose CatBoost because it handles these challenges natively using Ordered Target Statistics, eliminating the need for these complex preprocessing steps.

# Model Parameters
model = CatBoostRegressor(
    iterations=2000,
    learning_rate=0.1,
    depth=6,
    loss_function="MultiQuantile:alpha=0.05,0.5,0.95", # This is the critical point!
    verbose=200
)

Note: I used the MultiQuantile loss function instead of RMSE. Thanks to this, the model predicts the 5% (Lowest), 50% (Median), and 95% (Highest) range where the price might fall, instead of giving a single price point.

2. Feature Engineering: "Expert Risk Score"

Just Year and Mileage (KM) information is not enough to predict the price. A vehicle having a painted roof does not have the same impact on price as a painted bumper.

To feed this Domain Knowledge into the model, I developed a special Risk Score Algorithm:

def calculate_expert_risk_score(row):
    score = 0
    
    # 1. Critical Parts (Roof, Hood, Trunk)
    # The roof implies structural integrity, thus carrying the highest penalty.
    if row.get("tavan_degisen") == 1: score += 150
    elif row.get("tavan_boyali") == 1: score += 75
    elif row.get("tavan_lokal") == 1:  score += 40
 
    # Hood and Trunk operations
    if row.get("kaput_degisen") == 1: score += 60
    if row.get("bagaj_degisen") == 1: score += 40
    # ... (lower penalties for painting/local repairs)
 
    # 2. Side Parts (Doors)
    # Aggregating total damage across all 4 doors
    doors_changed = sum([row.get(c, 0) for c in ["door_fl", "door_fr", ...]])
    doors_painted = sum([row.get(c, 0) for c in ["door_fl_boyali", ...]])
    
    # Weights: Replaced door = 10 pts, Painted door = 5 pts
    score += (doors_changed * 10) + (doors_painted * 5)
 
    # 3. Fenders
    # Fenders have slightly less impact than doors (8 pts for replacement)
    score += (fenders_changed * 8) + (fenders_painted * 4)
 
    return score

Thanks to this score, the model can distinguish which of two vehicles with the same model/year/km is "clean" and which is "processed" (damaged/repaired).

3. Explainability: Opening the "Black Box" with SHAP

Accuracy is not enough; users need to trust the prediction. A "Black Box" model that spits out a price without reasoning is useless for decision support.

To solve this, I integrated SHAP (SHapley Additive exPlanations).

Global Interpretability: I analyzed which features drive the market generally. Unsurprisingly, Year and Engine Power are positive drivers, while Risk Score and Mileage are negative drivers.

Model explainability plays a key role in deriving market insights.

4. MLOps: The Model as a Living Entity

Training the model once and setting it aside is the biggest mistake made. The market changes, inflation rises, or people's preferences might shift to SUVs. This situation is called Data Drift.

The system I established in this project:

Versioning: Every model training (v1, v2) is stored along with metrics.json and the model file.
Drift Detection: The distribution of incoming new data is compared with the training data.
Automated Alerting: If the price distribution shows statistical deviation (KS Test p-value < 0.05), the system flags the specific drifted features.

Monitoring via Dashboard

The MLOps Panel I developed on the Frontend side (Next.js) visualizes the distribution of the reference model vs. current data.

(The orange area shifting to the right indicates that prices in the market are rising and the model needs to be updated.)

4. Tech Stack

This project was developed with the "Full Stack Data Science" principle:

Model Training: Python, Pandas, CatBoost, SHAP
Backend API: FastAPI (Asynchronous architecture, Pydantic validations)
Frontend: Next.js 16 (App Router), TypeScript, Tailwind CSS, Shadcn/UI
Data Visualization: Interactive charts are powered by Recharts - shadcn/ui (Licensed under MIT) and Observable Plot (Licensed under ISC).
Database: PostgreSQL (Data storage)

Conclusion

This project is proof of how a machine learning model transforms from a "Jupyter Notebook" into a living, monitorable product that provides value to the real user.

You can visit the Main Page to view project.