Post 15: Your First Property Prediction Pipeline

Module 4 · Practical Tools

Bring everything together: fetch data from the Materials Project, engineer physics-informed features, train and compare multiple ML models, evaluate with cross-validation, and export a ready-to-use predictor — all in one reproducible Python pipeline for transition-metal chalcogenides.

Abderrahmane REGGAD · June 20, 2026

Module 4

🔧

Task

End-to-end ML pipeline

⚛️

Target

Band gap + magnetic ordering

🧪

Dataset

MₓCᵧ from Materials Project

🏆

Output

Trained model + SHAP analysis

Every post in this course has built one piece of the puzzle: data fetching (Post 14), feature engineering (Posts 3–4), regression (Post 5), classification (Post 6), ensemble methods (Post 7), and neural networks (Posts 9–10). This final post of Module 4 assembles all of these into a single reproducible pipeline that goes from a Materials Project query to a trained, evaluated, and interpretable ML model — the kind of pipeline you would actually use in a research publication.

🔧

What we will build

A six-stage pipeline for MₓCᵧ transition-metal chalcogenides: Fetch → Clean → Engineer features → Train & compare → Evaluate → Interpret with SHAP. Two targets: band gap regression (eV) and magnetic ordering classification (NM / AFM / FM). All code is copy-paste ready.

Pipeline Architecture

🗄️
1. Fetch
mp-api

→

🧹

2. Clean

pandas

→

⚗️

3. Features

physics rules

→

🤖

4. Train

sklearn

→

📊

5. Evaluate

CV + metrics

→

🔍

6. Interpret

SHAP

Stage 1 — Fetch Data from the Materials Project

from mp_api.client import MPRester

import pandas as pd, numpy as np

CHEMSYS = [

  f"{m}-{c}" for m in ["Fe","Ni","Co","Mn","Cr","Ti","V","Cu"]

  for c  in ["S", "Se", "Te"]

]

with MPRester() as mpr:  # reads MP_API_KEY env var

    docs = mpr.materials.summary.search(

        chemsys=CHEMSYS,

        energy_above_hull=(0, 0.1),

        fields=["material_id", "formula_pretty",

                "band_gap", "ordering", "total_magnetization",

                "energy_above_hull", "crystal_system",

                "spacegroup_number", "volume",

                "nsites", "density"]

    )

df_raw = pd.DataFrame([{

    'mpid':       d.material_id,

    'formula':    d.formula_pretty,

    'band_gap':   d.band_gap,

    'ordering':   str(d.ordering),

    'mag_total':  d.total_magnetization,

    'e_hull':     d.energy_above_hull,

    'crystal_sys':d.crystal_system,

    'spacegroup': d.spacegroup_number,

    'volume':     d.volume,

    'nsites':     d.nsites,

    'density':    d.density,

} for d in docs])

print(f"Raw: {len(df_raw)} entries")

Stage 2 — Clean and Label

from pymatgen.core import Composition, Element

def extract_features_from_formula(f):

    """Extract metal element, chalcogen, and stoichiometry."""

    comp = Composition(f)

    chalcogens = {'S','Se','Te'}

    metals = {el.symbol:float(amt) for el,amt in comp.items()

              if el.symbol not in chalcogens}

    chalcs = {el.symbol:float(amt) for el,amt in comp.items()

              if el.symbol in chalcogens}

    if not metals or not chalcs: return None

    M = max(metals, key=metals.get)

    C = max(chalcs, key=chalcs.get)

    return pd.Series({'metal':M,'chalcogen':C,

                      'M_amt':metals[M],'C_amt':chalcs[C]})

# Apply and merge

extra = df_raw['formula'].apply(extract_features_from_formula)

df = pd.concat([df_raw, extra], axis=1).dropna(subset=['metal'])

# Clean ordering labels

df['ordering'] = df['ordering'].str.upper().str.replace('ORDERING.','')

df = df[df['ordering'].isin(['FM','AFM','NM'])]

# Dedup — keep lowest E-hull per formula

df = df.sort_values('e_hull').drop_duplicates('formula',keep='first')

print(f"Clean: {len(df)} entries | ordering: {df['ordering'].value_counts().to_dict()}")

Stage 3 — Physics-Informed Feature Engineering

Raw database fields are good starting features, but the most powerful descriptors encode domain knowledge: electronegativity differences drive ionic/covalent character, d-electron count governs crystal-field splitting and Hund's coupling, and the chalcogen period controls covalency (S < Se < Te).

Feature	Physics motivation	Source
ΔEN	Electronegativity difference → ionic/covalent character → gap size	Pauling scale (pymatgen)
n_d	d-electron count → crystal field, Hund's rule, Stoner criterion	Element data
Chalc. period	S(3p) < Se(4p) < Te(5p): increasing covalency, smaller gaps	Periodic table
M/C ratio	Stoichiometry controls band filling and bonding geometry	Formula parsing
Volume/site	Proxy for bond length → bandwidth → gap vs metal character	MP
Spacegroup	Crystal symmetry encodes coordination environment	MP
U/W proxy	n_d(5−n_d)/ΔEN ≈ Mott criterion: large → correlated insulator	Derived
GKA flag	180° M-C-M super-exchange → AFM if half-filled d-shell	Derived (n_d=5)

from pymatgen.core import Element

# Periodic table lookups

CHALC_PERIOD = {'S':3, 'Se':4, 'Te':5}

D_ELECTRONS  = {'Ti':2,'V':3,'Cr':4,'Mn':5,

               'Fe':6,'Co':7,'Ni':8,'Cu':9}

def engineer_features(row):

    M, C = row['metal'], row['chalcogen']

    eM = Element(M).X  # Pauling electronegativity

    eC = Element(C).X

    nd = D_ELECTRONS.get(M, 0)

    return {

        'delta_EN':      eC - eM,

        'n_d':           nd,

        'chalc_period':  CHALC_PERIOD[C],

        'MC_ratio':      row['M_amt'] / (row['C_amt'] + 1e-6),

        'vol_per_site':  row['volume'] / (row['nsites'] + 1e-6),

        'spacegroup':    row['spacegroup'],

        'e_hull':        row['e_hull'],

        'density':       row['density'],

        # Physics-derived descriptors

        'UW_proxy':      nd * (10 - nd) / (eC - eM + 0.1),  # Mott criterion proxy

        'GKA_flag':      1 if nd == 5 else 0,               # half-filled → AFM

        'hund_coupling': min(nd, 10-nd),                     # Hund's rule energy

    }

feat_df = df.apply(engineer_features, axis=1, result_type='expand')

df = pd.concat([df, feat_df], axis=1)

FEATURES = ['delta_EN','n_d','chalc_period','MC_ratio',

            'vol_per_site','spacegroup','e_hull','density',

            'UW_proxy','GKA_flag','hund_coupling']

Stage 4 — Train and Compare Multiple Models

A good pipeline never commits to a single model. We train five algorithms simultaneously and let cross-validation decide.

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.linear_model import Ridge

from sklearn.svm import SVR

from sklearn.neural_network import MLPRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import cross_val_score, KFold

from sklearn.metrics import mean_absolute_error, r2_score

X = df[FEATURES].values

y_reg = df['band_gap'].values    # regression target

y_cls = df['ordering'].values    # classification target

MODELS = {

    'Ridge':  make_pipeline(StandardScaler(), Ridge(alpha=1.0)),

    'RF':     RandomForestRegressor(n_estimators=200, random_state=42),

    'GBM':    GradientBoostingRegressor(n_estimators=200, random_state=42),

    'SVR':    make_pipeline(StandardScaler(), SVR(kernel='rbf', C=10)),

    'MLP':    make_pipeline(StandardScaler(),

                          MLPRegressor(hidden_layer_sizes=(64,32),

                                      max_iter=1000, random_state=42)),

}

cv = KFold(n_splits=5, shuffle=True, random_state=42)

results = {}

for name, model in MODELS.items():

    scores = cross_val_score(model, X, y_reg, cv=cv,

                            scoring='neg_mean_absolute_error')

    results[name] = -scores.mean()

    print(f"{name:8s}  CV-MAE = {-scores.mean():.3f} ± {scores.std():.3f} eV")

Stage 5 — Evaluate the Best Model

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# Pick best model from CV results

best_name = min(results, key=results.get)

best_model = MODELS[best_name]

print(f"Best model: {best_name} (MAE = {results[best_name]:.3f} eV)")

# Final train/test split for reporting

X_tr, X_te, y_tr, y_te = train_test_split(

    X, y_reg, test_size=0.2, random_state=42)

best_model.fit(X_tr, y_tr)

y_pred = best_model.predict(X_te)

print(f"Test MAE : {mean_absolute_error(y_te, y_pred):.3f} eV")

print(f"Test R²  : {r2_score(y_te, y_pred):.3f}")

# Parity plot

plt.figure(figsize=(5,5))

plt.scatter(y_te, y_pred, alpha=0.7)

plt.plot([0,max(y_te)],[0,max(y_te)],'--r')

plt.xlabel('DFT Band Gap (eV)'); plt.ylabel('ML Predicted (eV)')

plt.title(f'{best_name} — Test set parity plot')

plt.tight_layout(); plt.savefig('parity_plot.png', dpi=150)

Stage 6 — Interpret with SHAP

A prediction is only publishable if you can explain why the model made it. SHAP (SHapley Additive exPlanations) assigns each feature a contribution score for each prediction — grounded in cooperative game theory.

import shap

# Fit final model on all data before SHAP

best_model.fit(X, y_reg)

# TreeExplainer works for RF and GBM; use KernelExplainer for SVR/MLP

if best_name in ['RF', 'GBM']:

    explainer = shap.TreeExplainer(best_model)

    shap_values = explainer.shap_values(X)

else:

    X_bg = shap.sample(X, 50)  # background dataset for KernelExplainer

    explainer = shap.KernelExplainer(best_model.predict, X_bg)

    shap_values = explainer.shap_values(X)

# Summary plot — feature importance + direction

shap.summary_plot(shap_values, X, feature_names=FEATURES,

                  plot_type='dot', show=False)

plt.tight_layout(); plt.savefig('shap_summary.png', dpi=150)

# Force plot for a single prediction (e.g. compound index 0)

shap.initjs()

shap.force_plot(explainer.expected_value, shap_values[0],

                X[0], feature_names=FEATURES)

🔬

What to look for in the SHAP summary

For transition-metal chalcogenides, a physically well-behaved model should show: ΔEN as the top feature (ionic character → gap), n_d second (d-band filling → metal/insulator boundary), chalc_period third (S<Se<Te covalency trend), and UW_proxy positively correlated with band gap size (Mott criterion). If spacegroup or density dominate, your model may be learning spurious structural correlations rather than electronic physics.

7. Saving and Deploying the Pipeline

import joblib, json, datetime

# Save the complete pipeline (preprocessor + model)

joblib.dump(best_model, f"MxCy_pipeline_{best_name}_v1.joblib")

# Save metadata alongside the model

meta = {

    'model':        best_name,

    'features':     FEATURES,

    'target':       'band_gap_eV',

    'cv_mae_eV':    round(results[best_name], 4),

    'n_train':      len(df),

    'chemsys':      'MxCy (M=Fe,Ni,Co,Mn,Cr,Ti,V,Cu; C=S,Se,Te)',

    'date':         datetime.date.today().isoformat(),

}

json.dump(meta, open('MxCy_pipeline_meta.json','w'), indent=2)

# Load and predict on a new compound

pipeline = joblib.load(f"MxCy_pipeline_{best_name}_v1.joblib")

new_compound = np.array([[1.10, 6, 4, 0.5, 18.3, 194, 0.01, 5.2, 24, 0, 4]])

print(f"Predicted band gap: {pipeline.predict(new_compound)[0]:.3f} eV")

8. Common Pitfalls and How to Avoid Them

Pitfall	Symptom	Fix
Data leakage	Test R² = 0.99, new compound prediction terrible	Never fit scalers or imputers on test data; use pipelines
Wrong CV strategy	CV score looks good but real-world performance poor	Use `GroupKFold` by metal element to test genuine generalisation
PBE gap = 0 ≠ metal	Many correlated insulators labelled as metals	Use DFT+U values from MP for 3d transition metals; flag band_gap < 0.1 as "uncertain"
Class imbalance	Ordering classifier always predicts NM (majority)	Use `class_weight='balanced'` or SMOTE oversampling
Feature leakage	Including band_gap as a feature when predicting ordering	Carefully separate features from targets; use only composition-based descriptors for true prediction

Quick Check

1. You achieve R² = 0.98 on your test set but only 0.61 on new compounds from a different database. What is the most likely cause?

A. The model has too few parameters
B. Data leakage — the test set was not truly independent, or the scaler was fitted on all data including the test set
C. The learning rate was too high
D. The SHAP values were computed incorrectly

2. Your SHAP summary shows "spacegroup_number" as the most important feature for band gap prediction. What should concern you?

A. Nothing — space group is a valid physical descriptor
B. The model may be learning a spurious correlation specific to your training set. Space group encodes crystal symmetry but not the electronic mechanism behind the gap — ΔEN and n_d should dominate for a physically meaningful model
C. SHAP cannot handle categorical features like space group
D. The model is underfitting because space group has too many values

3. For a magnetic ordering classifier (NM/AFM/FM) trained on MₓCᵧ data, why should you use GroupKFold by metal element rather than standard KFold?

A. GroupKFold is faster for large datasets
B. Standard KFold can place Fe-S and Fe-Se compounds in different folds, giving the model information about Fe behaviour during validation. GroupKFold ensures all Fe compounds are either all-train or all-test — testing true generalisation to new metal systems
C. KFold does not support classification targets
D. GroupKFold automatically balances the class distribution

Pipeline Feature Engineering Cross-validation SHAP Random Forest Band Gap MₓCᵧ joblib scikit-learn

Header Ads Widget

Last Posts

Post 15: Your First Property Prediction Pipeline — Step by Step

Pipeline Architecture

Stage 1 — Fetch Data from the Materials Project

Stage 2 — Clean and Label

Stage 3 — Physics-Informed Feature Engineering

Stage 4 — Train and Compare Multiple Models

Stage 5 — Evaluate the Best Model

Stage 6 — Interpret with SHAP

7. Saving and Deploying the Pipeline

8. Common Pitfalls and How to Avoid Them

Quick Check

About me

My page

Popular Posts

Post 1: What is Artificial Intelligence? A researcher's first look

Post 2: ML vs Traditional Simulation — Where does DFT end and ML begin?

Post 4: Types of ML — Supervised, Unsupervised & Reinforcement Learning

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Post 3: Key Mathematical Tools — Vectors, Matrices & Probability

Post 6 — Classification: Is This Material a Metal or Insulator?

Post 8: Support Vector Machines (SVMs) — with a crystal property example

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Post 14: Working with Materials Databases — ICSD, Materials Project, AFLOW

Post 12: Graph Neural Networks — ALIGNN & CGCNN Explained for Beginners

Categories

Pageviews past week

You may contact me here

Foundations

Core Algorithms

Neural Networks & Deep Learning

Menu Footer Widget