Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 15: Your First Property Prediction Pipeline — Step by Step

Bring everything together: fetch data from the Materials Project, engineer physics-informed features, train and compare multiple ML models, evaluate with cross-validation, and export a ready-to-use predictor — all in one reproducible Python pipeline for transition-metal chalcogenides.

Module 4
🔧
Task

End-to-end ML pipeline

⚛️
Target

Band gap + magnetic ordering

🧪
Dataset

MₓCᵧ from Materials Project

🏆
Output

Trained model + SHAP analysis

Every post in this course has built one piece of the puzzle: data fetching (Post 14), feature engineering (Posts 3–4), regression (Post 5), classification (Post 6), ensemble methods (Post 7), and neural networks (Posts 9–10). This final post of Module 4 assembles all of these into a single reproducible pipeline that goes from a Materials Project query to a trained, evaluated, and interpretable ML model — the kind of pipeline you would actually use in a research publication.

🔧
What we will build

A six-stage pipeline for MₓCᵧ transition-metal chalcogenides: Fetch → Clean → Engineer features → Train & compare → Evaluate → Interpret with SHAP. Two targets: band gap regression (eV) and magnetic ordering classification (NM / AFM / FM). All code is copy-paste ready.

Pipeline Architecture

🗄️
1. Fetch
mp-api
🧹
2. Clean
pandas
⚗️
3. Features
physics rules
🤖
4. Train
sklearn
📊
5. Evaluate
CV + metrics
🔍
6. Interpret
SHAP

Stage 1 — Fetch Data from the Materials Project

from mp_api.client import MPRester
import pandas as pd, numpy as np

CHEMSYS = [
  f"{m}-{c}" for m in ["Fe","Ni","Co","Mn","Cr","Ti","V","Cu"]
  for c in ["S", "Se", "Te"]
]

with MPRester() as mpr: # reads MP_API_KEY env var
    docs = mpr.materials.summary.search(
        chemsys=CHEMSYS,
        energy_above_hull=(0, 0.1),
        fields=["material_id", "formula_pretty",
                "band_gap", "ordering", "total_magnetization",
                "energy_above_hull", "crystal_system",
                "spacegroup_number", "volume",
                "nsites", "density"]
    )

df_raw = pd.DataFrame([{
    'mpid': d.material_id,
    'formula': d.formula_pretty,
    'band_gap': d.band_gap,
    'ordering': str(d.ordering),
    'mag_total': d.total_magnetization,
    'e_hull': d.energy_above_hull,
    'crystal_sys':d.crystal_system,
    'spacegroup': d.spacegroup_number,
    'volume': d.volume,
    'nsites': d.nsites,
    'density': d.density,
} for d in docs])
print(f"Raw: {len(df_raw)} entries")

Stage 2 — Clean and Label

from pymatgen.core import Composition, Element

def extract_features_from_formula(f):
    """Extract metal element, chalcogen, and stoichiometry."""
    comp = Composition(f)
    chalcogens = {'S','Se','Te'}
    metals = {el.symbol:float(amt) for el,amt in comp.items()
              if el.symbol not in chalcogens}
    chalcs = {el.symbol:float(amt) for el,amt in comp.items()
              if el.symbol in chalcogens}
    if not metals or not chalcs: return None
    M = max(metals, key=metals.get)
    C = max(chalcs, key=chalcs.get)
    return pd.Series({'metal':M,'chalcogen':C,
                      'M_amt':metals[M],'C_amt':chalcs[C]})

# Apply and merge
extra = df_raw['formula'].apply(extract_features_from_formula)
df = pd.concat([df_raw, extra], axis=1).dropna(subset=['metal'])

# Clean ordering labels
df['ordering'] = df['ordering'].str.upper().str.replace('ORDERING.','')
df = df[df['ordering'].isin(['FM','AFM','NM'])]

# Dedup — keep lowest E-hull per formula
df = df.sort_values('e_hull').drop_duplicates('formula',keep='first')
print(f"Clean: {len(df)} entries | ordering: {df['ordering'].value_counts().to_dict()}")

Stage 3 — Physics-Informed Feature Engineering

Raw database fields are good starting features, but the most powerful descriptors encode domain knowledge: electronegativity differences drive ionic/covalent character, d-electron count governs crystal-field splitting and Hund's coupling, and the chalcogen period controls covalency (S < Se < Te).

FeaturePhysics motivationSource
ΔENElectronegativity difference → ionic/covalent character → gap sizePauling scale (pymatgen)
ndd-electron count → crystal field, Hund's rule, Stoner criterionElement data
Chalc. periodS(3p) < Se(4p) < Te(5p): increasing covalency, smaller gapsPeriodic table
M/C ratioStoichiometry controls band filling and bonding geometryFormula parsing
Volume/siteProxy for bond length → bandwidth → gap vs metal characterMP
SpacegroupCrystal symmetry encodes coordination environmentMP
U/W proxynd(5−nd)/ΔEN ≈ Mott criterion: large → correlated insulatorDerived
GKA flag180° M-C-M super-exchange → AFM if half-filled d-shellDerived (nd=5)
from pymatgen.core import Element

# Periodic table lookups
CHALC_PERIOD = {'S':3, 'Se':4, 'Te':5}
D_ELECTRONS = {'Ti':2,'V':3,'Cr':4,'Mn':5,
               'Fe':6,'Co':7,'Ni':8,'Cu':9}

def engineer_features(row):
    M, C = row['metal'], row['chalcogen']
    eM = Element(M).X # Pauling electronegativity
    eC = Element(C).X
    nd = D_ELECTRONS.get(M, 0)

    return {
        'delta_EN': eC - eM,
        'n_d': nd,
        'chalc_period': CHALC_PERIOD[C],
        'MC_ratio': row['M_amt'] / (row['C_amt'] + 1e-6),
        'vol_per_site': row['volume'] / (row['nsites'] + 1e-6),
        'spacegroup': row['spacegroup'],
        'e_hull': row['e_hull'],
        'density': row['density'],
        # Physics-derived descriptors
        'UW_proxy': nd * (10 - nd) / (eC - eM + 0.1), # Mott criterion proxy
        'GKA_flag': 1 if nd == 5 else 0, # half-filled → AFM
        'hund_coupling': min(nd, 10-nd), # Hund's rule energy
    }

feat_df = df.apply(engineer_features, axis=1, result_type='expand')
df = pd.concat([df, feat_df], axis=1)
FEATURES = ['delta_EN','n_d','chalc_period','MC_ratio',
            'vol_per_site','spacegroup','e_hull','density',
            'UW_proxy','GKA_flag','hund_coupling']

Stage 4 — Train and Compare Multiple Models

A good pipeline never commits to a single model. We train five algorithms simultaneously and let cross-validation decide.

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_absolute_error, r2_score

X = df[FEATURES].values
y_reg = df['band_gap'].values # regression target
y_cls = df['ordering'].values # classification target

MODELS = {
    'Ridge': make_pipeline(StandardScaler(), Ridge(alpha=1.0)),
    'RF': RandomForestRegressor(n_estimators=200, random_state=42),
    'GBM': GradientBoostingRegressor(n_estimators=200, random_state=42),
    'SVR': make_pipeline(StandardScaler(), SVR(kernel='rbf', C=10)),
    'MLP': make_pipeline(StandardScaler(),
                          MLPRegressor(hidden_layer_sizes=(64,32),
                                      max_iter=1000, random_state=42)),
}

cv = KFold(n_splits=5, shuffle=True, random_state=42)
results = {}

for name, model in MODELS.items():
    scores = cross_val_score(model, X, y_reg, cv=cv,
                            scoring='neg_mean_absolute_error')
    results[name] = -scores.mean()
    print(f"{name:8s} CV-MAE = {-scores.mean():.3f} ± {scores.std():.3f} eV")

Stage 5 — Evaluate the Best Model

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Pick best model from CV results
best_name = min(results, key=results.get)
best_model = MODELS[best_name]
print(f"Best model: {best_name} (MAE = {results[best_name]:.3f} eV)")

# Final train/test split for reporting
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y_reg, test_size=0.2, random_state=42)
best_model.fit(X_tr, y_tr)
y_pred = best_model.predict(X_te)

print(f"Test MAE : {mean_absolute_error(y_te, y_pred):.3f} eV")
print(f"Test R² : {r2_score(y_te, y_pred):.3f}")

# Parity plot
plt.figure(figsize=(5,5))
plt.scatter(y_te, y_pred, alpha=0.7)
plt.plot([0,max(y_te)],[0,max(y_te)],'--r')
plt.xlabel('DFT Band Gap (eV)'); plt.ylabel('ML Predicted (eV)')
plt.title(f'{best_name} — Test set parity plot')
plt.tight_layout(); plt.savefig('parity_plot.png', dpi=150)

Stage 6 — Interpret with SHAP

A prediction is only publishable if you can explain why the model made it. SHAP (SHapley Additive exPlanations) assigns each feature a contribution score for each prediction — grounded in cooperative game theory.

import shap

# Fit final model on all data before SHAP
best_model.fit(X, y_reg)

# TreeExplainer works for RF and GBM; use KernelExplainer for SVR/MLP
if best_name in ['RF', 'GBM']:
    explainer = shap.TreeExplainer(best_model)
    shap_values = explainer.shap_values(X)
else:
    X_bg = shap.sample(X, 50) # background dataset for KernelExplainer
    explainer = shap.KernelExplainer(best_model.predict, X_bg)
    shap_values = explainer.shap_values(X)

# Summary plot — feature importance + direction
shap.summary_plot(shap_values, X, feature_names=FEATURES,
                  plot_type='dot', show=False)
plt.tight_layout(); plt.savefig('shap_summary.png', dpi=150)

# Force plot for a single prediction (e.g. compound index 0)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0],
                X[0], feature_names=FEATURES)
🔬
What to look for in the SHAP summary

For transition-metal chalcogenides, a physically well-behaved model should show: ΔEN as the top feature (ionic character → gap), nd second (d-band filling → metal/insulator boundary), chalc_period third (S<Se<Te covalency trend), and UW_proxy positively correlated with band gap size (Mott criterion). If spacegroup or density dominate, your model may be learning spurious structural correlations rather than electronic physics.

7. Saving and Deploying the Pipeline

import joblib, json, datetime

# Save the complete pipeline (preprocessor + model)
joblib.dump(best_model, f"MxCy_pipeline_{best_name}_v1.joblib")

# Save metadata alongside the model
meta = {
    'model': best_name,
    'features': FEATURES,
    'target': 'band_gap_eV',
    'cv_mae_eV': round(results[best_name], 4),
    'n_train': len(df),
    'chemsys': 'MxCy (M=Fe,Ni,Co,Mn,Cr,Ti,V,Cu; C=S,Se,Te)',
    'date': datetime.date.today().isoformat(),
}
json.dump(meta, open('MxCy_pipeline_meta.json','w'), indent=2)

# Load and predict on a new compound
pipeline = joblib.load(f"MxCy_pipeline_{best_name}_v1.joblib")
new_compound = np.array([[1.10, 6, 4, 0.5, 18.3, 194, 0.01, 5.2, 24, 0, 4]])
print(f"Predicted band gap: {pipeline.predict(new_compound)[0]:.3f} eV")

8. Common Pitfalls and How to Avoid Them

PitfallSymptomFix
Data leakage Test R² = 0.99, new compound prediction terrible Never fit scalers or imputers on test data; use pipelines
Wrong CV strategy CV score looks good but real-world performance poor Use GroupKFold by metal element to test genuine generalisation
PBE gap = 0 ≠ metal Many correlated insulators labelled as metals Use DFT+U values from MP for 3d transition metals; flag band_gap < 0.1 as "uncertain"
Class imbalance Ordering classifier always predicts NM (majority) Use class_weight='balanced' or SMOTE oversampling
Feature leakage Including band_gap as a feature when predicting ordering Carefully separate features from targets; use only composition-based descriptors for true prediction
🔧
App 15 — Interactive ML Pipeline Builder
Select features, choose a model, tune hyperparameters, and run cross-validation on the MₓCᵧ dataset live — see parity plots, CV scores, feature importances, and SHAP values update instantly.
Open App →

Quick Check

1. You achieve R² = 0.98 on your test set but only 0.61 on new compounds from a different database. What is the most likely cause?

  • A. The model has too few parameters
  • B. Data leakage — the test set was not truly independent, or the scaler was fitted on all data including the test set
  • C. The learning rate was too high
  • D. The SHAP values were computed incorrectly

2. Your SHAP summary shows "spacegroup_number" as the most important feature for band gap prediction. What should concern you?

  • A. Nothing — space group is a valid physical descriptor
  • B. The model may be learning a spurious correlation specific to your training set. Space group encodes crystal symmetry but not the electronic mechanism behind the gap — ΔEN and n_d should dominate for a physically meaningful model
  • C. SHAP cannot handle categorical features like space group
  • D. The model is underfitting because space group has too many values

3. For a magnetic ordering classifier (NM/AFM/FM) trained on MₓCᵧ data, why should you use GroupKFold by metal element rather than standard KFold?

  • A. GroupKFold is faster for large datasets
  • B. Standard KFold can place Fe-S and Fe-Se compounds in different folds, giving the model information about Fe behaviour during validation. GroupKFold ensures all Fe compounds are either all-train or all-test — testing true generalisation to new metal systems
  • C. KFold does not support classification targets
  • D. GroupKFold automatically balances the class distribution
Pipeline Feature Engineering Cross-validation SHAP Random Forest Band Gap MₓCᵧ joblib scikit-learn