Bring everything together: fetch data from the Materials Project, engineer physics-informed features, train and compare multiple ML models, evaluate with cross-validation, and export a ready-to-use predictor — all in one reproducible Python pipeline for transition-metal chalcogenides.
End-to-end ML pipeline
Band gap + magnetic ordering
MₓCᵧ from Materials Project
Trained model + SHAP analysis
Every post in this course has built one piece of the puzzle: data fetching (Post 14), feature engineering (Posts 3–4), regression (Post 5), classification (Post 6), ensemble methods (Post 7), and neural networks (Posts 9–10). This final post of Module 4 assembles all of these into a single reproducible pipeline that goes from a Materials Project query to a trained, evaluated, and interpretable ML model — the kind of pipeline you would actually use in a research publication.
A six-stage pipeline for MₓCᵧ transition-metal chalcogenides: Fetch → Clean → Engineer features → Train & compare → Evaluate → Interpret with SHAP. Two targets: band gap regression (eV) and magnetic ordering classification (NM / AFM / FM). All code is copy-paste ready.
Pipeline Architecture
Stage 1 — Fetch Data from the Materials Project
import pandas as pd, numpy as np
CHEMSYS = [
f"{m}-{c}" for m in ["Fe","Ni","Co","Mn","Cr","Ti","V","Cu"]
for c in ["S", "Se", "Te"]
]
with MPRester() as mpr: # reads MP_API_KEY env var
docs = mpr.materials.summary.search(
chemsys=CHEMSYS,
energy_above_hull=(0, 0.1),
fields=["material_id", "formula_pretty",
"band_gap", "ordering", "total_magnetization",
"energy_above_hull", "crystal_system",
"spacegroup_number", "volume",
"nsites", "density"]
)
df_raw = pd.DataFrame([{
'mpid': d.material_id,
'formula': d.formula_pretty,
'band_gap': d.band_gap,
'ordering': str(d.ordering),
'mag_total': d.total_magnetization,
'e_hull': d.energy_above_hull,
'crystal_sys':d.crystal_system,
'spacegroup': d.spacegroup_number,
'volume': d.volume,
'nsites': d.nsites,
'density': d.density,
} for d in docs])
print(f"Raw: {len(df_raw)} entries")
Stage 2 — Clean and Label
def extract_features_from_formula(f):
"""Extract metal element, chalcogen, and stoichiometry."""
comp = Composition(f)
chalcogens = {'S','Se','Te'}
metals = {el.symbol:float(amt) for el,amt in comp.items()
if el.symbol not in chalcogens}
chalcs = {el.symbol:float(amt) for el,amt in comp.items()
if el.symbol in chalcogens}
if not metals or not chalcs: return None
M = max(metals, key=metals.get)
C = max(chalcs, key=chalcs.get)
return pd.Series({'metal':M,'chalcogen':C,
'M_amt':metals[M],'C_amt':chalcs[C]})
# Apply and merge
extra = df_raw['formula'].apply(extract_features_from_formula)
df = pd.concat([df_raw, extra], axis=1).dropna(subset=['metal'])
# Clean ordering labels
df['ordering'] = df['ordering'].str.upper().str.replace('ORDERING.','')
df = df[df['ordering'].isin(['FM','AFM','NM'])]
# Dedup — keep lowest E-hull per formula
df = df.sort_values('e_hull').drop_duplicates('formula',keep='first')
print(f"Clean: {len(df)} entries | ordering: {df['ordering'].value_counts().to_dict()}")
Stage 3 — Physics-Informed Feature Engineering
Raw database fields are good starting features, but the most powerful descriptors encode domain knowledge: electronegativity differences drive ionic/covalent character, d-electron count governs crystal-field splitting and Hund's coupling, and the chalcogen period controls covalency (S < Se < Te).
| Feature | Physics motivation | Source |
|---|---|---|
| ΔEN | Electronegativity difference → ionic/covalent character → gap size | Pauling scale (pymatgen) |
| nd | d-electron count → crystal field, Hund's rule, Stoner criterion | Element data |
| Chalc. period | S(3p) < Se(4p) < Te(5p): increasing covalency, smaller gaps | Periodic table |
| M/C ratio | Stoichiometry controls band filling and bonding geometry | Formula parsing |
| Volume/site | Proxy for bond length → bandwidth → gap vs metal character | MP |
| Spacegroup | Crystal symmetry encodes coordination environment | MP |
| U/W proxy | nd(5−nd)/ΔEN ≈ Mott criterion: large → correlated insulator | Derived |
| GKA flag | 180° M-C-M super-exchange → AFM if half-filled d-shell | Derived (nd=5) |
# Periodic table lookups
CHALC_PERIOD = {'S':3, 'Se':4, 'Te':5}
D_ELECTRONS = {'Ti':2,'V':3,'Cr':4,'Mn':5,
'Fe':6,'Co':7,'Ni':8,'Cu':9}
def engineer_features(row):
M, C = row['metal'], row['chalcogen']
eM = Element(M).X # Pauling electronegativity
eC = Element(C).X
nd = D_ELECTRONS.get(M, 0)
return {
'delta_EN': eC - eM,
'n_d': nd,
'chalc_period': CHALC_PERIOD[C],
'MC_ratio': row['M_amt'] / (row['C_amt'] + 1e-6),
'vol_per_site': row['volume'] / (row['nsites'] + 1e-6),
'spacegroup': row['spacegroup'],
'e_hull': row['e_hull'],
'density': row['density'],
# Physics-derived descriptors
'UW_proxy': nd * (10 - nd) / (eC - eM + 0.1), # Mott criterion proxy
'GKA_flag': 1 if nd == 5 else 0, # half-filled → AFM
'hund_coupling': min(nd, 10-nd), # Hund's rule energy
}
feat_df = df.apply(engineer_features, axis=1, result_type='expand')
df = pd.concat([df, feat_df], axis=1)
FEATURES = ['delta_EN','n_d','chalc_period','MC_ratio',
'vol_per_site','spacegroup','e_hull','density',
'UW_proxy','GKA_flag','hund_coupling']
Stage 4 — Train and Compare Multiple Models
A good pipeline never commits to a single model. We train five algorithms simultaneously and let cross-validation decide.
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_absolute_error, r2_score
X = df[FEATURES].values
y_reg = df['band_gap'].values # regression target
y_cls = df['ordering'].values # classification target
MODELS = {
'Ridge': make_pipeline(StandardScaler(), Ridge(alpha=1.0)),
'RF': RandomForestRegressor(n_estimators=200, random_state=42),
'GBM': GradientBoostingRegressor(n_estimators=200, random_state=42),
'SVR': make_pipeline(StandardScaler(), SVR(kernel='rbf', C=10)),
'MLP': make_pipeline(StandardScaler(),
MLPRegressor(hidden_layer_sizes=(64,32),
max_iter=1000, random_state=42)),
}
cv = KFold(n_splits=5, shuffle=True, random_state=42)
results = {}
for name, model in MODELS.items():
scores = cross_val_score(model, X, y_reg, cv=cv,
scoring='neg_mean_absolute_error')
results[name] = -scores.mean()
print(f"{name:8s} CV-MAE = {-scores.mean():.3f} ± {scores.std():.3f} eV")
Stage 5 — Evaluate the Best Model
import matplotlib.pyplot as plt
# Pick best model from CV results
best_name = min(results, key=results.get)
best_model = MODELS[best_name]
print(f"Best model: {best_name} (MAE = {results[best_name]:.3f} eV)")
# Final train/test split for reporting
X_tr, X_te, y_tr, y_te = train_test_split(
X, y_reg, test_size=0.2, random_state=42)
best_model.fit(X_tr, y_tr)
y_pred = best_model.predict(X_te)
print(f"Test MAE : {mean_absolute_error(y_te, y_pred):.3f} eV")
print(f"Test R² : {r2_score(y_te, y_pred):.3f}")
# Parity plot
plt.figure(figsize=(5,5))
plt.scatter(y_te, y_pred, alpha=0.7)
plt.plot([0,max(y_te)],[0,max(y_te)],'--r')
plt.xlabel('DFT Band Gap (eV)'); plt.ylabel('ML Predicted (eV)')
plt.title(f'{best_name} — Test set parity plot')
plt.tight_layout(); plt.savefig('parity_plot.png', dpi=150)
Stage 6 — Interpret with SHAP
A prediction is only publishable if you can explain why the model made it. SHAP (SHapley Additive exPlanations) assigns each feature a contribution score for each prediction — grounded in cooperative game theory.
# Fit final model on all data before SHAP
best_model.fit(X, y_reg)
# TreeExplainer works for RF and GBM; use KernelExplainer for SVR/MLP
if best_name in ['RF', 'GBM']:
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X)
else:
X_bg = shap.sample(X, 50) # background dataset for KernelExplainer
explainer = shap.KernelExplainer(best_model.predict, X_bg)
shap_values = explainer.shap_values(X)
# Summary plot — feature importance + direction
shap.summary_plot(shap_values, X, feature_names=FEATURES,
plot_type='dot', show=False)
plt.tight_layout(); plt.savefig('shap_summary.png', dpi=150)
# Force plot for a single prediction (e.g. compound index 0)
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0],
X[0], feature_names=FEATURES)
For transition-metal chalcogenides, a physically well-behaved model should show: ΔEN as the top feature (ionic character → gap), nd second (d-band filling → metal/insulator boundary), chalc_period third (S<Se<Te covalency trend), and UW_proxy positively correlated with band gap size (Mott criterion). If spacegroup or density dominate, your model may be learning spurious structural correlations rather than electronic physics.
7. Saving and Deploying the Pipeline
# Save the complete pipeline (preprocessor + model)
joblib.dump(best_model, f"MxCy_pipeline_{best_name}_v1.joblib")
# Save metadata alongside the model
meta = {
'model': best_name,
'features': FEATURES,
'target': 'band_gap_eV',
'cv_mae_eV': round(results[best_name], 4),
'n_train': len(df),
'chemsys': 'MxCy (M=Fe,Ni,Co,Mn,Cr,Ti,V,Cu; C=S,Se,Te)',
'date': datetime.date.today().isoformat(),
}
json.dump(meta, open('MxCy_pipeline_meta.json','w'), indent=2)
# Load and predict on a new compound
pipeline = joblib.load(f"MxCy_pipeline_{best_name}_v1.joblib")
new_compound = np.array([[1.10, 6, 4, 0.5, 18.3, 194, 0.01, 5.2, 24, 0, 4]])
print(f"Predicted band gap: {pipeline.predict(new_compound)[0]:.3f} eV")
8. Common Pitfalls and How to Avoid Them
| Pitfall | Symptom | Fix |
|---|---|---|
| Data leakage | Test R² = 0.99, new compound prediction terrible | Never fit scalers or imputers on test data; use pipelines |
| Wrong CV strategy | CV score looks good but real-world performance poor | Use GroupKFold by metal element to test genuine generalisation |
| PBE gap = 0 ≠ metal | Many correlated insulators labelled as metals | Use DFT+U values from MP for 3d transition metals; flag band_gap < 0.1 as "uncertain" |
| Class imbalance | Ordering classifier always predicts NM (majority) | Use class_weight='balanced' or SMOTE oversampling |
| Feature leakage | Including band_gap as a feature when predicting ordering | Carefully separate features from targets; use only composition-based descriptors for true prediction |
Quick Check
1. You achieve R² = 0.98 on your test set but only 0.61 on new compounds from a different database. What is the most likely cause?
2. Your SHAP summary shows "spacegroup_number" as the most important feature for band gap prediction. What should concern you?
3. For a magnetic ordering classifier (NM/AFM/FM) trained on MₓCᵧ data, why should you use GroupKFold by metal element rather than standard KFold?