Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 13: Python Essentials for ML — NumPy, pandas & scikit-learn

The three-library starter kit every materials scientist needs before writing a single ML model.

Module 4
🔢
NumPy

Fast arrays, vectorised math, linear algebra

📊
pandas

Tabular data — load, filter, encode, clean

⚙️
scikit-learn

Unified fit / predict / score for any ML model

🔑
Key rule

Never fit_transform on test data — data leakage!

In Posts 5–12 we built intuition for ML algorithms — from linear regression to GNNs. Now we move to Module 4: practical tools. Before you can train any model on real materials data, you need to master three Python libraries that form the backbone of almost every ML workflow in science.

These are not abstract concepts — they are the actual code you will write in Post 15 to build your first property-prediction pipeline. Think of them as the DFT equivalent of setting up your INCAR, KPOINTS, and POSCAR files: essential scaffolding before the calculation can run.

🔬
Good news for DFT researchers

Python is already the scripting language of the DFT ecosystem — ASE, pymatgen, AiiDA, and the JARVIS tools from Post 12 all speak it natively. If you have run any of these, you already have a working Python + NumPy environment. This post shows you the ML-specific patterns on top of that foundation.

1. NumPy — The Language of Arrays

NumPy gives Python a fast, compiled array type called ndarray. In DFT you work with matrices constantly — charge density grids, k-point meshes, eigenvalue lists, Hamiltonian matrices. NumPy handles all of these efficiently in Python, without writing a single loop in C or Fortran.

1.1 Creating arrays from materials data

import numpy as np

# 1-D: band gaps (eV) for 5 materials
band_gaps = np.array([1.12, 3.37, 3.37, 0.66, 5.47])

# 2-D: rows = materials, cols = [a, b, c] lattice params (Å)
lattice = np.array([
    [5.43, 5.43, 5.43], # Si
    [4.51, 4.51, 7.31], # GaN (wurtzite)
    [3.25, 3.25, 5.21], # ZnO
])

print(band_gaps.shape) # (5,)
print(lattice.shape) # (3, 3)

1.2 Vectorised operations — no loops needed

# Convert eV → kJ/mol for all materials at once
band_gaps_kJ = band_gaps * 96.485

# Normalise to [0, 1] (min-max scaling, manual version)
bg_norm = (band_gaps - band_gaps.min()) / (band_gaps.max() - band_gaps.min())

# Boolean mask — select only wide-gap materials (Eg > 2 eV)
wide = band_gaps[band_gaps > 2.0]
print(wide) # [3.37 3.37 5.47]

# Reshape: (5,) → (5, 1) — required before sklearn models
X = band_gaps.reshape(-1, 1)
Key habit: avoid Python loops over array elements

NumPy operations execute in compiled C at orders-of-magnitude higher speed than Python for loops. With 50,000 materials from the Materials Project, a vectorised operation takes milliseconds; a loop takes minutes.

1.3 NumPy operations you will use in every ML script

# Statistics
np.mean(band_gaps) # 2.798
np.std(band_gaps) # 1.713
np.corrcoef(band_gaps, lattice[:,0]) # correlation matrix

# Linear algebra — used in PCA, SVD-based descriptors
A = np.array([[2,1],[1,3]])
eigenvalues, eigenvectors = np.linalg.eigh(A)

# Stacking arrays (combine feature columns)
density = np.array([2.33, 6.15, 5.61, 5.32, 3.51])
X = np.column_stack([band_gaps, density]) # shape (5, 2)

2. pandas — Data Tables for Materials

A pandas DataFrame is a table with named columns and labelled rows — a smart spreadsheet living inside Python. For materials ML, it is how you store and manipulate your feature matrix before handing it to scikit-learn.

2.1 Building a materials DataFrame

import pandas as pd

data = {
    'material' : ['Si', 'GaN', 'ZnO', 'Ge', 'Diamond'],
    'band_gap' : [1.12, 3.37, 3.37, 0.66, 5.47],
    'lattice_a' : [5.43, 4.51, 3.25, 5.66, 3.57],
    'density' : [2.33, 6.15, 5.61, 5.32, 3.51],
    'crystal' : ['cubic', 'wurtzite', 'wurtzite', 'cubic', 'cubic'],
}
df = pd.DataFrame(data)
print(df.head())

2.2 Exploring and cleaning data

# Always start with these three lines on any new dataset
df.describe() # statistics: mean, std, min, quartiles
df.dtypes # check column types (float64, object...)
df.isnull().sum() # count missing values per column

# Filtering rows
wide = df[df['band_gap'] > 2.0]
cubic = df[df['crystal'] == 'cubic']

# Add computed features
df['volume'] = df['lattice_a'] ** 3 # cubic approximation
df['is_wide'] = (df['band_gap'] > 2.0).astype(int) # binary label
⚠️
Missing values — the silent accuracy killer

DFT databases always have gaps: some properties are not computed for every structure. Before training any model, run df.isnull().sum(). Then decide: drop incomplete rows (df.dropna()), fill with the column median (df.fillna(df.median())), or use scikit-learn's SimpleImputer. Never silently pass NaN values to a model — most algorithms will silently produce wrong predictions.

2.3 Reading from files — Materials Project, AFLOW, ICSD exports

# CSV exported from Materials Project or AFLOW
df = pd.read_csv('mp_data.csv')

# Excel file (e.g. your own Heusler alloy compilation)
df = pd.read_excel('heusler_alloys.xlsx', sheet_name='Band gaps')

# JSON from Materials Project API (mp-api)
df = pd.json_normalize(api_results) # flatten nested JSON

# Save your cleaned dataset
df.to_csv('features_clean.csv', index=False)

2.4 One-hot encoding crystal structures

ML models only understand numbers. Text categories like cubic, wurtzite, hexagonal must be converted into numeric columns before training.

# pd.get_dummies converts one column → several binary (0/1) columns
df_enc = pd.get_dummies(df, columns=['crystal'])
# New columns: crystal_cubic crystal_wurtzite
print(df_enc.columns.tolist())

# Drop original text column if it was not auto-handled
df_num = df.drop(columns=['material', 'crystal'])

3. scikit-learn — ML in Five Lines

scikit-learn provides every classical ML algorithm from Posts 5–8 (linear regression, SVMs, random forests) plus all the utilities you need to evaluate and tune them — all sharing the same clean API. You can swap one algorithm for another without rewriting any surrounding code.

3.1 The unified API — fit / predict / score

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train) # learn patterns from training data
y_pred = model.predict(X_test) # predict on unseen test set
score = model.score(X_test, y_test) # returns R²

# Swap the model — everything else stays identical:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # same fit / predict / score calls work

3.2 Train / test split

from sklearn.model_selection import train_test_split

features = ['lattice_a', 'density', 'volume', 'crystal_cubic']
X = df_enc[features].values # NumPy array, shape (n_materials, n_features)
y = df_enc['band_gap'].values # 1-D target vector

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# 80 % training · 20 % held-out test — never used during training

3.3 Feature scaling with StandardScaler

SVMs, ridge regression, and neural networks are sensitive to feature scale — a density in g/cm³ and a lattice parameter in Å have very different numerical ranges. StandardScaler maps each feature to mean = 0, std = 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # compute mean/std on train, then scale
X_test = scaler.transform(X_test) # apply the SAME transform (no refit!)

# ⚠️ NEVER call fit_transform on X_test — that is data leakage
🔑
The most common beginner mistake: data leakage

Calling fit_transform(X_test) computes normalisation statistics (mean, std) using test data, which the model is not supposed to have seen. Your evaluation metric becomes optimistic, and your model will underperform when deployed on new data. Always: fit on train, transform both.

3.4 Cross-validation — essential for small datasets

Materials datasets are often small (a few hundred DFT calculations). A single 80/20 split can be misleading — you might get lucky or unlucky with which structures land in the test set. K-fold cross-validation repeats the split K times and averages the score.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model, X, y, cv=5, scoring='neg_mean_absolute_error'
)
print(f"CV MAE = {-scores.mean():.3f} ± {scores.std():.3f} eV")
# 5 independent train/test splits — much more reliable than one split

3.5 Evaluation metrics for regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"MAE = {mae:.3f} eV")
print(f"RMSE = {rmse:.3f} eV")
print(f"R² = {r2:.3f}")
MetricFormulaInterpretationMaterials context
MAE mean|y − ŷ| Average absolute error — easy to read in target units ALIGNN achieves MAE ≈ 0.022 eV/atom for formation energy
RMSE √mean(y−ŷ)² Penalises large errors more — sensitive to outliers Useful when rare but catastrophically wrong predictions matter
1 − SSres/SStot 1 = perfect · 0 = predicts mean · <0 = worse than mean Target R² > 0.9 for publishable property models

4. A Complete End-to-End Pipeline

Let us assemble everything into a single script that loads a materials CSV, engineers a feature matrix, trains a random forest, and prints evaluation metrics. This is a miniature version of what you will build fully in Post 15.

1
Load

pd.read_csv, dropna

2
Engineer

get_dummies, add computed cols

3
Split

train_test_split 80/20

4
Scale

fit on train, transform both

5
Train

model.fit(X_train, y_train)

6
Evaluate

MAE · RMSE · R² · CV

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score

# ── 1. Load ──
df = pd.read_csv('materials.csv')
df = df.dropna(subset=['band_gap', 'lattice_a', 'density'])

# ── 2. Feature engineering ──
df['volume'] = df['lattice_a'] ** 3
df = pd.get_dummies(df, columns=['crystal_system'])

features = ['lattice_a', 'density', 'volume',
            'atomic_number', 'electronegativity']
X = df[features].values
y = df['band_gap'].values

# ── 3. Split ──
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42)

# ── 4. Scale ──
sc = StandardScaler()
X_tr = sc.fit_transform(X_tr)
X_te = sc.transform(X_te)

# ── 5. Train ──
model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_tr, y_tr)

# ── 6. Evaluate ──
y_pred = model.predict(X_te)
print(f"MAE = {mean_absolute_error(y_te, y_pred):.3f} eV")
print(f"R² = {r2_score(y_te, y_pred):.3f}")

cv = cross_val_score(model, X, y, cv=5,
                     scoring='neg_mean_absolute_error')
print(f"CV MAE = {-cv.mean():.3f} ± {cv.std():.3f} eV")

# ── Feature importance ──
for name, imp in zip(features, model.feature_importances_):
    print(f" {name:22s} {imp:.3f}")
🎯
What this gives you

A trained model that predicts band gap from structural features in milliseconds — versus hours for a full DFT calculation. Feature importances tell you which descriptors carry the most predictive power, directly guiding your next DFT screening campaign: spend compute time on structures where those top features are most informative.

📦
App 13 — Python ML Sandbox
Load a simulated materials dataset, choose features, pick an algorithm, toggle StandardScaler, and watch MAE / R² / CV scores update in real time — all without installing Python.
Open App →

Quick Check

1. You have a DataFrame with 1,200 materials. After calling train_test_split(X, y, test_size=0.2), how many samples are in X_train?

  • A. 200
  • B. 960
  • C. 1,000
  • D. 240

2. A colleague scales features by calling scaler.fit_transform(X_test). What is wrong with this?

  • A. Nothing — StandardScaler always needs to see all data to work correctly
  • B. It will crash because X_test has a different number of columns
  • C. It is data leakage: the scaler learns statistics (mean, std) from the test set, making evaluation artificially optimistic and the deployed model less accurate
  • D. It produces incorrect feature order, shuffling the column names

3. You train a random forest on 300 materials and get R² = 0.95 on the test set. A reviewer asks for a more reliable estimate. What should you report instead?

  • A. Train on all 300 and report the training R²
  • B. Increase the test split to 50% and retrain
  • C. Run 5-fold or 10-fold cross-validation with cross_val_score and report mean ± std R² across folds
  • D. Use MAE instead of R² — it is always more reliable
NumPy pandas scikit-learn train_test_split StandardScaler cross_val_score RandomForest MAE data leakage