Post 13: Python Essentials for ML — NumPy, pandas & scikit-learn

Module 4 · Post 13

The three-library starter kit every materials scientist needs before writing a single ML model.

⏱ 18 min read · Level: Beginner · AI & ML for Materials Scientists

Module 4

🔢

NumPy

Fast arrays, vectorised math, linear algebra

📊

pandas

Tabular data — load, filter, encode, clean

⚙️

scikit-learn

Unified fit / predict / score for any ML model

🔑

Key rule

Never fit_transform on test data — data leakage!

In Posts 5–12 we built intuition for ML algorithms — from linear regression to GNNs. Now we move to Module 4: practical tools. Before you can train any model on real materials data, you need to master three Python libraries that form the backbone of almost every ML workflow in science.

These are not abstract concepts — they are the actual code you will write in Post 15 to build your first property-prediction pipeline. Think of them as the DFT equivalent of setting up your INCAR, KPOINTS, and POSCAR files: essential scaffolding before the calculation can run.

🔬

Good news for DFT researchers

Python is already the scripting language of the DFT ecosystem — ASE, pymatgen, AiiDA, and the JARVIS tools from Post 12 all speak it natively. If you have run any of these, you already have a working Python + NumPy environment. This post shows you the ML-specific patterns on top of that foundation.

1. NumPy — The Language of Arrays

NumPy gives Python a fast, compiled array type called ndarray. In DFT you work with matrices constantly — charge density grids, k-point meshes, eigenvalue lists, Hamiltonian matrices. NumPy handles all of these efficiently in Python, without writing a single loop in C or Fortran.

1.1 Creating arrays from materials data

import numpy as np

# 1-D: band gaps (eV) for 5 materials

band_gaps = np.array([1.12, 3.37, 3.37, 0.66, 5.47])

# 2-D: rows = materials, cols = [a, b, c] lattice params (Å)

lattice = np.array([

    [5.43, 5.43, 5.43],   # Si

    [4.51, 4.51, 7.31],   # GaN (wurtzite)

    [3.25, 3.25, 5.21],   # ZnO

])

print(band_gaps.shape)   # (5,)

print(lattice.shape)     # (3, 3)

1.2 Vectorised operations — no loops needed

# Convert eV → kJ/mol for all materials at once

band_gaps_kJ = band_gaps * 96.485

# Normalise to [0, 1]  (min-max scaling, manual version)

bg_norm = (band_gaps - band_gaps.min()) / (band_gaps.max() - band_gaps.min())

# Boolean mask — select only wide-gap materials (Eg > 2 eV)

wide = band_gaps[band_gaps > 2.0]

print(wide)   # [3.37 3.37 5.47]

# Reshape: (5,) → (5, 1)  — required before sklearn models

X = band_gaps.reshape(-1, 1)

✅

Key habit: avoid Python loops over array elements

NumPy operations execute in compiled C at orders-of-magnitude higher speed than Python for loops. With 50,000 materials from the Materials Project, a vectorised operation takes milliseconds; a loop takes minutes.

1.3 NumPy operations you will use in every ML script

# Statistics

np.mean(band_gaps)         # 2.798

np.std(band_gaps)          # 1.713

np.corrcoef(band_gaps, lattice[:,0])  # correlation matrix

# Linear algebra — used in PCA, SVD-based descriptors

A = np.array([[2,1],[1,3]])

eigenvalues, eigenvectors = np.linalg.eigh(A)

# Stacking arrays (combine feature columns)

density = np.array([2.33, 6.15, 5.61, 5.32, 3.51])

X = np.column_stack([band_gaps, density])  # shape (5, 2)

2. pandas — Data Tables for Materials

A pandas DataFrame is a table with named columns and labelled rows — a smart spreadsheet living inside Python. For materials ML, it is how you store and manipulate your feature matrix before handing it to scikit-learn.

2.1 Building a materials DataFrame

import pandas as pd

data = {

    'material'  : ['Si', 'GaN', 'ZnO', 'Ge', 'Diamond'],

    'band_gap'  : [1.12, 3.37, 3.37, 0.66, 5.47],

    'lattice_a' : [5.43, 4.51, 3.25, 5.66, 3.57],

    'density'   : [2.33, 6.15, 5.61, 5.32, 3.51],

    'crystal'   : ['cubic', 'wurtzite', 'wurtzite', 'cubic', 'cubic'],

}

df = pd.DataFrame(data)

print(df.head())

2.2 Exploring and cleaning data

# Always start with these three lines on any new dataset

df.describe()          # statistics: mean, std, min, quartiles

df.dtypes              # check column types (float64, object...)

df.isnull().sum()     # count missing values per column

# Filtering rows

wide = df[df['band_gap'] > 2.0]

cubic = df[df['crystal'] == 'cubic']

# Add computed features

df['volume']  = df['lattice_a'] ** 3           # cubic approximation

df['is_wide'] = (df['band_gap'] > 2.0).astype(int)  # binary label

⚠️

Missing values — the silent accuracy killer

DFT databases always have gaps: some properties are not computed for every structure. Before training any model, run df.isnull().sum(). Then decide: drop incomplete rows (df.dropna()), fill with the column median (df.fillna(df.median())), or use scikit-learn's SimpleImputer. Never silently pass NaN values to a model — most algorithms will silently produce wrong predictions.

2.3 Reading from files — Materials Project, AFLOW, ICSD exports

# CSV exported from Materials Project or AFLOW

df = pd.read_csv('mp_data.csv')

# Excel file (e.g. your own Heusler alloy compilation)

df = pd.read_excel('heusler_alloys.xlsx', sheet_name='Band gaps')

# JSON from Materials Project API (mp-api)

df = pd.json_normalize(api_results)   # flatten nested JSON

# Save your cleaned dataset

df.to_csv('features_clean.csv', index=False)

2.4 One-hot encoding crystal structures

ML models only understand numbers. Text categories like cubic, wurtzite, hexagonal must be converted into numeric columns before training.

# pd.get_dummies converts one column → several binary (0/1) columns

df_enc = pd.get_dummies(df, columns=['crystal'])

# New columns: crystal_cubic  crystal_wurtzite

print(df_enc.columns.tolist())

# Drop original text column if it was not auto-handled

df_num = df.drop(columns=['material', 'crystal'])

3. scikit-learn — ML in Five Lines

scikit-learn provides every classical ML algorithm from Posts 5–8 (linear regression, SVMs, random forests) plus all the utilities you need to evaluate and tune them — all sharing the same clean API. You can swap one algorithm for another without rewriting any surrounding code.

3.1 The unified API — fit / predict / score

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)

model.fit(X_train, y_train)          # learn patterns from training data

y_pred = model.predict(X_test)       # predict on unseen test set

score  = model.score(X_test, y_test)  # returns R²

# Swap the model — everything else stays identical:

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)   # same fit / predict / score calls work

3.2 Train / test split

from sklearn.model_selection import train_test_split

features = ['lattice_a', 'density', 'volume', 'crystal_cubic']

X = df_enc[features].values   # NumPy array, shape (n_materials, n_features)

y = df_enc['band_gap'].values  # 1-D target vector

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42

)

# 80 % training · 20 % held-out test — never used during training

3.3 Feature scaling with StandardScaler

SVMs, ridge regression, and neural networks are sensitive to feature scale — a density in g/cm³ and a lattice parameter in Å have very different numerical ranges. StandardScaler maps each feature to mean = 0, std = 1.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)   # compute mean/std on train, then scale

X_test  = scaler.transform(X_test)         # apply the SAME transform (no refit!)

# ⚠️  NEVER call fit_transform on X_test — that is data leakage

🔑

The most common beginner mistake: data leakage

Calling fit_transform(X_test) computes normalisation statistics (mean, std) using test data, which the model is not supposed to have seen. Your evaluation metric becomes optimistic, and your model will underperform when deployed on new data. Always: fit on train, transform both.

3.4 Cross-validation — essential for small datasets

Materials datasets are often small (a few hundred DFT calculations). A single 80/20 split can be misleading — you might get lucky or unlucky with which structures land in the test set. K-fold cross-validation repeats the split K times and averages the score.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(

    model, X, y, cv=5, scoring='neg_mean_absolute_error'

)

print(f"CV MAE = {-scores.mean():.3f} ± {scores.std():.3f} eV")

# 5 independent train/test splits — much more reliable than one split

3.5 Evaluation metrics for regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae  = mean_absolute_error(y_test, y_pred)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

r2   = r2_score(y_test, y_pred)

print(f"MAE  = {mae:.3f} eV")

print(f"RMSE = {rmse:.3f} eV")

print(f"R²   = {r2:.3f}")

Metric	Formula	Interpretation	Materials context
MAE	mean\|y − ŷ\|	Average absolute error — easy to read in target units	ALIGNN achieves MAE ≈ 0.022 eV/atom for formation energy
RMSE	√mean(y−ŷ)²	Penalises large errors more — sensitive to outliers	Useful when rare but catastrophically wrong predictions matter
R²	1 − SS_res/SS_tot	1 = perfect · 0 = predicts mean · <0 = worse than mean	Target R² > 0.9 for publishable property models

4. A Complete End-to-End Pipeline

Let us assemble everything into a single script that loads a materials CSV, engineers a feature matrix, trains a random forest, and prints evaluation metrics. This is a miniature version of what you will build fully in Post 15.

Load

pd.read_csv, dropna

Engineer

get_dummies, add computed cols

Split

train_test_split 80/20

Scale

fit on train, transform both

Train

model.fit(X_train, y_train)

Evaluate

MAE · RMSE · R² · CV

import numpy as np

import pandas as pd

from sklearn.ensemble      import RandomForestRegressor

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.preprocessing  import StandardScaler

from sklearn.metrics        import mean_absolute_error, r2_score

# ── 1. Load ──

df = pd.read_csv('materials.csv')

df = df.dropna(subset=['band_gap', 'lattice_a', 'density'])

# ── 2. Feature engineering ──

df['volume'] = df['lattice_a'] ** 3

df = pd.get_dummies(df, columns=['crystal_system'])

features = ['lattice_a', 'density', 'volume',

            'atomic_number', 'electronegativity']

X = df[features].values

y = df['band_gap'].values

# ── 3. Split ──

X_tr, X_te, y_tr, y_te = train_test_split(

    X, y, test_size=0.2, random_state=42)

# ── 4. Scale ──

sc = StandardScaler()

X_tr = sc.fit_transform(X_tr)

X_te = sc.transform(X_te)

# ── 5. Train ──

model = RandomForestRegressor(n_estimators=200, random_state=42)

model.fit(X_tr, y_tr)

# ── 6. Evaluate ──

y_pred = model.predict(X_te)

print(f"MAE = {mean_absolute_error(y_te, y_pred):.3f} eV")

print(f"R²  = {r2_score(y_te, y_pred):.3f}")

cv = cross_val_score(model, X, y, cv=5,

                     scoring='neg_mean_absolute_error')

print(f"CV MAE = {-cv.mean():.3f} ± {cv.std():.3f} eV")

# ── Feature importance ──

for name, imp in zip(features, model.feature_importances_):

    print(f"  {name:22s}  {imp:.3f}")

🎯

What this gives you

A trained model that predicts band gap from structural features in milliseconds — versus hours for a full DFT calculation. Feature importances tell you which descriptors carry the most predictive power, directly guiding your next DFT screening campaign: spend compute time on structures where those top features are most informative.

Quick Check

1. You have a DataFrame with 1,200 materials. After calling train_test_split(X, y, test_size=0.2), how many samples are in X_train?

A. 200
B. 960
C. 1,000
D. 240

2. A colleague scales features by calling scaler.fit_transform(X_test). What is wrong with this?

A. Nothing — StandardScaler always needs to see all data to work correctly
B. It will crash because X_test has a different number of columns
C. It is data leakage: the scaler learns statistics (mean, std) from the test set, making evaluation artificially optimistic and the deployed model less accurate
D. It produces incorrect feature order, shuffling the column names

3. You train a random forest on 300 materials and get R² = 0.95 on the test set. A reviewer asks for a more reliable estimate. What should you report instead?

A. Train on all 300 and report the training R²
B. Increase the test split to 50% and retrain
C. Run 5-fold or 10-fold cross-validation with cross_val_score and report mean ± std R² across folds
D. Use MAE instead of R² — it is always more reliable

NumPy pandas scikit-learn train_test_split StandardScaler cross_val_score RandomForest MAE R² data leakage

Header Ads Widget

Last Posts

Post 13: Python Essentials for ML — NumPy, pandas & scikit-learn

1. NumPy — The Language of Arrays

1.1 Creating arrays from materials data

1.2 Vectorised operations — no loops needed

1.3 NumPy operations you will use in every ML script

2. pandas — Data Tables for Materials

2.1 Building a materials DataFrame

2.2 Exploring and cleaning data

2.3 Reading from files — Materials Project, AFLOW, ICSD exports

2.4 One-hot encoding crystal structures

3. scikit-learn — ML in Five Lines

3.1 The unified API — fit / predict / score

3.2 Train / test split

3.3 Feature scaling with StandardScaler

3.4 Cross-validation — essential for small datasets

3.5 Evaluation metrics for regression

4. A Complete End-to-End Pipeline

Quick Check

About me

My page

Popular Posts

Post 1: What is Artificial Intelligence? A researcher's first look

Post 2: ML vs Traditional Simulation — Where does DFT end and ML begin?

Post 4: Types of ML — Supervised, Unsupervised & Reinforcement Learning

Post 7: Decision Trees and Random Forests — Handling Non-linear Boundaries

Post 3: Key Mathematical Tools — Vectors, Matrices & Probability

Post 6 — Classification: Is This Material a Metal or Insulator?

Post 8: Support Vector Machines (SVMs) — with a crystal property example

Post 5: Linear Regression — Predicting Band Gap from Simple Features.

Post 14: Working with Materials Databases — ICSD, Materials Project, AFLOW

Post 12: Graph Neural Networks — ALIGNN & CGCNN Explained for Beginners

Categories

Pageviews past week

You may contact me here

Foundations

Core Algorithms

Neural Networks & Deep Learning

Menu Footer Widget