Learn to query the world's largest computational and experimental materials databases programmatically: fetch crystal structures, band gaps, magnetic moments, and thermodynamic stability data directly into your Python workflow.
ICSD, Materials Project, AFLOW, OQMD
mp-api, aflow, pymatgen, requests
Structures, band gaps, DOS, E-hull
Transition-metal chalcogenides MₓCᵧ
Every ML model we built in Modules 2 and 3 used a hand-crafted 12-compound dataset. Real materials ML requires thousands of DFT-computed entries — far more than any single research group can compute from scratch. Fortunately, three major open databases have collectively computed millions of compounds and made the results freely accessible via web APIs: the Materials Project, AFLOW, and the OQMD. The experimental counterpart, the ICSD, contains over 270,000 experimentally determined crystal structures. This post shows you how to query all of them from Python.
Transition-metal chalcogenides MₓCᵧ (M = Fe, Ni, Co, Mn, Cr, Ti; C = S, Se, Te)
— band gaps, magnetic ordering, energy above hull, crystal system, and space group.
This is the exact dataset used in the MxCy_pipeline from the research
pipeline series.
1. The Major Materials Databases — At a Glance
| Database | Type | Entries | Key data | Access |
|---|---|---|---|---|
| Materials Project | Computational (DFT-PBE/HSE) | ~160,000 inorganic | Eg, E-hull, magnetic moment, DOS, band structure | Free API key at materialsproject.org |
| AFLOW | Computational (DFT-PBE) | ~3.5 million | Eg, elastic constants, Debye temp, Bader charges | REST API, no key needed |
| OQMD | Computational (DFT-PBE) | ~1.2 million | Formation energy, E-hull, structure | REST API, no key needed |
| ICSD | Experimental (X-ray / neutron) | ~270,000 | Crystal structure, space group, lattice parameters | Institutional licence (FIZ Karlsruhe) |
| NOMAD | Computational (multi-code) | ~14 million calculations | Raw DFT output files (VASP, QE, Wien2k, SIESTA…) | Free, nomad-lab.eu |
Start with the Materials Project — it has the most complete
property set (band gap, magnetic ordering, E-hull, DOS) and the best Python
API (mp-api). Cross-check stability with AFLOW's
larger dataset. For raw Wien2k/VASP output files of your own calculations,
upload to NOMAD for long-term FAIR data management.
2. Materials Project — The Essential Starting Point
The Materials Project provides a modern REST API accessible via the official
mp-api Python client. You need a free API key from
materialsproject.org
(sign up → Dashboard → API Key).
Installation and setup
pip install mp-api pymatgen
# Set your API key as an environment variable (recommended)
# On Linux/Mac: export MP_API_KEY="your_key_here"
# On Windows: set MP_API_KEY=your_key_here
# Or pass directly in code (not recommended for shared scripts)
Fetching transition-metal chalcogenides
import pandas as pd
API_KEY = "your_api_key_here" # or omit if MP_API_KEY env var is set
with MPRester(API_KEY) as mpr:
docs = mpr.materials.summary.search(
chemsys=["Fe-S", "Fe-Se", "Fe-Te",
"Ni-S", "Ni-Se", "Ni-Te",
"Co-S", "Co-Se", "Co-Te",
"Mn-S", "Mn-Se", "Mn-Te",
"Cr-S", "Cr-Se", "Cr-Te",
"Ti-S", "Ti-Se", "Ti-Te"],
energy_above_hull=(0, 0.1), # eV/atom — near-stable only
fields=["material_id", "formula_pretty",
"band_gap", "is_magnetic",
"ordering", "total_magnetization",
"energy_above_hull", "crystal_system",
"spacegroup_number", "volume"]
)
# Convert to DataFrame
df = pd.DataFrame([{
'mpid': d.material_id,
'formula': d.formula_pretty,
'band_gap': d.band_gap,
'is_magnetic': d.is_magnetic,
'ordering': d.ordering,
'mag_moment': d.total_magnetization,
'e_above_hull': d.energy_above_hull,
'crystal_sys': d.crystal_system,
'spacegroup': d.spacegroup_number,
} for d in docs])
print(f"Fetched {len(df)} compounds")
df.head()
Fetching a full crystal structure (CIF / pymatgen Structure)
# Get pymatgen Structure object for FeS2 (pyrite)
structure = mpr.get_structure_by_material_id("mp-226")
print(structure) # lattice + sites
structure.to(fmt="cif", filename="FeS2_pyrite.cif") # save CIF
# Extract lattice parameters
lat = structure.lattice
print(f"a={lat.a:.3f} Å b={lat.b:.3f} Å c={lat.c:.3f} Å")
print(f"α={lat.alpha:.2f}° β={lat.beta:.2f}° γ={lat.gamma:.2f}°")
3. AFLOW — Three Million Compounds, No API Key Needed
AFLOW provides a REST API at aflow.org/API. No registration is required.
Queries return JSON and can be filtered by compound, property, or prototype structure.
# AFLOW REST API — search for Fe-S binary compounds
BASE = "http://aflow.org/API/aflowlib.php"
params = {
"species": "Fe,S",
"Egap,gt": "0", # only semiconductors/insulators
"paging": "1",
"format": "json",
"fields": "compound,Egap,Emag,spacegroup_orig,volume_cell"
}
resp = requests.get(BASE, params=params, timeout=30)
data = resp.json()
df_aflow = pd.DataFrame(data)
print(f"AFLOW: {len(df_aflow)} Fe-S entries with Eg > 0")
print(df_aflow[['compound','Egap','Emag','spacegroup_orig']].head(8))
AFLOW uses DFT-PBE without a Hubbard U correction for most entries. For transition-metal chalcogenides with strong correlation effects (NiO, CoO, MnO), the PBE band gap is severely underestimated or even zero for compounds that are experimentally insulating. Always cross-check with Materials Project (which applies DFT+U for transition-metal oxides) or NOMAD raw outputs.
4. OQMD — Formation Energies and Phase Stability
The Open Quantum Materials Database (OQMD) is particularly strong for
thermodynamic stability data. Its REST API at
oqmd.org/api returns formation energies and stability information.
BASE = "https://oqmd.org/oqmdapi/formationenergy"
params = {
"element_set": "Fe,S", # binary Fe-S system
"stability,lt": "0.1", # near-stable (eV/atom)
"fields": "name,spacegroup,stability,band_gap,delta_e",
"limit": "50",
"format": "json"
}
resp = requests.get(BASE, params=params, timeout=30)
results = resp.json()['results']
df_oqmd = pd.DataFrame(results)
print(df_oqmd[['name','spacegroup','stability',
'band_gap','delta_e']].head(10))
5. Merging Databases — Building a Richer Training Set
Each database has different strengths. The most powerful approach for materials ML is to merge entries across databases, using the chemical formula and space group as the join key, and flag discrepancies (e.g. MP says Eg = 0, AFLOW says 0.3 eV) for manual review.
from pymatgen.core import Composition
# Normalise formula strings for comparison
def norm_formula(f):
return Composition(f).reduced_formula
df['formula_norm'] = df['formula'].apply(norm_formula)
df_aflow['formula_norm'] = df_aflow['compound'].apply(norm_formula)
# Merge MP + AFLOW on normalised formula
merged = df.merge(df_aflow[['formula_norm','Egap','Emag']],
on='formula_norm', how='left', suffixes=('_mp','_aflow'))
# Flag band gap discrepancies > 0.5 eV
merged['Eg_discrepancy'] = (merged['band_gap'] - merged['Egap']).abs()
flagged = merged[merged['Eg_discrepancy'] > 0.5]
print(f"{len(flagged)} compounds with >0.5 eV gap disagreement between MP and AFLOW")
print(flagged[['formula','band_gap','Egap','Eg_discrepancy']])
6. Data Cleaning — The Step Everyone Skips (Don't)
- Remove duplicates: same formula can appear multiple times with different polymorphs. Keep the lowest-energy entry or all distinct space groups depending on your goal.
- Handle missing values: not all entries have all properties (e.g. magnetic ordering is only computed if spin-polarised calculations were run). Use
df.dropna(subset=['band_gap'])or impute carefully. - Filter by stability: restrict to
energy_above_hull < 0.1 eV/atomfor experimentally plausible compounds. Highly metastable entries (Ehull > 0.3 eV) distort regression models. - Check for DFT-PBE artefacts: PBE systematically underestimates band gaps. For transition-metal compounds, consider DFT+U values (available in MP) or flag all gaps < 0.05 eV as "effectively zero".
- Validate units: MP gives band gaps in eV, volumes in ų. AFLOW's
Egapis in eV but check the documentation version — units changed between API versions.
import pandas as pd
import numpy as np
def clean_dataset(df):
# 1. Drop entries with missing essential properties
df = df.dropna(subset=['band_gap', 'e_above_hull']).copy()
# 2. Filter thermodynamically unstable entries
df = df[df['e_above_hull'] < 0.1]
# 3. Classify band gap into Metal / Semiconductor / Insulator
df['class'] = pd.cut(df['band_gap'],
bins=[-0.01, 0.05, 2.0, np.inf],
labels=['Metal', 'Semiconductor', 'Insulator'])
# 4. Remove duplicates — keep lowest E-hull per formula
df = df.sort_values('e_above_hull')
df = df.drop_duplicates(subset='formula', keep='first')
# 5. Reset index
return df.reset_index(drop=True)
df_clean = clean_dataset(df)
print(f"Clean dataset: {len(df_clean)} compounds")
print(df_clean['class'].value_counts())
7. Saving Your Dataset — Ready for ML
df_clean.to_csv("MxCy_dataset.csv", index=False) # portable
df_clean.to_parquet("MxCy_dataset.parquet") # fast + typed
df_clean.to_excel("MxCy_dataset.xlsx", index=False) # for review
# Load and pass to sklearn pipeline
from sklearn.model_selection import train_test_split
df = pd.read_csv("MxCy_dataset.csv")
features = ['band_gap', 'e_above_hull', 'mag_moment', 'volume']
X = df[features].values
y = df['class'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Quick Check
1. You query the Materials Project for FeS and get band_gap = 0.0 eV. Does this definitively mean FeS is metallic?
2. Why should you filter by energy_above_hull < 0.1 eV/atom before training an ML model?
3. You find that the same compound (MnS) appears with band_gap = 2.3 eV in MP and 0.8 eV in AFLOW. What is the most likely cause?