Header Ads Widget

AI & Machine Learning for Materials Sciences

Last Posts

10/recent/ticker-posts

Post 14: Working with Materials Databases — ICSD, Materials Project, AFLOW

Learn to query the world's largest computational and experimental materials databases programmatically: fetch crystal structures, band gaps, magnetic moments, and thermodynamic stability data directly into your Python workflow.

🗄️
Databases

ICSD, Materials Project, AFLOW, OQMD

🐍
Tools

mp-api, aflow, pymatgen, requests

⚛️
Data types

Structures, band gaps, DOS, E-hull

🔬
Focus

Transition-metal chalcogenides MₓCᵧ

Every ML model we built in Modules 2 and 3 used a hand-crafted 12-compound dataset. Real materials ML requires thousands of DFT-computed entries — far more than any single research group can compute from scratch. Fortunately, three major open databases have collectively computed millions of compounds and made the results freely accessible via web APIs: the Materials Project, AFLOW, and the OQMD. The experimental counterpart, the ICSD, contains over 270,000 experimentally determined crystal structures. This post shows you how to query all of them from Python.

🗄️
What we will fetch

Transition-metal chalcogenides MₓCᵧ (M = Fe, Ni, Co, Mn, Cr, Ti; C = S, Se, Te) — band gaps, magnetic ordering, energy above hull, crystal system, and space group. This is the exact dataset used in the MxCy_pipeline from the research pipeline series.

1. The Major Materials Databases — At a Glance

DatabaseTypeEntriesKey dataAccess
Materials Project Computational (DFT-PBE/HSE) ~160,000 inorganic Eg, E-hull, magnetic moment, DOS, band structure Free API key at materialsproject.org
AFLOW Computational (DFT-PBE) ~3.5 million Eg, elastic constants, Debye temp, Bader charges REST API, no key needed
OQMD Computational (DFT-PBE) ~1.2 million Formation energy, E-hull, structure REST API, no key needed
ICSD Experimental (X-ray / neutron) ~270,000 Crystal structure, space group, lattice parameters Institutional licence (FIZ Karlsruhe)
NOMAD Computational (multi-code) ~14 million calculations Raw DFT output files (VASP, QE, Wien2k, SIESTA…) Free, nomad-lab.eu
🔬
Which database for transition-metal chalcogenides?

Start with the Materials Project — it has the most complete property set (band gap, magnetic ordering, E-hull, DOS) and the best Python API (mp-api). Cross-check stability with AFLOW's larger dataset. For raw Wien2k/VASP output files of your own calculations, upload to NOMAD for long-term FAIR data management.

2. Materials Project — The Essential Starting Point

The Materials Project provides a modern REST API accessible via the official mp-api Python client. You need a free API key from materialsproject.org (sign up → Dashboard → API Key).

Installation and setup

# Install the official client
pip install mp-api pymatgen

# Set your API key as an environment variable (recommended)
# On Linux/Mac: export MP_API_KEY="your_key_here"
# On Windows: set MP_API_KEY=your_key_here
# Or pass directly in code (not recommended for shared scripts)

Fetching transition-metal chalcogenides

from mp_api.client import MPRester
import pandas as pd

API_KEY = "your_api_key_here" # or omit if MP_API_KEY env var is set

with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(
        chemsys=["Fe-S", "Fe-Se", "Fe-Te",
                 "Ni-S", "Ni-Se", "Ni-Te",
                 "Co-S", "Co-Se", "Co-Te",
                 "Mn-S", "Mn-Se", "Mn-Te",
                 "Cr-S", "Cr-Se", "Cr-Te",
                 "Ti-S", "Ti-Se", "Ti-Te"],
        energy_above_hull=(0, 0.1), # eV/atom — near-stable only
        fields=["material_id", "formula_pretty",
                "band_gap", "is_magnetic",
                "ordering", "total_magnetization",
                "energy_above_hull", "crystal_system",
                "spacegroup_number", "volume"]
    )

# Convert to DataFrame
df = pd.DataFrame([{
    'mpid': d.material_id,
    'formula': d.formula_pretty,
    'band_gap': d.band_gap,
    'is_magnetic': d.is_magnetic,
    'ordering': d.ordering,
    'mag_moment': d.total_magnetization,
    'e_above_hull': d.energy_above_hull,
    'crystal_sys': d.crystal_system,
    'spacegroup': d.spacegroup_number,
} for d in docs])

print(f"Fetched {len(df)} compounds")
df.head()

Fetching a full crystal structure (CIF / pymatgen Structure)

with MPRester(API_KEY) as mpr:
    # Get pymatgen Structure object for FeS2 (pyrite)
    structure = mpr.get_structure_by_material_id("mp-226")

print(structure) # lattice + sites
structure.to(fmt="cif", filename="FeS2_pyrite.cif") # save CIF

# Extract lattice parameters
lat = structure.lattice
print(f"a={lat.a:.3f} Å b={lat.b:.3f} Å c={lat.c:.3f} Å")
print(f"α={lat.alpha:.2f}° β={lat.beta:.2f}° γ={lat.gamma:.2f}°")

3. AFLOW — Three Million Compounds, No API Key Needed

AFLOW provides a REST API at aflow.org/API. No registration is required. Queries return JSON and can be filtered by compound, property, or prototype structure.

import requests, pandas as pd

# AFLOW REST API — search for Fe-S binary compounds
BASE = "http://aflow.org/API/aflowlib.php"

params = {
    "species": "Fe,S",
    "Egap,gt": "0", # only semiconductors/insulators
    "paging": "1",
    "format": "json",
    "fields": "compound,Egap,Emag,spacegroup_orig,volume_cell"
}
resp = requests.get(BASE, params=params, timeout=30)
data = resp.json()

df_aflow = pd.DataFrame(data)
print(f"AFLOW: {len(df_aflow)} Fe-S entries with Eg > 0")
print(df_aflow[['compound','Egap','Emag','spacegroup_orig']].head(8))
⚠️
AFLOW band gap caveat

AFLOW uses DFT-PBE without a Hubbard U correction for most entries. For transition-metal chalcogenides with strong correlation effects (NiO, CoO, MnO), the PBE band gap is severely underestimated or even zero for compounds that are experimentally insulating. Always cross-check with Materials Project (which applies DFT+U for transition-metal oxides) or NOMAD raw outputs.

4. OQMD — Formation Energies and Phase Stability

The Open Quantum Materials Database (OQMD) is particularly strong for thermodynamic stability data. Its REST API at oqmd.org/api returns formation energies and stability information.

import requests, pandas as pd

BASE = "https://oqmd.org/oqmdapi/formationenergy"

params = {
    "element_set": "Fe,S", # binary Fe-S system
    "stability,lt": "0.1", # near-stable (eV/atom)
    "fields": "name,spacegroup,stability,band_gap,delta_e",
    "limit": "50",
    "format": "json"
}
resp = requests.get(BASE, params=params, timeout=30)
results = resp.json()['results']

df_oqmd = pd.DataFrame(results)
print(df_oqmd[['name','spacegroup','stability',
               'band_gap','delta_e']].head(10))

5. Merging Databases — Building a Richer Training Set

Each database has different strengths. The most powerful approach for materials ML is to merge entries across databases, using the chemical formula and space group as the join key, and flag discrepancies (e.g. MP says Eg = 0, AFLOW says 0.3 eV) for manual review.

import pandas as pd
from pymatgen.core import Composition

# Normalise formula strings for comparison
def norm_formula(f):
    return Composition(f).reduced_formula

df['formula_norm'] = df['formula'].apply(norm_formula)
df_aflow['formula_norm'] = df_aflow['compound'].apply(norm_formula)

# Merge MP + AFLOW on normalised formula
merged = df.merge(df_aflow[['formula_norm','Egap','Emag']],
                 on='formula_norm', how='left', suffixes=('_mp','_aflow'))

# Flag band gap discrepancies > 0.5 eV
merged['Eg_discrepancy'] = (merged['band_gap'] - merged['Egap']).abs()
flagged = merged[merged['Eg_discrepancy'] > 0.5]
print(f"{len(flagged)} compounds with >0.5 eV gap disagreement between MP and AFLOW")
print(flagged[['formula','band_gap','Egap','Eg_discrepancy']])

6. Data Cleaning — The Step Everyone Skips (Don't)

  • Remove duplicates: same formula can appear multiple times with different polymorphs. Keep the lowest-energy entry or all distinct space groups depending on your goal.
  • Handle missing values: not all entries have all properties (e.g. magnetic ordering is only computed if spin-polarised calculations were run). Use df.dropna(subset=['band_gap']) or impute carefully.
  • Filter by stability: restrict to energy_above_hull < 0.1 eV/atom for experimentally plausible compounds. Highly metastable entries (Ehull > 0.3 eV) distort regression models.
  • Check for DFT-PBE artefacts: PBE systematically underestimates band gaps. For transition-metal compounds, consider DFT+U values (available in MP) or flag all gaps < 0.05 eV as "effectively zero".
  • Validate units: MP gives band gaps in eV, volumes in ų. AFLOW's Egap is in eV but check the documentation version — units changed between API versions.
# Full cleaning pipeline for MxCy dataset
import pandas as pd
import numpy as np

def clean_dataset(df):
    # 1. Drop entries with missing essential properties
    df = df.dropna(subset=['band_gap', 'e_above_hull']).copy()

    # 2. Filter thermodynamically unstable entries
    df = df[df['e_above_hull'] < 0.1]

    # 3. Classify band gap into Metal / Semiconductor / Insulator
    df['class'] = pd.cut(df['band_gap'],
        bins=[-0.01, 0.05, 2.0, np.inf],
        labels=['Metal', 'Semiconductor', 'Insulator'])

    # 4. Remove duplicates — keep lowest E-hull per formula
    df = df.sort_values('e_above_hull')
    df = df.drop_duplicates(subset='formula', keep='first')

    # 5. Reset index
    return df.reset_index(drop=True)

df_clean = clean_dataset(df)
print(f"Clean dataset: {len(df_clean)} compounds")
print(df_clean['class'].value_counts())

7. Saving Your Dataset — Ready for ML

# Save to multiple formats
df_clean.to_csv("MxCy_dataset.csv", index=False) # portable
df_clean.to_parquet("MxCy_dataset.parquet") # fast + typed
df_clean.to_excel("MxCy_dataset.xlsx", index=False) # for review

# Load and pass to sklearn pipeline
from sklearn.model_selection import train_test_split

df = pd.read_csv("MxCy_dataset.csv")
features = ['band_gap', 'e_above_hull', 'mag_moment', 'volume']
X = df[features].values
y = df['class'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
🗄️
App 14 — Materials Database Explorer
Search, filter, and visualise Materials Project data for transition-metal chalcogenides interactively — band gap vs E-hull scatter, class distribution, and export-ready tables — no API key needed (uses pre-fetched data).
Open App →

Quick Check

1. You query the Materials Project for FeS and get band_gap = 0.0 eV. Does this definitively mean FeS is metallic?

  • A. Yes — the Materials Project is always correct
  • B. Not necessarily — DFT-PBE systematically underestimates band gaps; FeS could be a small-gap semiconductor that PBE incorrectly predicts as metallic
  • C. Yes, because DFT is an exact theory
  • D. No — band_gap = 0 always means the API returned a missing value

2. Why should you filter by energy_above_hull < 0.1 eV/atom before training an ML model?

  • A. To reduce the dataset size for faster training
  • B. Highly metastable compounds (large E-hull) may never be synthesisable and could introduce spurious patterns; focusing on near-stable entries improves physical relevance and generalisation
  • C. Because the Materials Project only stores entries with E-hull < 0.1
  • D. To ensure all band gaps are non-zero

3. You find that the same compound (MnS) appears with band_gap = 2.3 eV in MP and 0.8 eV in AFLOW. What is the most likely cause?

  • A. One database made a coding error
  • B. MP applies a Hubbard U correction (DFT+U) to Mn compounds which opens the gap; AFLOW uses bare DFT-PBE which underestimates it
  • C. The two databases use different crystal structures for MnS
  • D. Band gaps are not stored correctly in either database
Materials Project AFLOW OQMD mp-api pymatgen DFT+U Band Gap E-hull Data Cleaning