- the chat convo: https://chatgpt.com/share/68d0a6be-f704-800d-ac58-540a42b6bc35
| Schema | Primary sources (stable) | What you can pull cleanly | Notes / gaps |
|---|---|---|---|
filetype (ext, name, category, mime, magic, open_with, related) |
IANA media types, mime-db (npm), freedesktop/shared-mime-info XML, PRONOM/DROID (signature files), libmagic database, Wikidata | ext↔MIME, canonical names, categories; magic signatures (PRONOM/libmagic); aliases; common apps (Wikidata) | FileInfo is useful but semi-structured; use it last for titles/"how to open" blurbs |
| mime | mime-db JSON (GitHub/npm), IANA registries | type/subtype, extensions, notes |
mime-db already merges IANA + community; treat as ground truth for ext lists |
| magic_signature | PRONOM (DROID ZIP/XML), libmagic text db, community file-signature DB repos | hex patterns, offsets, description, associated extensions | PRONOM is very complete but bureaucratic IDs; libmagic is pragmatic for detection |
| container | FFmpeg docs/ffprobe -formats, Matroska & MP4 official docs, Wikipedia container pages |
extensions, MIME, supported stream types | FFmpeg tells read/write/pipe support; pair with official specs for MIME |
| codec | ffmpeg -codecs / docs, Wikipedia codec pages, AOM/SVT/x264/x265 repos |
names, kind (video/audio), common containers, profiles/levels, hw support | HW support best from vendor docs (NVIDIA/Intel/Apple); keep limited to "common" |
| software (openers/handlers) | Wikidata (SPARQL), app vendor pages, chocolatey/homebrew formulae | app name, platforms, homepage, supported extensions | Wikidata has many app→filetype relations |
| support (browser/OS) | Can I Use (for AVIF/WebP etc.), Apple/Android docs, MDN | "yes/partial/no" by OS/browser | Don’t over-promise; "partial" when decoder exists but UI missing |
| subtitle/archives/raw clusters | Wikipedia format pages, Matroska specs, vendor docs | format descriptions, typical containers | Good for cross-linking ("VTT in HLS") |
| manifests | HLS/DASH specs, MDN, player docs (Shaka/HLS.js) | tags/attributes and examples | More textual than tabular; scrape for examples, not truth tables |
-
Bootstrap with machine-readable sources
- mime-db → seed MIME ↔ extensions (single JSON).
- shared-mime-info → parse XML for categories + magic patterns.
- PRONOM/DROID → unzip signature files; extract hex patterns + offsets + PUIDs.
- libmagic → parse
/usr/share/file/magictext; secondary support. - ffprobe → programmatically list containers/codecs from your own FFmpeg build (
ffprobe -formats -codecs -protocols -of json).
-
Enrich with semi-structured sources
- Wikidata → SPARQL queries for "software X supports extension Y", "format family", "developer", etc.
- Wikipedia → per-page infobox scrape for missing descriptions/aliases (cache and manual review).
-
Human-curate deltas
- Where sources disagree, keep priority order: IANA/mime-db > shared-mime > PRONOM/libmagic > Wikidata > Wikipedia > FileInfo.
- Open a "review" sheet for oddities (e.g.,
.binoverlapping meanings).
-
Normalization rules
extlowercase, no dot.mimeunique, lowercase; prefer mime-db entries.categorymap from shared-mime-info "generic-icons" (image, video, audio, text, app → collapse to your enum).magicstore as normalized hex with wildcards; keep source innotes.open_with: cap to 3–5 popular, per OS; source = Wikidata/vendor.
-
Versioning
- Store raw snapshots in
/sources/...with dates. - Build a deterministic pipeline (same inputs → same JSONs). Emit a
build_idwith timestamps + git SHA.
- Store raw snapshots in
-
mime-db (JS/JSON): gives you
mime.fullandextensions[]. Map intomime.schema.jsonand use reverse index to tee upfiletype.mime. -
shared-mime-info (XML db):
- Fields:
<mime-type type="image/heic">,<glob pattern="*.heic"/>,<magic>with<match value="..." offset="...">. - Use this to fill
filetype.magic[],category, and alternate extensions.
- Fields:
-
PRONOM/DROID (signature files):
- XML with byte sequences (
Pos,ByteSequenceValue) and PUIDs. - Perfect for
magic_signature.schema.json; map PUID →id, include extensions.
- XML with byte sequences (
-
ffprobe (your build):
ffprobe -hide_banner -formats -of json→ mux/demux flags for containers.ffprobe -hide_banner -codecs -of json→ codec names + decoders/encoders.ffprobe -hide_banner -protocols -of json→ protocols list. Populatecontainer,codec, and your taxonomy buckets 10–11–14.
-
Wikidata SPARQL (JSON results, no scraping):
- Query: apps that open a given extension; or file formats with filename extension ".heic".
- Populates
software.handles_extensions[]andfiletype.open_with[].
- I/O: always save raw →
/sources/{provider}/{date}/...(don’t parse in-place). - Parser: pure functions from raw to normalized records matching your schemas.
- Joiners: merge by normalized keys (
ext,mime.full) with priority rules. - Emitted: one JSON per entity type (
/build/filetypes.json,mimes.json, ...) and (optionally) one JSON-per-item for static page generation.
mime-db → mime + filetype seeds
import json, requests db = requests.get("https://raw.githubusercontent.com/jshttp/mime-db/master/db.json").json() mimes, ext_to_mimes = [], {} for full, meta in db.items(): exts = meta.get("extensions", []) mimes.append({"type": full.split("/")[0], "subtype": full.split("/")[1], "full": full, "extensions": exts}) for e in exts: ext_to_mimes.setdefault(e.lower(), set()).add(full) # seed filetypes from ext_to_mimes filetypes = [{"id": ext, "ext": ext, "name": f".{ext.upper()} file", "category":"other", "mime": sorted(list(m))} for ext,m in ext_to_mimes.items()]
ffprobe → containers/codecs
ffprobe -hide_banner -formats -of json > formats.json ffprobe -hide_banner -codecs -of json > codecs.json ffprobe -hide_banner -protocols -of json > protocols.json
shared-mime-info XML → magic
from lxml import etree root = etree.parse("freedesktop.org.xml") sigs = [] for mt in root.findall(".//mime-type"): t = mt.get("type") for m in mt.findall(".//magic//match"): sigs.append({ "id": f"{t}-{m.get('value')[:8]}", "hex": m.get("value").upper().replace("\\x"," ").strip(), "offset": int(m.get("offset","0")), "meaning": t, "extensions": [g.get("pattern").lstrip("*.") for g in mt.findall(".//glob")] })
Wikidata SPARQL (apps that open HEIC)
SELECT ?app ?appLabel WHERE { ?fmt wdt:P1195 "heic". # filename extension ?app wdt:P1072 ?fmt. # software supports file format SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
filetype.ext/name/mime/category/magic/open_with/related→ mime-db + shared-mime-info + PRONOM + Wikidatacodec.kind/profiles/common_containers→ ffprobe + Wikipediacontainer.extensions/mime/streams_supported→ ffprobe formats + specsmagic_signature.hex/offset/meaning→ PRONOM/libmagic/shared-mime-infosoftware.platforms/handles_extensions→ Wikidata + vendor docssupport.browsers/oses→ Can I Use/MDN (only for a handful like AVIF/WebP/HEVC)
- Licensing: PRONOM is free to use but credit; mime-db is MIT; shared-mime-info is LGPL-2.1 data; Wikipedia/Wikidata are CC-BY-SA/CC0. Attribute where required.
- Rate limits: cache requests, backoff; for Wikipedia/Wikidata use official APIs, not HTML.
- Consistency: extensions are many-to-many to MIME; your UI must handle multiple MIME per ext.
- Ambiguity: generic extensions (
.bin,.dat) → keep but markcategory: other, and do not auto-suggest risky openers.