Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

devinschumacher/ecosystem

Repository files navigation

A. Source map (what to scrape for each silo)

Schema Primary sources (stable) What you can pull cleanly Notes / gaps
filetype (ext, name, category, mime, magic, open_with, related) IANA media types, mime-db (npm), freedesktop/shared-mime-info XML, PRONOM/DROID (signature files), libmagic database, Wikidata ext↔MIME, canonical names, categories; magic signatures (PRONOM/libmagic); aliases; common apps (Wikidata) FileInfo is useful but semi-structured; use it last for titles/"how to open" blurbs
mime mime-db JSON (GitHub/npm), IANA registries type/subtype, extensions, notes mime-db already merges IANA + community; treat as ground truth for ext lists
magic_signature PRONOM (DROID ZIP/XML), libmagic text db, community file-signature DB repos hex patterns, offsets, description, associated extensions PRONOM is very complete but bureaucratic IDs; libmagic is pragmatic for detection
container FFmpeg docs/ffprobe -formats, Matroska & MP4 official docs, Wikipedia container pages extensions, MIME, supported stream types FFmpeg tells read/write/pipe support; pair with official specs for MIME
codec ffmpeg -codecs / docs, Wikipedia codec pages, AOM/SVT/x264/x265 repos names, kind (video/audio), common containers, profiles/levels, hw support HW support best from vendor docs (NVIDIA/Intel/Apple); keep limited to "common"
software (openers/handlers) Wikidata (SPARQL), app vendor pages, chocolatey/homebrew formulae app name, platforms, homepage, supported extensions Wikidata has many app→filetype relations
support (browser/OS) Can I Use (for AVIF/WebP etc.), Apple/Android docs, MDN "yes/partial/no" by OS/browser Don’t over-promise; "partial" when decoder exists but UI missing
subtitle/archives/raw clusters Wikipedia format pages, Matroska specs, vendor docs format descriptions, typical containers Good for cross-linking ("VTT in HLS")
manifests HLS/DASH specs, MDN, player docs (Shaka/HLS.js) tags/attributes and examples More textual than tabular; scrape for examples, not truth tables

B. First-pass ETL plan (fast + reproducible)

  1. Bootstrap with machine-readable sources

    • mime-db → seed MIME ↔ extensions (single JSON).
    • shared-mime-info → parse XML for categories + magic patterns.
    • PRONOM/DROID → unzip signature files; extract hex patterns + offsets + PUIDs.
    • libmagic → parse /usr/share/file/magic text; secondary support.
    • ffprobe → programmatically list containers/codecs from your own FFmpeg build (ffprobe -formats -codecs -protocols -of json).
  2. Enrich with semi-structured sources

    • Wikidata → SPARQL queries for "software X supports extension Y", "format family", "developer", etc.
    • Wikipedia → per-page infobox scrape for missing descriptions/aliases (cache and manual review).
  3. Human-curate deltas

    • Where sources disagree, keep priority order: IANA/mime-db > shared-mime > PRONOM/libmagic > Wikidata > Wikipedia > FileInfo.
    • Open a "review" sheet for oddities (e.g., .bin overlapping meanings).
  4. Normalization rules

    • ext lowercase, no dot.
    • mime unique, lowercase; prefer mime-db entries.
    • category map from shared-mime-info "generic-icons" (image, video, audio, text, app → collapse to your enum).
    • magic store as normalized hex with wildcards; keep source in notes.
    • open_with: cap to 3–5 popular, per OS; source = Wikidata/vendor.
  5. Versioning

    • Store raw snapshots in /sources/... with dates.
    • Build a deterministic pipeline (same inputs → same JSONs). Emit a build_id with timestamps + git SHA.

C. Concrete pulls you can implement immediately

  • mime-db (JS/JSON): gives you mime.full and extensions[]. Map into mime.schema.json and use reverse index to tee up filetype.mime.

  • shared-mime-info (XML db):

    • Fields: <mime-type type="image/heic">, <glob pattern="*.heic"/>, <magic> with <match value="..." offset="...">.
    • Use this to fill filetype.magic[], category, and alternate extensions.
  • PRONOM/DROID (signature files):

    • XML with byte sequences (Pos, ByteSequenceValue) and PUIDs.
    • Perfect for magic_signature.schema.json; map PUID → id, include extensions.
  • ffprobe (your build):

    • ffprobe -hide_banner -formats -of json → mux/demux flags for containers.
    • ffprobe -hide_banner -codecs -of json → codec names + decoders/encoders.
    • ffprobe -hide_banner -protocols -of json → protocols list. Populate container, codec, and your taxonomy buckets 10–11–14.
  • Wikidata SPARQL (JSON results, no scraping):

    • Query: apps that open a given extension; or file formats with filename extension ".heic".
    • Populates software.handles_extensions[] and filetype.open_with[].

D. Minimal extractor specs (so your scrapers are small)

  • I/O: always save raw → /sources/{provider}/{date}/... (don’t parse in-place).
  • Parser: pure functions from raw to normalized records matching your schemas.
  • Joiners: merge by normalized keys (ext, mime.full) with priority rules.
  • Emitted: one JSON per entity type (/build/filetypes.json, mimes.json, ...) and (optionally) one JSON-per-item for static page generation.

E. Example: tiny pipelines (pseudo-Python)

mime-db → mime + filetype seeds

import json, requests
db = requests.get("https://raw.githubusercontent.com/jshttp/mime-db/master/db.json").json()
mimes, ext_to_mimes = [], {}
for full, meta in db.items():
 exts = meta.get("extensions", [])
 mimes.append({"type": full.split("/")[0], "subtype": full.split("/")[1], "full": full, "extensions": exts})
 for e in exts:
 ext_to_mimes.setdefault(e.lower(), set()).add(full)
# seed filetypes from ext_to_mimes
filetypes = [{"id": ext, "ext": ext, "name": f".{ext.upper()} file", "category":"other", "mime": sorted(list(m))} for ext,m in ext_to_mimes.items()]

ffprobe → containers/codecs

ffprobe -hide_banner -formats -of json > formats.json
ffprobe -hide_banner -codecs -of json > codecs.json
ffprobe -hide_banner -protocols -of json > protocols.json

shared-mime-info XML → magic

from lxml import etree
root = etree.parse("freedesktop.org.xml")
sigs = []
for mt in root.findall(".//mime-type"):
 t = mt.get("type")
 for m in mt.findall(".//magic//match"):
 sigs.append({
 "id": f"{t}-{m.get('value')[:8]}",
 "hex": m.get("value").upper().replace("\\x"," ").strip(),
 "offset": int(m.get("offset","0")),
 "meaning": t,
 "extensions": [g.get("pattern").lstrip("*.") for g in mt.findall(".//glob")]
 })

Wikidata SPARQL (apps that open HEIC)

SELECT ?app ?appLabel WHERE {
 ?fmt wdt:P1195 "heic". # filename extension
 ?app wdt:P1072 ?fmt. # software supports file format
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

F. Filling your schemas (coverage matrix)

  • filetype.ext/name/mime/category/magic/open_with/relatedmime-db + shared-mime-info + PRONOM + Wikidata
  • codec.kind/profiles/common_containersffprobe + Wikipedia
  • container.extensions/mime/streams_supportedffprobe formats + specs
  • magic_signature.hex/offset/meaningPRONOM/libmagic/shared-mime-info
  • software.platforms/handles_extensionsWikidata + vendor docs
  • support.browsers/osesCan I Use/MDN (only for a handful like AVIF/WebP/HEVC)

G. Practical cautions

  • Licensing: PRONOM is free to use but credit; mime-db is MIT; shared-mime-info is LGPL-2.1 data; Wikipedia/Wikidata are CC-BY-SA/CC0. Attribute where required.
  • Rate limits: cache requests, backoff; for Wikipedia/Wikidata use official APIs, not HTML.
  • Consistency: extensions are many-to-many to MIME; your UI must handle multiple MIME per ext.
  • Ambiguity: generic extensions (.bin, .dat) → keep but mark category: other, and do not auto-suggest risky openers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /