Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Data and code associated with Green Building Blocks research

License

Notifications You must be signed in to change notification settings

complexly/gbb_code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

10 Commits

Repository files navigation

Patent Green-Technology Matching & Clustering

This repo provides notebooks to (i) identify green/CCMT patent families via CPC (Y02/Y04), (ii) match to non-green patents using text embeddings and ANN (HNSW), (iii) cluster/name green tech groups (HDBSCAN/UMAP), (iv) aggregate to firms/cities/countries and (v) analyze data in the paper "Green building blocks reveal the complex anatomy of climate change mitigation technologies".

Requirements

  • OS: Linux / macOS / Windows (WSL2 recommended on Windows)
  • Python: 3.11 (see requirements.txt for pinned packages)
  • R: 4.4.3 (for 5.3_entry_reg_R.ipynb, 7.3_complement_reg_R.ipynb)
  • Optional: Stata (for 5.4_marginplot_stata.ipynb).

Install with conda:

conda create -n gbb python=3.11 -y
conda activate gbb
conda install r=4.4.3 r-irkernel
pip install -r requirements.txt

Typical install time on a normal desktop: 10–25 minutes (most time is Python wheel downloads; nmslib may compile on some platforms).

Quickstart (demo)

Open notebooks to see the processing codes and results on whole datset.

A tiny demo dataset is in sampledata/, which are not the whole data used but a few sampled rows to show the structure. Execution of notebooks on these data will not get the same results.

Paths are now configured via an auto-inserted setup cell. You can also set an environment variable to relocate the project:

export GBB_PROJECT_ROOT=/path/to/this/repo

Typical workflow

A download of Patentsview dataset and PATSTAT (license needed) is necessary, and the raw data is recommended to store in Parquet files for faster loading.

Run in order: 0_get_familiy_id.ipynb → 1_ccmt_matching.ipynb → 2_cpc.ipynb → 3.1_cluster_GBB.ipynb → 3.2_cluster_sourcefields.ipynb → 3.3_name_clusters.ipynb → 4_assocition_source_gbb.ipynb → 5.1_firm_agg.ipynb → 5.2_data4reg.ipynb → 5.3_entry_reg_R.ipynb (R) → 5.4_marginplot_stata.ipynb (Stata) → 6_viz_firm_cities.ipynb → 7.1_firm_primary_ctry.ipynb → 7.2_complement_firm_ctry.ipynb → 7.3_complement_reg_R.ipynb (R) → 7.4_fig_complement.ipynb

Notes

  • Parquet IO requires pyarrow (already pinned).
  • R notebooks require IRkernel if run in Jupyter.
  • Stata notebook can be run in Stata directly or using a Jupyter kernel configured for Stata.
  • The matching and clustering step requires large memory, typically need memory of 64GB or larger for the whole patent dataset.

License

See LICENSE.

About

Data and code associated with Green Building Blocks research

Resources

License

Stars

Watchers

Forks

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /