lciric

sycophancy-construct-validity sycophancy-construct-validity Public

Python
safety-concept-vectors safety-concept-vectors Public

Extracting and validating safety concept vectors (eval-awareness, deception, sycophancy, etc.) from open-weight LLMs — extending Anthropic's emotion vectors methodology to alignment-critical concepts

Python 1
eval-awareness-detection eval-awareness-detection Public

Mechanistic detection of eval-awareness in language models via representation engineering

Python 1
does-quantization-kill-interpretability does-quantization-kill-interpretability Public

Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.

Python 1
gptq-from-scratch gptq-from-scratch Public

GPTQ post-training quantization from scratch — GPT-2, OPT, LLaMA support

Jupyter Notebook 1
pcm-bitslicing pcm-bitslicing Public

Python 1

Navigation Menu