Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: Baron-Sun/socialscikit

SocialSciKit v0.1.0 — Zero-code text analysis toolkit

18 Apr 09:23
@Baron-Sun Baron-Sun

Choose a tag to compare

SocialSciKit v0.1.0 — Initial Release

SocialSciKit is an open-source, zero-code toolkit for social science text analysis. It runs entirely in the browser, supports GPT / Claude / Ollama backends, and ships with a bilingual UI (English / 中文).

This initial release covers the full research lifecycle — from raw data to a publication-ready Methods section — through three independent modules plus a unified visualization dashboard.

📦 Three Core Modules

QuantiKit — Text Classification

End-to-end pipeline for supervised text classification.

  • Method recommendation with CSS-literature citations (zero-shot / few-shot / fine-tuning)
  • Annotation budget estimation via power-law learning-curve fitting, with 80% CI and marginal-return curves
  • Built-in annotator (skip / undo / flag) with real-time progress chart
  • Three classification paths: prompt classification (with APE-based prompt optimization), local transformer fine-tuning, OpenAI fine-tuning API
  • Pipeline log export in JSON for downstream tools

QualiKit — Qualitative Coding

End-to-end pipeline for interview transcripts, focus groups, and open-ended surveys.

  • PII de-identification with Chinese + English NER, per-item review and bulk acceptance
  • Interactive research framework (RQs + sub-themes) with LLM-assisted sub-theme suggestion
  • LLM batch coding grounded in a verbatim evidence_span extracted from the source text
  • Review workflow with confidence ranking, bulk accept, manual coding, cascading dropdowns
  • Structured Excel export + pipeline log

Toolbox — Research Methods Tools

Three standalone utilities that work with any CSV or pipeline log.

  • ICR Calculator: Cohen's Kappa, Krippendorff's Alpha, Multi-label Jaccard — supports 2 or more coders with auto metric selection
  • Consensus Coding: dispatch the same coding task to 2–5 LLMs in parallel and aggregate via majority vote
  • Methods Section Generator: auto-draft a bilingual Methods paragraph from an imported pipeline log or a short form

📊 Visualization Dashboard

Academic-style matplotlib charts embedded throughout both pipelines:

  • QuantiKit Step 5 (Evaluation) — metric summary cards + row-normalized confusion-matrix heatmap + per-class P/R/F1 grouped bar chart
  • QuantiKit Step 3 (Annotation) — live progress donut, updated after every action
  • QualiKit Step 5 (Review) — review-progress donut + confidence histogram (with tier shading and median marker) + theme-distribution horizontal bar chart
  • Toolbox ICR — pairwise agreement bar chart with "Good" and "Moderate" reference lines

All charts use a consistent blue / green / orange palette and include full CJK font support.

🔍 Evidence Highlighting

LLM coding in QualiKit is now grounded in verbatim evidence rather than opaque labels.

  • The coding prompt requires the LLM to return an evidence_span — the exact phrase or sentence from the source text that supports the assigned RQ / sub-theme.
  • In the review UI, the original text is rendered with the supporting quote highlighted in green at the correct position.
  • When the quote can't be matched verbatim (e.g. paraphrased), a fallback "Evidence" block displays the cited text so reviewers can still audit the coding decision.
  • Case-insensitive substring matching makes highlighting robust to minor capitalization differences.

This makes every LLM decision auditable — a critical step for IRB-facing qualitative research.

💾 Project Save & Restore

Save the full state of your research session — loaded DataFrames, annotation sessions (with cursor + history + elapsed time), extraction review sessions, research questions, de-identification results — to a single JSON file. Reload from the Home tab to resume work later. Tagged-union serialization keeps complex types (DataFrames, dataclasses, enums) losslessly round-tripping.

🌐 Runtime

Component Tested
Python 3.9 – 3.12
Gradio 4.44+
LLM backends OpenAI (gpt-4o / gpt-4o-mini / gpt-4.1), Anthropic (Claude Sonnet 4 / Haiku 4.5), Ollama (Llama 3 / Mistral / Qwen 2.5)
Test suite 676 tests passing

🚀 Install & Launch

pip install socialscikit
socialscikit # launches the unified UI at http://127.0.0.1:7860
Assets 2
Loading

AltStyle によって変換されたページ (->オリジナル) /