Skip to content

Add South Carolina dataset exploration#120

Open
DTrim99 wants to merge 12 commits intoPolicyEngine:mainfrom
DTrim99:sc-data-exploration
Open

Add South Carolina dataset exploration#120
DTrim99 wants to merge 12 commits intoPolicyEngine:mainfrom
DTrim99:sc-data-exploration

Conversation

@DTrim99
Copy link
Contributor

@DTrim99 DTrim99 commented Feb 26, 2026

Summary

  • Adds data exploration notebook for South Carolina (SC) state dataset
  • Includes comprehensive summary CSV with weighted population estimates
  • Analyzes AGI distribution at both household and person levels (median, average, percentiles)
  • Breaks down households by number of children and children by age groups
  • NEW: Adds SC H.4216 tax reform analysis with RFA comparison

Key SC Statistics

Metric Value
Household count (weighted) 1,887,388
Person count (weighted) 5,451,832
Average household size 2.9
Weighted median household AGI $43,222
Weighted average household AGI $103,858
Weighted median person AGI $38,962
Weighted average person AGI $93,926

H.4216 Tax Reform Analysis

Compares PolicyEngine microsimulation results against official RFA (Revenue & Fiscal Affairs) analysis.

Metric RFA PolicyEngine
General Fund Impact -$119.1M +$39.8M
Tax Decrease % 38.7% 20.0%
Tax Increase % 26.7% 24.0%
No Change % 34.6% 56.0%

Key Differences

The $159M discrepancy is primarily due to:

  1. Upper-middle income ($100k-$500k): PE shows larger tax increases due to SCIAD phase-out
  2. Middle income ($30k-$100k): PE shows smaller tax cuts
  3. Data source: RFA uses actual SC tax returns; PE uses CPS-based synthetic data

See h4216_analysis_comparison.md for detailed analysis.

Files Added

  • us/states/sc/data_exploration.ipynb - SC dataset exploration
  • us/states/sc/sc_dataset_summary_weighted.csv - Dataset summary
  • us/states/sc/sc_h4216_reform_analysis.ipynb - H.4216 reform analysis
  • us/states/sc/sc_h4216_tax_impact_analysis.csv - PE analysis results
  • us/states/sc/rfa_h4216_analysis.csv - RFA official analysis
  • us/states/sc/h4216_analysis_comparison.md - Comparison analysis

Test plan

  • Data exploration notebook runs successfully
  • H.4216 reform analysis notebook runs with correct 5.39% top rate
  • All weighted statistics calculated correctly

🤖 Generated with Claude Code

DTrim99 and others added 7 commits February 26, 2026 15:00
Adds data exploration notebook and summary CSV for South Carolina (SC) dataset:
- Household and person counts (weighted)
- AGI distribution (median, average, percentiles) at household and person level
- Households with children breakdown
- Children by age group demographics
- Income bracket analysis

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add H.4216 reform analysis notebook using PolicyEngine microsimulation
- Include RFA official analysis data for comparison
- Add detailed comparison markdown explaining $159M difference:
  - PE shows +$40M revenue vs RFA's -$119M
  - Key difference: SCIAD phase-out treatment for upper-middle income
  - Implementation uses AGI - SCIAD vs federal taxable income

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key findings:
- PE has 7.85x more $0 income returns vs RFA
- PE has ~50% fewer returns in $100k-$300k brackets
- PE has 1.9x more millionaire returns paying 78% higher avg tax
- Total baseline revenue similar ($6.52B vs $6.40B) but composition differs
- PE derives 48% of SC income tax from millionaires vs RFA's 15%

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PE includes non-filers which explains 540k extra returns in $0 bracket

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add implementation note about sc_additions bug fix
- Add RFA comparison section to notebook
- Update comparison markdown with post-fix accuracy (~93%)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add data_exploration_staging.ipynb for staging SC dataset
- Add sc_h4216_budget_impact.py for quick budget impact calculation
- Add staging dataset summary CSV
- Update reform analysis notebook with RFA comparison fixes
- Update tax impact CSV with corrected results (staging data)

Staging vs Production dataset comparison:
- Staging has 17% fewer households (more focused on filers)
- Staging median AGI is 39% higher (0k vs 3k)
- Budget impact with staging: -46.6M (5.21%) / -10.9M (5.39%)
- RFA estimate: -19.1M (93% accuracy with 5.39% rate)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@DTrim99
Copy link
Contributor Author

DTrim99 commented Mar 2, 2026

Update: Staging Dataset Analysis & PR #7514 Fix

Changes

  • Added - explores the staging SC dataset
  • Added - quick budget impact calculation script
  • Updated reform analysis notebook with correct RFA column names
  • Re-ran analysis with staging dataset and policyengine-us 1.589.1 (includes PR #7514 fix)

Results with PR #7514 Fix

Scenario Top Rate PE Estimate RFA Estimate Accuracy
H.4216 (bill text) 5.21% -\46.6M N/A -
H.4216 (RFA version) 5.39% -\10.9M -\19.1M 93%

Staging vs Production Dataset

Metric Production Staging Change
Households 1,887,388 1,573,988 -17%
Median HH AGI \3,222 \0,027 +39%
25th pctl AGI ,425 \5,465 +170%

The staging dataset better represents actual tax filers (fewer zero/low income units), which explains the improved alignment with RFA estimates.

DTrim99 and others added 4 commits March 2, 2026 14:07
- Remove staging dataset files (broken data)
- Add data_exploration_test.ipynb for test dataset (hf://policyengine/test/mar/SC.h5)
- Update all notebooks to use .values for raw arrays (avoid double-weighting)
- Update sc_h4216_budget_impact.py to use test dataset and correct RFA estimate
- Update sc_h4216_reform_analysis.ipynb to use test dataset
- Add sc_h4216_dataset_comparison.py comparing production vs test datasets

RFA estimates:
- 5.21% rate: -$309M
- 5.39% rate: -$119.1M

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Produces output in exact RFA format for direct comparison
- Uses test dataset (hf://policyengine/test/mar/SC.h5)
- Uses 5.39% top rate (RFA version)
- Exports to pe_h4216_test_analysis.csv
- Includes side-by-side comparison with RFA data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add detailed analysis explaining why Production overestimates and Test underestimates
- Core issue: baseline revenue calibration ($6.5B Production vs $4.0B Test vs $6.4B RFA)
- Add test dataset exploration notebook and summary CSV
- Update comparison markdown with recommendations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@DTrim99
Copy link
Contributor Author

DTrim99 commented Mar 6, 2026

Dataset Comparison Findings

Updated analysis comparing Production and Test datasets against RFA fiscal notes for SC H.4216.

Budget Impact Results

Dataset 5.21% Impact vs RFA (-$309M) 5.39% Impact vs RFA (-$119M)
Production -$393M 73% accuracy -$198M 34% accuracy
Test (Mar) -$212M 69% accuracy -$93M 78% accuracy

Key Finding: Baseline Revenue Calibration

Source Baseline Revenue vs RFA
RFA ~$6.4B -
Production $6.5B +2%
Test $4.0B -37%

Production overestimates because it has higher average incomes ($104k vs $74k Test) and more tax units affected by rate cuts.

Test underestimates because it has 37% less baseline revenue than RFA despite better return counts (2.71M vs 2.76M).

Ideal Dataset Would Have:

  • Test's return count (~2.7M matching RFA's 2.76M filers)
  • Production's baseline revenue (~$6.5B matching RFA's ~$6.4B)

See h4216_analysis_comparison.md for full details and recommendations.

- Restructure into h4216_analysis/ folder with rate-specific subfolders
- Add analysis notebooks for both State and Test datasets at each rate
- Add comprehensive comparison markdown with bracket-by-bracket analysis
- Remove unused intermediate scripts and notebooks

Key findings:
- 5.21% rate: State -$393M, Test -$212M vs RFA -$309M
- 5.39% rate: State -$198M, Test -$93M vs RFA -$119M
- Primary driver: millionaire distribution (State has 90% more, Test has 41% fewer)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@DTrim99
Copy link
Contributor Author

DTrim99 commented Mar 6, 2026

SC H.4216 Analysis Update - Comprehensive RFA Comparison

Reorganized analysis with full bracket-by-bracket comparison against RFA fiscal notes.

RFA Fiscal Notes

5.21% Rate: https://legiscan.com/SC/supplement/H4216/id/682946/South_Carolina-2025-H4216-H4216_2026-02-24_Amended.pdf
5.39% Rate: https://legiscan.com/SC/supplement/H4216/id/636685/South_Carolina-2025-H4216-H4216_2026-01-13_Updated.pdf

Budget Impact Summary

Rate RFA State Dataset Test Dataset
5.21% -$308.7M -$393.0M (27% over) -$212.0M (31% under)
5.39% -$119.1M -$198.2M (66% over) -$92.7M (22% under)

Key Finding: Millionaire Distribution

The primary driver of discrepancies is the millionaire bracket:

Metric RFA State Test
Millionaire Returns 11,936 22,686 (+90%) 6,993 (-41%)
5.21% Impact -$45M -$333M -$142M
  • State dataset has nearly double the millionaires RFA reports
  • Test dataset has 41% fewer millionaires but an extreme outlier ($418.7M AGI)

File Structure

h4216_analysis/
├── h4216_analysis_comparison.md    # Full comparison document
├── 5.21_rate/
│   ├── rfa_h4216_5.21_analysis.csv
│   ├── state/  (notebook + PE output)
│   └── test/   (notebook + PE output)
└── 5.39_rate/
    ├── rfa_h4216_analysis.csv
    ├── state/  (notebook + PE output)
    └── test/   (notebook + PE output)

Conclusion

Policy encoding is correct. All discrepancies stem from dataset characteristics, primarily millionaire weighting. See h4216_analysis_comparison.md for full bracket-by-bracket analysis and recommendations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant