The data file we have provided raw_rank_trunk_chars.npz should be placed in the data/ directory in the project if it is not already there. This file can be found under this link:
Note, there are two differences between the data-set we have provided and the data-set used in the paper
- The data-set we have provided is a truncated version of the data-set in the paper. This is because the full data-set is around 20 GB, and running all the results requires making multiple copies of the data-set, which is very time and space consuming. For this replication package, we have taken the last 5 years of the characteristic data. Note that we also provide the full data set of the characteristic data for all the years and the code can be applied accordingly to the full data.
- The returns in the data-set we have provided have been altered such as not to violate the terms of service from their source, we have done this by adding noise to them. Specifically, we have added i.i.d noise with a Normal(0, 0.1) distribution to each return observation. Therefore, one should not expect this data to exactly replicate any of the numerical results in the paper concerning returns, nor should it replicate standard results. However, the qualitative results are similar. The permos corresponding to returns have not been modified in any way, and therefore users with WRDS access can easily replace the contaminated returns with the correct returns and can exactly replicate the numerical results in the paper.
NOTE all of the commands below expect that the user is running them from the base directory of the repository
-
install the required packages
$ pip install -r requirements.txt -
create the necessary directions
$ ./setup_directories.sh -
generate the masked data
$ cd src && python generate_masked_data.py -
run the imputations
$ cd src && python run_data_imputations.py -
run the desired notebook for the particular results in question, ensure to run the first cell of the notebook to import the required modules and load the required data. to start the notebook server run
$ cd src && jupyter notebook
- We have indicated in each result in the notebook how long it took us to run on a Macbook Pro with 2.8 GHz Quad-Core Intel Core i7 processor and 16 GB 2133 MHz LPDDR3 for memory.
- Some of the results take quite some time to run (for example the simulations).
- Generating the data and running the imputations should take on the order of an hour or two
Each of the paper results can be found in the notebook corresponding to it's section in the paper.
- src/appendix.ipynb
- src/section2.ipynb
- src/section3.ipynb
- src/section4.ipynb
- src/section5.ipynb
- src/section6.ipynb
Within each notebook, the first cell corresponds to data-loading and imports, and then each section corresponds to a table or figure from the text. These are clearly labeled along with a description of what is being done. Running the cell corresponding to a result will produce and display the table or figure for the result as well as writing either a pdf of the plot for figures or a tex file of the table for tables to the images-pdfs directory. Additionally, below we have listed (1) the notebook containing each result in the paper (2) the file and line number of the code to generate that result in the repository and (3) the location of files which are generated for that plot.
Main Text
- Figure 1: Missing Values over Time
run_section_2_plots.ipynbcode located (src/plots_and_tables/section_2.py)images-pdfs/section2/MissingValuesOverTime.pdf - Figure 2: Missing Observations by Characteristic
run_section_2_plots.ipynbcode located (src/plots_and_tables/section_2.py)images-pdfs/section2/MissingObservationByCharacteristic_by_permno_first.pdf - Figure 3: Missing Observations by Characteristic Quintiles
run_section_2_plots.ipynbcode located (src/plots_and_tables/section_2.py)images-pdfs/section2/[MissingValuesByCharQuintile.pdf, MissingValuesBySizeQuintile.pdf] - Table 1: Logistic Regressions Explaining Missingess
run_section_2_plots.ipynbcode located (src/plots_and_tables/section_2.py)images-pdfs/section2/MissingLogitRegressions.tex - Figure 4: Autocorrelation of Characteristic Ranks
run_section_2_plots.ipynbcode located (src/plots_and_tables/section_2.py)images-pdfs/section2/AutocorrOfChars.pdf - Figure 5: Heatmap of Pairwise Correlation
run_section_2_plots.ipynbcode located (src/plots_and_tables/section_2.py)images-pdfs/section2/HeatmatOfCorr.pdf - Figure 6: Joint Distribution of Missing Patterns
run_section_3_plots.ipynbcode located (src/plots_and_tables/section_3.py) result written toimages-pdfs/section3/missing--20180331.pdf - Figure 7: Eigenvalues of Σ
run_section_4_plots.ipynbcode located (src/plots_and_tables/section_4.py) result written toimages-pdfs/section4/figure_2_avg_cov_ev.pdf - Figure 8: Number of Factors and Regularization
run_section_4_plots.ipynbcode located (src/plots_and_tables/section_4.py) result written toimages-pdfs/section4/[number_of_factors_determination_xs-MAR0-True.pdf, number_of_factors_determination_xs-MAR0_0001-True.pdf, number_of_factors_determination_xs-MAR0_001-True.pdf, number_of_factors_determination_xs-MAR0_01-True.pdf, number_of_factors_determination_xs-logit0-True.pdf, number_of_factors_determination_xs-logit0_0001-True.pdf, number_of_factors_determination_xs-logit0_001-True.pdf, number_of_factors_determination_xs-logit0_01-True.pdf, number_of_factors_determination_xs-prob_block0-True.pdf, number_of_factors_determination_xs-prob_block0_0001-True.pdf, number_of_factors_determination_xs-prob_block0_001-True.pdf, number_of_factors_determination_xs-prob_block0_01-True.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-MAR0-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-MAR0_0001-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-MAR0_001-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-MAR0_01-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-logit0-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-logit0_0001-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-logit0_001-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-logit0_01-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-prob_block0-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-prob_block0_0001-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-prob_block0_001-True-incremental.pdf, metrics_by_char_vol_sort-number_of_factors_determination_xs-prob_block0_01-True-incremental.pdf] - Figure 9: Optimal Regularization
run_section_4_plots.ipynbcode located (src/plots_and_tables/section_4.py) result written toimages-pdfs/section4/[optimal_reg_determination_xs-MARk=200_0,0001_0,0005_0,001_0,005_0,01_0,05_0,1_0,5_1-.pdf, optimal_reg_determination_xs-logitk=200_0,0001_0,0005_0,001_0,005_0,01_0,05_0,1_0,5_1-.pdf, optimal_reg_determination_xs-prob_blockk=200_0,0001_0,0005_0,001_0,005_0,01_0,05_0,1_0,5_1-.pdf] - Table 3: Imputation Error for Different Imputation Methods
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) and (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/[AggregateImputationErrorsFullDataset.tex, AggregateImputationR2FullDataset.tex] - Table 4: Imputation Error for Extreme Characteristic Quintiles
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/ImputationErrorsByCharQuintileFullDS.tex - Figure 10: Illustrative Model-Implied and Imputed Time-Series
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/[HAS-one-year-mask-AT.pdf, HAS-one-year-mask-ME.pdf, HAS-one-year-mask-Q.pdf, HAS-one-year-mask-VAR.pdf, MSFT-one-year-mask-AT.pdf, MSFT-one-year-mask-ME.pdf, MSFT-one-year-mask-Q.pdf, MSFT-one-year-mask-VAR.pdf] - Table 5: Imputation Error for Types of Missingness
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/[ImputationErrorsByMissingTypeEnd.tex, ImputationErrorsByMissingTypeMiddle.tex, ImputationErrorsByMissingTypeStart.tex] - Figure 11: Imputation Error for Individual Characteristics
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/[metrics_by_char_vol_sort-table_1_in_sample.pdf, metrics_by_char_vol_sort-table_1_out_of_sample_MAR.pdf, metrics_by_char_vol_sort-table_1_out_of_sample_block.pdf, metrics_by_char_vol_sort-table_1_out_of_sample_logit.pdf] - Figure 12: Information Used for Imputation
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/InfoUsedForImputationBW-bw_beta_weights.pdf - Table 6: Imputation Error for Alternative Methods
run_section_5_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/ComparisonWithAlternativeMethods.tex - Figure 13: Market Premium Conditional on Observing a Firm Characteristic
run_section_6_plots.ipynbcode located (src/plots_and_tables/section_6.py) result written toimages-pdfs/section6/ls-missing-obs-ports.pdf - Figure 14: Sharpe Ratios with IPCA Factors
run_section_6_plots.ipynbcode located (src/plots_and_tables/section_6.py) result written toimages-pdfs/section6/[ipca_sharpes_in_sample.pdf, ipca_sharpes_outof_sample.pdf] - Figure 15: Univariate Sorts with and without Missing Values
run_section_6_plots.ipynbcode located (src/plots_and_tables/section_6.py) result written toimages-pdfs/section6/[portfolio-sorts-B2M-Mean_Return.pdf, portfolio-sorts-B2M-Percent_Used.pdf, portfolio-sorts-B2M-Sharpe_Ratio.pdf, portfolio-sorts-B2M-Volatility.pdf, portfolio-sorts-INV-Mean_Return.pdf, portfolio-sorts-INV-Percent_Used.pdf, portfolio-sorts-INV-Sharpe_Ratio.pdf, portfolio-sorts-INV-Volatility.pdf, portfolio-sorts-ME-Mean_Return.pdf, portfolio-sorts-ME-Percent_Used.pdf, portfolio-sorts-ME-Sharpe_Ratio.pdf, portfolio-sorts-ME-Volatility.pdf, portfolio-sorts-OP-Mean_Return.pdf, portfolio-sorts-OP-Percent_Used.pdf, portfolio-sorts-OP-Sharpe_Ratio.pdf, portfolio-sorts-OP-Volatility.pdf] - Figure 16: Imputation Bias in Pure-Play Mimicking Portfolios
run_section_6_plots.ipynbcode located (src/plots_and_tables/section_6.py) result written toimages-pdfs/section6/[masked-factor_regression-pure_play-bw-xsmed-mean-abs-error.pdf, masked-factor_regression-pure_play-bw-xsmed-corr.pdf] - Figure 17: Characteristic Mimicking Factor Portfolios
run_section_6_plots.ipynbcode located (src/plots_and_tables/section_6.py) result written toimages-pdfs/section6/[masked-factor_regression-pure_play-B2M-bw-xsmed.pdf, masked-factor_regression-pure_play-INV-bw-xsmed.pdf, masked-factor_regression-pure_play-ME-bw-xsmed.pdf, masked-factor_regression-pure_play-S2P-bw-xsmed.pdf] - Table A.1: Imputation Error for Alternative Implementations
run_appendix_plots.ipynbcode located (src/plots_and_tables/appendix.py) result written toimages-pdfs/appendix/[ComparisonOfModelConfigs.tex] - Simulations
- Figure A.1: Errors with Missing-Completely-at-Random
run_appendix_plots.ipynbcode located (src/plots_and_tables/appendix.py) result written toimages-pdfs/appendix/[MAR_simulation_CCMSE_residreg_L=100_K=10.pdf, MAR_simulation_CCMSE_residreg_L=100_K=15.pdf, MAR_simulation_CCMSE_residreg_L=100_K=5.pdf, MAR_simulation_CCMSE_residreg_L=500_K=10.pdf, MAR_simulation_CCMSE_residreg_L=500_K=15.pdf, MAR_simulation_CCMSE_residreg_L=500_K=5.pdf, MAR_simulation_CCMSE_residreg_L=50_K=10.pdf, MAR_simulation_CCMSE_residreg_L=50_K=15.pdf, MAR_simulation_CCMSE_residreg_L=50_K=5.pdf] - Figure A.2: Imputation Errors with Missing-Conditionally-at-Random
run_appendix_plots.ipynbcode located (src/plots_and_tables/appendix.py) result written toimages-pdfs/appendix/[Lmbda_simulation_CCMSE_residreg_L=100_K=10.pdf, Lmbda_simulation_CCMSE_residreg_L=100_K=15.pdf, Lmbda_simulation_CCMSE_residreg_L=100_K=5.pdf, Lmbda_simulation_CCMSE_residreg_L=500_K=10.pdf, Lmbda_simulation_CCMSE_residreg_L=500_K=15.pdf, Lmbda_simulation_CCMSE_residreg_L=500_K=5.pdf, Lmbda_simulation_CCMSE_residreg_L=50_K=10.pdf, Lmbda_simulation_CCMSE_residreg_L=50_K=15.pdf,Lmbda_simulation_CCMSE_residreg_L=50_K=5.pdf]
- Figure A.1: Errors with Missing-Completely-at-Random
- Table C.2: Missing by Characteristic Quintiles
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_2.py) result written toimages-pdfs/section2/MssingByQuintile.tex - Table C.3: Lengths of Missing Blocks
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_2.py) result written toimages-pdfs/section2/MssingBlockLengths.tex - Figure D.1: Missing Observations over Time By Characteristics
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_2.py) result written toimages-pdfs/section2/HeatmatOfMissingPerc.pdf - Figure D.2: Missing Observations by Characteristic Pooled by Stocks
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_2.py) result written toimages-pdfs/section2/[MissingObservationByCharacteristic_by_date_first.pdf, MissingObservationByCharacteristic_by_date_first_value_weight.pdf] - Figure D.3: Heatmap of Pairwise Correlation from 1967–1976 NOT INCLUDED AS TRUNCATED DATA DOES NOT INCLUDE THESE YEARS
- Figure D.4: Standard Deviation of Characteristic Ranks
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_2.py) result written toimages-pdfs/section2/StdOfChars.pdf - Figure D.5: Generalized Correlation of Global and Local Factor Weights
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_4.py) result written toimages-pdfs/section4/generalized_corr.pdf - Figure D.6: Composition of Proxy Factors by Characteristic Categories
run_appendix_plots.ipynbcode located (src/plots_and_tables/appendix.py) result written toimages-pdfs/appendix/[factor_vis_0_full_factors.pdf, factor_vis_0_sparse_factors.pdf, factor_vis_1_full_factors.pdf, factor_vis_1_sparse_factors.pdf, factor_vis_2_full_factors.pdf, factor_vis_2_sparse_factors.pdf, factor_vis_3_full_factors.pdf, factor_vis_3_sparse_factors.pdf, factor_vis_4_full_factors.pdf, factor_vis_4_sparse_factors.pdf, factor_vis_5_full_factors.pdf, factor_vis_5_sparse_factors.pdf, factor_vis_6_full_factors.pdf, factor_vis_6_sparse_factors.pdf, factor_vis_7_full_factors.pdf, factor_vis_7_sparse_factors.pdf, factor_vis_8_full_factors.pdf, factor_vis_8_sparse_factors.pdf, factor_vis_9_full_factors.pdf, factor_vis_9_sparse_factors.pdf] - Figure D.8: Global and Local Imputation for Individual Characteristics
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_5.py) result written toimages-pdfs/section5/metrics_by_char_vol_sort-table_2_out_of_sample_block.pdf - Figure D.9: Top and Bottom Deciles with and without Missing Values
run_appendix_plots.ipynbcode located (src/plots_and_tables/section_6.py) result written toimages-pdfs/section6/[hl-portfolios-Intangibles,TradingFrictions,Other-MeanReturn.pdf, hl-portfolios-Intangibles,TradingFrictions,Other-SharpeRatio.pdf, hl-portfolios-Investment,Profitability-MeanReturn.pdf, hl-portfolios-Investment,Profitability-SharpeRatio.pdf, hl-portfolios-PastReturns,Value-MeanReturn.pdf, hl-portfolios-PastReturns,Value-SharpeRatio.pdf] - Figure D.10: Sharpe Ratios with Non-parametric IPCA Factors
run_appendix_plots.ipynbcode located (src/plots_and_tables/appendix.py) result written toimages-pdfs/appendix/[decile_ipca_sharpes_in_sample.pdf, decile_ipca_sharpes_outof_sample.pdf]