This document describes some of the operations performed to generate the downloadable bulk files from the NCI Open Database structures and biological test data (cancer and AIDS, see here for more information). The aim of this document is to show how to combine the tools of the SDF Toolkit and to provide tricks and recipes by showing real examples. All these examples shoud be run on a Unix system.
All input files are based on the publicly and freely available data from NCI's Developmental Therapeutics Program (DTP). We collected the structures and biological data from DTP (cancer data as of August 1999, AIDS data as of October 1999), combined them where applicable, and generated MDL SD files from this information.
The SDF_Toolkit can be downloaded here. You'll need version 1.06 or later.
append_sdf -prop NSC nciopen_LMCH_aug99_0D.sdf aids_o99_chemical_structs.sdf > new.sdf
append_sdf -prop NSC nciopen_LMCH_aug99_0D.sdf aids_o99_chemical_structs.sdf | extract_prop_sdf -prop NSC > temp.list
> 42687 entries read and 2212 entries added from aids_o99_chemical_structs.sdf
tail -2212 temp.list > 2212_oct99.list
select_sdf -labelfile 2212_oct99.list -property_name NSC < aids_o99_chemical_structs.sdf > 2212_oct99_3D.sdf
remove_h_sdf < 2212_oct99_3D.sdf | remove_charge_sdf | tee 2212_oct99_3D_no_H.sdf | zero_sdf > 2212_oct99_0D.sdf
cactus_2d_nci 2212_oct99_0D.sdf | remove_stereo_sdf > 2212_oct99_2D.sdf
Redo the same thing for the 689 file:
select_sdf -labelfile 689_aug99.list -property_name NSC < cancer_screened_a99_chemical_structs.sdf > 689_aug99_3D.sdf
remove_h_sdf < 689_aug99_3D.sdf | remove_charge_sdf | tee 689_aug99_3D_no_H.sdf | zero_sdf > 689_aug99_0D.sdf
cactus_2d_nci 689_aug99_0D.sdf | remove_stereo_sdf > 689_aug99_2D.sdf
Source code of remove_900000.pm:
##################################
sub is_sdf_record_kept
{
my $sdf_entry = shift;
my $record_number = shift ;
defined $sdf_entry || die "Assertion failed" ;
my $value = $sdf_entry->data_for_field_name("NSC");
defined $value || die "Assertion failed: undefined property" ;
# print STDERR $value, "\n";
return $value < 900000; #Keep NSC's < 9000000
}
1;
##################################
cat open_397.mol 689_aug99_0D.sdf 2212_oct99_0D.sdf | select_sdf -perlfile remove_900000.pm > temp.sdf
cat open_397.sdf 689_aug99_3D.sdf 2212_oct99_3D.sdf| sort_sdf -prop NSC > nciopen_LMCH_oct99_3D.sdf
nciscreen2csv < cancer_screened_gi50_a99 > cancer_screened_gi50_a99.csv
nciscreen2csv < cancer_screened_lc50_a99 > cancer_screened_lc50_a99.csv
nciscreen2csv < cancer_screened_tgi_a99 > cancer_screened_tgi_a99.csv
add_prop_sdf < nciopen_LMCH_oct99_2D.sdf -match NSC -table cancer_screened_gi50_a99.csv -noskip -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_lc50_a99.csv -noskip -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_tgi_a99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ec50_oct99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ic50_oct99.csv -noskip -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_conc_oct99.csv -noskip -silent > nciopen_LMCH_oct99_2D_AIDS_cancer.sdf
-perlclass is a special option for the tool add_prop_sdf. The argument to the option -perlclass, NCI_screen, is the name of a customized Perl class which derives from the class that processes standard CSV (comma separated value) table files. Its purpose is to reformat the biological data. See the file NCI_screen.pm in the toolkit (this will interest probably only Perl 5 programmers).
-noskip is an option that instructs to keep all entries even if biological data is not available.
add_prop_sdf < nciopen_LMCH_oct99_2D.sdf -match NSC -table cancer_screened_gi50_a99.csv -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_lc50_a99.csv -perlclass NCI_screen -silent | add_prop_sdf -match NSC -table cancer_screened_tgi_a99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ec50_oct99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_ic50_oct99.csv -perlclass NCI_screen -silent| add_prop_sdf -match NSC -table aids_conc_oct99.csv -silent> nciopen_LMCH_oct99_2D_all_have_AIDS_cancer.sdf
-perlclass is a special option for the tool add_prop_sdf. The argument to the option -perlclass, NCI_screen, is the name of a customized Perl class which derives from the class that processes standard CSV (comma separated value) table files. Its purpose is to reformat the biological data. See the file NCI_screen.pm in the toolkit (this will interest probably only Perl 5 programmers).
The -noskip option is not used.
Bruno Bienfait 1-11-2000
Last Update: 2024年02月09日
Center for Cancer
Research
National Cancer
Institute
National Institutes
of Health
Department of Health
and Human Services