-
Notifications
You must be signed in to change notification settings - Fork 33
Problems in using FC dataset #61
Description
Hello! Thank you for your great work of Torchdrug, GearNet, and ESM-GearNet!
Sorry to bother you. I'm trying to extract feature embeddings using GearNet (as discussed in several former issues) on EC, GO, and FC dataset (as provided in https://zenodo.org/records/7593591). It is easy to notice that different from EC and GO where proteins are provided in pdb format, proteins in FC are in hdf5 format, so I use your Fold3d class in GearNet (https://github.com/DeepGraphLearning/GearNet/blob/main/gearnet/dataset.py) to preprocess the data.
However, when I pass the Protein class into GearNet network following the instructions in Torchdrug, I met the following errors when running on GPU:
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
and then
RuntimeError: Error building extension 'torch_ext':
...
... ...site-packages/torchdrug/utils/extension/torch_ext.cpp:1:
/usr/include/features.h:424:12: fatal error: sys/cdefs.h: No such file or directory
424 | # include <sys/cdefs.h>
| ^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
When running on CPU, I met:
NotImplementedError: Could not run 'aten::view' with arguments from the 'SparseCPU' backend
I searched for the cause of these errors on the Internet but found that I couldn't solve them because they are related to the environment. I'm wondering why I don't meet any of the problems when directly use Protein.from_pdb() on EC and GO, but encounter these problems on FC where I use your Fold3D class to get also data.Protein instances.
For reference, my code is as follows:
... # graph graph_construction_model = layers.GraphConstruction(node_layers=[geometry.AlphaCarbonNode()], edge_layers=[geometry.SpatialEdge(radius=10.0, min_distance=5), geometry.KNNEdge(k=10, min_distance=5), geometry.SequentialEdge(max_distance=2)], edge_feature="gearnet") # model gearnet_edge = models.GearNet(input_dim=21, hidden_dims=[512, 512, 512, 512, 512, 512], num_relation=7, edge_input_dim=59, num_angle_bin=8, batch_norm=True, concat_hidden=True, short_cut=True, readout="sum") pthfile = 'models/mc_gearnet_edge.pth' net = torch.load(pthfile, map_location=torch.device(device)) #print('torch succesfully load model') gearnet_edge.load_state_dict(net) gearnet_edge.eval() print('successfully load gearnet') def get_subdataset_rep(pdbs: list, proteins: list, subroot: str): for idx in range(0, len(pdbs), bs): # reformulate to batches pdb_batch = pdbs[idx : min(len(pdbs), idx + bs)] protein_batch = proteins[idx : min(len(pdbs), idx + bs)] # protein _protein = data.Protein.pack(protein_batch) _protein.view = "residue" print(_protein) final_protein = graph_construction_model(_protein) print(final_protein) # output with torch.no_grad(): output = gearnet_edge(final_protein, final_protein.node_feature.float(), all_loss=None, metric=None) print(output['graph_feature'].shape, output['node_feature'].shape) counter = 0 for idx in range(len(final_protein.num_residues)): # idx: protein/graph id in this batch this_graph_feature = output['graph_feature'][idx] this_node_feature = output['node_feature'][counter : counter + final_protein.num_residues[idx], :] print(this_graph_feature.shape, this_node_feature.shape) torch.save((this_graph_feature, this_node_feature), f"{subroot}/{os.path.splitext(pdb_batch[idx])[0].split('/')[-1]}.pt") counter += final_protein.num_residues[idx] break # get representations if args.task not in ['FC', 'fc']: for root in roots: pdbs = [os.path.join(root, i) for i in os.listdir(root)] proteins = [] for pdb_file in pdbs: try: protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol") protein.view = "residue" proteins.append(protein) except: error_fn = os.path.basename(root) + '_' if args.task in ['EC', 'ec', 'GO', 'go'] else '' with open(f"{error_path}/{args.task}_{error_fn}error.txt", "a") as f: f.write(os.path.splitext(pdb_file)[0].split('/')[-1] + '\n') f.close() if len(proteins) == bs: # for debug break subroot = os.path.join(output_dir, root.split('/')[-1]) if args.task in ['EC', 'ec', 'GO', 'go'] else output_dir get_subdataset_rep(pdbs, proteins, subroot) break else: transform = transforms.Compose([transforms.ProteinView(view='residue')]) dataset = Fold3D(root, transform=transform) #, atom_feature=None, bond_feature=None split_sets = dataset.split() # train_set, valid_set, test_fold_set print('There are', len(split_sets), 'sets in total.') for split_set in split_sets: print(split_set.indices) this_slice = slice(list(split_set.indices)[0], (list(split_set.indices)[-1] + 1)) this_pdbs, this_datas = dataset.pdb_files[this_slice], dataset.data[this_slice] #for fn, protein in zip(this_pdbs, this_datas): # print(fn, protein) # break get_subdataset_rep(this_pdbs, this_datas, os.path.join(output_dir, this_pdbs[0].split('/')[0]))
Are there any ways to solve the problem, or is my understanding of torchdrug wrong? Sincerely looking forward to your help. Thank you very much!