Memory consumption issue #53

Open

Description

opened

on Mar 31, 2022

I use the command:

bash run.sh train RotatE FB15k-237 0 0 1024 256 1000 9.0 1.0 0.00005 100000 16 -de

to train RotatE on a 11 GB GPU. I ensure it is completely free.
I still get the following error:

2022年03月31日 19:32:37,370 INFO negative_adversarial_sampling = False
2022年03月31日 19:32:37,370 INFO learning_rate = 0
2022年03月31日 19:32:39,079 INFO Training average positive_sample_loss at step 0: 5.635527
2022年03月31日 19:32:39,079 INFO Training average negative_sample_loss at step 0: 0.003591
2022年03月31日 19:32:39,079 INFO Training average loss at step 0: 2.819559
2022年03月31日 19:32:39,079 INFO Evaluating on Valid Dataset...
2022年03月31日 19:32:39,552 INFO Evaluating the model... (0/2192)
2022年03月31日 19:33:38,650 INFO Evaluating the model... (1000/2192)
2022年03月31日 19:34:38,503 INFO Evaluating the model... (2000/2192)
2022年03月31日 19:34:49,981 INFO Valid MRR at step 0: 0.005509
2022年03月31日 19:34:49,982 INFO Valid MR at step 0: 6894.798660
2022年03月31日 19:34:49,982 INFO Valid HITS@1 at step 0: 0.004733
2022年03月31日 19:34:49,982 INFO Valid HITS@3 at step 0: 0.005076
2022年03月31日 19:34:49,982 INFO Valid HITS@10 at step 0: 0.005646
Traceback (most recent call last):
 File "codes/run.py", line 371, in <module>
 main(parse_args())
 File "codes/run.py", line 315, in main
 log = kge_model.train_step(kge_model, optimizer, train_iterator, args)
 File "/home/prachi/related_work/KnowledgeGraphEmbedding/codes/model.py", line 315, in train_step
 loss.backward()
 File "/home/prachi/anaconda3/envs/py36/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
 torch.autograd.backward(self, gradient, retain_graph, create_graph)
 File "/home/prachi/anaconda3/envs/py36/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
 allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 10.92 GiB total capacity; 7.41 GiB already allocated; 1.51 GiB free; 1.52 GiB cached)
run.sh: line 79: 
CUDA_VISIBLE_DEVICES=$GPU_DEVICE python -u $CODE_PATH/run.py --do_train \
 --cuda \
 --do_valid \
 --do_test \
 --data_path $FULL_DATA_PATH \
 --model $MODEL \
 -n $NEGATIVE_SAMPLE_SIZE -b $BATCH_SIZE -d $HIDDEN_DIM \
 -g $GAMMA -a $ALPHA -adv \
 -lr $LEARNING_RATE --max_steps $MAX_STEPS \
 -save $SAVE --test_batch_size $TEST_BATCH_SIZE \
 ${14} ${15} ${16} ${17} ${18} ${19} ${20}
: No such file or directory

I get similar errors on trying to train FB15k using the command in best_config.sh file.
I reduced the batchsize to 500 and it worked but the performance is much less than the numbers reported in the paper.

I am not sure what is the issue.

Metadata

Assignees

No one assigned

Labels

No labels

Type

No type

Fields

Give feedback

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption issue #53

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions