-
-
Notifications
You must be signed in to change notification settings - Fork 954
Launches git cat-file processes which never die #1209
-
I'm using this package to clone many repos. It's part of a utility which clones all repos under a GitLab group.
While running it I noticed over 6k git processes (and growing). See output below.
They're all git cat-file
some with --batch-check
and some with --batch
.
Doing a search of this repo for batch_check I found these lines:
cmd = self._get_persistent_cmd("cat_file_header", "cat_file", batch_check=True) # and ... cmd = self._get_persistent_cmd("cat_file_all", "cat_file", batch=True)
I guess emphasis on persistent right?
Are these processes meant to never die? What purpose do they serve?
Gather, count, and look at the procs
[ec2-user@ip-10-10-10-10 ~]$ ps -eaf | grep git > ~/git_procs
[ec2-user@ip-10-10-10-10 ~]$ wc -l ~/git_procs
6086 /home/ec2-user/git_procs
[ec2-user@ip-10-10-10-10 ~]$ head ~/git_procs && tail ~/git_procs
ec2-user 306 21895 0 23:22 pts/1 00:00:00 git cat-file --batch-check
ec2-user 309 21895 0 21:50 pts/1 00:00:00 git cat-file --batch-check
ec2-user 310 21895 0 22:46 pts/1 00:00:00 git cat-file --batch-check
ec2-user 312 21895 0 21:50 pts/1 00:00:00 git cat-file --batch
ec2-user 321 21895 0 23:22 pts/1 00:00:00 git cat-file --batch-check
ec2-user 323 21895 0 21:50 pts/1 00:00:00 git cat-file --batch-check
ec2-user 325 21895 0 22:46 pts/1 00:00:00 git cat-file --batch-check
ec2-user 326 21895 0 23:22 pts/1 00:00:00 git cat-file --batch
ec2-user 333 21895 0 21:50 pts/1 00:00:00 git cat-file --batch-check
ec2-user 335 21895 0 23:22 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32731 21895 0 22:46 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32732 21895 0 21:50 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32735 21895 0 21:50 pts/1 00:00:00 git cat-file --batch
ec2-user 32741 21895 0 23:22 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32746 21895 0 22:46 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32748 21895 0 21:50 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32750 21895 0 21:50 pts/1 00:00:00 git cat-file --batch
ec2-user 32759 21895 0 23:22 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32762 21895 0 21:50 pts/1 00:00:00 git cat-file --batch-check
ec2-user 32764 21895 0 22:46 pts/1 00:00:00 git cat-file --batch-check
Beta Was this translation helpful? Give feedback.
All reactions
I think it's good to start off this discussion with a reference to known resource leakage and ways to fix it.
Something worth investigating here is if somehow the same repository is creating multiple git commands, leaving 'zombies' of previous invocation as process children or worse, detached from its parent process (python).
If you think that this is probably not the case then it's worth trying to explicitly calling the destructor on a Repo
instance once you are done with it.
Replies: 2 comments 3 replies
-
I think it's good to start off this discussion with a reference to known resource leakage and ways to fix it.
Something worth investigating here is if somehow the same repository is creating multiple git commands, leaving 'zombies' of previous invocation as process children or worse, detached from its parent process (python).
If you think that this is probably not the case then it's worth trying to explicitly calling the destructor on a Repo
instance once you are done with it.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for the support. I confirmed that calling del() on each of the repo does in fact clean everything up.
It is an interesting choice to have persistent subprocesses. I'm guessing this was done for performance reasons... to re-use an existing process. I may dig into the code more if I find myself curious.
Fortunately for my use-case, while this process does run a long time it's not a daemon. It'll be ran periodically.
Thanks again for the quick reply. Next time I'll RTFReadme
Beta Was this translation helpful? Give feedback.
All reactions
-
It is an interesting choice to have persistent subprocesses. I'm guessing this was done for performance reasons... to re-use an existing process. I may dig into the code more if I find myself curious.
That's true, these are used to read object headers and object data respectively. This also causes surprising behaviour if objects are cached though, so better don't do that but read one object at a time in GitPython.
Beta Was this translation helpful? Give feedback.
All reactions
-
I noticed a repo acts as a context manager via __enter__
and __exit__
. In fact, __exit__
calls self.close()
same as the __del__()
you recommended calling.
Question... is it safe to re-use existing repo objects more than once or should I create a new one each time?
Create and re-use just one Repo object
for path in paths: repo = git.Repo(some_path) with repo: repo.some_operation() # ... other code with repo: repo.another_operation()
Create Repo objects as needed
for path in paths: with git.Repo(some_path) as repo: repo.some_operation() # ... other code with git.Repo(some_path) as repo: repo.another_operation()
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.