I am running a training job using MLflow and an MLflow recipe. In the recipe.train step, the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times:
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:07 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 44549
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
readable, writable, errored = select.select(
^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:08 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 35473
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
readable, writable, errored = select.select(
^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
During this time, the CPU usage reaches almost 100% (98% of which is from system usage), and after an hour or so the recipe.train() step fails with
Fatal error: The Python kernel is unresponsive.
While throughout the training step the CPU and GPU usage are below 40% mostly. I am also using the databricks recipe to log the regular artifacts as part of the experiment.
Has anyone faced the above issue? Please let me know if any logs would help in identifying the real problem.
1 Answer 1
It looks like you've logged too many artifacts, which caused the system to exceed the file descriptor limit used by select() calls.
Here are two ways you can try to resolve the issue:
- Reduce the frequency of artifact logging. For example, log only every 10 epochs:
if epoch % 10 == 0:
mlflow.log_artifact("your_artifact_file.xxx")
- Upgrade to a newer version of the Py4J library (and the Spark runtime, if applicable), where the underlying system call used to wait for socket events has been changed from
select()topoll(). This helps avoid the file descriptor limit that the system callselect()imposes.
Also, consider calling mlflow.end_run()
after training to ensure all resources are properly released.
Comments
Explore related questions
See similar questions with these tags.