0

I am running a training job using MLflow and an MLflow recipe. In the recipe.train step, the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times:

ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:07 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 44549
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
 File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
 readable, writable, errored = select.select(
 ^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:08 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 35473
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
 File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
 readable, writable, errored = select.select(
 ^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()

During this time, the CPU usage reaches almost 100% (98% of which is from system usage), and after an hour or so the recipe.train() step fails with Fatal error: The Python kernel is unresponsive.

While throughout the training step the CPU and GPU usage are below 40% mostly. I am also using the databricks recipe to log the regular artifacts as part of the experiment.

Has anyone faced the above issue? Please let me know if any logs would help in identifying the real problem.

President James K. Polk
42.3k35 gold badges114 silver badges149 bronze badges
asked Jun 2, 2025 at 3:08

1 Answer 1

0

It looks like you've logged too many artifacts, which caused the system to exceed the file descriptor limit used by select() calls.

Here are two ways you can try to resolve the issue:

  1. Reduce the frequency of artifact logging. For example, log only every 10 epochs:
if epoch % 10 == 0:
 mlflow.log_artifact("your_artifact_file.xxx")
  1. Upgrade to a newer version of the Py4J library (and the Spark runtime, if applicable), where the underlying system call used to wait for socket events has been changed from select() to poll(). This helps avoid the file descriptor limit that the system call select() imposes.

Also, consider calling mlflow.end_run() after training to ensure all resources are properly released.

answered Jun 11, 2025 at 7:55
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.