filedescriptor out of range in select()

Question 1

I am running a training job using MLflow and an MLflow recipe. In the recipe.train step, the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times:

ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:07 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 44549
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
 File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
 readable, writable, errored = select.select(
 ^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:08 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 35473
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
 File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
 readable, writable, errored = select.select(
 ^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()

During this time, the CPU usage reaches almost 100% (98% of which is from system usage), and after an hour or so the recipe.train() step fails with Fatal error: The Python kernel is unresponsive.

While throughout the training step the CPU and GPU usage are below 40% mostly. I am also using the databricks recipe to log the regular artifacts as part of the experiment.

Has anyone faced the above issue? Please let me know if any logs would help in identifying the real problem.

Question 2

It looks like you've logged too many artifacts, which caused the system to exceed the file descriptor limit used by select() calls.

Here are two ways you can try to resolve the issue:

Reduce the frequency of artifact logging. For example, log only every 10 epochs:

if epoch % 10 == 0:
 mlflow.log_artifact("your_artifact_file.xxx")

Upgrade to a newer version of the Py4J library (and the Spark runtime, if applicable), where the underlying system call used to wait for socket events has been changed from select() to poll(). This helps avoid the file descriptor limit that the system call select() imposes.

Also, consider calling mlflow.end_run() after training to ensure all resources are properly released.

davidzhou 615 bronze badges · Accepted Answer · 2025-06-11 07:55:30Z

It looks like you've logged too many artifacts, which caused the system to exceed the file descriptor limit used by select() calls.

Here are two ways you can try to resolve the issue:

Reduce the frequency of artifact logging. For example, log only every 10 epochs:

if epoch % 10 == 0:
 mlflow.log_artifact("your_artifact_file.xxx")

Upgrade to a newer version of the Py4J library (and the Spark runtime, if applicable), where the underlying system call used to wait for socket events has been changed from select() to poll(). This helps avoid the file descriptor limit that the system call select() imposes.

Also, consider calling mlflow.end_run() after training to ensure all resources are properly released.

CollectivesTM on Stack Overflow

filedescriptor out of range in select()

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related