Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

【Wikidata preprocess_dump error】AttributeError when closing writer during data preprocessing #32

Open
@YYForReal

Description

I encountered an AttributeError during the data preprocessing step. The error occurs after the process has successfully written over 112 million lines. Here is the terminal output:

...
112000000 lines written in 3.91s. 
112200000 lines written in 4.43s. 
112400000 lines written in 4.69s. 
Done! Read 112473858 lines
Process Process-2:
Traceback (most recent call last):
 File "/home/szu/miniconda3/envs/tog/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
 self.run()
 File "/home/szu/miniconda3/envs/tog/lib/python3.11/multiprocessing/process.py", line 108, in run
 self._target(*self._args, **self._kwargs)
 File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 89, in write_data
 writer.close()
 File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 79, in close
 v.close()
 File "/home/szu/code/kg_project/ToG/Wikidata/simple_wikidata_db/preprocess_utils/writer_process.py", line 51, in close
 self.cur_file_writer.close()
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'close'
Finished processing 112473858 in 4191.5129499435425s

Reproduction Steps:

  1. Run the data preprocessing script.
python3 preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./wiki_process 
  1. Observe the terminal output as the script processes the data.

Additional Questions:

  • I am not sure if this error affects subsequent steps. Can someone confirm?
  • The disk has remaining space, but is it necessary to have 1024G of space? Can the process be tested with a subset of the dataset?
  • If a subset can be used for testing, could you please provide instructions on how to do so?

Thank you in advance for your assistance and for any insights you may offer regarding these queries. Your help is greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /