2

I want to convert a text file into a json lines format using Python. I need this to be applicable to a text file of any length (in characters or words).

As an example, I want to convert the following text;

A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so. 
These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.

To this:

{"text": "A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so."}
{"text": "These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics."}

I tried this:

text = ""
with open(text.txt", encoding="utf8") as f:
 for line in f:
 text = {"text": line}

But not luck.

asked Dec 28, 2021 at 1:05
11
  • 2
    So you mean you want to iterate over your text lines, put each in a dictionary using "text" as the key, convert it to JSON and append it to a file? Commented Dec 28, 2021 at 1:12
  • Something like this could work. I'll need to save it as a .jsonl though Commented Dec 28, 2021 at 1:13
  • Then use a filename ending in .jsonl when opening a file for writing. Commented Dec 28, 2021 at 1:14
  • So at which of these steps was there a problem? Commented Dec 28, 2021 at 1:15
  • I'm not sure how to iterate over the text lines as you've mentioned. Commented Dec 28, 2021 at 1:17

2 Answers 2

4

The basic idea of your for loop was correct but the line text = {"text": line} is just overwriting the previous line every time, whereas what you want is to generate a list of lines.

Try the following:

import json
# Generate a list of dictionaries
lines = []
with open("text.txt", encoding="utf8") as f:
 for line in f.read().splitlines():
 if line:
 lines.append({"text": line})
# Convert to a list of JSON strings
json_lines = [json.dumps(l) for l in lines]
# Join lines and save to .jsonl file
json_data = '\n'.join(json_lines)
with open('my_file.jsonl', 'w') as f:
 f.write(json_data)

splitlines removes the \n characters and if line: ignores blank lines.

answered Dec 28, 2021 at 17:29
Sign up to request clarification or add additional context in comments.

1 Comment

This works really well thank you!
0

A hacky way of doing this is to paste the text file into a csv. Make sure to write text in the first cell of the csv then use this code:

import pandas as pd 
df = pd.read_csv(knowledge)
 df.to_json(knowledge_jsonl,
 orient="records",
 lines=True)

Not ideal but it works.

answered Dec 28, 2021 at 16:29

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.