I have a .txt
file that looks like this:
SHT1 E: T1:30.45°C H1:59.14 %RH
SHT2 S: T2:29.93°C H2:67.38 %RH
SHT1 E: T1:30.49°C H1:58.87 %RH
SHT2 S: T2:29.94°C H2:67.22 %RH
SHT1 E: T1:30.53°C H1:58.69 %RH
SHT2 S: T2:29.95°C H2:67.22 %RH
I want to have a DataFrame
that looks like this:
T1 H1 T2 H2
0 30.45 59.14 29.93 67.38
1 30.49 58.87 29.94 67.22
2 30.53 58.69 29.95 67.22
I parse this by:
- Reading up the text file line by line
- Parsing the lines e.g. matching only the parts with
T1, T2, H1, and H2
, splitting by:
, and removing°C
and%RH
- The above produces a list of lists each having two items
- I flatten the list of lists
- Just to chop it up into a list of four-item lists
- Dump that to a
df
- Write to an Excel file
Here's the code:
import itertools
import pandas as pd
def read_lines(file_object) -> list:
return [
parse_line(line) for line in file_object.readlines() if line.strip()
]
def parse_line(line: str) -> list:
return [
i.split(":")[-1].replace("°C", "").replace("%RH", "")
for i in line.strip().split()
if i.startswith(("T1", "T2", "H1", "H2"))
]
def flatten(parsed_lines: list) -> list:
return list(itertools.chain.from_iterable(parsed_lines))
def cut_into_pieces(flattened_lines: list, piece_size: int = 4) -> list:
return [
flattened_lines[i:i + piece_size] for i
in range(0, len(flattened_lines), piece_size)
]
with open("your_text_data.txt") as data:
df = pd.DataFrame(
cut_into_pieces(flatten(read_lines(data))),
columns=["T1", "H1", "T2", "H2"],
)
print(df)
df.to_excel("your_table.xlsx", index=False)
This works and I get what I want but I feel like points 3, 4, and 5
are a bit of redundant work, especially creating a list of list just to flatten it and then chop it up again.
Question:
How could I simplify the whole parsing process? Or maybe most of the heavy-lifting can be done with pandas
alone?
Also, any other feedback is more than welcomed.
2 Answers 2
Disclaimer: I know this is a very liberal interpretation of a code review since it suggests an entirely different approach. I still thought it might provide a useful perspective when thinking about such problems in the future and reducing coding effort.
I would suggest the following approach using regex to extract all the numbers that match the format "12.34".
import re
import pandas as pd
with open("your_text_data.txt") as data_file:
data_list = re.findall(r"\d\d\.\d\d", data_file.read())
result = [data_list[i:i + 4] for i in range(0, len(data_list), 4)]
df = pd.DataFrame(result, columns=["T1", "H1", "T2", "H2"])
print(df)
df.to_excel("your_table.xlsx", index=False)
This will of course only work for the current data format you provided. The code will need to be adjusted if the format of your data changes. For example: If relevant numbers may contain a varying number of digits, you might use the regex "\d+\.\d+"
to match all numbers that contain at least one digit on either side of the decimal point.
Also please note the use of the context manager with open(...) as x:
. Only code that accesses the object needs to and should be part of the managed context.
-
\$\begingroup\$ I absolutely don't mind that you've offered a new approach. I totally forgot about regex, I was so much into those lists of lists. This is short, simple, and does the job. Nice! Thank you for your time and insight. \$\endgroup\$baduker– baduker2021年03月26日 20:36:17 +00:00Commented Mar 26, 2021 at 20:36
-
\$\begingroup\$ PS. You've got your imports the other way round.
re
should be first and thenpandas
. \$\endgroup\$baduker– baduker2021年03月26日 20:37:48 +00:00Commented Mar 26, 2021 at 20:37 -
\$\begingroup\$ You're right, I fixed the import order! \$\endgroup\$riskypenguin– riskypenguin2021年03月26日 23:01:52 +00:00Commented Mar 26, 2021 at 23:01
You can use numpy.loadtxt()
to read the data and numpy.reshape()
to get the shape you want. The default is to split on whitespace and dtype of float. usecols
are the columns we want. conveters
is a dict mapping column nos. to functions to convert the column data; here they chop of the unwanted text. The .shape()
converts the resulting numpy array from two columns to four columns (the -1 lets numpy calculate the number of rows).
src.seek(0)
data = np.loadtxt(src,
usecols=(2, 3),
converters={2:lambda s:s[3:-2], 3:lambda s:s[3:]}
).reshape(-1, 4)
Then just load it in a dataframe and name the columns:
df = pd.DataFrame(data, columns='T1 H1 T2 H2'.split())
df
Output:
T1 H1 T2 H2
0 30.45 59.14 29.93 67.38
1 30.49 58.87 29.94 67.22
2 30.53 58.69 29.95 67.22