I have a dataframe with measurement data of different runs at same conditions. Each row contains both the constant conditions the experiment was conducted and all the results from the different runs.
Since I am not able to provide a real dataset, the code snippet provided below will generate some dummy data.
I was able to achieve the desired output, but my function transform_columns()
seems to be unecessary complicated:
import pandas as pd
import numpy as np
np.random.seed(seed=1234)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), columns=['constant 1', 'constant 2', 1, 2, 3, 4])
def transform_columns(data):
factor_columns = []
response_columns = []
for col in data:
if isinstance(col, int):
response_columns.append(col)
else:
factor_columns.append(col)
collected = []
for index, row in data.iterrows():
conditions = row.loc[factor_columns]
data_values = row.loc[response_columns].dropna()
for val in data_values:
out = conditions.copy()
out['value'] = val
collected.append(out)
df = pd.DataFrame(collected).reset_index(drop=True)
return df
print(transform_columns(df))
Is there any Pythonic or Pandas way to do this nicely?
2 Answers 2
It is probably easier to work with the underlying Numpy array directly than through Pandas. Ensure that all factor columns comes before all data columns, then this code will work:
import pandas as pd
import numpy as np
np.random.seed(seed=1234)
n_rows = 100
n_cols = 6
n_factor_cols = 2
n_data_cols = n_cols - n_factor_cols
arr = np.random.randint(0, 100, size=(n_rows, n_cols))
factor_cols = arr[:,:n_factor_cols]
data_cols = [arr[:,i][:,np.newaxis] for i in range(n_factor_cols, n_cols)]
stacks = [np.hstack((factor_cols, data_col)) for data_col in data_cols]
output = np.concatenate(stacks)
The above code assumes that order is not important. If it is, then use
the following instead of np.concatenate
:
output = np.empty((n_rows * n_data_cols, n_factor_cols + 1),
dtype = arr.dtype)
for i, stack in enumerate(stacks):
output[i::n_data_cols] = stack
This is the best I can do, but I wouldn't be surprised if someone comes along and rewrites it as a Numpy one-liner. :)
pandas
library has rich functionality and allows to build a complex pipelines as a chain of routine calls.
In your case the whole idea is achievable with the following single pipeline:
import pandas as pd
import numpy as np
np.random.seed(seed=1234)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)),
columns=['constant 1', 'constant 2', 1, 2, 3, 4])
def transform_columns(df):
return df.set_index(df.filter(regex=r'\D').columns.tolist()) \
.stack().reset_index(name='value') \
.drop(columns='level_2', axis=1)
print(transform_columns(df))
Details:
df.filter(regex=r'\D').columns.tolist()
df.filter
returns a subset of columns enforced by specified regex patternregex=r'\D'
(ensure the column name contains non-digit chars)df.set_index(...)
- set the input dataframe index (row labels) using column names from previous step.stack()
- reshape the dataframe from columns to index, having a multi-level index.reset_index(name='value')
pandas.Series.reset_index
resets/treats index as a column;name='value'
points to a desired column name containing the crucial values.drop(columns='level_2', axis=1)
- drops supplementary labellevel_2
from columns (axis=1
)
You may check/debug each step separately to watch how the intermediate series/dataframe looks like and how it's transformed.
Sample output:
constant 1 constant 2 value
0 47 83 38
1 47 83 53
2 47 83 76
3 47 83 24
4 15 49 23
.. ... ... ...
395 16 16 80
396 16 92 46
397 16 92 77
398 16 92 68
399 16 92 83
[400 rows x 3 columns]
np.random.seed()
, do you know how to change it? the desired output Can you explain what that is? It's much better for everyone than having to reverse-engineer your code. \$\endgroup\$