Convert rows with multiple values to row with single value

Question 1

I have a dataframe with measurement data of different runs at same conditions. Each row contains both the constant conditions the experiment was conducted and all the results from the different runs.

Since I am not able to provide a real dataset, the code snippet provided below will generate some dummy data.

I was able to achieve the desired output, but my function transform_columns() seems to be unecessary complicated:

import pandas as pd
import numpy as np
np.random.seed(seed=1234)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), columns=['constant 1', 'constant 2', 1, 2, 3, 4])
def transform_columns(data):
 factor_columns = []
 response_columns = []
 for col in data:
 if isinstance(col, int):
 response_columns.append(col)
 else:
 factor_columns.append(col)
 collected = []
 for index, row in data.iterrows():
 conditions = row.loc[factor_columns]
 data_values = row.loc[response_columns].dropna()
 for val in data_values:
 out = conditions.copy()
 out['value'] = val
 collected.append(out)
 df = pd.DataFrame(collected).reset_index(drop=True)
 return df
print(transform_columns(df))

Is there any Pythonic or Pandas way to do this nicely?

Question 2

It looks like the docs discourage the use of np.random.seed(), do you know how to change it? the desired output Can you explain what that is? It's much better for everyone than having to reverse-engineer your code.

Question 3

It is probably easier to work with the underlying Numpy array directly than through Pandas. Ensure that all factor columns comes before all data columns, then this code will work:

import pandas as pd
import numpy as np
np.random.seed(seed=1234)
n_rows = 100
n_cols = 6
n_factor_cols = 2
n_data_cols = n_cols - n_factor_cols
arr = np.random.randint(0, 100, size=(n_rows, n_cols))
factor_cols = arr[:,:n_factor_cols]
data_cols = [arr[:,i][:,np.newaxis] for i in range(n_factor_cols, n_cols)]
stacks = [np.hstack((factor_cols, data_col)) for data_col in data_cols]
output = np.concatenate(stacks)

The above code assumes that order is not important. If it is, then use the following instead of np.concatenate:

output = np.empty((n_rows * n_data_cols, n_factor_cols + 1),
 dtype = arr.dtype)
for i, stack in enumerate(stacks):
 output[i::n_data_cols] = stack

This is the best I can do, but I wouldn't be surprised if someone comes along and rewrites it as a Numpy one-liner. :)

Question 4

pandas library has rich functionality and allows to build a complex pipelines as a chain of routine calls.
In your case the whole idea is achievable with the following single pipeline:

import pandas as pd
import numpy as np
np.random.seed(seed=1234)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), 
 columns=['constant 1', 'constant 2', 1, 2, 3, 4])
def transform_columns(df):
 return df.set_index(df.filter(regex=r'\D').columns.tolist()) \
 .stack().reset_index(name='value') \
 .drop(columns='level_2', axis=1)
print(transform_columns(df))

Details:

df.filter(regex=r'\D').columns.tolist()
df.filter returns a subset of columns enforced by specified regex pattern regex=r'\D' (ensure the column name contains non-digit chars)
df.set_index(...) - set the input dataframe index (row labels) using column names from previous step
.stack() - reshape the dataframe from columns to index, having a multi-level index
.reset_index(name='value')
pandas.Series.reset_index resets/treats index as a column; name='value' points to a desired column name containing the crucial values
.drop(columns='level_2', axis=1) - drops supplementary label level_2 from columns (axis=1)

You may check/debug each step separately to watch how the intermediate series/dataframe looks like and how it's transformed.

Sample output:

 constant 1 constant 2 value
0 47 83 38
1 47 83 53
2 47 83 76
3 47 83 24
4 15 49 23
.. ... ... ...
395 16 16 80
396 16 92 46
397 16 92 77
398 16 92 68
399 16 92 83
[400 rows x 3 columns]

score 2 · Answer 1 · 2020-01-10 11:45:49Z

It is probably easier to work with the underlying Numpy array directly than through Pandas. Ensure that all factor columns comes before all data columns, then this code will work:

import pandas as pd
import numpy as np
np.random.seed(seed=1234)
n_rows = 100
n_cols = 6
n_factor_cols = 2
n_data_cols = n_cols - n_factor_cols
arr = np.random.randint(0, 100, size=(n_rows, n_cols))
factor_cols = arr[:,:n_factor_cols]
data_cols = [arr[:,i][:,np.newaxis] for i in range(n_factor_cols, n_cols)]
stacks = [np.hstack((factor_cols, data_col)) for data_col in data_cols]
output = np.concatenate(stacks)

The above code assumes that order is not important. If it is, then use the following instead of np.concatenate:

output = np.empty((n_rows * n_data_cols, n_factor_cols + 1),
 dtype = arr.dtype)
for i, stack in enumerate(stacks):
 output[i::n_data_cols] = stack

This is the best I can do, but I wouldn't be surprised if someone comes along and rewrites it as a Numpy one-liner. :)

score 2 · Answer 2 · 2020-01-10 20:42:29Z

pandas library has rich functionality and allows to build a complex pipelines as a chain of routine calls.
In your case the whole idea is achievable with the following single pipeline:

import pandas as pd
import numpy as np
np.random.seed(seed=1234)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 6)), 
 columns=['constant 1', 'constant 2', 1, 2, 3, 4])
def transform_columns(df):
 return df.set_index(df.filter(regex=r'\D').columns.tolist()) \
 .stack().reset_index(name='value') \
 .drop(columns='level_2', axis=1)
print(transform_columns(df))

Details:

df.filter(regex=r'\D').columns.tolist()
df.filter returns a subset of columns enforced by specified regex pattern regex=r'\D' (ensure the column name contains non-digit chars)
df.set_index(...) - set the input dataframe index (row labels) using column names from previous step
.stack() - reshape the dataframe from columns to index, having a multi-level index
.reset_index(name='value')
pandas.Series.reset_index resets/treats index as a column; name='value' points to a desired column name containing the crucial values
.drop(columns='level_2', axis=1) - drops supplementary label level_2 from columns (axis=1)

You may check/debug each step separately to watch how the intermediate series/dataframe looks like and how it's transformed.

Sample output:

 constant 1 constant 2 value
0 47 83 38
1 47 83 53
2 47 83 76
3 47 83 24
4 15 49 23
.. ... ... ...
395 16 16 80
396 16 92 46
397 16 92 77
398 16 92 68
399 16 92 83
[400 rows x 3 columns]

Stack Exchange Network

Convert rows with multiple values to row with single value

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Convert rows with multiple values to row with single value

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions