Feature subset selection using neural network

Question 1

This listing selects the best features from the 1011 available columns in a given dataset.

The first three columns are dropped because they are useless data.

The dataset is huge. So, they were read in 25 chunks.

Please focus on threshold, feature_importance, and selected_features .

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler
from collections import Counter
# Path to the dataset
file_path = 'dataset.csv'
output_file_path = 'feature_selection_output.txt'
# Chunk size calculation
total_rows = 1000000
num_chunks = 25
chunk_size = total_rows // num_chunks
# Check for GPU availability
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
 print("Using GPU")
else:
 print("No GPU found, using CPU")
# Function to process a chunk
def process_chunk(chunk):
 # Fill empty values with zeros
 chunk.fillna(0, inplace=True)
 # Remove the first three columns
 chunk = chunk.iloc[:, 3:]
 # Filter rows based on the first column's value
 chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]
 # Separate target and features
 y = chunk.iloc[:, 0]
 X = chunk.iloc[:, 1:]
 # One-hot encode the target column
 y = pd.get_dummies(y)
 return X, y
# Placeholder for selected features count
feature_counter = Counter()
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
 print(f"Processing chunk {i+1}/{num_chunks}")
 X, y = process_chunk(chunk)
 # Normalize features
 scaler = StandardScaler()
 X = scaler.fit_transform(X)
 # Build and train a simple neural network model
 model = Sequential()
 model.add(Dense(128, input_dim=X.shape[1], activation='relu'))
 model.add(Dense(64, activation='relu'))
 model.add(Dense(y.shape[1], activation='softmax'))
 model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
 model.fit(X, y, epochs=10, batch_size=64, verbose=0)
 # Extract feature importance from the model weights
 weights = model.layers[0].get_weights()[0]
 feature_importance = np.mean(np.abs(weights), axis=1)
 # Select features based on importance
 threshold = np.median(feature_importance)
 selected_features = np.where(feature_importance > threshold)[0]
 # Update feature counter
 feature_counter.update(selected_features)
 # Clear the Keras session to free memory
 tf.keras.backend.clear_session()
# Get the most frequently selected features and sort by frequency in descending order
sorted_features = feature_counter.most_common()
# Save the selected features and their frequencies to a file
with open(output_file_path, 'w') as f:
 for feature, count in sorted_features:
 f.write(f"{feature}: {count}\n")
print(f"Selected features saved to {output_file_path}")

Question 2

Series.isin is much faster than Series.str.contains since you're only checking for literal strings (rather than substrings):

- chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]
+ chunk = chunk[chunk.iloc[:, 0].isin(list('ABC'))]

inplace will be deprecated in PDEP-8 and is not recommended:

- chunk.fillna(0, inplace=True)
+ chunk = chunk.fillna(0)

Question 3

That was less important part of the code. Please focus on threshold, feature_importance, and selected_features ..

Question 4

new concept? new name!

 y = chunk.iloc[:, 0]
 ...
 # One-hot encode the target column
 y = pd.get_dummies(y)

Imagine a telephone conversation with a colleague about this code. Maybe it's a facetime conversation with ~~(削除) Abe Lincoln (削除ここまで)~~Bob Newhart.

Yeah, you want to do that with y? You mean the old y? Or maybe the new y. You know, like y-prime. No, not like that transformer. You know, the new y, the hot one. Uhhhh, the one hot one. Oh, you know what I mean! No? Well, let me give it a name so you'll know what I mean.

You had an opening where you could give each concept a distinct name. Next time, perhaps you will seize the opportunity.

giant number

total_rows = 1000000

I ran out of fingers on one hand to count the powers of ten there. Prefer to spell it

total_rows = 1_000_000

so we can see at once that you mean "a million".

Question 5

Please focus on threshold, feature_importance, and selected_features

tdy tdy 2,2661 gold badge10 silver badges21 bronze badges · Answer 1 · 2024-07-07 04:31:57Z

Series.isin is much faster than Series.str.contains since you're only checking for literal strings (rather than substrings):

- chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]
+ chunk = chunk[chunk.iloc[:, 0].isin(list('ABC'))]

inplace will be deprecated in PDEP-8 and is not recommended:

- chunk.fillna(0, inplace=True)
+ chunk = chunk.fillna(0)

That was less important part of the code. Please focus on threshold, feature_importance, and selected_features ..

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 2 · 2024-07-07 05:47:02Z

new concept? new name!

 y = chunk.iloc[:, 0]
 ...
 # One-hot encode the target column
 y = pd.get_dummies(y)

Imagine a telephone conversation with a colleague about this code. Maybe it's a facetime conversation with ~~(削除) Abe Lincoln (削除ここまで)~~Bob Newhart.

Yeah, you want to do that with y? You mean the old y? Or maybe the new y. You know, like y-prime. No, not like that transformer. You know, the new y, the hot one. Uhhhh, the one hot one. Oh, you know what I mean! No? Well, let me give it a name so you'll know what I mean.

You had an opening where you could give each concept a distinct name. Next time, perhaps you will seize the opportunity.

giant number

total_rows = 1000000

I ran out of fingers on one hand to count the powers of ten there. Prefer to spell it

total_rows = 1_000_000

so we can see at once that you mean "a million".

Please focus on threshold, feature_importance, and selected_features

Stack Exchange Network

Feature subset selection using neural network

2 Answers 2

new concept? new name!

giant number

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Feature subset selection using neural network

2 Answers 2

new concept? new name!

giant number

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions