This listing selects the best features from the 1011 available columns in a given dataset.
The first three columns are dropped because they are useless data.
The dataset is huge. So, they were read in 25 chunks.
Please focus on threshold
, feature_importance
, and selected_features
.
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler
from collections import Counter
# Path to the dataset
file_path = 'dataset.csv'
output_file_path = 'feature_selection_output.txt'
# Chunk size calculation
total_rows = 1000000
num_chunks = 25
chunk_size = total_rows // num_chunks
# Check for GPU availability
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
print("Using GPU")
else:
print("No GPU found, using CPU")
# Function to process a chunk
def process_chunk(chunk):
# Fill empty values with zeros
chunk.fillna(0, inplace=True)
# Remove the first three columns
chunk = chunk.iloc[:, 3:]
# Filter rows based on the first column's value
chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]
# Separate target and features
y = chunk.iloc[:, 0]
X = chunk.iloc[:, 1:]
# One-hot encode the target column
y = pd.get_dummies(y)
return X, y
# Placeholder for selected features count
feature_counter = Counter()
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
print(f"Processing chunk {i+1}/{num_chunks}")
X, y = process_chunk(chunk)
# Normalize features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Build and train a simple neural network model
model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X, y, epochs=10, batch_size=64, verbose=0)
# Extract feature importance from the model weights
weights = model.layers[0].get_weights()[0]
feature_importance = np.mean(np.abs(weights), axis=1)
# Select features based on importance
threshold = np.median(feature_importance)
selected_features = np.where(feature_importance > threshold)[0]
# Update feature counter
feature_counter.update(selected_features)
# Clear the Keras session to free memory
tf.keras.backend.clear_session()
# Get the most frequently selected features and sort by frequency in descending order
sorted_features = feature_counter.most_common()
# Save the selected features and their frequencies to a file
with open(output_file_path, 'w') as f:
for feature, count in sorted_features:
f.write(f"{feature}: {count}\n")
print(f"Selected features saved to {output_file_path}")
2 Answers 2
Series.isin
is much faster thanSeries.str.contains
since you're only checking for literal strings (rather than substrings):- chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')] + chunk = chunk[chunk.iloc[:, 0].isin(list('ABC'))]
inplace
will be deprecated in PDEP-8 and is not recommended:- chunk.fillna(0, inplace=True) + chunk = chunk.fillna(0)
-
\$\begingroup\$ That was less important part of the code. Please focus on threshold, feature_importance, and selected_features .. \$\endgroup\$user366312– user3663122024年07月07日 05:00:33 +00:00Commented Jul 7, 2024 at 5:00
new concept? new name!
y = chunk.iloc[:, 0]
...
# One-hot encode the target column
y = pd.get_dummies(y)
Imagine a telephone conversation with a colleague about this code.
Maybe it's a facetime conversation with
(削除) Abe Lincoln (削除ここまで)Bob Newhart.
Yeah, you want to do that with y? You mean the old y? Or maybe the new y. You know, like y-prime. No, not like that transformer. You know, the new y, the hot one. Uhhhh, the one hot one. Oh, you know what I mean! No? Well, let me give it a name so you'll know what I mean.
You had an opening where you could give each concept a distinct name. Next time, perhaps you will seize the opportunity.
giant number
total_rows = 1000000
I ran out of fingers on one hand to count the powers of ten there. Prefer to spell it
total_rows = 1_000_000
so we can see at once that you mean "a million".
-
\$\begingroup\$ Please focus on threshold, feature_importance, and selected_features \$\endgroup\$user366312– user3663122024年07月07日 06:58:51 +00:00Commented Jul 7, 2024 at 6:58