0
\$\begingroup\$

This listing selects the best features from the 1011 available columns in a given dataset.

The first three columns are dropped because they are useless data.

The dataset is huge. So, they were read in 25 chunks.

Please focus on threshold, feature_importance, and selected_features .

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler
from collections import Counter
# Path to the dataset
file_path = 'dataset.csv'
output_file_path = 'feature_selection_output.txt'
# Chunk size calculation
total_rows = 1000000
num_chunks = 25
chunk_size = total_rows // num_chunks
# Check for GPU availability
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
 print("Using GPU")
else:
 print("No GPU found, using CPU")
# Function to process a chunk
def process_chunk(chunk):
 # Fill empty values with zeros
 chunk.fillna(0, inplace=True)
 # Remove the first three columns
 chunk = chunk.iloc[:, 3:]
 # Filter rows based on the first column's value
 chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]
 # Separate target and features
 y = chunk.iloc[:, 0]
 X = chunk.iloc[:, 1:]
 # One-hot encode the target column
 y = pd.get_dummies(y)
 return X, y
# Placeholder for selected features count
feature_counter = Counter()
for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
 print(f"Processing chunk {i+1}/{num_chunks}")
 X, y = process_chunk(chunk)
 # Normalize features
 scaler = StandardScaler()
 X = scaler.fit_transform(X)
 # Build and train a simple neural network model
 model = Sequential()
 model.add(Dense(128, input_dim=X.shape[1], activation='relu'))
 model.add(Dense(64, activation='relu'))
 model.add(Dense(y.shape[1], activation='softmax'))
 model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
 model.fit(X, y, epochs=10, batch_size=64, verbose=0)
 # Extract feature importance from the model weights
 weights = model.layers[0].get_weights()[0]
 feature_importance = np.mean(np.abs(weights), axis=1)
 # Select features based on importance
 threshold = np.median(feature_importance)
 selected_features = np.where(feature_importance > threshold)[0]
 # Update feature counter
 feature_counter.update(selected_features)
 # Clear the Keras session to free memory
 tf.keras.backend.clear_session()
# Get the most frequently selected features and sort by frequency in descending order
sorted_features = feature_counter.most_common()
# Save the selected features and their frequencies to a file
with open(output_file_path, 'w') as f:
 for feature, count in sorted_features:
 f.write(f"{feature}: {count}\n")
print(f"Selected features saved to {output_file_path}")
toolic
14.4k5 gold badges29 silver badges201 bronze badges
asked Jul 7, 2024 at 2:36
\$\endgroup\$

2 Answers 2

1
\$\begingroup\$
  • Series.isin is much faster than Series.str.contains since you're only checking for literal strings (rather than substrings):

    - chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]
    + chunk = chunk[chunk.iloc[:, 0].isin(list('ABC'))]
    
  • inplace will be deprecated in PDEP-8 and is not recommended:

    - chunk.fillna(0, inplace=True)
    + chunk = chunk.fillna(0)
    
answered Jul 7, 2024 at 4:31
\$\endgroup\$
1
  • \$\begingroup\$ That was less important part of the code. Please focus on threshold, feature_importance, and selected_features .. \$\endgroup\$ Commented Jul 7, 2024 at 5:00
1
\$\begingroup\$

new concept? new name!

 y = chunk.iloc[:, 0]
 ...
 # One-hot encode the target column
 y = pd.get_dummies(y)

Imagine a telephone conversation with a colleague about this code. Maybe it's a facetime conversation with (削除) Abe Lincoln (削除ここまで)Bob Newhart.

Yeah, you want to do that with y? You mean the old y? Or maybe the new y. You know, like y-prime. No, not like that transformer. You know, the new y, the hot one. Uhhhh, the one hot one. Oh, you know what I mean! No? Well, let me give it a name so you'll know what I mean.

You had an opening where you could give each concept a distinct name. Next time, perhaps you will seize the opportunity.

giant number

total_rows = 1000000

I ran out of fingers on one hand to count the powers of ten there. Prefer to spell it

total_rows = 1_000_000

so we can see at once that you mean "a million".

answered Jul 7, 2024 at 5:47
\$\endgroup\$
1
  • \$\begingroup\$ Please focus on threshold, feature_importance, and selected_features \$\endgroup\$ Commented Jul 7, 2024 at 6:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.