The given datasets are graph data structure that represents social interactions.
The nodes will be represented as People{node_id, edge, gender, occupation}
and the edges indicate how often individuals communicate with each other, whether through phone calls, emails, or messages. A higher weight signifies more frequent communication.
There are 10 persons where 60% will be female and 30% will be male and 10% will be of unknown gender.
'target' represents their genders.
People Nodes
node_id | age | gender | occupation |
---|---|---|---|
1 | 41 | Female | 0 |
2 | 40 | Unclassified | 3 |
3 | 21 | Female | 1 |
4 | 43 | Female | 4 |
5 | 31 | Male | 3 |
6 | 30 | Female | 0 |
7 | 28 | Female | 1 |
8 | 29 | Male | 4 |
9 | 39 | Female | 1 |
10 | 32 | Male | 0 |
Social Interaction Edges
from | to | weight |
---|---|---|
1 | 2 | 0.450499 |
1 | 3 | 0.833195 |
1 | 4 | 0.449754 |
1 | 5 | 0.539692 |
1 | 6 | 0.293488 |
1 | 7 | 0.496794 |
1 | 8 | 0.514994 |
1 | 9 | 0.840499 |
1 | 10 | 0.412794 |
2 | 3 | 0.684105 |
2 | 4 | 0.963660 |
2 | 5 | 0.943470 |
2 | 6 | 0.192717 |
2 | 7 | 0.924245 |
2 | 8 | 0.201781 |
2 | 9 | 0.425102 |
2 | 10 | 0.613922 |
3 | 4 | 0.749955 |
3 | 5 | 0.675580 |
3 | 6 | 0.293714 |
3 | 7 | 0.816202 |
3 | 8 | 0.043064 |
3 | 9 | 0.922738 |
3 | 10 | 0.458666 |
4 | 5 | 0.034314 |
4 | 6 | 0.840261 |
4 | 7 | 0.925287 |
4 | 8 | 0.118203 |
4 | 9 | 0.547889 |
4 | 10 | 0.779928 |
5 | 6 | 0.624413 |
5 | 7 | 0.227053 |
5 | 8 | 0.695268 |
5 | 9 | 0.318876 |
5 | 10 | 0.960750 |
6 | 7 | 0.428481 |
6 | 8 | 0.798711 |
6 | 9 | 0.543386 |
6 | 10 | 0.277181 |
7 | 8 | 0.215006 |
7 | 9 | 0.285211 |
7 | 10 | 0.772858 |
8 | 9 | 0.963206 |
8 | 10 | 0.676292 |
9 | 10 | 0.412905 |
Adjacency Matrix representation of the edges
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0.0000 | 0.4505 | 0.8332 | 0.4498 | 0.5397 | 0.2935 | 0.4495 | 0.8489 | 0.4754 | 0.8805 |
2 | 0.4505 | 0.0000 | 0.1734 | 0.3952 | 0.5868 | 0.0141 | 0.0954 | 0.7217 | 0.5633 | 0.6244 |
3 | 0.8332 | 0.1734 | 0.0000 | 0.9267 | 0.9653 | 0.1988 | 0.3708 | 0.2360 | 0.6955 | 0.2956 |
4 | 0.4498 | 0.3952 | 0.9267 | 0.0000 | 0.6070 | 0.7113 | 0.6688 | 0.2561 | 0.1393 | 0.1055 |
5 | 0.5397 | 0.5868 | 0.9653 | 0.6070 | 0.0000 | 0.7902 | 0.6659 | 0.0404 | 0.6044 | 0.4565 |
6 | 0.2935 | 0.0141 | 0.1988 | 0.7113 | 0.7902 | 0.0000 | 0.5913 | 0.7107 | 0.5398 | 0.2184 |
7 | 0.4495 | 0.0954 | 0.3708 | 0.6688 | 0.6659 | 0.5913 | 0.0000 | 0.1109 | 0.2031 | 0.4165 |
8 | 0.8489 | 0.7217 | 0.2360 | 0.2561 | 0.0404 | 0.7107 | 0.1109 | 0.0000 | 0.9429 | 0.8833 |
9 | 0.4754 | 0.5633 | 0.6955 | 0.1393 | 0.6044 | 0.5398 | 0.2031 | 0.9429 | 0.0000 | 0.3243 |
10 | 0.8805 | 0.6244 | 0.2956 | 0.1055 | 0.4565 | 0.2184 | 0.4165 | 0.8833 | 0.3243 | 0.0000 |
This is an implementation of GNN that take adjacency matrix as edge inputs.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
# Sample data for the nodes in the graph
nodes_data = {
'node_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'age': [41, 40, 21, 43, 31, 30, 28, 29, 39, 32],
'gender': [2, 0, 2, 2, 1, 2, 2, 1, 2, 1], # 0: Unknown, 1: Male, 2: Female
'occupation': [0, 3, 1, 4, 3, 0, 1, 4, 1, 0]
}
# Adjacency matrix representing the edges data between nodes
adjacency_matrix = np.array([
[0.0000, 0.4505, 0.8332, 0.4498, 0.5397, 0.2935, 0.4968, 0.5149, 0.8405, 0.4128],
[0.4505, 0.0000, 0.6841, 0.9637, 0.9435, 0.1927, 0.9242, 0.2018, 0.4251, 0.6139],
[0.8332, 0.6841, 0.0000, 0.7500, 0.6756, 0.2937, 0.8162, 0.0431, 0.9227, 0.4587],
[0.4498, 0.9637, 0.7500, 0.0000, 0.0343, 0.8403, 0.9253, 0.1182, 0.5479, 0.7799],
[0.5397, 0.9435, 0.6756, 0.0343, 0.0000, 0.6244, 0.2271, 0.6953, 0.3189, 0.9608],
[0.2935, 0.1927, 0.2937, 0.8403, 0.6244, 0.0000, 0.4285, 0.7987, 0.5434, 0.2772],
[0.4968, 0.9242, 0.8162, 0.9253, 0.2271, 0.4285, 0.0000, 0.2150, 0.2852, 0.7729],
[0.5149, 0.2018, 0.0431, 0.1182, 0.6953, 0.7987, 0.2150, 0.0000, 0.9632, 0.6763],
[0.8405, 0.4251, 0.9227, 0.5479, 0.3189, 0.5434, 0.2852, 0.9632, 0.0000, 0.4129],
[0.4128, 0.6139, 0.4587, 0.7799, 0.9608, 0.2772, 0.7729, 0.6763, 0.4129, 0.0000]
], dtype=np.float32)
# Convert the node data into a DataFrame
nodes_df = pd.DataFrame(nodes_data)
# Convert node_id to zero-based indexing (if needed for model consistency)
nodes_df['node_id'] = nodes_df['node_id'] - 1
# Extract features from the DataFrame and convert to numpy array
features = nodes_df[['age', 'gender', 'occupation']].to_numpy()
num_features = features.shape[1] # Number of features for each node
num_nodes = features.shape[0] # Number of nodes in the graph
# Target labels representing genders
target_labels = nodes_df['gender'].to_numpy()
# Define a custom Graph Convolution Layer
class GraphConvLayer(layers.Layer):
def __init__(self, output_dim, **kwargs):
super(GraphConvLayer, self).__init__(**kwargs)
self.output_dim = output_dim
def build(self, input_shape):
feature_shape = input_shape[0][-1] # Shape of the input features
# Initialize the weights for the layer
self.kernel = self.add_weight(
shape=(feature_shape, self.output_dim),
initializer='glorot_uniform',
name='kernel'
)
def call(self, inputs):
features, adj_matrix = inputs
# Perform graph convolution by multiplying adjacency matrix with features
output = tf.matmul(adj_matrix, features)
# Apply the learned weights
output = tf.matmul(output, self.kernel)
return output
# Function to create the GNN model
def create_gnn_model(input_shape, output_dim, num_nodes):
# Define input layers for features and adjacency matrix
features_input = keras.Input(shape=(num_nodes, input_shape), name='features')
adj_matrix_input = keras.Input(shape=(num_nodes, num_nodes), name='adj_matrix')
# Apply the first Graph Convolution Layer
x = GraphConvLayer(16)([features_input, adj_matrix_input])
x = layers.ReLU()(x) # Apply ReLU activation
# Apply the second Graph Convolution Layer
x = GraphConvLayer(output_dim)([x, adj_matrix_input])
return keras.Model(inputs=[features_input, adj_matrix_input], outputs=x)
# Create the GNN model
gnn_model = create_gnn_model(num_features, 3, num_nodes) # 3 output classes for gender (Unknown, Male, Female)
# Compile the model with Adam optimizer and Sparse Categorical Crossentropy loss
gnn_model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.01),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[keras.metrics.SparseCategoricalAccuracy(name='acc')]
)
# Data preparation for training
features_input = features.astype(np.float32) # Convert features to float32 type
adj_matrix_input = adjacency_matrix.astype(np.float32) # Convert adjacency matrix to float32 type
# Expand dimensions to match the input shape (batch size, num_nodes, num_features)
features_input = np.expand_dims(features_input, axis=0)
adj_matrix_input = np.expand_dims(adj_matrix_input, axis=0)
target_labels = np.expand_dims(target_labels, axis=0) # Expand dimensions of target_labels to match the batch size
# Print shapes of inputs and targets for verification
print("features_input shape:", features_input.shape)
print("adj_matrix_input shape:", adj_matrix_input.shape)
print("target_labels shape:", target_labels.shape)
# Train the model
history = gnn_model.fit(
x=[features_input, adj_matrix_input], # Inputs to the model
y=target_labels, # Target labels
epochs=100, # Number of epochs
batch_size=1, # Batch size
validation_split=0 # Set validation_split to 0
)
# Plot training loss over epochs
plt.plot(history.history['loss'])
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()
# Plot training accuracy over epochs
plt.plot(history.history['acc'])
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.show()
1 Answer 1
For the most part this is textbook bog standard boilerplate, of uncertain original authorship.
zero origin
This doesn't seem like a convenient notation.
nodes_data = {
'node_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
...
# Convert node_id to zero-based indexing (if needed for model consistency)
nodes_df['node_id'] = nodes_df['node_id'] - 1
Why not just adopt sensible node identifiers from the get go? As written, I need to keep worrying about "node 3, now is that the adjusted node 3 or the input node 3?"
Also, you appear to be offering an undirected graph using digraph notation. Consider writing down just the lower left triangle, and then have code express the notion that upper right is definitely the exact mirror image, with no typographic transcription errors.
redundant conversion
adjacency_matrix = np.array([ ...
], dtype=np.float32)
...
adj_matrix_input = adjacency_matrix.astype(np.float32) # Convert adjacency matrix to float32 type
We already had floats.
Also, the code already told us we were converting to type float, there's no need for an English sentence to say the exact same thing. We express "how?" in the code, and "why?" in the comments.
motivation
You didn't set up the problem to be solved,
you cited no authors,
it's unclear what the meaning of e.g. 0.8332
as an edge weight is supposed to be.
If the task is to infer gender from observed
occupation, then tell us that.
We can only judge correctness against some written specification.
As written, it's hard to say much more about the
code than "it ran, and it didn't crash".