-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
🚀 Describe the improvement or the new tutorial
The current tutorial inference example uses an input tensor of shape (N, 28, 28), which is correct for the shown MLP model because it applies nn.Flatten(start_dim=1).
Some users later encounter confusion when transitioning to CNN-based models (nn.Conv2d), which require inputs in (N, C, H, W) format (e.g., (1, 1, 28, 28) for grayscale images).
The example is technically correct as written, but the lack of an explicit clarification can lead learners to incorrectly assume that (N, 28, 28) is a general requirement for vision models.
A short explanatory comment near the inference snippet could help distinguish:
- MLP-style models that flatten inputs
- CNN-style models that preserve spatial and channel dimensions
- The proposed clarification would be documentation-only:
- No API changes
- No behavior changes
- No modification to the existing example code
The intent is to improve conceptual understanding for beginners, especially those progressing from fully connected networks to convolutional networks.
I would appreciate feedback on whether adding a brief note clarifying this distinction would be acceptable, and if so, whether the suggested wording aligns with tutorial style guidelines.
If this clarification sounds reasonable, I’d be happy to submit a small documentation PR incorporating it.
Existing tutorials on this topic
No response
Additional context
No response