MongoDB GridFS triples the file sizes

Question 1

I really like using mongodb to store my data and recently I tried out GridFS and it really fits my use case.

My problem with it is the space requirement, which seems quite odd. I have ~107GB of images in Amazon S3, which is around 1 million files (all images, mostly small ones). I made a simple Java project to download the images from S3 and insert them into two separate MongoDB GridFS collections (single server, 3.6.5, 64 bit, Windows Server 2016). The problem is, when the upload/download completes, the GridFS collections take more than 300GB storage on the server. Is this acceptable for this kind of collection or should I worry about the tripled size?

Note: I simply inserted the images using the Java Mongo Driver (Spring Boot) without any significant change, the problem is with the image chunks. I do not delete or update any images (I defined a unique index for the MD5 field though, to ignore image duplication), thus compact and repair does not change the collection sizes. As much as I can see, the collections are not overly preallocated (I don't think my problem is similar to this: Huge size on mongodb's gridfs. Should I compact? )

Also, currently it is a single mongodb server, without a replica set.

Thank you very much for your help!

Question 2

It's basically expected. GridFS breaks the data into "chunks" and stores those binary parts within documents just like anything else in MongoDB. Those documents have some additional overhead by definition and additionally in terms of storage they get allocated a pre-set amount, being typically larger than they would be expected to grow. There are things that can be done to "tune" that allocated amount, as well as enabling compression in storage which is another option.

Question 3

Also, whilst you may not be at "that other problem yet", it is basically inherent to storing data within the database. If you don't have a compelling reason why you might possibly need "distributed data" access to stored files, and are just simply looking for "cheap options", then anything essentially filesystem based is usually the better option. Using MongoDB for storage does have overheads. You either want those for a specific reason, or you're looking in the wrong place.

Question 4

@ NeilLunn,Even in MongoDB the GridFS is using to store the BSON document size, who has larger than 16 MB.

Question 5

How did you determine that the collection is taking >300GB in GridFS and 107GB in S3? If you used ls -l it will be helpful to include those output in the question. Please also post the output of db.fs.chunks.stats() (assuming your chunks collection is using the default name of fs.chunks)

Question 6

@KevinAdistambha S3 gives you the size of a bucket plain and simple, same with mongoDb. I migrated everything and dropped gridfs, so I can't give you more details now.

Question 7

Add the MongoDB Java driver dependency to your project's pom.xml file:

<dependency>
 <groupId>org.mongodb</groupId>
 <artifactId>mongodb-driver-sync</artifactId>
 <version>4.4.2</version>
</dependency>

Create a MongoDB client bean in your application configuration class:

@Configuration
public class MongoConfig {
 @Value("${spring.data.mongodb.uri}")
 private String mongoUri;
 @Bean
 public MongoClient mongoClient() {
 ConnectionString connectionString = new ConnectionString(mongoUri);
 MongoClientSettings settings = MongoClientSettings.builder()
 .applyConnectionString(connectionString)
 .build();
 return MongoClients.create(settings);
 }
 @Bean
 public MongoDatabase mongoDatabase(MongoClient mongoClient) {
 return mongoClient.getDatabase("your_database_name");
 }
}

Define a service class to handle the GridFS operations:

@Service
public class GridFsService {
 private final MongoDatabase mongoDatabase;
 private final GridFSBucket gridFSBucket;
 public GridFsService(MongoDatabase mongoDatabase) {
 this.mongoDatabase = mongoDatabase;
 this.gridFSBucket = GridFSBuckets.create(mongoDatabase);
 }
 public ObjectId uploadFile(String filename, InputStream inputStream, String contentType) throws IOException {
 GridFSUploadOptions options = new GridFSUploadOptions()
 .chunkSizeBytes(256 * 1024) // Set the desired chunk size
 .metadata(new Document("contentType", contentType)); // Set additional metadata if needed
 return gridFSBucket.uploadFromStream(filename, inputStream, options);
 }
 public GridFSDownloadStream downloadFile(ObjectId fileId) {
 return gridFSBucket.openDownloadStream(fileId);
 }
}

Use the GridFsService in your application logic to upload and download files:

@Service
public class YourService {
 private final GridFsService gridFsService;
 public YourService(GridFsService gridFsService) {
 this.gridFsService = gridFsService;
 }
 public void uploadFile(MultipartFile file) throws IOException {
 try (InputStream inputStream = file.getInputStream()) {
 gridFsService.uploadFile(file.getOriginalFilename(), inputStream, file.getContentType());
 }
 }
 public InputStream downloadFile(ObjectId fileId) {
 GridFSDownloadStream downloadStream = gridFsService.downloadFile(fileId);
 return downloadStream.getInputStream();
 }
}

Question 8

But why, what does it do?

warashi nguyen warashi nguyen 2231 silver badge6 bronze badges · Answer 1 · 2023-07-19 16:47:38Z

Add the MongoDB Java driver dependency to your project's pom.xml file:

<dependency>
 <groupId>org.mongodb</groupId>
 <artifactId>mongodb-driver-sync</artifactId>
 <version>4.4.2</version>
</dependency>

Create a MongoDB client bean in your application configuration class:

@Configuration
public class MongoConfig {
 @Value("${spring.data.mongodb.uri}")
 private String mongoUri;
 @Bean
 public MongoClient mongoClient() {
 ConnectionString connectionString = new ConnectionString(mongoUri);
 MongoClientSettings settings = MongoClientSettings.builder()
 .applyConnectionString(connectionString)
 .build();
 return MongoClients.create(settings);
 }
 @Bean
 public MongoDatabase mongoDatabase(MongoClient mongoClient) {
 return mongoClient.getDatabase("your_database_name");
 }
}

Define a service class to handle the GridFS operations:

@Service
public class GridFsService {
 private final MongoDatabase mongoDatabase;
 private final GridFSBucket gridFSBucket;
 public GridFsService(MongoDatabase mongoDatabase) {
 this.mongoDatabase = mongoDatabase;
 this.gridFSBucket = GridFSBuckets.create(mongoDatabase);
 }
 public ObjectId uploadFile(String filename, InputStream inputStream, String contentType) throws IOException {
 GridFSUploadOptions options = new GridFSUploadOptions()
 .chunkSizeBytes(256 * 1024) // Set the desired chunk size
 .metadata(new Document("contentType", contentType)); // Set additional metadata if needed
 return gridFSBucket.uploadFromStream(filename, inputStream, options);
 }
 public GridFSDownloadStream downloadFile(ObjectId fileId) {
 return gridFSBucket.openDownloadStream(fileId);
 }
}

Use the GridFsService in your application logic to upload and download files:

@Service
public class YourService {
 private final GridFsService gridFsService;
 public YourService(GridFsService gridFsService) {
 this.gridFsService = gridFsService;
 }
 public void uploadFile(MultipartFile file) throws IOException {
 try (InputStream inputStream = file.getInputStream()) {
 gridFsService.uploadFile(file.getOriginalFilename(), inputStream, file.getContentType());
 }
 }
 public InputStream downloadFile(ObjectId fileId) {
 GridFSDownloadStream downloadStream = gridFsService.downloadFile(fileId);
 return downloadStream.getInputStream();
 }
}

1

But why, what does it do?

Rohit Gupta
– Rohit Gupta

2023年07月20日 20:25:30 +00:00
Commented Jul 20, 2023 at 20:25

Stack Exchange Network

MongoDB GridFS triples the file sizes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

MongoDB GridFS triples the file sizes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions