Archived

Code Issues Pull requests Projects Releases Packages Wiki Activity

a simple test project for me to try out pushing the limits of fast reading & processing of large files in c++

This repository has been archived on 2025年01月12日. You can view files and clone it, but you cannot make any changes to its state, such as pushing and creating new issues, pull requests or comments.

22 commits 1 branch 0 tags 1.6 MiB

C++ 97.8%
Meson 2.2%

Find a file

ab34 cb9ca81abc edit readme		2024年08月12日 08:56:06 +02:00
img	edit readme	2024年08月12日 08:56:06 +02:00
src	edit readme	2024年08月12日 08:56:06 +02:00
.gitignore	seriously nearly finished	2024年03月22日 14:15:18 +01:00
example_data.csv	finish touches	2024年03月23日 15:56:30 +01:00
LICENSE	fileparsing works...	2024年03月12日 16:35:25 +01:00
meson.build	finish touches	2024年03月23日 15:56:30 +01:00
README.md	edit readme	2024年08月12日 08:56:06 +02:00

README.md

fast file-processing

a simple test project for me to try out pushing the limits of fast reading & processing of large files in c++. [it's mainly about optimisation]

how to compile?

this project uses the meson buildsystem for compiling commands for compilation (in root directory of project):

meson setup <build_directory_name> --buildtype=release -Dcpp_args='/GL'
cd <build_directory_name> && meson compile
./fast_file_processing

data-set

the programs purpose was to be run on a file containing 1 billion datasets with a total size of ~16GB. however, such a file is too large to upload to the git repository, so there's only a small example file with ~44'000 datasets/lines.

performance:

on my laptop, being bottlenecked by the intel i7-9750 processor, the program - using the large data-set - finishes, on average, within ~6 minutes. running the small data-set, it finishes within 58ms.

the most time-consuming operation by far, is the add_to_map function, which is adding and updating data in the std::map.

simply parsing the large data-set, without calling the add_to_map function [on the large data-set], the program takes ~49 seconds to run. in this case, it is bottlenecked by my ssd's read-speed, sitting at ~330MB/s.

notes to the implementation:

this project is realised using memory-mapping functions provided by the windows-api. it isn't multithreaded and simply served as a playground for trying out different methods of file reading.

since the primary goal was speed, some parts of the codebase are deliberately written to not be extensible or pretty for the sakes of performance.

licence

this project is released under the gnu affero general public license