Why is reading lines from stdin much slower in C++ than Python?
tl;dr: Because of different default settings in C++ requiring more system calls.
cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:
Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the
iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:
int myvalue1;cin >> myvalue1;int myvalue2;scanf("%d",&myvalue2);
If more input was read by
cin than it actually needed, then the second integer value wouldn't be available for the
scanf function, which has its own independent buffer. This would lead to unexpected results.
To avoid this, by default, streams are synchronized with
stdio. One common way to achieve this is to have
cin read each character one at a time as needed using
stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.
Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the
Just out of curiosity I've taken a look at what happens under the hood, and I've used dtruss/strace on each test.
./a.out < inSaw 6512403 lines in 8 seconds. Crunch speed: 814050
sudo dtruss -c ./a.out < in
CALL COUNT__mac_syscall 1<snip>open 6pread 8mprotect 17mmap 22stat64 30read_nocancel 25958
./a.py < inRead 6512402 lines in 1 seconds. LPS: 6512402
sudo dtruss -c ./a.py < in
CALL COUNT__mac_syscall 1<snip>open 5pread 8mprotect 17mmap 21stat64 29
I'm a few years behind here, but:
In 'Edit 4/5/6' of the original post, you are using the construction:
$ /usr/bin/time cat big_file | program_to_benchmark
This is wrong in a couple of different ways:
You're actually timing the execution of
cat, not your benchmark. The 'user' and 'sys' CPU usage displayed by
timeare those of
cat, not your benchmarked program. Even worse, the 'real' time is also not necessarily accurate. Depending on the implementation of
catand of pipelines in your local OS, it is possible that
catwrites a final giant buffer and exits long before the reader process finishes its work.
catis unnecessary and in fact counterproductive; you're adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and -- in certain generations of computers -- I/O faster than CPU) -- the mere fact that
catwas running could substantially color the results. You are also subject to whatever input and output buffering and other processing
catmay do. (This would likely earn you a 'Useless Use Of Cat' award if I were Randal Schwartz.
A better construction would be:
$ /usr/bin/time program_to_benchmark < big_file
In this statement it is the shell which opens big_file, passing it to your program (well, actually to
time which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you're trying to benchmark. This gets you a real reading of its performance without spurious complications.
I will mention two possible, but actually wrong, 'fixes' which could also be considered (but I 'number' them differently as these are not things which were wrong in the original post):
A. You could 'fix' this by timing only your program:
$ cat big_file | /usr/bin/time program_to_benchmark
B. or by timing the entire pipeline:
$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'
These are wrong for the same reasons as #2: they're still using
cat unnecessarily. I mention them for a few reasons:
they're more 'natural' for people who aren't entirely comfortable with the I/O redirection facilities of the POSIX shell
there may be cases where
catis needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked:
sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output)
in practice, on modern machines, the added
catin the pipeline is probably of no real consequence.
But I say that last thing with some hesitation. If we examine the last result in 'Edit 5' --
$ /usr/bin/time cat temp_big_file | wc -l0.01user 1.34system 0:01.83elapsed 74%CPU ...
-- this claims that
cat consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:
$ /usr/bin/time wc -l < temp_big_file
would have taken only the remaining .49 seconds! Probably not:
cat here had to pay for the
read() system calls (or equivalent) which transferred the file from 'disk' (actually buffer cache), as well as the pipe writes to deliver them to
wc. The correct test would still have had to do those
read() calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.
Still, I predict you would be able to measure the difference between
cat file | wc -l and
wc -l < file and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.
In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually 'best of 3' results; after priming the cache, of course):
$ time wc -l < /tmp/junkreal 0.280s user 0.156s sys 0.124s (total cpu 0.280s)$ time cat /tmp/junk | wc -lreal 0.407s user 0.157s sys 0.618s (total cpu 0.775s)$ time sh -c 'cat /tmp/junk | wc -l'real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)
Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I'm using the shell (bash)'s built-in 'time' command, which is cognizant of the pipeline; and I'm on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using
/usr/bin/time I see smaller CPU time than realtime -- showing that it can only time the single pipeline element passed to it on its command line. Also, the shell's output gives milliseconds while
/usr/bin/time only gives hundredths of a second.
So at the efficiency level of
wc -l, the
cat makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.
I should add that there is at least one other significant difference between these styles of testing, and I can't say whether it is a benefit or fault; you have to decide this yourself:
When you run
cat big_file | /usr/bin/time my_program, your program is receiving input from a pipe, at precisely the pace sent by
cat, and in chunks no larger than written by
When you run
/usr/bin/time my_program < big_file, your program receives an open file descriptor to the actual file. Your program -- or in many cases the I/O libraries of the language in which it was written -- may take different actions when presented with a file descriptor referencing a regular file. It may use
mmap(2) to map the input file into its address space, instead of using explicit
read(2) system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the
Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using
mmap(). So in practice it might be good to run the benchmarks both ways; perhaps discounting the
cat result by some small factor to "forgive" the cost of running