Dvector::fancy_read) is slow. Really slow. So I decided to switch to a C implementation keeping the same functionalities.
I quickly did a rough translation of the function into C, using basically the same mechanics (and in particular using Ruby regular expressions for parsing), and I was surprised when I found out that I was only winning a factor of around three in the speed of reading. I was even more surprised to see that the reading is O(n2) (reading 100000 lines is around 100 times slower than reading 100 times 1000 lines !). So, I decided I should try my luck with a profiler.
My first step was naively to compile the library with the
-pg gcc option, but that didn't produce any output file (although it might have been due to the fact that I forgot to add the switch again for linking). I attributed that to the fact that the whole program should be compiled with the switch, and not only the shared library. So I did write a small C wrapper, compiled it, and ran it. It did produce a
gmon.out file, but gprof was unable to give me any interesting information from that. I guess I needed a finer granularity that my own functions, and for that I should have compile Ruby with profiling support. Well. Drop it.
So, I was about to give up when I thought about valgrind. Valgrind also comes with a profiler tool,
callgrind. So, I ended up doing the following:
~ valgrind --tool=callgrind ./fancy_read ~ callgrind_annotate callgrind.out.24425| less
The first command runs the program with valgrind, saving data into a file called something like
callgrind.out.24425. The second parses this file and displays the number of intructions spent in each of the most significative functions. Here is a extract of the output:
306,777,466 ???:__printf_fp [/lib/libc-2.6.1.so] 226,575,122 ???:0x0000000000041C20 [/lib/libc-2.6.1.so] 121,900,026 ???:0x00000000000805F0 [/usr/lib/libruby1.8.so.1.8.6] 98,208,889 ???:__strtod_internal [/lib/libc-2.6.1.so] 90,449,474 ???:0x00000000000482D0 [/lib/libc-2.6.1.so] 87,300,200 ???:vfprintf [/lib/libc-2.6.1.so] 84,197,045 ???:ruby_re_search [/usr/lib/libruby1.8.so.1.8.6] 71,866,550 ???:0x0000000000071D60 [/lib/libc-2.6.1.so] 65,387,136 ???:0x0000000000071380 [/lib/libc-2.6.1.so] 62,784,967 ???:0x000000000004A9F0'2 [/usr/lib/libruby1.8.so.1.8.6] 61,279,027 ???:0x000000000004AC80 [/usr/lib/libruby1.8.so.1.8.6] 48,357,879 ???:malloc [/lib/libc-2.6.1.so] 41,915,985 ???:free [/lib/libc-2.6.1.so] 34,200,000 ???:rb_reg_search [/usr/lib/libruby1.8.so.1.8.6] 31,432,232 ???:ruby_xmalloc [/usr/lib/libruby1.8.so.1.8.6]
This shows that most of the time is spent displaying the data. Normal, half of my program does only that. Then, a fair amount of time is spent in
strtod. Nothing to improve there. Another fair amount is spent in regular expression matching, and then, a significant part of the processing time is actually spent on memory management ! Dreadful ! I guess there's not much more I could do
The conclusion to this is that if you need to profile something, use
valgrind. This is much more powerful than