Profiling memory growth and execution time

Introduction

I've performed call graph profiling, memory profiling, and system call tracing to look for reasons why some of our mappers and harvesters appear to rise steadily in memory allocation, and why at least one of our mappings (Hathi) has periods where it stalls out and stops performing work.

I've produced some files, below, that point, in both the execution wall time and memory usage contexts, to the SPARQL::Client gem (or another gem that it includes) as a subject of interest, and further research.

This work is related to Memory Profiling

Patterns

Run any large mapping or harvest, and you see the process's memory grow steadily over time until the job is finished. Note that the Resque worker, which listens to the queue and starts new jobs, forks a child process which lives for the duration of the job and then exits, freeing its memory.

Run the Hathi mapping and, by tailing the logfile of the Nginx proxy in front of Marmotta, observe how HTTP requests stop occasionally for minutes at a time, and then start up again. When the requests are paused, observe in top that the worker child process is at near 100% CPU. Simultaneously, trace the process with strace and observe that no system calls are being made, as would be the case if it were interacting with disk or network – that is to say, performing "work." After some number of seconds, observe then how the worker child starts making brk() system calls, setting the top of the process's heap to higher and higher memory addresses (allocating memory). Observe top reporting the process growing in size. Finally, when the brk() calls end, observe thousands of stat() system calls that appear to test the existence of files throughout the Krikri and Heidrun installations that match the word "grammar," such as directories named "grammar" and files named "grammar.rb". Thousands of those calls pass by (over 100K sometimes) and then the "work" resumes, where you see system calls for network requests, etc. Then the work stops after a minute or two and it's back to the brk() calls.

Facts and Figures

Though I'm not an expert in performance profiling and working with call graphs, I can use KCachegrind to examine the output of trace files that I created with RubyProf, and I believe that I see IO#select calls taking a significant amount of time (42% of the wall time!), which may be normal, but the closest methods to these calls that are not part of the Ruby standard library are Faraday::Adapter::NetHttp#perform_request and SPARQL::Client#request. See the trace dump file included below and check it out in KCachegrind. Let me know what you think if you have some experience looking at call graphs! (IO#select makes a select() system call, and is used to determine if a file (including a network socket) is capable of being read or written.)

One note on the select() calls: They do not correspond with the Hathi mapper actually stalling out. They occur regularly and frequently throughout a mapping run, every few seconds.

I've run strace -p 17307 on a worker process with PID 17307. I can grep and tail -f this output to see all of the open() and select() calls. There are some select() calls, which visibly pause the output in my terminal, indicating they're taking a decent amount of time, and follow this pattern:

open("/proc/17307/smaps", O_RDONLY|O_CLOEXEC) = 11
select(12, [11], NULL, NULL, {60, 0})   = 1 (in [11], left {59, 983846})

The first argument to select() is the highest file descriptor number plus one. The second argument is an array of file descriptors, in this case including just 11, the number of the smaps file.

There is another select() pattern for network sockets, as follows

socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 11
... various calls to fstat(), connect(), getsockname(), getsockopt(), write() ...
read(11, 0x7f6434d52cc0, 16384)         = -1 EAGAIN (Resource temporarily unavailable)
select(12, [11], NULL, NULL, {60, 0})   = 1 (in [11], left {59, 990628})

So I think this is possibly the IO#select call mentioned above, because it's for a socket.

See the file below, more-system-call-tracing.txt

One very interesting pattern in that file is a loop that I see often with lots of open() and read() calls on /proc/<pid>/smaps, which looks like garbage-collection activity. It doesn't look very efficient, repeatedly opening and reading that file.

I've implemented debug logging in specific files, and I can rule out a few places that are not taking a significant amount of wall time:

The bodies of Krikri::Activity#entity_uris and the block passed to query.each_solution.lazy.map. These took about 3 seconds out of a recent 287 second run. Note that each_solution is getting into SPARQL::Client territory.
The body of Krikri::Activity#run. 8 seconds here, out of 287.
SPARQL::Client's use of Net::HTTP:: Persistent. I was eyeing that module's monkeypatch of Net::BufferedIO in https://github.com/drbrain/net-http-persistent/blob/v2.9.4/lib/net/http/faster.rb as a possible issue, but I observe the same results in profiling if I modify SPARQL::Client to use NET::HTTP instead (which does not override Net::BufferedIO#rbuf_fill).

Similarly, there are notable regions of code that do not allocate significant amounts of memory:

The body of Mapping#process_record
The body of OriginalRecordEntityBehavior#entities
The body of each_solution in Krikri::Activity#entity_uris
Ditto with regard to Net::HTTP:: Persistent, above, also in the context of memory.

All signs so far point to the enumerators, specifically, I think, the RDF::SPARQL::Query that is returned by ProvenanceQueryClient#find_by_activity.

Open Questions and Vexations

I haven't figured out yet what causes the brk() calls (which are just what you see on the kernel end for malloc() calls in C). The trace dump below was generated by RubyProf, which isn't profiling Ruby's C function calls.

I can't tell what in the call graphs corresponds with the stall-out periods in the Hathi mappings. It's possible that something so low-level is happening that it's preventing the execution of the RubyProf code, so RubyProf isn't profiling whatever is causing the process to hang.

After days of being able to run a mapping and see it stall out with brk() calls, I find today (3/10) that I am unable to observe this behavior. Instead, I see it stall, but I see it going through a loop of calling open() on /proc/<the pid>/smaps and then issuing a number of successful read() calls. This is probably part of the Ruby garbage-collection process. I've seen this behavior and switched over to gdb, to observe a number of garbage-collection-related function calls. The thing is, this loop of open() and read() lasts a long time – minutes – with no requests being logged in Nginx's log.

Since 3/9/2016, when I could see the brk() calls, I increased the size of my dev2 VM to 4 Gb memory, and adjusted some kernel memory-related parameters (kernel.shmmax=4080218931, kernel.shmall=1022361, ~~and vm.overcommit_memory=2~~) to try to get MemoryProfiler to function with more than 5 records (with no success). It's possible that this has changed Ruby's behavior. (Note 3/11/2016: vm.overcommit_memory was already at 2 because it's set that way for the 'ingestion_app' inventory group. Running an ingest again, I get the same "new" behavior of open() and read() ... So I don't think the kernel parameters had any effect.)

I've been able to use gdb to examine call stack backtraces of the running worker child process, during periods when Marmotta requests have stalled See the attached file, below. This deserves further analysis, but I have a couple of backtraces that show garbage collection and memory allocation being performed. I have not yet analyzed the calls higher up the stack leading to the memory allocation and garbage collection.

It's interesting that we don't see the mappings stall out with all of our providers. The current issue is getting Hathi running beyond 24K records, but other providers that we've mapped recently have been able to map hundreds of thousands of records in an acceptable amount of time. Does record size, or some other quality of the record, have something to do with it?

We may benefit from some more time spent here to gather more information.

Miscellaneous Notes

MemoryProfiler uses a lot of memory. Normally a worker exits after consuming 450 Mb of resident memory from a 200-record run. With MemoryProfiler wrapped around Krikri::Mapper.map in Krikri::Mapper::Agent#run, the process ends up using almost 2 Gb for a 5-record run. Therefore, we can only profile 5 records being mapped, which may not be realistic. I hope that it reveals something useful that can be acted upon, nevertheless.

Profiling and logging code

See https://github.com/dpla/KriKri/tree/example-mem-prof-201603 for some of the changes that were made to profile the code.

Specifically, https://github.com/dpla/KriKri/blob/example-mem-prof-201603/lib/krikri/mapper.rb

I have another branch that's similar to the one tagged above, in which RubyProf is wrapped around Krikri::Mapper::Agent#run, instead of MemoryProfiler. I could push that, too, but I imagine it's clear how you could implement it.

Files

System calls issued during a mapper run, focusing on where it stalls out for minutes:

MemoryProfiler report, for a 5-record Hathi mapper run:

Profiling trace dump for a 200-record Hathi mapper run (open in KCachegrind). Note that this run includes a stallout described above, where memory was allocated and brk() was called:

Callee map of Krikri::Mapper::Agent#run (from trace dump above):

Call graph of Krikri::Mapper::Agent#run (from trace dump above):

A set of backtraces grabbed from the running worker child process with gdb:

Application logfile excerpts from two mapping runs, showing the difference in execution time for 200 and 800 records:

More system call tracing: