Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
Trac
Trac
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Issue Boards

GitLab is used only for code review, issue tracking and project management. Canonical locations for source code are still https://gitweb.torproject.org/ https://git.torproject.org/ and git-rw.torproject.org.

  • Legacy
  • TracTrac
  • Issues
  • #25161

Closed (moved)
Open
Opened Feb 06, 2018 by Karsten Loesing@karsten

Fix any memory problem caused by number of log files to be imported by the webstats module

This ticket should enable webstats module to import any number (within reasonable bounds) of logfiles, that are below the 2G limit (cf. comment:12). The handling of larger logs is treated in #25317 (moved).


Inititial report: I'm running a modified CollecTor that sanitizes webstats with some tweaks towards bulk importing existing webstats. In particular, it reads files in slices of 10 MiB plus another 2 MiB that overlap with the next slice. I just pushed the changes here.

First of all, the runtime is okay. Not great, but okay. It takes 36 minutes to sanitize 10 MiB. We have 927 MiB of files, so 93 slices, which is going to take ~2.5 days.

However, I ran into an out-of-memory problem at the 6th slice:

2018-02-06 13:30:36,499 INFO o.t.c.w.SanitizeWeblogs:116 Processing 20 logs for dist.torproject.org on archeotrichon.torproject.org.
2018-02-06 13:40:28,968 ERROR o.t.c.c.CollecTorMain:71 The webstats module failed: null
java.lang.OutOfMemoryError: null
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
	at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at org.torproject.collector.webstats.SanitizeWeblogs.findCleanWrite(SanitizeWeblogs.java:127)
	at org.torproject.collector.webstats.SanitizeWeblogs.startProcessing(SanitizeWeblogs.java:91)
	at org.torproject.collector.cron.CollecTorMain.run(CollecTorMain.java:67)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
	at java.lang.StringCoding.encode(StringCoding.java:344)
	at java.lang.StringCoding.encode(StringCoding.java:387)
	at java.lang.String.getBytes(String.java:958)
	at org.torproject.descriptor.log.LogDescriptorImpl.collectionToBytes(LogDescriptorImpl.java:119)
	at org.torproject.descriptor.log.WebServerAccessLogImpl.<init>(WebServerAccessLogImpl.java:72)
	at org.torproject.collector.webstats.SanitizeWeblogs.storeSanitized(SanitizeWeblogs.java:147)
	at org.torproject.collector.webstats.SanitizeWeblogs.lambda$findCleanWrite$3(SanitizeWeblogs.java:127)
	at org.torproject.collector.webstats.SanitizeWeblogs$$Lambda$38/1233367077.accept(Unknown Source)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.concurrent.ConcurrentHashMap$EntrySpliterator.forEachRemaining(ConcurrentHashMap.java:3606)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

I didn't look very closely, but I believe we're running out of memory while writing a sanitized file to disk, in particular while converting a list of strings to a byte array that we're then compressing and writing to disk. If this is the case, can we avoid creating that second "copy" of the file in memory and write lines to the file directly?

Or is this just the operation where we happen to run out of memory from accumulating stuff over time, and where fixing this issue would just mean that we're failing somewhere else, shortly after?

Should I rather let each module run sanitize a single slice and then exit, store which slice has been processed, and run CollecTor in an endless loop? Or something like that, but something that collects all the garbage between slices?

(Note that I still need to check the output and whether that looks okay across slices. Doing that now, unrelated to the issue at hand.)

To upload designs, you'll need to enable LFS and have admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: legacy/trac#25161