BETA: As of November 14, 2017, this document is still under discussion and subject to change without prior notice. Feel free to contact us for questions or concerns regarding this document.
Tor's web servers, like most web servers, keep request logs for maintenance and informational purposes.
However, unlike most other web servers, Tor's web servers use a privacy-aware log format that avoids logging too sensitive data about their users.
Also unlike most other web server logs, Tor's logs are neither archived nor analyzed before performing a number of post-processing steps to further reduce any privacy-sensitive parts.
This document describes 1) meta-data contained in log file names written by Tor's web servers, 2) the privacy-aware log format used in these files, and 3) subsequent sanitizing steps that are applied before archiving and analyzing these log files.
As a basis for our current implementation this document also describes the naming conventions for the input log files, which is just a description of the current state and subject to change.
As a convention for this document, all format strings conform to the format strings used by Apache's mod_log_config module.
Log files have meta-data that is not part of the file's contents, in particular, the names of the virtual and physical hosts.
All access log files written by Tor's web servers follow the naming convention <virtual-host>-access.log-YYYYMMDD, where "YYYYMMDD" is the date of the rotation and finalization of the log file, which is not used in the further sanitizing process. The "access.log" part serves as a marker for web server access logs.
The virtual hostname can be inferred from the input log's name, whereas the physical hostname needs to be provided by other means. Currently, log files are made available to the santizer in a separate directory per physical web server host. Log files are typically gz-compressed, which is indicated by appending ".gz" to log file names, but this is subject to change. Files with unknown compression type are discarded (currently ".xz", ".gz", and ".bz2" are recognized). Overall, the sanitizer expects log files to use the following path format:
As first safeguard against publishing log files that are too sensitive, we discard all files not matching the naming convention for access logs. This is to prevent, for example, error logs from slipping through.
Tor's Apache web servers are configured to write log files that extend Apache's Combined Log Format with a couple tweaks towards privacy. For example, the following Apache configuration lines were in use at the time of writing (subject to change):
The main difference to Apache's Common Log Format is that request IP addresses are removed and the field is instead used to encode whether the request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the site's onion service (0.0.0.2).
Tor's web servers are configured to use UTC as timezone, which is also highly recommended when rewriting request times to "00:00:00" in order for the subsequent sanitizing steps to work correctly. Alternatively, if the system timezone is not set to UTC, web servers should keep request times unchanged and let them be handled by the subsequent sanitizing steps.
Tor's web servers are configured to rotate logs at least once per day, which does not necessarily happen at 00:00:00 UTC. As a result, log files may contain requests from up to two UTC days and several log files may contain requests that have been started on the same UTC day.
The request logs written by Tor's web servers still contain too many details that we are uncomfortable publishing. Therefore, we apply a couple of sanitizing steps on these log files before making them public and analyzing them ourselves. Some of these steps could as well be made directly by Apache, but others can only be made with a delay.
Log files are expected to contain exactly one request per line. We process these files line by line and discard any lines not matching the following criteria:
Any lines not meeting all these criteria will be discarded, and processing continues with the next line.
In addition, log lines are treated differently according to the date they contain:
All matching lines, which are already checked to match Apache's Common Log Format ("%h %l %u %t \"%r\" %>s %b"), are rewritten following these rules:
Any columns exceeding Apache's Common Log Format are discarded.
The result is still supposed to be fully compatible with the Common Log Format and can be processed by any tools being capable of processing that format.
Rewritten log lines are re-assembled into sanitized log files based on physical host, virtual host, and request start date.
All rewritten log lines are sorted alphabetically, so that request order cannot be inferred from sanitized log files.
Many of the sanitized log lines will now be identical. But in order to not remove too much useful information we keep the identical log lines and thus enable typical web log analyzers to operate on the sanitized log files.
The naming convention for sanitized log files is:
The underscore is a separator symbol between the various parts of the filename.
Sanitized log files may additionally be sorted into directories by virtual host and date as in:
The virtual hostnames, like 'metrics.torproject.org' or 'dist.torproject.org', are more familiar to the public and were therefore chosen to be the first naming component.
Sanitized log files are typically compressed before publication. The sorting step also allows for highly efficient compression rates. We typically use XZ for compression, which is indicated by appending ".xz" to log file names, but this is subject to change.
This material is supported in part by the National Science Foundation under Grant No. CNS-0959138. Any opinions, finding, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. "Tor" and the "Onion Logo" are registered trademarks of The Tor Project, Inc.. Data on this site is freely available under a CC0 no copyright declaration: To the extent possible under law, the Tor Project has waived all copyright and related or neighboring rights in the data. Graphs are licensed under a Creative Commons Attribution 3.0 United States License.