Tor Metrics Portal: Data Formats


Statistical analysis in the Tor network can be performed using various kinds of data. This page gives an overview of three major data sources for statistics in the Tor network:

  1. First, we recap measuring the Tor network from public directory information (PDF) by describing the data format of server descriptors and network statuses, and we explain the sanitzation process of (non-public) bridge directory information.
  2. Second, we describe the numerous aggregate statistics that relays publish about their usage (PDF), including byte histories, directory request statistics, connecting client statistics, bridge user statistics, cell-queue statistics, exit-port statistics, and bidirectional connection use.
  3. Third, we delineate the output of various Tor services like BridgeDB, or Tor Check as well as specific measurement tools like Torperf.

All data described on this page are available for download on the data page. This page is based on a technical report (PDF) and is very likely more recent than the report.



Descriptor types


Any file containing descriptors described on this page may contain meta data in its first text line using the format @type $descriptortype $major.$minor. Any tool that processes these descriptors may parse files without meta data or with an unknown descriptor type at its own risk, can safely parse files with known descriptor type and same major version number, and should not parse files with known descriptor type and higher major version number.

The following descriptor types and versions are known. Gray entries are deprecated, black entries are recent:



Server descriptors and network statuses


Relays in the Tor network report their capabilities by publishing server descriptors to the directory authorities. The directory authorities confirm reachability of relays and assign flags to help clients make good path selections. Every hour, the directory authorities publish a network status consensus with all known running relays at the time. Both server descriptors and network statuses constitute a solid data basis for statistical analysis in the Tor network. We described the approach to measure the Tor network from public directory information in the HotPETS 2009 paper (PDF) and provide interactive graphs on the metrics website. We briefly describe the most interesting pieces of the two descriptor formats that can be used for statistics.

The server descriptors published by relays at least once every 18 hours contain the necessary information for clients to build circuits using a given relay. These server descriptors can also be useful for statistical analysis of the Tor network infrastructure.

We assume that the majority of server descriptors are correct. But when performing statistical analysis on server descriptors, one has to keep in mind that only a small subset of the information written to server descriptors is confirmed by the trusted directory authorities. In theory, relays can provide false information in their server descriptors, even though the incentive to do so is probably low.

Server descriptor published by relay blutmagie (without cryptographic keys and hashes):

router blutmagie 192.251.226.206 443 0 80
platform Tor 0.2.2.20-alpha on Linux x86_64
opt protocols Link 1 2 Circuit 1
published 2010-12-27 14:35:27
opt fingerprint 6297 B13A 687B 521A 59C6 BD79 188A 2501 EC03 A065
uptime 445412
bandwidth 14336000 18432000 15905178
opt extra-info-digest 5C1D5D6F8B243304079BC15CD96C7FCCB88322D4
opt caches-extra-info
onion-key
[...]
signing-key
[...]
family $66CA87E164F1CFCE8C3BB5C095217A28578B8BAF $67EC84376D9C4C467DCE8621AACA109160B5264E $7B698D327F1695590408FED95CDEE1565774D136
opt hidden-service-dir
contact abuse@blutmagie.de
reject 0.0.0.0/8:*
reject 169.254.0.0/16:*
reject 127.0.0.0/8:*
reject 192.168.0.0/16:*
reject 10.0.0.0/8:*
reject 172.16.0.0/12:*
reject 192.251.226.206:*
reject *:25
reject *:119
reject *:135-139
reject *:445
reject *:465
reject *:563
reject *:587
reject *:1214
reject *:4661-4666
reject *:6346-6429
reject *:6660-6999
accept *:*
router-signature
[...]

The document above shows an example server descriptor. The following data fields in server descriptors may be relevant to statistical analysis:

These are just a subset of the fields in a server descriptor that seem relevant for statistical analysis. For a complete list of fields in server descriptors, see the directory protocol specification.

Every hour, the directory authorities publish a new network status that contains a list of all running relays. The directory authorities confirm reachability of the contained relays and assign flags based on the relays' characteristics. The entries in a network status reference the last published server descriptor of a relay.

The network statuses are relevant for statistical analysis, because they constitute trusted snapshots of the Tor network. Anyone can publish as many server descriptors as they want, but only the directory authorities can confirm that a relay was running at a given time. Most statistics on the Tor network infrastructure rely on network statuses and possibly combine them with the referenced server descriptors. The document below shows the network status entry referencing the server descriptor above. In addition to the reachability information, network statuses contain the following fields that may be relevant for statistical analysis:

Network status entry of relay blutmagie:

r blutmagie YpexOmh7UhpZxr15GIolAewDoGU lFY7WmD/yvVFp9drmZzNeTxZ6dw 2010-12-27 14:35:27 192.251.226.206 443 80
s Exit Fast Guard HSDir Named Running Stable V2Dir Valid
v Tor 0.2.2.20-alpha
w Bandwidth=30800
p reject 25,119,135-139,445,465,563,587,1214,4661-4666,6346-6429,6660-6999



Sanitized bridge descriptors


Bridges in the Tor network publish server descriptors to the bridge authority which in turn generates a bridge network status. We cannot, however, make the bridge server descriptors and bridge network statuses available for statistical analysis as we do with the relay server descriptors and relay network statuses. The problem is that bridge server descriptors and network statuses contain bridge IP addresses and other sensitive information that shall not be made publicly available. We therefore sanitize bridge descriptors by removing all potentially identifying information and publish sanitized versions of the descriptors. The processing steps for sanitizing bridge descriptors are as follows:
  1. Replace the bridge identity with its SHA1 value: Clients can request a bridge's current descriptor by sending its identity string to the bridge authority. This is a feature to make bridges on dynamic IP addresses useful. Therefore, the original identities (and anything that could be used to derive them) need to be removed from the descriptors. The bridge identity is replaced with its SHA1 hash value. The idea is to have a consistent replacement that remains stable over months or even years (without keeping a secret for a keyed hash function).
  2. Remove all cryptographic keys and signatures: It would be straightforward to learn about the bridge identity from the bridge's public key. Replacing keys by newly generated ones seemed to be unnecessary (and would involve keeping a state over months/years), so that all cryptographic objects have simply been removed.
  3. Replace IP address with IP address hash: Of course, IP addresses need to be sanitized, too.
    • IPv4 addresses are replaced with 10.x.x.x with x.x.x being the 3 byte output of H(IP address | bridge identity | secret)[:3]. The input IP address is the 4-byte long binary representation of the bridge's current IP address. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is a 31-byte long secure random string that changes once per month for all descriptors and statuses published in that month. H() is SHA-256. The [:3] operator means that we pick the 3 most significant bytes of the result.
    • IPv6 addresses are replaced with [fd9f:2e19:3bcf::xx:xxxx] with xx:xxxx being the hex-formatted 3 byte output of a similar hash function as described for IPv4 addresses. The only differences are that the input IP address is 16 bytes long and the secret is only 19 bytes long.
  4. Replace contact information: If there is contact information in a descriptor, the contact line is changed to somebody.
  5. Remove pluggable transport addresses and arguments: Bridges may provide transports in addition to the onion-routing protocol and include information about these transports in their extra-info descriptors for BridgeDB. In that case, any IP addresses, TCP ports, or additional arguments are removed, only leaving in the supported transport names.

Apart from these processing steps, sanitized bridge server descriptors follow the same format as relay server descriptors. The same applies to sanitized bridge extra-info descriptors. Sanitized bridge network statuses are similar to version 2 relay network statuses, but with only a published line in the header and without any lines in the footer.

The two documents below show an example bridge server descriptor that is referenced from a bridge network status. For more details about this process, see the metrics data processor software.

Sanitized bridge server descriptor:

@type bridge-server-descriptor 1.0
router Hawthorne 10.175.105.22 443 0 0
platform Tor 0.2.2.19-alpha (git-1988927edecce4c7) on Linux i686
opt protocols Link 1 2 Circuit 1
published 2010-12-27 18:55:01
opt fingerprint A5FA 7F38 B02A 415E 72FE 614C 64A1 E5A9 2BA9 9BBD
uptime 2347112
bandwidth 5242880 10485760 1016594
opt extra-info-digest E729BCB5E06A5657A73151B55354EB003D2BAE0F
opt hidden-service-dir
contact somebody
reject *:*
router-digest 46DFDBE7B67B7C90A1962B0B5AA4526FAF406979

Sanitized bridge network status:

@type bridge-network-status 1.0
published 2010-12-27 22:07:03
[...status entries...]
r Hawthorne pfp/OLAqQV5y/mFMZKHlqSupm70 Rt/b57Z7fJChlisLWqRSb69AaXk 2010-12-27 18:55:01 10.175.105.22 443 0
s Fast Guard HSDir Running Stable Valid
[...status entries...]



Byte histories


Relays include aggregate statistics in their descriptors that they upload to the directory authorities. These aggregate statistics are contained in extra-info descriptors that are published in companion with server descriptors. Extra-info descriptors are not required for clients to build circuits. An extra-info descriptor belonging to a server descriptor is referenced by its SHA1 hash value.

Byte histories were the first statistical data that relays published about their usage. Relays report the number of written and read bytes in 15-minute intervals throughout the last 24 hours. The extra-info descriptor in the document below contains the byte histories in the two lines starting with write-history and read-history. More details about these statistics can be found in the directory protocol specification.

Extra-info descriptor published by relay blutmagie (without cryptographic signature and with long lines being truncated):

extra-info blutmagie 6297B13A687B521A59C6BD79188A2501EC03A065
published 2010-12-27 14:35:27
write-history 2010-12-27 14:34:05 (900 s) 12902389760,12902402048,12859373568,12894131200,[...]
read-history 2010-12-27 14:34:05 (900 s) 12770249728,12833485824,12661140480,12872439808,[...]
dirreq-write-history 2010-12-27 14:26:13 (900 s) 51731456,60808192,56740864,54948864,[...]
dirreq-read-history 2010-12-27 14:26:13 (900 s) 4747264,4767744,4511744,4752384,[...]
dirreq-stats-end 2010-12-27 10:51:09 (86400 s)
dirreq-v3-ips us=2000,de=1344,fr=744,kr=712,[...]
dirreq-v2-ips ??=8,au=8,cn=8,cz=8,[...]
dirreq-v3-reqs us=2368,de=1680,kr=1048,fr=800,[...]
dirreq-v2-reqs id=48,??=8,au=8,cn=8,[...]
dirreq-v3-resp ok=12504,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=0,busy=128
dirreq-v2-resp ok=64,unavailable=0,not-found=8,not-modified=0,busy=8
dirreq-v2-share 1.03%
dirreq-v3-share 1.03%
dirreq-v3-direct-dl complete=316,timeout=4,running=0,min=4649,d1=36436,d2=68056,q1=76600,d3=87891,d4=131294,md=173579,d6=229695,d7=294528,q3=332053,d8=376301,d9=530252,max=2129698
dirreq-v2-direct-dl complete=16,timeout=52,running=0,min=9769,d1=9769,d2=9844,q1=9981,d3=9981,d4=27297,md=33640,d6=60814,d7=205884,q3=205884,d8=361137,d9=628256,max=956009
dirreq-v3-tunneled-dl complete=12088,timeout=92,running=4,min=534,d1=31351,d2=49166,q1=58490,d3=70774,d4=88192,md=109778,d6=152389,d7=203435,q3=246377,d8=323837,d9=559237,max=26601000
dirreq-v2-tunneled-dl complete=0,timeout=0,running=0
entry-stats-end 2010-12-27 10:51:09 (86400 s)
entry-ips de=11024,us=10672,ir=5936,fr=5040,[...]
exit-stats-end 2010-12-27 10:51:09 (86400 s)
exit-kibibytes-written 80=6758009,443=498987,4000=227483,5004=1182656,11000=22767,19371=1428809,31551=8212,41500=965584,51413=3772428,56424=1912605,other=175227777
exit-kibibytes-read 80=197075167,443=5954607,4000=1660990,5004=1808563,11000=1893893,19371=130360,31551=7588414,41500=756287,51413=2994144,56424=1646509,other=288412366
exit-streams-opened 80=5095484,443=359256,4000=4508,5004=22288,11000=124,19371=24,31551=40,41500=96,51413=16840,56424=28,other=1970964



Directory requests


The directory authorities and directory mirrors report statistical data about processed directory requests. Starting with Tor version 0.2.2.15-alpha, all directories report the number of written and read bytes for answering directory requests. The format is similar to the format of byte histories as described in the previous section. The relevant lines are dirreq-write-history and dirreq-read-history in the document listed in the Byte histories section above. These two lines contain the subset of total read and written bytes that the directory mirror spent on responding to any kind of directory request, including network statuses, server descriptors, extra-info descriptors, authority certificates, etc.

The directories further report statistics on answering directory requests for network statuses only. For Tor versions before 0.2.3.x, relay operators had to manually enable these statistics, which is why only a few directories report them. The lines starting with dirreq-v3- all belong to the directory request statistics (the lines starting with dirreq-v2- report similar statistics for version 2 of the directory protocol which is deprecated at the time of writing this report). The following fields may be relevant for statistical analysis:

More details about these statistics can be found in the directory protocol specification.



Connecting clients


Relays can be configured to report per-country statistics on directly connecting clients. This metric includes clients connecting to a relay in order to build circuits and clients creating a 1-hop circuit to request directory information. In practice, the latter number outweighs the former number. The entry-ips line in the document listed in the Byte histories section above shows the number of unique IP addresses connecting to the relay by country. More details about these statistics can be found in the directory protocol specification.



Bridge users


Bridges report statistics on connecting bridge clients in their extra-info descriptors. The document below shows a bridge extra-info descriptor with the bridge user statistics in the bridge-ips line.

Sanitized bridge extra-info descriptor:

extra-info Unnamed A5FA7F38B02A415E72FE614C64A1E5A92BA99BBD
published 2010-12-27 18:55:01
write-history 2010-12-27 18:43:50 (900 s) 151712768,176698368,180030464,163150848,[...]
read-history 2010-12-27 18:43:50 (900 s) 148109312,172274688,172168192,161094656,[...]
bridge-stats-end 2010-12-27 14:56:29 (86400 s)
bridge-ips sa=48,us=40,de=32,ir=32,[...]

Bridges running Tor version 0.2.2.3-alpha or earlier report bridge users in a similar line starting with geoip-client-origins. The reason for switching to bridge-ips was that the measurement interval in geoip-client-origins had a variable length, whereas the measurement interval in 0.2.2.4-alpha and later is set to exactly 24 hours. In order to clearly distinguish the new measurement intervals from the old ones, the new keywords have been introduced. More details about these statistics can be found in the directory protocol specification.



Cell-queue statistics


Relays can be configured to report aggregate statistics on their cell queues. These statistics include average processed cells, average number of queued cells, and average time that cells spend in circuits. Circuits are split into deciles based on the number of processed cells. The statistics are provided for circuit deciles from loudest to quietest circuits. The document below shows the cell statistics contained in an extra-info descriptor by relay gabelmoo. An early analysis of cell-queue statistics can be found in a tech report (PDF). More details about these statistics can be found in the directory protocol specification.

Cell statistics in extra-info descriptor by relay gabelmoo:

cell-stats-end 2010-12-27 09:59:50 (86400 s)
cell-processed-cells 4563,153,42,15,7,7,6,5,4,2
cell-queued-cells 9.39,0.98,0.09,0.01,0.00,0.00,0.00,0.01,0.00, 0.01
cell-time-in-queue 2248,807,277,92,49,22,52,55,81,148
cell-circuits-per-decile 7233



Exit-port statistics


Exit relays running Tor version 0.2.1.1-alpha or higher can be configured to report aggregate statistics on exiting connections. These relays report the number of opened streams, written and read bytes by exiting port. Until version 0.2.2.19-alpha, relays reported all ports exceeding a threshold of 0.01 % of all written and read exit bytes. Starting with version 0.2.2.20-alpha, relays only report the top 10 ports in exit-port statistics in order not to exceed the maximum extra-info descriptor length of 50 KB. The document listed in the Byte histories section above contains exit-port statistics in the lines starting with exit-. More details about these statistics can be found in the directory protocol specification.



Bidirectional connection use


Relays running Tor version 0.2.3.1-alpha or higher can be configured to report what fraction of connections is used uni- or bi-directionally. Every 10 seconds, relays determine for every connection whether they read and wrote less than a threshold of 20 KiB. Connections below this threshold are labeled as "Below Threshold". For the remaining connections, relays report whether they read/wrote at least 10 times as many bytes as they wrote/read. If so, they classify a connection as "Mostly reading" or "Mostly writing," respectively. All other connections are classified as "Both reading and writing." After classifying connections, read and write counters are reset for the next 10-second interval. Statistics are aggregated over 24 hours. The document below shows the bidirectional connection use statistics in an extra-info descriptor by relay zweifaltigkeit. The four numbers denote the number of connections "Below threshold," "Mostly reading," "Mostly writing," and "Both reading and writing." More details about these statistics can be found in the directory protocol specification.

Bidirectional connection use statistic in extra-info descriptor by relay zweifaltigkeit:

conn-bi-direct 2010-12-28 15:55:11 (86400 s) 387465,45285,55361,81786



Torperf output files


Torperf is a little tool that measures Tor's performance as users experience it. Torperf uses a trivial SOCKS client to download files of various sizes over the Tor network and notes how long substeps take. Torperf can be downloaded from the metrics website. A Torperf results file contains a single line per Torperf run with key=value pairs. Such a result line is sufficient to learn about 1) the Tor and Torperf configuration, 2) measurement results, and 3) additional information that might help explain the results. Known keys are explained below.

Torperf .tpf output lines for a single request to download a 50 KiB file (reformatted):

BUILDTIMES=1.16901898384,1.86555600166,2.13295292854
CIRC_ID=9878
CONNECT=1338357901.42
DATACOMPLETE=1338357902.91
DATAPERC10=1338357902.48
DATAPERC20=1338357902.48
DATAPERC30=1338357902.61
DATAPERC40=1338357902.64
DATAPERC50=1338357902.65
DATAPERC60=1338357902.74
DATAPERC70=1338357902.74
DATAPERC80=1338357902.75
DATAPERC90=1338357902.79
DATAREQUEST=1338357901.83
DATARESPONSE=1338357902.25
DIDTIMEOUT=0
FILESIZE=51200
LAUNCH=1338357661.74
NEGOTIATE=1338357901.42
PATH=$980D326017CEF4CBBF4089FBABE767DC83D059AF,$03545609092A24C71CCAD2F4523F5CCC6714F159,$CAC3CF7154AE9C656C4096DC38B4EFA145905654
QUANTILE=0.800000
READBYTES=51442
REQUEST=1338357901.42
RESPONSE=1338357901.83
SOCKET=1338357901.42
SOURCE=torperf
START=1338357901.42
TIMEOUT=5049
USED_AT=1338357902.91
USED_BY=18869
WRITEBYTES=75


Torperf can produce two output files: .data and .extradata. The .data file contains timestamps for request substeps and the byte summaries for downloading a test file via Tor. The document below shows an example output of a Torperf run. The timestamps are seconds and microseconds since 1970-01-01 00:00:00.000000. Torperf can be configured to write .extradata files by attaching a Tor controller and writing certain controller events to disk. The format of a .extradata line is similar to the combined format as specified above, except that it can only contain "Additional information" keywords.

Torperf .data and .extradata output lines for a single request to download a 50 KiB file (reformatted and annotated with comments):

# Timestamps and byte summaries contained in .data files:
1338357901 422336 # Connection process started
1338357901 422346 # After socket is created
1338357901 422521 # After socket is connected
1338357901 422604 # After authentication methods are negotiated (SOCKS 5 only)
1338357901 423550 # After SOCKS request is sent
1338357901 839639 # After SOCKS response is received
1338357901 839849 # After HTTP request is written
1338357902 258157 # After first response is received
1338357902 914263 # After payload is complete
75 # Written bytes
51442 # Read bytes
0 # Timeout (optional field)
1338357902 481591 # After 10% of expected bytes are read (optional field)
1338357902 482719 # After 20% of expected bytes are read (optional field)
1338357902 613169 # After 30% of expected bytes are read (optional field)
1338357902 647108 # After 40% of expected bytes are read (optional field)
1338357902 651764 # After 50% of expected bytes are read (optional field)
1338357902 743705 # After 60% of expected bytes are read (optional field)
1338357902 743876 # After 70% of expected bytes are read (optional field)
1338357902 757475 # After 80% of expected bytes are read (optional field)
1338357902 795100 # After 90% of expected bytes are read (optional field)

# Path information contained in .extradata files:
CIRC_ID=9878
LAUNCH=1338357661.74
PATH=$980D326017CEF4CBBF4089FBABE767DC83D059AF,$03545609092A24C71CCAD2F4523F5CCC6714F159,$CAC3CF7154AE9C656C4096DC38B4EFA145905654
BUILDTIMES=1.16901898384,1.86555600166,2.13295292854
USED_AT=1338357902.91
USED_BY=18869
TIMEOUT=5049
QUANTILE=0.800000



BridgeDB pool assignment files


BridgeDB is the software that receives bridge network statuses containing the information which bridges are running from the bridge authority, assigns these bridges to persistent distribution rings, and hands them out to bridge users. BridgeDB periodically dumps the list of running bridges with information about the rings, subrings, and file buckets to which they are assigned to a local file. The sanitized versions of these lists containing SHA-1 hashes of bridge fingerprints instead of the original fingerprints are available for statistical analysis.

BridgeDB pool assignment file from March 13, 2011:

bridge-pool-assignment 2011-03-13 14:38:03
00b834117566035736fc6bd4ece950eace8e057a unallocated
00e923e7a8d87d28954fee7503e480f3a03ce4ee email port=443 flag=stable
0103bb5b00ad3102b2dbafe9ce709a0a7c1060e4 https ring=2 port=443 flag=stable
[...]

The document above shows a BridgeDB pool assignment file from March 13, 2011. Every such file begins with a line containing the timestamp when BridgeDB wrote this file. Subsequent lines always start with the SHA-1 hash of a bridge fingerprint, followed by ring, subring, and/or file bucket information. There are currently three distributor ring types in BridgeDB:

  1. unallocated: These bridges are not distributed by BridgeDB, but are either reserved for manual distribution or are written to file buckets for distribution via an external tool. If a bridge in the unallocated ring is assigned to a file bucket, this is noted by bucket=$bucketname.
  2. email: These bridges are distributed via an e-mail autoresponder. Bridges can be assigned to subrings by their OR port or relay flag which is defined by port=$port and/or flag=$flag.
  3. https: These bridges are distributed via https server. There are multiple https rings to further distribute bridges by IP address ranges, which is denoted by ring=$ring. Bridges in the https ring can also be assigned to subrings by OR port or relay flag which is defined by port=$port and/or flag=$flag.


Tor Check exit lists


TorDNSEL is an implementation of the active testing, DNS-based exit list for Tor exit nodes. Tor Check makes the list of known exits and corresponding exit IP addresses available in a specific format. The document below shows an entry of the exit list written on December 28, 2010 at 15:21:44 UTC. This entry means that the relay with fingerprint 63BA.. which published a descriptor at 07:35:55 and was contained in a version 2 network status from 08:10:11 uses two different IP addresses for exiting. The first address 91.102.152.236 was found in a test performed at 07:10:30. When looking at the corresponding server descriptor, one finds that this is also the IP address on which the relay accepts connections from inside the Tor network. A second test performed at 10:35:30 reveals that the relay also uses IP address 91.102.152.227 for exiting.

Exit list entry written on December 28, 2010 at 15:21:44 UTC:

ExitNode 63BA28370F543D175173E414D5450590D73E22DC
Published 2010-12-28 07:35:55
LastStatus 2010-12-28 08:10:11
ExitAddress 91.102.152.236 2010-12-28 07:10:30
ExitAddress 91.102.152.227 2010-12-28 10:35:30

This material is supported in part by the National Science Foundation under Grant No. CNS-0959138. Any opinions, finding, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

"Tor" and the "Onion Logo" are registered trademarks of The Tor Project, Inc.

Data on this site is freely available under a CC0 no copyright declaration: To the extent possible under law, the Tor Project has waived all copyright and related or neighboring rights in the data. Graphs are licensed under a Creative Commons Attribution 3.0 United States License.