OntoNotes 4 Statistics

Subcorpus Toks Propositions Senses << ## .onf .coref .name .parallel .parse .prop .sense .speaker English ~ 5433 v, 250 n 2683 v 2194 n WSJ 900k 85k (82%) v 38k (93%^†) v 51k (70%^†) n ##nw/wsj 1728 597 1728 0 1728 1718 1606 0 | (newswire; excludes 584 financial docs/375k tokens with .onf files only) Broadcast News (TDT-4) 200k 27k (89%) v 28k (93%) v 28k (70%) n ##bn/{abc,cnn,mnb,nbc,pri,voa} 5681 4734 3787 2840 2840 1893 947 0 Broadcast Conversation 200k 30k (95%) v 28k (90%) v 17k (55%) n ##bc/{cctv,cnn,msnbc,phoenix} 177 154 131 108 96 73 50 27 | (50k EN-ZH, 50k ZH-EN) English-Chinese Treebank 325k 32k (90%) v 31k (86%) v 48k (70%) n ##mz/sinorama, nw/xinhua 403 403 403 403 403 403 403 0 | (Xinhua newswire, Sinorama magazine) P2.5 145k 14k (87%) v 14k (81%) v ~ ##{bc,bn,nw,wb}/p2.5_{a2e,c2e} 469 0 373 294 469 459 459 202 | (80k ZH-EN, 65k AR-EN; 35k for each of nw, bn, bc, and wb genres) Web 200k 19k (75%) v 19k (73%) v ~ ##wb/{a2e,c2e,eng} 867 745 641 537 450 328 224 120 | (55k AR-EN, 75k ZH-EN) Selected Web sentences 85k 2k (56%) v 3k (83%) v ~ ##wb/sel 3655 0 0 0 3655 2060 3459 0 Chinese ~ 20134 total 763 total << English-Chinese Treebank 254k 40k (90%) v 32k (71%) v 15k (20%) n ##mz/sinorama, nw/xinhua 403 403 403 403 403 403 401 0 | (100k Xinhua newswire, 154k Sinorama magazine) Broadcast News (TDT-4) 269k 45k (88%) v 38k (75%) v 12k (16%) n ##bn/{cbs,cnr,cts,ctv,vom} 5071 4249 3104 2463 2788 1967 1146 0 Broadcast Conversation 169k 26k (83%) v 21k (66%) v 4k (13%) n ##bc/{cctv,cnn,msnbc,phoenix} 122 108 94 72 68 54 40 26 | (GALE; 50k ZH-EN, 55k EN-ZH) Web 196k 15k (74%) v 6k (28%) v ~ ##wb/{cmn,dev_09_cmn,e2c} 140 115 0 161 140 59 0 73 | (40k ZH-EN, 70k EN-ZH, 86k Dev09) P2.5 40k 6k (63%) v 3k (33%) v ~ ##{bc,bn,nw,wb}/p2.5_cmn 246 0 0 294 246 186 0 66 | (nw, bn, bc, and wb genres) Arabic ~ 2155 v, 404 n,
623 a 150 v 111 n An-Nahar (newswire; trees from Penn Arabic Treebank Part 3 v. 3.1) 400k 26k (72%) v 20k (55%) v 22k (17%) n ##nw/ann 599 447 446 0 599 598 310 0

^† WSJ sense coverage is out of the 300k-token (Year 1) portion that has been annotated for OntoNotes senses.

This table provides a breakdown of the resources available in OntoNotes 4.0. The portion on the left records counts and coverage statistics drawn primarily from the OntoNotes manual (details). To the right are counts of files in each subcorpus by filetype, computed with a script as described below. Hover over a row for a subcorpus to see its directories in the release.

Explanations of statistics

For each language are counts of proposition frames and sense types. (Some proposition frames do not yet have any corresponding annotations.) For each subcorpus are counts of tokens, verb propositions, and verb and noun senses. (Noun proposition annotations are not included in OntoNotes 4. The word sense coverage figures give credit for monosemous words even if they are not explicitly annotated.)

Where the OntoNotes manual was unclear or inconsistent, the .parse files were used to calculate the number of tokens in a subcorpus, e.g.:

Computing the file counts

Ontonotes_release_4_0/data/files/data/chinese/annotations/wb/dev_09_cmn$ for e in onf coref name parallel parse prop sense speaker; do N=`ls -1 ??/*.$e 2>/dev/null | wc -l`; echo "$N $e"; done 67 onf 44 coref 0 name 74 parallel 67 parse 0 prop 0 speaker

Here is the full table of file counts by directory (these are aggregated by subcorpus in the above table):

~ .onf .coref .name .parallel .parse .prop .sense .speaker English bc/cctv 6 6 6 6 6 6 6 6 bc/cnn 9 9 9 5 9 9 9 9 bc/msnbc 8 8 8 1 8 8 8 8 bc/p2.5_a2e 37 0 37 0 37 37 37 37 bc/p2.5_c2e 61 0 61 61 61 61 61 61 bc/phoenix 4 4 4 4 4 4 4 4 bn/abc 69 69 69 0 69 69 69 0 bn/cnn 437 437 437 0 437 437 437 0 bn/mnb 25 25 25 0 25 25 25 0 bn/nbc 39 39 39 0 39 39 39 0 bn/p2.5_a2e 38 0 38 0 38 38 38 38 bn/p2.5_c2e 66 0 66 66 66 66 66 66 bn/pri 112 112 112 0 112 111 112 0 bn/voa 265 265 265 0 265 265 265 0 mz/sinorama 78 78 78 78 78 78 78 0 nw/p2.5_a2e 49 0 49 0 49 49 49 0 nw/p2.5_c2e 85 0 0 85 85 85 85 0 nw/wsj 2312 597 1728 0 1728 1718 1606 0 nw/xinhua 325 325 325 325 325 325 325 0 wb/a2e 34 34 34 0 34 34 34 34 wb/c2e 70 52 52 70 70 52 52 68 wb/eng 18 18 18 17 18 18 18 18 wb/p2.5_a2e 51 0 51 0 51 51 51 0 wb/p2.5_c2e 82 0 71 82 82 72 72 0 wb/sel 3655 0 0 0 3655 2060 3459 0 Chinese bc/cctv 8 8 8 6 8 8 8 8 bc/cnn 5 5 5 5 5 5 5 5 bc/msnbc 1 1 1 1 1 1 1 1 bc/p2.5_cmn 60 0 0 61 60 0 0 5 bc/phoenix 12 12 12 4 12 12 12 12 bn/cbs 181 181 180 0 181 181 181 0 bn/cnr 140 140 136 0 140 140 140 0 bn/cts 303 303 299 0 303 303 303 0 bn/ctv 197 197 26 0 197 197 197 0 bn/p2.5_cmn 61 0 0 66 61 61 0 61 bn/vom 325 324 0 0 325 325 325 0 mz/sinorama 78 78 78 78 78 78 78 0 nw/p2.5_cmn 51 0 0 85 51 51 0 0 nw/xinhua 325 325 325 325 325 325 323 0 wb/cmn 56 54 0 70 56 43 0 56 wb/dev_09_cmn 67 44 0 74 67 0 0 0 wb/e2c 17 17 0 17 17 16 0 17 wb/p2.5_cmn 74 0 0 82 74 74 0 0 Arabic nw/ann 599 447 446 0 599 598 310 0

Nathan Schneider, 2012-08-25. Feel free to contact me with fixes, updates, and additions. Thanks to Martha Palmer, Jinho Choi, and Christian Buck for elucidating some of the nitty-gritty details.