Subcorpus					Toks	Propositions	Senses		<<			##						.onf	.coref	.name	.parallel	.parse	.prop	.sense	.speaker

English						~		5433 v, 250 n	2683 v		2194 n
WSJ							900k	85k (82%) v		38k (93%) v	51k (70%) n	##nw/wsj						1728	597		1728	0			1728	1718	1606	0
| (newswire; excludes 584 financial docs/375k tokens with .onf files only)
Broadcast News (TDT-4)		200k	27k (89%) v		28k (93%) v	28k (70%) n	##bn/{abc,cnn,mnb,nbc,pri,voa}	5681	4734	3787	2840	2840	1893	947	0
Broadcast Conversation		200k	30k (95%) v		28k (90%) v	17k (55%) n	##bc/{cctv,cnn,msnbc,phoenix}	177	154	131	108	96	73	50	27
| (50k EN-ZH, 50k ZH-EN)
English-Chinese Treebank 	325k	32k (90%) v		31k (86%) v	48k (70%) n	##mz/sinorama, nw/xinhua		403	403	403	403	403	403	403	0
| (Xinhua newswire, Sinorama magazine)
P2.5 						145k	14k (87%) v		14k (81%) v	~			##{bc,bn,nw,wb}/p2.5_{a2e,c2e}	469	0	373	294	469	459	459	202
| (80k ZH-EN, 65k AR-EN; 35k for each of nw, bn, bc, and wb genres)
Web 						200k	19k (75%) v		19k (73%) v	~			##wb/{a2e,c2e,eng}				867	745	641	537	450	328	224	120
| (55k AR-EN, 75k ZH-EN)
Selected Web sentences		 85k	 2k (56%) v		 3k (83%) v	~			##wb/sel						3655	0		0		0			3655	2060	3459	0

Chinese						~		20134 total		763 total	<<
English-Chinese Treebank	254k	40k (90%) v		32k (71%) v	15k (20%) n	##mz/sinorama, nw/xinhua		403	403	403	403	403	403	401	0
| (100k Xinhua newswire, 154k Sinorama magazine)
Broadcast News (TDT-4)		269k	45k (88%) v		38k (75%) v	12k (16%) n	##bn/{cbs,cnr,cts,ctv,vom}		5071	4249	3104	2463	2788	1967	1146	0
Broadcast Conversation		169k	26k (83%) v		21k (66%) v	4k (13%) n	##bc/{cctv,cnn,msnbc,phoenix}	122	108	94	72	68	54	40	26
| (GALE; 50k ZH-EN, 55k EN-ZH)
Web						 	196k	15k (74%) v		 6k (28%) v	~			##wb/{cmn,dev_09_cmn,e2c}		140	115	0	161	140	59	0	73
| (40k ZH-EN, 70k EN-ZH, 86k Dev09)
P2.5						 40k	 6k (63%) v		 3k (33%) v	~			##{bc,bn,nw,wb}/p2.5_cmn		246	0	0	294	246	186	0	66
| (nw, bn, bc, and wb genres)

Arabic						~		2155 v, 404 n, 
623 a 150 v 111 n An-Nahar (newswire; trees from Penn Arabic Treebank Part 3 v. 3.1) 400k 26k (72%) v 20k (55%) v 22k (17%) n ##nw/ann 599 447 446 0 599 598 310 0

WSJ sense coverage is out of the 300k-token (Year 1) portion that has been annotated for OntoNotes senses.

This table provides a breakdown of the resources available in OntoNotes 4.0. The portion on the left records counts and coverage statistics drawn primarily from the OntoNotes manual (details). To the right are counts of files in each subcorpus by filetype, computed with a script as described below. Hover over a row for a subcorpus to see its directories in the release.

Explanations of statistics

For each language are counts of proposition frames and sense types. (Some proposition frames do not yet have any corresponding annotations.) For each subcorpus are counts of tokens, verb propositions, and verb and noun senses. (Noun proposition annotations are not included in OntoNotes 4. The word sense coverage figures give credit for monosemous words even if they are not explicitly annotated.)

Where the OntoNotes manual was unclear or inconsistent, the .parse files were used to calculate the number of tokens in a subcorpus, e.g.:

Ontonotes_release_4_0/data/files/data/arabic/annotations/nw/ann$ W=`cat ??/*.parse | wc -l`; X=`grep 'TOP' ??/*.parse | wc -l`; Y=`grep '\-NONE\-' ??/*.parse | wc -l`; echo "$W-$X-$Y" | bc
402246

Computing the file counts

This was done with bash commands like:

Ontonotes_release_4_0/data/files/data/chinese/annotations/wb/dev_09_cmn$ for e in onf coref name parallel parse prop sense speaker; do N=`ls -1 ??/*.$e 2>/dev/null | wc -l`; echo "$N $e"; done
67 onf
44 coref
0 name
74 parallel
67 parse
0 prop
0 speaker

Here is the full table of file counts by directory (these are aggregated by subcorpus in the above table):


~			.onf	.coref	.name	.parallel	.parse	.prop	.sense	.speaker

English
bc/cctv		6		6		6		6			6		6		6		6
bc/cnn		9		9		9		5			9		9		9		9
bc/msnbc	8		8		8		1			8		8		8		8
bc/p2.5_a2e	37		0		37		0			37		37		37		37
bc/p2.5_c2e	61		0		61		61			61		61		61		61
bc/phoenix	4		4		4		4			4		4		4		4
bn/abc		69		69		69		0			69		69		69		0
bn/cnn		437		437		437		0			437		437		437		0
bn/mnb		25		25		25		0			25		25		25		0
bn/nbc		39		39		39		0			39		39		39		0
bn/p2.5_a2e	38		0		38		0			38		38		38		38
bn/p2.5_c2e	66		0		66		66			66		66		66		66
bn/pri		112		112		112		0			112		111		112		0
bn/voa		265		265		265		0			265		265		265		0
mz/sinorama	78		78		78		78			78		78		78		0
nw/p2.5_a2e	49		0		49		0			49		49		49		0
nw/p2.5_c2e	85		0		0		85			85		85		85		0
nw/wsj		2312	597		1728	0			1728	1718	1606	0
nw/xinhua	325		325		325		325			325		325		325		0
wb/a2e		34		34		34		0			34		34		34		34
wb/c2e		70		52		52		70			70		52		52		68
wb/eng		18		18		18		17			18		18		18		18
wb/p2.5_a2e	51		0		51		0			51		51		51		0
wb/p2.5_c2e	82		0		71		82			82		72		72		0
wb/sel		3655	0		0		0			3655	2060	3459	0

Chinese
bc/cctv		8		8		8		6			8		8		8		8
bc/cnn		5		5		5		5			5		5		5		5
bc/msnbc	1		1		1		1			1		1		1		1
bc/p2.5_cmn	60		0		0		61			60		0		0		5
bc/phoenix	12		12		12		4			12		12		12		12
bn/cbs		181		181		180		0			181		181		181		0
bn/cnr		140		140		136		0			140		140		140		0
bn/cts		303		303		299		0			303		303		303		0
bn/ctv		197		197		26		0			197		197		197		0
bn/p2.5_cmn	61		0		0		66			61		61		0		61
bn/vom		325		324		0		0			325		325		325		0
mz/sinorama	78		78		78		78			78		78		78		0
nw/p2.5_cmn	51		0		0		85			51		51		0		0
nw/xinhua	325		325		325		325			325		325		323		0
wb/cmn		56		54		0		70			56		43		0		56
wb/dev_09_cmn	67	44		0		74			67		0		0		0
wb/e2c		17		17		0		17			17		16		0		17
wb/p2.5_cmn	74		0		0		82			74		74		0		0

Arabic
nw/ann		599		447		446		0			599		598		310		0

To calculate number of words in a subcorpus, use the .parse files, e.g.:

Ontonotes_release_4_0/data/files/data/arabic/annotations/nw/ann$ W=`cat ??/*.parse | wc -l`; X=`grep 'TOP' ??/*.parse | wc -l`; Y=`grep '\-NONE\-' ??/*.parse | wc -l`; echo "$W-$X-$Y" | bc
402246

Nathan Schneider, 2012-08-25. Feel free to contact me with fixes, updates, and additions. Thanks to Martha Palmer, Jinho Choi, and Christian Buck for elucidating some of the nitty-gritty details.