Corpora from the Web
COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora created by Felix Bildhauer and Roland Schäfer at Freie Universität Berlin, German Grammar Group. Roland Schäfer’s work related to the COW corpora is currently supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) in the form of the project Linguistic web characterization and web corpus creation (SCHA1916/1-1).
We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). The third-generation corpora COW14 are all larger than their predecessors, some containing 10 billion tokens or even 20 billion (DECOW14A). We focus on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes.
Projektbeteiligte: Dr. Roland Schäfer, Felix Bildhauer
Jahr: 2015 fortlaufend
Fach: Deutsche Philologie