Creator

Mark Davies

Abstract

Complete Corpus of Contemporary American English linear text data format for linguistic text data originating from spoken word, fiction, magazine, newspaper, academic writing, movie and television subtitles, blogs, and web page sources.

This format provides a textID for each text, and then the entire text on the same line. In this format, words are not annotated for part of speech or lemma. In addition, contracted words like can't are separated into two parts (ca n't) and punctuation is separated from words (eye level . As her).

This TAR file includes 8 zipped folders, each containing between 30 and 34 .txt files of data.

File Format

.tar

File Size (MB)

1999.1

Creation Date

2-22-2020

Deposit Date

7-11-2024

Comments

This file may take a long time to download due its size (1.95 GB).

License Restrictions

Corpora data is subject to access and use restrictions, including:

  • Data cannot be distributed outside Gonzaga
  • Access limited to restricted login or password
  • Data cannot be used to create software or products for sale or consumption
  • Data is for research and substantial portions (50,000 words or more) cannot be made available to undergraduates
  • Any publications or products based on the data should reference the source of the data (see Citation Information)
See the full limitations at Restrictions on use of the corpora.

Share

COinS