Mark Davies


Corpus del Español word, lemma, and part of speech data format.

Zip folder contains 20 .txt files of linguistic data from the United States split into two categories: General (g) and Blogs (b).

Texts are separated by a line with ## and the textID.

File Format


File Size (MB)


Creation Date


Deposit Date



Note from Corpus del Español:

While the categorization by country is very good overall, there is one exception: the texts from the United States. The problem is that when Google didn't know what country a text (or domain) was from, they then categorized it as the United States (as kind of a "default"). So most of the texts (and domains) that are supposedly from the United States are probably from another country.

License Restrictions

Corpora data is subject to access and use restrictions, including:

  • Data cannot be distributed outside Gonzaga
  • Access limited to restricted login or password
  • Data cannot be used to create software or products for sale or consumption
  • Data is for research and substantial portions (50,000 words or more) cannot be made available to undergraduates
  • Any publications or products based on the data should reference the source of the data (see Citation Information)
See the full limitations at Restrictions on use of the corpora.