
Mark Davies


Complete Corpus del Español word, lemma, part of speech format for linguistic data from 21 Spanish speaking countries. The data is provided in vertical format, making it possible to import into a database. Within the file, texts are separated by a line with ## and the textID.

This TAR file includes 21 zip files each containing word, lemma, and part of speech data for the specific country in .txt files.

File Format


File Size (MB)


Creation Date


Deposit Date



Due to the large size of this file (43.2 GB) it may take a long time to download.

License Restrictions

Corpora data is subject to access and use restrictions, including:

  • Data cannot be distributed outside Gonzaga
  • Access limited to restricted login or password
  • Data cannot be used to create software or products for sale or consumption
  • Data is for research and substantial portions (50,000 words or more) cannot be made available to undergraduates
  • Any publications or products based on the data should reference the source of the data (see Citation Information)
See the full limitations at Restrictions on use of the corpora.
