Mark Davies


Corpus del Español database format. This is the format allows for the most robust searches and allows for powerful JOINs across corpus, lexicon, and source tables but requires knowledge of SQL. Zip folder contains 20 .txt files of linguistic data from the United States split into two categories: General (g) and Blogs (b). See Full-text corpus data for more information on how to use the database format.

File Format


File Size (MB)


Creation Date


Deposit Date

June 2024


Note from Corpus del Español: While the categorization by country is very good overall, there is one exception: the texts from the United States. The problem is that when Google didn't know what country a text (or domain) was from, they then categorized it as the United States (as kind of a "default"). So most of the texts (and domains) that are supposedly from the United States are probably from another country.

License Restrictions

Corpora data is subject to access and use restrictions, including:

  • Data cannot be distributed outside Gonzaga
  • Access limited to restricted login or password
  • Data cannot be used to create software or products for sale or consumption
  • Data is for research and substantial portions (50,000 words or more) cannot be made available to undergraduates
  • Any publications or products based on the data should reference the source of the data (see Citation Information)
See the full limitations at Restrictions on use of the corpora.