Abstract
Corpus del Español word, lemma, and part of speech data format.
Zip folder contains 20 .txt files of linguistic data from the United States split into two categories: General (g) and Blogs (b).
Texts are separated by a line with ## and the textID.
File Format
.zip
File Size (MB)
121
Creation Date
11-17-2016
Deposit Date
6-18-2024
Recommended Citation
Davies, Mark. (2016-) Corpus del Español: Web/Dialects. Available online at http://www.corpusdelespanol.org/web-dial/.
License Restrictions
Corpora data is subject to access and use restrictions, including:
- Data cannot be distributed outside Gonzaga
- Access limited to restricted login or password
- Data cannot be used to create software or products for sale or consumption
- Data is for research and substantial portions (50,000 words or more) cannot be made available to undergraduates
- Any publications or products based on the data should reference the source of the data (see Citation Information)
COinS
Comments
Note from Corpus del Español:
While the categorization by country is very good overall, there is one exception: the texts from the United States. The problem is that when Google didn't know what country a text (or domain) was from, they then categorized it as the United States (as kind of a "default"). So most of the texts (and domains) that are supposedly from the United States are probably from another country.