Abstract

Corpus del Español word, lemma, and part of speech data format.

Zip folder contains 20 .txt files of linguistic data from the United States split into two categories: General (g) and Blogs (b).

Texts are separated by a line with ## and the textID.

File Format

.zip

File Size (MB)

121

Creation Date

11-17-2016

Deposit Date

6-18-2024

Comments

Note from Corpus del Español:

While the categorization by country is very good overall, there is one exception: the texts from the United States. The problem is that when Google didn't know what country a text (or domain) was from, they then categorized it as the United States (as kind of a "default"). So most of the texts (and domains) that are supposedly from the United States are probably from another country.

Recommended Citation

Davies, Mark. (2016-) Corpus del Español: Web/Dialects. Available online at http://www.corpusdelespanol.org/web-dial/.

License Restrictions

Corpora data is subject to access and use restrictions, including:

Data cannot be distributed outside Gonzaga
Access limited to restricted login or password
Data cannot be used to create software or products for sale or consumption
Data is for research and substantial portions (50,000 words or more) cannot be made available to undergraduates
Any publications or products based on the data should reference the source of the data (see Citation Information)

See the full limitations at Restrictions on use of the corpora.

Download

COinS

Corpus del Español

Word, Lemma, and Part of Speech (United States)

Abstract

File Format

File Size (MB)

Creation Date

Deposit Date

Comments

Recommended Citation

License Restrictions

Search

Browse

Author Corner

LINKS

Corpus del Español

Word, Lemma, and Part of Speech (United States)

Creator

Abstract

File Format

File Size (MB)

Creation Date

Deposit Date

Comments

Recommended Citation

License Restrictions

Share

Search

Browse

Author Corner

LINKS