quanteda/index.Rmd at master · pszewach/quanteda · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
title: "Home"
output:
  html_document:
    toc: true
---

# Overview

## Introduction

**quanteda** is an R package for managing and analyzing text. This package makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole.  **quanteda** includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams.  **quanteda**'s functions for tokenizing texts and forming multiple tokenized documents into a *document-feature matrix* are both extremely fast and extremely simple to use.  **quanteda** can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.

Built on the text processing
functions in the **stringi** package,
which is in turn built on C++ implementation of the [ICU](http://www.icu-project.org/) libraries for Unicode text handling, **quanteda** pays special attention to fast and correct implementation of Unicode and the handling of text in any character set, following conversion internally to UTF-8.

**quanteda** is built for efficiency and speed, through its design around three infrastructures: the **stringi** package for text processing, the **data.table** package for indexing large documents efficiently, and the **Matrix** package for sparse matrix objects.  If you can fit it into memory, **quanteda** will handle it quickly.  (And eventually, we will make it possible to process objects even larger than available memory.)

**quanteda** is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features.  The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables.  The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined "thesaurus", and trimming and weighting features based on document frequency, feature frequency, and related measures such as *tf-idf*.

## Features

### Corpus Management Tools

The tools for getting texts into a corpus object include:

* loading texts from directories of individual files
* loading texts ``manually'' by inserting them into a corpus using helper functions
*  managing text encodings and conversions from source files into corpus texts
* attaching variables to each text that can be used for grouping, reorganizing a corpus, or simply recording additional information to supplement quantitative analyses with non-textual data
* recording meta-data about the sources and creation details for the corpus.

The tools for working with a corpus include:

* summarizing the corpus in terms of its language units
* reshaping the corpus into smaller units or more aggregated units
* adding to or extracting subsets of a corpus
* resampling texts of the corpus, for example for use in non-parametric bootstrapping of the texts
* Easy extraction and saving, as a new data frame or corpus, key words in context (KWIC)

### Natural-Language Processing tools

For extracting features from a corpus, **quanteda** provides the following tools:

* extraction of word types
* extraction of word n-grams
* extraction of dictionary entries from user-defined dictionaries
* feature selection through
    - stemming
    - random selection
    - document frequency
    - word frequency
* and a variety of options for cleaning word types, such as capitalization and rules for handling punctuation.

### Document-Feature Matrix analysis tools

For analyzing the resulting *document-feature* matrix created when features are abstracted from a corpus, **quanteda** provides:

* scaling methods, such as correspondence analysis, Wordfish, and Wordscores
* topic models, such as LDA
* classifiers, such as Naive Bayes or k-nearest neighbour
* sentiment analysis, using dictionaries

### Additional and Planned Features

**Additional features** of quanteda include:

*  the ability to explore texts using *key-words-in-context*
*  fast computation of a variety of readability indexes
*  fast computation of a variety of lexical diversity measures
*  quick computation of word or document association measures, for clustering or to compute similarity scores for other purposes
*  a comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document

### Working with other packages

**quanteda** is hardly unique in providing facilities for working with text -- the excellent *tm* package already provides many of the features we have described.  **quanteda** is designed to complement those packages, as well to simplify the implementation of the text-to-analysis workflow.  **quanteda** corpus structures are simpler objects than in *tm*s, as are the document-feature matrix objects from **quanteda**, compared to the sparse matrix implementation found in *tm*.  However, there is no need to choose only one package, since we provide translator functions from one matrix or corpus object to the other in **quanteda**.

Once constructed, a **quanteda** dfm can be easily passed to other text-analysis packages for additional analysis of topic models or scaling, such as:

*  topic models (including converters for direct use with the **topicmodels**, **LDA**, and **stm** packages)
*  document scaling using **quanteda**'s own functions for the "wordfish" and "Wordscores" models, and a sparse method for correspondence analysis
*  document classification methods, using (for example) Naive Bayes, k-nearest neighbour, or Support Vector Machines
*  more sophisticated machine learning through a variety of other packages that take matrix or matrix-like inputs.
*  graphical analysis, including word clouds and strip plots for selected themes or words.

## System Requirements

**quanteda** is cross-platform but we recommend MacOS or Linux as an operating system for their better support for Unicode. RAM depends on the size and the structure of the textual data to analyze. Usually, a text file of 100MB on disk takes 500MB to 1GB on memory as a tokens object (short texts require more memory than long texts when the total numbers of words are the same).

| | Minimum | Recommended |
| --- | :---: | :---: |
| OS | Windows/MacOS/Linux | MacOS/Linux |
| CPU | 1 core | 4 cores or more |
| RAM | 2GB | 8GB or more |
| IDE | | R Studio |