Package: text2vec 0.6.6

text2vec: Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Authors:Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph], Qing Wang [aut, cph]

text2vec_0.6.6.tar.gz
text2vec_0.6.6.zip(r-4.7)text2vec_0.6.6.zip(r-4.6)text2vec_0.6.6.zip(r-4.5)
text2vec_0.6.6.tgz(r-4.6-x86_64)text2vec_0.6.6.tgz(r-4.6-arm64)text2vec_0.6.6.tgz(r-4.5-x86_64)text2vec_0.6.6.tgz(r-4.5-arm64)
text2vec_0.6.6.tar.gz(r-4.7-arm64)text2vec_0.6.6.tar.gz(r-4.7-x86_64)text2vec_0.6.6.tar.gz(r-4.6-arm64)text2vec_0.6.6.tar.gz(r-4.6-x86_64)
text2vec_0.6.6.tgz(r-4.6-emscripten)
manual.pdf |manual.html
DESCRIPTION |NEWS
card.svg |card.png
text2vec/json (API)

# Install 'text2vec' in R:
install.packages('text2vec', repos = c('https://dselivanov.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/dselivanov/text2vec/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On CRAN:

Conda:

glovelatent-dirichlet-allocationnatural-language-processingtext-miningtopic-modelingvectorizationword-embeddingsword2veccpp

13.57 score 874 stars 29 packages 1.7k scripts 5.6k downloads 4 mentions 42 exports 14 dependencies

Last updated from:0b31bdd81f. Checks:13 OK. Indexed: yes.

TargetResultTimeFilesSyslog
linux-devel-arm64OK198
linux-devel-x86_64OK233
source / vignettesOK256
linux-release-arm64OK219
linux-release-x86_64OK205
macos-release-arm64OK115
macos-release-x86_64OK289
macos-oldrel-arm64OK125
macos-oldrel-x86_64OK228
windows-develOK186
windows-releaseOK188
windows-oldrelOK160
wasm-releaseOK180

Exports:as.lda_cBNSchar_tokenizercheck_analogy_accuracycoherenceCollocationscombine_vocabulariescreate_dtmcreate_tcmcreate_vocabularydist2fitfit_transformGlobalVectorsGloVehash_vectorizeridirifilesifiles_parallelitokenitoken_paralleljsPCA_robustLatentDirichletAllocationLatentSemanticAnalysisLDALSAnormalizepdist2perplexitypostag_lemma_tokenizerprepare_analogy_questionsprune_vocabularypsim2RelaxedWordMoversDistanceRWMDsim2space_tokenizersplit_intoTfIdfvocab_vectorizervocabularyword_tokenizer

Dependencies:data.tabledigestfloatlatticelgrMatrixMatrixExtramlapiR6RcppRcppArmadilloRhpcBLASctlrsparsestringi

Analyzing Texts with the text2vec package
Text analysis pipeline | Vectorization | Vocabulary-based vectorization | Pruning vocabulary | N-grams | Feature hashing | Basic transformations | Normalization | TF-IDF

Last update: 2023-11-13
Started: 2016-01-10

GloVe Word Embeddings
Word embeddings | GloVe algorithm | Linguistic regularities

Last update: 2023-11-13
Started: 2016-01-10

Readme and manuals

Help Manual

Help pageTopics
Converts document-term matrix sparse matrix to 'lda_c' formatas.lda_c
BNSBNS
Checks accuracy of word embeddings on the analogy taskcheck_analogy_accuracy
Coherence metrics for topic modelscoherence
Collocations model.Collocations
Combines multiple vocabularies into onecombine_vocabularies
Document-term matrix constructioncreate_dtm create_dtm.itoken create_dtm.itoken_parallel
Term-co-occurence matrix constructioncreate_tcm create_tcm.itoken create_tcm.itoken_parallel
Creates a vocabulary of unique termscreate_vocabulary create_vocabulary.character create_vocabulary.itoken create_vocabulary.itoken_parallel vocabulary
Pairwise Distance Matrix Computationdist2 distances pdist2
re-export rsparse::GloVeGlobalVectors GloVe
Creates iterator over text files from the diskidir ifiles ifiles_parallel
Iterators (and parallel iterators) over input objectsitoken itoken.character itoken.iterator itoken.list itoken_parallel itoken_parallel.character itoken_parallel.iterator itoken_parallel.list
(numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal ComponentsjsPCA_robust
Creates Latent Dirichlet Allocation model.LatentDirichletAllocation LDA
Latent Semantic Analysis modelLatentSemanticAnalysis LSA
IMDB movie reviewsmovie_review
Matrix normalizationnormalize
Perplexity of a topic modelperplexity
Prepares list of analogy questionsprepare_analogy_questions
Printing Vocabularyprint.text2vec_vocabulary
Prune vocabularyprune_vocabulary
Creates Relaxed Word Movers Distance (RWMD) modelRelaxedWordMoversDistance RWMD
Pairwise Similarity Matrix Computationpsim2 sim2 similarities
Split a vector for parallel processingsplit_into
text2vectext2vec-package text2vec
TfIdfTfIdf
Simple tokenization functions for string splittingchar_tokenizer postag_lemma_tokenizer space_tokenizer tokenizers word_tokenizer
Vocabulary and hash vectorizershash_vectorizer vectorizers vocab_vectorizer