R + quanteda + automatic detection of topics: error when running model

Question 1

I have a set of many (around 20 thousand) short job descriptions in English. My purpose for now is to be able to detect their optimal number of topics. I use an R script which worked decently on a different corpus, but here I get some error I cannot decipher. Please have a look at the reprex at the end of this post. The data can be downloaded from

https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k

and the frequency matrix from

https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V

(in any case it is calculated in the script).

Any suggestion is appreciated.

library(tidyverse)
library(quanteda)
#> Package version: 4.0.2
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
library(seededlda)
#> Loading required package: proxyC
#> 
#> Attaching package: 'proxyC'
#> The following object is masked from 'package:stats':
#> 
#> dist
#> 
#> Attaching package: 'seededlda'
#> The following object is masked from 'package:stats':
#> 
#> terms
library(ldatuning)
 
## Download the data from
## https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k
jobs <- readRDS("jobs_in_english.RDS") ## read the data
corp <- corpus(jobs, docid_field = "id",
 text_field = "description") ## create a corpus
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
 remove_numbers = TRUE, remove_url = TRUE)
## generate the frequency matrix
## if you want, you can download it directly from
## https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V
dfmt <- dfm(toks) |> 
 dfm_remove(stopwords("en")) |>
 dfm_remove("*@*") |>
 dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 19,989 documents, 91,979 features (99.90% sparse) and 1 docvar.
#> features
#> docs panel paint technician colchester essex 7.00am-4.30pm basic p.a
#> 872828466 5 5 4 2 1 1 2 1
#> 857077872 0 0 0 0 0 0 0 0
#> 801801567 0 0 0 0 0 0 0 0
#> 855162927 0 0 0 0 0 0 0 0
#> 767099713 0 0 0 0 0 0 0 0
#> 770142853 0 0 0 0 0 0 0 0
#> features
#> docs depending held
#> 872828466 1 1
#> 857077872 0 0
#> 801801567 0 0
#> 855162927 0 0
#> 767099713 0 0
#> 770142853 0 0
#> [ reached max_ndoc ... 19,983 more documents, reached max_nfeat ... 91,969 more features ]
## try to determine the optimal number of topics.
result <- FindTopicsNumber(
 dfmt,
 topics = seq(from = 2, to = 10, by = 1),
 metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
 method = "Gibbs",
 control = list(seed = 77),
 mc.cores = 2L,
 verbose = TRUE
)
#> fit models...
#> Error in checkForRemoteErrors(val): 2 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
### Here the code fails...and I do not understand why
sessionInfo()
#> R version 4.4.1 (2024年06月14日)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C 
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C 
#> [9] LC_ADDRESS=C LC_TELEPHONE=C 
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base 
#> 
#> other attached packages:
#> [1] ldatuning_1.0.2 seededlda_1.2.1 proxyC_0.4.1 quanteda_4.0.2 
#> [5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 
#> [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 
#> [13] ggplot2_3.5.1 tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#> [1] styler_1.10.3 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 
#> [5] slam_0.1-50 stringi_1.8.4 lattice_0.22-6 hms_1.1.3 
#> [9] digest_0.6.35 magrittr_2.0.3 evaluate_0.23 grid_4.4.1 
#> [13] timechange_0.3.0 fastmap_1.1.1 R.oo_1.26.0 R.cache_0.16.0 
#> [17] Matrix_1.7-0 tm_0.7-13 R.utils_2.12.3 topicmodels_0.2-16
#> [21] stopwords_2.3 fansi_1.0.6 scales_1.3.0 modeltools_0.2-23 
#> [25] cli_3.6.2 rlang_1.1.3 R.methodsS3_1.8.2 munsell_0.5.1 
#> [29] reprex_2.1.0 withr_3.0.0 yaml_2.3.8 parallel_4.4.1 
#> [33] NLP_0.2-1 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-0 
#> [37] fastmatch_1.1-4 vctrs_0.6.5 R6_2.5.1 stats4_4.4.1 
#> [41] lifecycle_1.0.4 fs_1.6.4 pkgconfig_2.0.3 pillar_1.9.0 
#> [45] gtable_0.3.5 glue_1.7.0 Rcpp_1.0.12 xfun_0.43 
#> [49] tidyselect_1.2.1 knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26 
#> [53] compiler_4.4.1

^{Created on 2024年06月18日 with reprex v2.1.0}

Question 2

ldatuning does not use seededlda, so you should find the number of topics, k, by writing a for-loop yourself. The good news is that there are function for optimization in seededlda v1.3: divergence() and perplexity(). If you set batch_size = 0.01, you can fit many LDA models quickly. For example:

for (k in seq(10, 100, by = 10) {
 lda <- textmodel_lda(dfmt, k = k, batch_size = 0.01, max_iter = 1000)
 print(divergence(lda, regularize = FALSE)) # higher the better
 print(perplexity(lda)) # lower the better
}

Kohei Watanabe 9255 silver badges8 bronze badges · Accepted Answer · 2024-06-19 09:48:27Z

ldatuning does not use seededlda, so you should find the number of topics, k, by writing a for-loop yourself. The good news is that there are function for optimization in seededlda v1.3: divergence() and perplexity(). If you set batch_size = 0.01, you can fit many LDA models quickly. For example:

for (k in seq(10, 100, by = 10) {
 lda <- textmodel_lda(dfmt, k = k, batch_size = 0.01, max_iter = 1000)
 print(divergence(lda, regularize = FALSE)) # higher the better
 print(perplexity(lda)) # lower the better
}

CollectivesTM on Stack Overflow

R + quanteda + automatic detection of topics: error when running model

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related