0

I have a set of many (around 20 thousand) short job descriptions in English. My purpose for now is to be able to detect their optimal number of topics. I use an R script which worked decently on a different corpus, but here I get some error I cannot decipher. Please have a look at the reprex at the end of this post. The data can be downloaded from

https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k

and the frequency matrix from

https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V

(in any case it is calculated in the script).

Any suggestion is appreciated.

library(tidyverse)
library(quanteda)
#> Package version: 4.0.2
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
library(seededlda)
#> Loading required package: proxyC
#> 
#> Attaching package: 'proxyC'
#> The following object is masked from 'package:stats':
#> 
#> dist
#> 
#> Attaching package: 'seededlda'
#> The following object is masked from 'package:stats':
#> 
#> terms
library(ldatuning)
 
## Download the data from
## https://e.pcloud.link/publink/show?code=XZ2oeTZTA9RAUJnpaFbknbl6IL9KSuUq14k
jobs <- readRDS("jobs_in_english.RDS") ## read the data
corp <- corpus(jobs, docid_field = "id",
 text_field = "description") ## create a corpus
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
 remove_numbers = TRUE, remove_url = TRUE)
## generate the frequency matrix
## if you want, you can download it directly from
## https://e.pcloud.link/publink/show?code=XZkynTZfDN89V7AJNRGzvVwyua2qFkQop3V
dfmt <- dfm(toks) |> 
 dfm_remove(stopwords("en")) |>
 dfm_remove("*@*") |>
 dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
print(dfmt)
#> Document-feature matrix of: 19,989 documents, 91,979 features (99.90% sparse) and 1 docvar.
#> features
#> docs panel paint technician colchester essex 7.00am-4.30pm basic p.a
#> 872828466 5 5 4 2 1 1 2 1
#> 857077872 0 0 0 0 0 0 0 0
#> 801801567 0 0 0 0 0 0 0 0
#> 855162927 0 0 0 0 0 0 0 0
#> 767099713 0 0 0 0 0 0 0 0
#> 770142853 0 0 0 0 0 0 0 0
#> features
#> docs depending held
#> 872828466 1 1
#> 857077872 0 0
#> 801801567 0 0
#> 855162927 0 0
#> 767099713 0 0
#> 770142853 0 0
#> [ reached max_ndoc ... 19,983 more documents, reached max_nfeat ... 91,969 more features ]
## try to determine the optimal number of topics.
result <- FindTopicsNumber(
 dfmt,
 topics = seq(from = 2, to = 10, by = 1),
 metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
 method = "Gibbs",
 control = list(seed = 77),
 mc.cores = 2L,
 verbose = TRUE
)
#> fit models...
#> Error in checkForRemoteErrors(val): 2 nodes produced errors; first error: Each row of the input matrix needs to contain at least one non-zero entry
### Here the code fails...and I do not understand why
sessionInfo()
#> R version 4.4.1 (2024年06月14日)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C 
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C 
#> [9] LC_ADDRESS=C LC_TELEPHONE=C 
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C 
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base 
#> 
#> other attached packages:
#> [1] ldatuning_1.0.2 seededlda_1.2.1 proxyC_0.4.1 quanteda_4.0.2 
#> [5] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 
#> [9] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 
#> [13] ggplot2_3.5.1 tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#> [1] styler_1.10.3 utf8_1.2.4 generics_0.1.3 xml2_1.3.6 
#> [5] slam_0.1-50 stringi_1.8.4 lattice_0.22-6 hms_1.1.3 
#> [9] digest_0.6.35 magrittr_2.0.3 evaluate_0.23 grid_4.4.1 
#> [13] timechange_0.3.0 fastmap_1.1.1 R.oo_1.26.0 R.cache_0.16.0 
#> [17] Matrix_1.7-0 tm_0.7-13 R.utils_2.12.3 topicmodels_0.2-16
#> [21] stopwords_2.3 fansi_1.0.6 scales_1.3.0 modeltools_0.2-23 
#> [25] cli_3.6.2 rlang_1.1.3 R.methodsS3_1.8.2 munsell_0.5.1 
#> [29] reprex_2.1.0 withr_3.0.0 yaml_2.3.8 parallel_4.4.1 
#> [33] NLP_0.2-1 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-0 
#> [37] fastmatch_1.1-4 vctrs_0.6.5 R6_2.5.1 stats4_4.4.1 
#> [41] lifecycle_1.0.4 fs_1.6.4 pkgconfig_2.0.3 pillar_1.9.0 
#> [45] gtable_0.3.5 glue_1.7.0 Rcpp_1.0.12 xfun_0.43 
#> [49] tidyselect_1.2.1 knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26 
#> [53] compiler_4.4.1

Created on 2024年06月18日 with reprex v2.1.0

asked Jun 18, 2024 at 13:14

1 Answer 1

1

ldatuning does not use seededlda, so you should find the number of topics, k, by writing a for-loop yourself. The good news is that there are function for optimization in seededlda v1.3: divergence() and perplexity(). If you set batch_size = 0.01, you can fit many LDA models quickly. For example:

for (k in seq(10, 100, by = 10) {
 lda <- textmodel_lda(dfmt, k = k, batch_size = 0.01, max_iter = 1000)
 print(divergence(lda, regularize = FALSE)) # higher the better
 print(perplexity(lda)) # lower the better
}
answered Jun 19, 2024 at 9:48
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.