318 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
2
votes
1
answer
62
views
Vespa indexing anomaly on `exact`-indexed field with diacritical variants and non-latin Scripts
I’m using the Vespa Python client (pyvespa 0.54.0) to query a Vespa index, and I’m running into an issue where Vespa doesn't find a document it has just returned in a previous query.
I have this field ...
0
votes
1
answer
126
views
geom_smooth() producing a linear fit
I'm using R to model the tone contour (pitch) of words in a language and I have two main questions. Note that I am new to R and don't have a data science background, so any help is really appreciated.
...
-1
votes
1
answer
65
views
Automatic Word Boundary Detection for German
I want to rephrase that: I need a corpus of German words so that I can check if a segment is a word. My solution so far is to take the string, check if it's in the dictionary and if not, delete the ...
1
vote
0
answers
100
views
Query Wikidata via SPARQL to get specific word etymology from Wiktionary
I'm trying to get the specific word etymology from Wikidata.
For example, this query to get the word "exact" in Wikidata but I wasn't able to get the etymology part for this word.
SELECT ...
0
votes
1
answer
957
views
What does "assign A to B" mean?
If I say "assign A to B", does it mean (a) A ← B or
(b) B ← A?
In other words, is it (a) A or (b) B that is being modified?
(a) makes sense because A has responsibility over B, so A is ...
0
votes
1
answer
87
views
Problems with reproducing the training of the spaCy pipeline
I'm trying to reproduce the training of one of the spaCy pipeline for Italian language: it_core_news_sm.
This pipeline is trained on 2 datasets:
UD_Italian-ISDT for the conllu tasks
WikiNer for NET ...
1
vote
0
answers
157
views
In NLTK, how to generate a sample of sentences from PCFG, respecting the probabilities
NLTK has a generate method which enumerates sentences for a given CFG. It also has a PCFG class for probabilistic context-free grammars. Is it possible generate a sample of sentences with respect to ...
0
votes
1
answer
307
views
Weighted Distance Matrix for QWERTZ Keyboard for Levenshtein Distance Algorithm
I have a weight Matrix for a Levenshtein Distance Algorithm which looks like this
int[,] weights = new int[6, 6]
{
{ 0, 1, 2, 1, 1, 2 },
{ 1, 0, 1, 2, 1, 2 },
{ 2, 1, 0,...
2
votes
1
answer
786
views
How to develop a corpus(corpus analysis)
I am goingt to build a linguistic corpus, but i don't understand which technologies should i use for it. Is it true, that for developing a courpus for any language i necessarily have to use IMS Corpus ...
0
votes
1
answer
138
views
Tool for detecting differences between text passages from two different groups
I have text data from two different groups. In total I have around 4000 text passages with around 300 words.
I am searching for a tool that allows me to analyze the difference between these two groups....
0
votes
0
answers
33
views
R - readtext and list of .xml files
I'm trying to create a corpus and a vcorpus with a bulk of .xml files, for quantitative linguistics
With txt files I usually write
library(tm)
library(stopwords)
library(magrittr)
library(dplyr)
...
2
votes
0
answers
66
views
How can I determine if a word is a part of an english word or is a portmanteau (a word created by combining parts of valid English words)?
I am trying to create a validator that takes in words and tries to determine if the word is one of the following:
It is a valid English word
It is a part of an English word
It is an abbreviation
It ...
1
vote
0
answers
211
views
Customization of Wav2Vec2CTCTokenizer with rules
my goal is to fine-tune an ASR model, WavLM, that relies on the pretrained tokenizer Wav2Vec2CTCTokenizer.
I want to fine-tune this ASR model with another language and to perform the tokenization ...
1
vote
0
answers
340
views
spaCy custom tokenizer to separate word with underscore and also to include the whole word
After referring to the link: How to tokenize word with hyphen in Spacy
I got to know how to tokenize by separating words containing hyphen/underscore but my requirement is to tokenize by separating it ...
1
vote
1
answer
2k
views
How to do the post hoc test in the linear mixed model if I have three predictors (two factor variables and one numeric variable)
I'm using a linear mixed effects model to analyze the reaction time of learners of English as a second language. I have two factor variables - grammaticality (grammatical v.s. ungrammatical) and ...