40 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
1
answer
87
views
Problems with reproducing the training of the spaCy pipeline
I'm trying to reproduce the training of one of the spaCy pipeline for Italian language: it_core_news_sm.
This pipeline is trained on 2 datasets:
UD_Italian-ISDT for the conllu tasks
WikiNer for NET ...
1
vote
0
answers
855
views
Label Studio: Importing Txt Files as Whole Files & Exporting the Result
I am trying to export the result of the file that I imported to Label Studio. This is my labeling interface :
<View>
<Labels name="label" toName="text">
<Label ...
1
vote
0
answers
32
views
Parsing Italian CONLLU files to remove lemmas
I am working with Italian Universal Dependency data in CONLLU format, like this:
sent_id = VIT-4006
text = "grazie dell'informazione, la metterò nella memoria del mio Macintosh".
1 " ...
4
votes
1
answer
1k
views
Creating a custom dataset based on CoNLL2003
I'm working on a named entity recognition (NER) project and would like to create my own dataset based on the CoNLL2003 dataset (link: https://huggingface.co/datasets/conll2003). I've been looking at ...
1
vote
1
answer
453
views
Convert Prodigy JSONL / Spacy Doc format to CONLL
I have been searching for a while now but haven't found any solution to my problem. For a relation classification task I have annotated several news like text documents with prodigy annotation ...
0
votes
1
answer
92
views
Problem with for loop, break statement does not do what I thought it would
This is my first time posting here, so be gentle, please.
I have written the following code:
import pandas as pd
import spacy
df = pd.read_csv('../../../Data/conll2003.dev.conll', sep='\t', ...
1
vote
3
answers
821
views
Convert spaCy `Doc` into CoNLL 2003 sample
I was planning to train a Spark NLP custom NER model, which uses the CoNLL 2003 format to do so (this blog even leaves some traning sample data to speed-up the follow-up). This "sample data" ...
0
votes
0
answers
469
views
What is the way used to split text file of CoNLL format into train, valid and test sets?
I have a text file that contains data for the NER model, the data is in CoNLL format. The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word ...
1
vote
0
answers
142
views
NLP in R: working with tokenization in a CONLLU-style dataframe
I am working in a Portuguese Digital Humanities project using R. I created a CONLLU-style dataframe with the corpus data, using the UDPipe library:
textAnnotated <- udpipe::udpipe_annotate(m_port, ...
-1
votes
1
answer
140
views
How to train a model in SageMaker Studio with .train and .test extension dataset files?
I'm trying to implement ML models with Amazon SageMaker Studio, the thing is that the model that I want to implement is from hugging face and It uses a Dataset from CONLL Corpora.
Following the ...
0
votes
2
answers
632
views
How to convert annotated text in XML to CONLL?
I need to preprocess XML files for a NER task and I am struggling with the conversion of the XML files. I guess there is a nice and easy way to solve the following problem.
Given an annotated text in ...
1
vote
1
answer
2k
views
Converting Spacy NER entity format to CONLL 2003 format
I am working on NER application where i have data annotated in the following data format.
[('The F15 aircraft uses a lot of fuel', {'entities': [(4, 7, 'aircraft')]}),
('did you see the F16 landing?',...
0
votes
1
answer
94
views
Removing a rows from pandas data frame if one of its cell contains list of all caps string
I was working with conll2003dataset. It contains articles from various news sources among other things. It contains sentences, part of speech tags for each word in those sentences, chunk ids for those ...
0
votes
1
answer
118
views
Count the number of labels on IOB corpus with Pandas
From my IOB corpus such as:
mention Tag
170
171 467 O
172
173 Vincennes B-LOCATION
174 . O
175
176 Confirmation O
177 des O
178 privilèges O
179 de O
180 la O
181 ville ...
1
vote
1
answer
449
views
AllenNLP BERT SRL input format ("OntoNotes v. 5.0 formatted")
The goal is to train BERT SRL on another data set. According to configuration, it requires conll-formatted-ontonotes-5.0.
Natively, my data comes in a CoNLL format and I converted it to the conll-...