Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Loading Check_spelling_dl Pretrained Pipeline crashes in Spark NLP 2.7.2 + #2254

Open
@C-K-Loan

Description

Since Spark-NLP 2.7.2 + loading Check_spelling_dl crashes
Building a pipeline with ContextSpellCheckerModel.pretrained() works fine

Successfully tested in the following versions of Spark NLP

  • 2.6.2
  • 2.7.0
  • 2.7.1

Crashes on the following versions Spark NLP

  • 2.7.2
  • 2.7.3

This runs fine

import sparknlp
from sparknlp.annotator import * 
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline, LightPipeline
spark = sparknlp.start()
document_assembler = DocumentAssembler() \
 .setInputCol("text") \
 .setOutputCol("document")
tokenizer = Tokenizer()\
 .setInputCols(["document"]) \
 .setOutputCol("token")
spell = ContextSpellCheckerModel.pretrained()\
 .setInputCols(["token"]) \
 .setOutputCol("spell")
nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, spell])
data = [ {"text": 'Some text hello world'}, ]
df = spark.createDataFrame(data)
nlp_pipeline.fit(df).transform(df).show()

Based on this snippet

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline, LightPipeline
spark = sparknlp.start()
pipeline = PretrainedPipeline('check_spelling_dl', lang='en')
data = [ {"text": 'Some text hello world'}, ]
df = spark.createDataFrame(data)
pipeline.transform(df).show()

Colab link for reproduction

https://colab.research.google.com/drive/1QpV7RYj65DXJQm2xxB1s2o_6J88yB8-n?usp=sharing

Results in the following Error message :

check_spelling_dl download started this may take some time.
Approx size to download 112.1 MB
[OK!]
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-2-f7b1aa24d037> in <module>()
 10 
 11 spark = sparknlp.start()
---> 12 pipeline = PretrainedPipeline('check_spelling_dl', lang='en')
 13 data = [ {"text": 'Some text hello world'}, ]
 14 df = spark.createDataFrame(data)
8 frames
/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
 326 raise Py4JJavaError(
 327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
 329 else:
 330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 12.0 failed 1 times, most recent failure: Lost task 1.0 in stage 12.0 (TID 23, localhost, executor driver): java.lang.ArrayStoreException: java.lang.Byte
	at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
	at scala.Array$.slowcopy(Array.scala:81)
	at scala.Array$.copy(Array.scala:107)
	at scala.collection.mutable.ResizableArray$class.copyToArray(ResizableArray.scala:77)
	at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
	at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:104)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect1ドル$$anonfun13ドル.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDD$$anonfun$collect1ドル$$anonfun13ドル.apply(RDD.scala:945)
	at org.apache.spark.SparkContext$$anonfun$runJob5ドル.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob5ドル.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun10ドル.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage1ドル.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage1ドル.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed1ドル.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed1ドル.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon1ドル.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect1ドル.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at com.johnsnowlabs.nlp.serialization.TransducerFeature.deserializeObject(Feature.scala:281)
	at com.johnsnowlabs.nlp.serialization.Feature.deserialize(Feature.scala:47)
	at com.johnsnowlabs.nlp.FeaturesReader$$anonfun$load1ドル.apply(ParamsAndFeaturesReadable.scala:15)
	at com.johnsnowlabs.nlp.FeaturesReader$$anonfun$load1ドル.apply(ParamsAndFeaturesReadable.scala:14)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:14)
	at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8)
	at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun4ドル.apply(Pipeline.scala:274)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun4ドル.apply(Pipeline.scala:272)
	at scala.collection.TraversableLike$$anonfun$map1ドル.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map1ドル.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:379)
	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:373)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadPipeline(ResourceDownloader.scala:479)
	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadPipeline(ResourceDownloader.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayStoreException: java.lang.Byte
	at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
	at scala.Array$.slowcopy(Array.scala:81)
	at scala.Array$.copy(Array.scala:107)
	at scala.collection.mutable.ResizableArray$class.copyToArray(ResizableArray.scala:77)
	at scala.collection.mutable.ArrayBuffer.copyToArray(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
	at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:104)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
	at scala.collection.AbstractTraversable.toArray(Traversable.scala:104)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
	at org.apache.spark.rdd.RDD$$anonfun$collect1ドル$$anonfun13ドル.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDD$$anonfun$collect1ドル$$anonfun13ドル.apply(RDD.scala:945)
	at org.apache.spark.SparkContext$$anonfun$runJob5ドル.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob5ドル.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun10ドル.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      AltStyle によって変換されたページ (->オリジナル) /