Why is schemaEvolution not working in databricks autoloader?

Question 1

I'm reading csv files and processing them daily so I can append the data to my bronze layer in databricks using autolader. The code looks like this:

 def run_autoloader(table_name, checkpoint_path, latest_file_location, new_columns):
# Configure Auto Loader to ingest parquet data to a Delta table
 (spark.readStream
 .format("cloudFiles")
 #.schema(df_schema)
 .option("cloudFiles.format", "parquet")
 .option("cloudFiles.schemaLocation", checkpoint_path)
 .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
 .load(latest_file_location)
 .toDF(*new_columns)
 .select("*", spark_col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"),current_date().alias("processing_date"))
 .writeStream
 .option("checkpointLocation", checkpoint_path)
 .trigger(once=True)
 .option("mergeSchema", "true")
 .toTable(table_name))

Previously this was able to handle evolving schemas, but today after the introduction of a new column in the input csv's I got the following error:

 requirement failed: The number of columns doesn't match.

I've read some posts suggesting editing the schema manually or resetting the schema by deleting the schema checkpoint path, but one would require manual maintenance and the other would mean we have to wipe all our bronze data so for now neither is an option, especially if it's only a temporary fix.

I don't understand why this suddenly started happening as this is specifically what the autoloader was designed to do.

Any help would be much appreciated.

Question 2

Can you clarify in your question if you are attempting to read parquet or csv? In the code snippet you provided you are specifying the format as parquet .option("cloudFiles.format", "parquet"). If you are trying to read csv files using autoloader you should specify the format as csv.

For CSV files, you need to set cloudFiles.inferColumnTypes to true if you want to infer the column datatypes. its default by false as specified in the documentation link below.
Double check checkpoint_path contains the inferred schema information and the checkpoint information.

referencing this documentation

(spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", checkpoint_path)
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.inferColumnTypes", "true") # check docs for explanation
.load(latest_file_location)
.toDF(*new_columns)
.select("*", spark_col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"),current_date().alias("processing_date"))
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(once=True)
.option("mergeSchema", "true")
.toTable(table_name))

jhhhn12 363 bronze badges · Answer 1 · 2024-12-20 05:33:41Z

Can you clarify in your question if you are attempting to read parquet or csv? In the code snippet you provided you are specifying the format as parquet .option("cloudFiles.format", "parquet"). If you are trying to read csv files using autoloader you should specify the format as csv.

For CSV files, you need to set cloudFiles.inferColumnTypes to true if you want to infer the column datatypes. its default by false as specified in the documentation link below.
Double check checkpoint_path contains the inferred schema information and the checkpoint information.

referencing this documentation

(spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", checkpoint_path)
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.inferColumnTypes", "true") # check docs for explanation
.load(latest_file_location)
.toDF(*new_columns)
.select("*", spark_col("_metadata.file_path").alias("source_file"), current_timestamp().alias("processing_time"),current_date().alias("processing_date"))
.writeStream
.option("checkpointLocation", checkpoint_path)
.trigger(once=True)
.option("mergeSchema", "true")
.toTable(table_name))

CollectivesTM on Stack Overflow

Why is schemaEvolution not working in databricks autoloader?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related