Goal: Writing more readable tests.
I have a couple of functions, which basically merge and converts two lists of Datasets together, written using Scala and Spark. Each of these Datasets has a lot of fields inside it. For testing, I'm creating three Datasets: New records, existing records, and expected result.
The problem is, tests are long and hard to read. An example:
test("Merging Movies") {
val newMovies: Dataset[ATMMovie] = Seq(
ATMMovie(
id = 123L,
utc_insert_timestamp = Some(1524522274),
movie_title = Some("New movie from ATM"),
censor_rating_id = Some(0),
release_year = Some(2018),
release_date = Some(1524522000),
primary_language_id = Some(0),
distributor_id = Some(0),
internal_pos_movie_code = Some("P1"),
internal_pos_movie_id = Some("ID1"),
temporary = 0,
utc_last_modified_timestamp = None,
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 0
),
ATMMovie(
id = 456L,
utc_insert_timestamp = Some(34567522274L),
movie_title = Some("Title updated"),
censor_rating_id = Some(0),
release_year = Some(2016),
release_date = Some(1524522000),
primary_language_id = Some(0),
distributor_id = Some(0),
internal_pos_movie_code = Some("Movie2"),
internal_pos_movie_id = Some("MovieID2"),
temporary = 0,
utc_last_modified_timestamp = None,
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 3
)
).toDS
val existingMovies: Dataset[ODSMovie] = Seq(
ODSMovie(
movie_row_id = 2L,
movie_source_id = Some("234"),
movie_entity_id = 7777L,
utc_insert_timestamp = Some(1524522000),
movie_title = Some("Old ODS Movie"),
censor_rating_id = Some(0),
release_year = Some(2017),
release_date = Some(1524522987),
primary_language_id = Some(1),
distributor_id = Some(5),
internal_pos_movie_id = Some("Movie 1"),
temporary = 0,
utc_Last_modified_timestamp = Some(1524522666),
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 1
),
ODSMovie(
movie_row_id = 764L,
movie_entity_id = 658L,
utc_insert_timestamp = Some(94567522333L),
movie_title = Some("Old title"),
censor_rating_id = Some(0),
release_year = Some(2016),
release_date = Some(1524522000),
primary_language_id = Some(0),
distributor_id = Some(0),
movie_source_id = Some("Movie2"),
internal_pos_movie_id = Some("MovieID2-old"),
temporary = 0,
utc_Last_modified_timestamp = None,
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 0
)
).toDS
val expectedODSMovies: Dataset[ODSMovie] = Seq(
ODSMovie(
movie_row_id = 765L,
movie_source_id = Some("P1"),
movie_entity_id = 123L,
utc_insert_timestamp = Some(1524522274),
movie_title = Some("New movie from ATM"),
censor_rating_id = Some(0),
release_year = Some(2018),
release_date = Some(1524522000),
primary_language_id = Some(0),
distributor_id = Some(0),
internal_pos_movie_id = Some("ID1"),
temporary = 0,
utc_Last_modified_timestamp = None,
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 0
),
ODSMovie(
movie_row_id = 764L,
movie_entity_id = 456L,
utc_insert_timestamp = Some(34567522274L),
movie_title = Some("Title updated"),
censor_rating_id = Some(0),
release_year = Some(2016),
release_date = Some(1524522000),
primary_language_id = Some(0),
distributor_id = Some(0),
movie_source_id = Some("Movie2"),
internal_pos_movie_id = Some("MovieID2"),
temporary = 0,
utc_Last_modified_timestamp = None,
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 3
),
ODSMovie( // Movie we had.
movie_row_id = 2L,
movie_source_id = Some("234"),
movie_entity_id = 7777L,
utc_insert_timestamp = Some(1524522000),
movie_title = Some("Old ODS Movie"),
censor_rating_id = Some(0),
release_year = Some(2017),
release_date = Some(1524522987),
primary_language_id = Some(1),
distributor_id = Some(5),
internal_pos_movie_id = Some("Movie 1"),
temporary = 0,
utc_Last_modified_timestamp = Some(1524522666),
force_update = 0,
utc_last_import_attempt_timestamp = None,
import_attempts = 1
)
).toDS
As you see, each test is very hard to read and follow. I'm looking to find a better way to write these tests.
Update: I don't care about the value of most of the fields. I'm going to test the logic of merging.
1 Answer 1
Our best solution to this was to use a generator-like class or function.
val newMovies = DataFrameBuilder().add(1).add(2, movie_id=71)
It generates values for fields based on the number we provide. If type is a number, value will become the number itself. If it's a string, value will become name of the field plus the number (e.g. MOVIE_TITLE 1
). We can also pass custom value for fields.
Another solution was to use a function that takes some of the fields we care about, and fills the rest with constant values. Like, it takes id
and utc_insert_timestamp
, and fills everything e
movie_title
in input record is the same asmovie_title
in the expected record. However, we don't care if it'sAvengers: End game
orAuei3894
. \$\endgroup\$