@@ -21,6 +21,7 @@ revealOptions:
21
21
- 📍 Principal Data Scientist, DSAI, Moderna
22
22
- 🎓 ScD, MIT Biological Engineering.
23
23
- 🧬 Inverse protein, mRNA, and molecule design.
24
+ - 🎉 Accelerated and enriched analysis of data.
24
25
25
26
---
26
27
@@ -43,6 +44,13 @@ If you write automated tests for your work, then:
43
44
44
45
---
45
46
47
+ ## ⭕️ Outline
48
+
49
+ - Testing in Software
50
+ - Testing in Data Science
51
+
52
+ ---
53
+
46
54
## 💻 Testing in Software
47
55
48
56
- 🤔 Why do testing?
@@ -59,6 +67,10 @@ Tests help falsify the hypothesis that our code _works_.
59
67
60
68
----
61
69
70
+ Without testing, we will have untested assumptions about whether our code works.
71
+
72
+ ----
73
+
62
74
### 🧪 What does a test look like?
63
75
64
76
----
@@ -137,7 +149,9 @@ mamba env update -f environment.yml
137
149
With ` pytest ` installed, use it to run your tests:
138
150
139
151
``` bash
140
- pytest
152
+ cd /path/to/my_project
153
+ conda activate my_project
154
+ pytest .
141
155
```
142
156
143
157
---
@@ -212,6 +226,8 @@ We update the test to establish new expectations.
212
226
1 . ✅ Guarantees against breaking changes.
213
227
2 . 🤔 Example-based documentation for your code.
214
228
229
+ > Testing is a contract between yourself (now) and yourself (in the future).
230
+
215
231
---
216
232
217
233
### 👆 What kind of tests exist?
@@ -220,25 +236,55 @@ We update the test to establish new expectations.
220
236
221
237
#### 1️⃣ Unit Test
222
238
223
- A test that checks that an individual function works correctly.
239
+ ``` python
240
+ def func1 (data ):
241
+ ...
242
+ return stuff
243
+
244
+ def test_func1 (data ):
245
+ stuff = func1(data)
246
+ assert stuff == ...
247
+ ```
224
248
225
- _ Strive to write this type of test!_
249
+ _ A test that checks that an individual function works correctly. Strive to write this type of test!_
226
250
227
251
----
228
252
229
253
#### 2️⃣ Execution Test
230
254
231
- A test that only checks that a function executes without erroring.
255
+ ``` python
256
+ def func1 (data ):
257
+ ...
258
+ return stuff
232
259
233
- _ Use only in a pinch._
260
+ def test_func1 (data ):
261
+ func1(data)
262
+ ```
263
+
264
+ _ A test that only checks that a function executes without erroring. Use only in a pinch._
234
265
235
266
----
236
267
237
268
#### 3️⃣ Integration Test
238
269
239
- A test that checks that multiple functions work correctly together.
270
+ ``` python
271
+ def func1 (data ):
272
+ ...
273
+ return stuff
274
+
275
+ def func2 (data ):
276
+ ...
277
+ return stuff
278
+
279
+ def pipeline (data ):
280
+ return func2(func1(data))
281
+
282
+ def test_pipeline (data ):
283
+ output = pipeline(data)
284
+ assert output = ...
285
+ ```
240
286
241
- _ Used to check that a system is working properly._
287
+ _ Checks that a system is working properly. Use this sparingly if the tests are long to execute! _
242
288
243
289
---
244
290
@@ -273,6 +319,10 @@ Testing your DS code will be good for you!
273
319
274
320
## 😎Testing in Data Science
275
321
322
+ - Machine Learning Model Code
323
+ - Data
324
+ - Pipelines
325
+
276
326
----
277
327
278
328
### 🧠 Testing Machine Learning Model Code
@@ -305,9 +355,29 @@ of the shape that `model` accepts.
305
355
306
356
#### 🤔 What can we test here?
307
357
308
- 1 . Our model accepts the correct inputs and outputs.
309
- 2 . Our model and datamodules work together.
310
- 3 . Our model does not fail in training loop.
358
+ 1 . ___ Unit test:___ ` dm ` produces correctly-shaped outputs when executed.
359
+ 2 . ___ Unit test:___ Given random inputs, ` model ` produces correctly-shaped outputs.
360
+ 3 . ___ Integration test:___ Given ` dm ` outputs, ` model ` produces correctly-shaped outputs.
361
+ 4 . ___ Execution test:___ ` model ` does not fail in training loop with ` trainer ` and ` dm ` .
362
+
363
+ ----
364
+
365
+ #### 🟩 DataModule output shapes
366
+
367
+ ``` python
368
+ def test_datamodule_shapes ():
369
+ # Arrange
370
+ batch_size = 3
371
+ input_dims = 4
372
+ dm = DataModule(batch_size = batch_size)
373
+
374
+ # Act
375
+ x, y = next (iter (dm.train_loader()))
376
+
377
+ # Assert
378
+ assert x.shape == (batch_size, data_dims)
379
+ assert y.shape == (batch_size, 1 )
380
+ ```
311
381
312
382
----
313
383
@@ -317,12 +387,17 @@ of the shape that `model` accepts.
317
387
from jax import random, vmap, numpy as np
318
388
319
389
def test_model_shapes ():
390
+ # Arrange
320
391
key = random.PRNGKey(55 )
321
- num_samples = 7
322
- num_input_dims = 211
323
- inputs = random.normal(shape = (num_samples, num_input_dims))
324
- model = Model(num_input_dims = num_input_dims)
392
+ batch_size = 3
393
+ input_dims = 4
394
+ inputs = random.normal(shape = (num_samples, input_dims))
395
+ model = Model(input_dims = input_dims)
396
+
397
+ # Act
325
398
outputs = vmap(model)(inputs)
399
+
400
+ # Assert
326
401
assert outputs.shape == (num_samples, 1 )
327
402
```
328
403
@@ -332,11 +407,16 @@ def test_model_shapes():
332
407
333
408
``` python
334
409
def test_model_datamodule_compatibility ():
410
+ # Arrange
335
411
dm = DataModule()
336
412
model = Model()
337
413
x, y = next (iter (dm.train_dataloader()))
414
+
415
+ # Act
338
416
pred = vmap(model)(x)
339
- assert x.shape == y.shape
417
+
418
+ # Assert
419
+ assert pred.shape == y.shape
340
420
```
341
421
342
422
----
@@ -345,18 +425,21 @@ def test_model_datamodule_compatibility():
345
425
346
426
``` python
347
427
def test_model ():
428
+ # Arrange
348
429
model = Model()
349
430
dm = DataModule()
350
431
trainer = default_trainer(epochs = 2 )
432
+
433
+ # Act
351
434
trainer.fit(model, dm)
352
435
```
353
436
354
- Ensure that model can be trained for at least 2 epochs.
355
-
356
437
---
357
438
358
439
### 📀 Testing Data
359
440
441
+ _ a.k.a. Data Validation_
442
+
360
443
----
361
444
362
445
#### 👆 What data guarantees do we need?
@@ -396,7 +479,7 @@ df_schema = pa.DataFrameSchema(
396
479
397
480
``` python
398
481
def func (df ):
399
- df_schema.validate(df)
482
+ df = df_schema.validate(df)
400
483
# The rest of the logic
401
484
...
402
485
```
@@ -415,11 +498,13 @@ Code is much more readable.
415
498
416
499
``` python
417
500
def pipeline (data ):
501
+ data = df_schema.validate(data)
418
502
d1 = func1(data)
419
503
d2 = func2(d1)
420
504
d3 = func3(d1)
421
505
d4 = func4(d2, d3)
422
- return outfunc(d4)
506
+ output = outfunc(d4)
507
+ return output_schema.validate(output)
423
508
```
424
509
425
510
----
@@ -440,6 +525,28 @@ def test_func4(data):
440
525
...
441
526
```
442
527
528
+ ----
529
+
530
+ #### 🤝 The whole pipeline can be integration tested
531
+
532
+ ``` python
533
+ def test_pipeline ()
534
+ # Arrange
535
+ data = pd.DataFrame(... )
536
+
537
+ # Act
538
+ output = pipeline(data)
539
+
540
+ # Assert
541
+ assert output = ...
542
+ ```
543
+
544
+ _ We assume your pipeline is quick to run._
545
+
546
+ ---
547
+
548
+ ### 🕓 One more thing
549
+
443
550
---
444
551
445
552
### 💰 Mock-up Realistic Fake Data
@@ -509,9 +616,9 @@ _Do unto others what you would have others do unto you._
509
616
510
617
## 😎 Summary
511
618
512
- 1 . ✅ Write tests for your ** code ** .
513
- 2 . ✅ Write tests for your ** data ** .
514
- 3 . ✅ Write tests for your ** models ** .
619
+ 1 . ✅ Write tests for your __ code __ .
620
+ 2 . ✅ Write tests for your __ data __ .
621
+ 3 . ✅ Write tests for your __ models __ .
515
622
516
623
---
517
624
0 commit comments