@@ -388,3 +388,194 @@ Home Restaurant - Los Feliz
388
388
389
389
...
390
390
```
391
+
392
+ # Writing Clean Data
393
+ Once we have extracted the data we want to make our data look good i.e. not without
394
+ any spaces and newlines so for that we will use a simple logic
395
+
396
+ ** writing_clean_data.py**
397
+ ```
398
+ ...
399
+ with open(file_path, 'a') as textFile:
400
+ count = 0
401
+ for biz in businesses:
402
+ try:
403
+ title = biz.find('a', {'class': 'biz-name'}).text
404
+ address = biz.find('address').contents
405
+ # print(address)
406
+ phone = biz.find('span', {'class': 'biz-phone'}).text
407
+ region = biz.find('span', {'class': 'neighborhood-str-list'}).contents
408
+ count += 1
409
+ for item in address:
410
+ if "br" in item:
411
+ print(item.getText())
412
+ else:
413
+ print('\n' + item.strip(" \n\r\t"))
414
+ for item in region:
415
+ if "br" in item:
416
+ print(item.getText())
417
+ else:
418
+ print(item.strip(" \n\t\r") + '\n')
419
+ ...
420
+ ```
421
+
422
+ We simply get the text of the item if there are any ** br** tags else we strip the
423
+ newlines, return lines, tabs and spaces / space from the text, On running the file
424
+
425
+ ```
426
+ 800 W Sunset Blvd
427
+ Echo Park
428
+
429
+
430
+ 4156 Santa Monica Blvd
431
+ Silver Lake
432
+
433
+
434
+ 8500 Beverly Blvd
435
+ Beverly Grove
436
+
437
+
438
+ 5484 Wilshire Blvd
439
+ Mid-Wilshire
440
+
441
+
442
+ 5115 Wilshire Blvd
443
+ Hancock Park
444
+
445
+
446
+ 126 E 6th St
447
+ Downtown
448
+
449
+
450
+ 8164 W 3rd St
451
+ Beverly Grove
452
+
453
+
454
+ 7910 W 3rd St
455
+ Beverly Grove
456
+
457
+
458
+ 4163 W 5th St
459
+ Koreatown
460
+
461
+
462
+ 435 N Fairfax Ave
463
+ Beverly Grove
464
+
465
+
466
+ 1267 W Temple St
467
+ Echo Park
468
+
469
+
470
+ 429 W 8th St
471
+ Downtown
472
+
473
+
474
+ 724 S Spring St
475
+ Downtown
476
+
477
+
478
+ 8450 W 3rd St
479
+ Beverly Grove
480
+
481
+
482
+ 2308 S Union Ave
483
+ University Park
484
+
485
+
486
+ 5583 W Pico Blvd
487
+ Mid-Wilshire
488
+
489
+ 'NoneType' object has no attribute 'contents'
490
+
491
+ 3413 Cahuenga Blvd W
492
+ Hollywood Hills
493
+
494
+
495
+ 727 N Broadway
496
+ Chinatown
497
+
498
+
499
+ 6602 Melrose Ave
500
+ Hancock Park
501
+
502
+
503
+ 612 E 11th St
504
+ Downtown
505
+ ...
506
+ ```
507
+
508
+ The same way we have to clean the phone number
509
+
510
+ ```
511
+ ...
512
+ for item in phone:
513
+ if "br" in item:
514
+ phone_number += item.getText() + " "
515
+ else:
516
+ phone_number += item.strip(" \n\t\r") + " "
517
+ ...
518
+
519
+ except Exception as e:
520
+ print(e)
521
+ logs = open('errors.log', 'a')
522
+ logs.write(str(e) + '\n')
523
+ logs.close()
524
+ address = None
525
+ phone_number = None
526
+ region = None
527
+ ```
528
+
529
+ Again run the file and change the ``` start = 990 ``` delete all content in
530
+ ``` yelp-{city}-clean.txt ``` run the file again. All details of the restaurant will
531
+ be written to the file
532
+
533
+ ** yelp-{city}-clean.txt**
534
+ ```
535
+ Tea Station Express
536
+
537
+
538
+
539
+
540
+ Bestia
541
+ 2121 E 7th Pl Downtown
542
+ (213) 514-5724
543
+
544
+
545
+ République
546
+ 624 S La Brea Ave Hancock Park
547
+ (310) 362-6115
548
+
549
+
550
+ The Morrison
551
+ 3179 Los Feliz Blvd Atwater Village
552
+ (323) 667-1839
553
+
554
+
555
+ A Food Affair
556
+ 1513 S Robertson Blvd Pico-Robertson
557
+ (310) 557-9795
558
+
559
+
560
+ Running Goose
561
+ 1620 N Cahuenga Blvd Hollywood
562
+ (323) 469-1080
563
+
564
+
565
+ Howlin’ Ray’s
566
+
567
+
568
+
569
+
570
+ Perch
571
+ 448 S Hill St Downtown
572
+ (213) 802-1770
573
+
574
+
575
+ Faith & Flower
576
+ 705 W 9th St Downtown
577
+ (213) 239-0642
578
+ ```
579
+
580
+ Notice that we some of the data is missing because due to some error we reduce the risk
581
+ of code crashing by setting the values to ``` None ```
0 commit comments