Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit e790365

Browse files
update README
1 parent 8c384cb commit e790365

File tree

1 file changed

+191
-0
lines changed

1 file changed

+191
-0
lines changed

‎README.md‎

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,3 +388,194 @@ Home Restaurant - Los Feliz
388388
389389
...
390390
```
391+
392+
# Writing Clean Data
393+
Once we have extracted the data we want to make our data look good i.e. not without
394+
any spaces and newlines so for that we will use a simple logic
395+
396+
**writing_clean_data.py**
397+
```
398+
...
399+
with open(file_path, 'a') as textFile:
400+
count = 0
401+
for biz in businesses:
402+
try:
403+
title = biz.find('a', {'class': 'biz-name'}).text
404+
address = biz.find('address').contents
405+
# print(address)
406+
phone = biz.find('span', {'class': 'biz-phone'}).text
407+
region = biz.find('span', {'class': 'neighborhood-str-list'}).contents
408+
count += 1
409+
for item in address:
410+
if "br" in item:
411+
print(item.getText())
412+
else:
413+
print('\n' + item.strip(" \n\r\t"))
414+
for item in region:
415+
if "br" in item:
416+
print(item.getText())
417+
else:
418+
print(item.strip(" \n\t\r") + '\n')
419+
...
420+
```
421+
422+
We simply get the text of the item if there are any **br** tags else we strip the
423+
newlines, return lines, tabs and spaces / space from the text, On running the file
424+
425+
```
426+
800 W Sunset Blvd
427+
Echo Park
428+
429+
430+
4156 Santa Monica Blvd
431+
Silver Lake
432+
433+
434+
8500 Beverly Blvd
435+
Beverly Grove
436+
437+
438+
5484 Wilshire Blvd
439+
Mid-Wilshire
440+
441+
442+
5115 Wilshire Blvd
443+
Hancock Park
444+
445+
446+
126 E 6th St
447+
Downtown
448+
449+
450+
8164 W 3rd St
451+
Beverly Grove
452+
453+
454+
7910 W 3rd St
455+
Beverly Grove
456+
457+
458+
4163 W 5th St
459+
Koreatown
460+
461+
462+
435 N Fairfax Ave
463+
Beverly Grove
464+
465+
466+
1267 W Temple St
467+
Echo Park
468+
469+
470+
429 W 8th St
471+
Downtown
472+
473+
474+
724 S Spring St
475+
Downtown
476+
477+
478+
8450 W 3rd St
479+
Beverly Grove
480+
481+
482+
2308 S Union Ave
483+
University Park
484+
485+
486+
5583 W Pico Blvd
487+
Mid-Wilshire
488+
489+
'NoneType' object has no attribute 'contents'
490+
491+
3413 Cahuenga Blvd W
492+
Hollywood Hills
493+
494+
495+
727 N Broadway
496+
Chinatown
497+
498+
499+
6602 Melrose Ave
500+
Hancock Park
501+
502+
503+
612 E 11th St
504+
Downtown
505+
...
506+
```
507+
508+
The same way we have to clean the phone number
509+
510+
```
511+
...
512+
for item in phone:
513+
if "br" in item:
514+
phone_number += item.getText() + " "
515+
else:
516+
phone_number += item.strip(" \n\t\r") + " "
517+
...
518+
519+
except Exception as e:
520+
print(e)
521+
logs = open('errors.log', 'a')
522+
logs.write(str(e) + '\n')
523+
logs.close()
524+
address = None
525+
phone_number = None
526+
region = None
527+
```
528+
529+
Again run the file and change the ```start = 990``` delete all content in
530+
```yelp-{city}-clean.txt``` run the file again. All details of the restaurant will
531+
be written to the file
532+
533+
**yelp-{city}-clean.txt**
534+
```
535+
Tea Station Express
536+
537+
538+
539+
540+
Bestia
541+
2121 E 7th Pl Downtown
542+
(213) 514-5724
543+
544+
545+
République
546+
624 S La Brea Ave Hancock Park
547+
(310) 362-6115
548+
549+
550+
The Morrison
551+
3179 Los Feliz Blvd Atwater Village
552+
(323) 667-1839
553+
554+
555+
A Food Affair
556+
1513 S Robertson Blvd Pico-Robertson
557+
(310) 557-9795
558+
559+
560+
Running Goose
561+
1620 N Cahuenga Blvd Hollywood
562+
(323) 469-1080
563+
564+
565+
Howlin’ Ray’s
566+
567+
568+
569+
570+
Perch
571+
448 S Hill St Downtown
572+
(213) 802-1770
573+
574+
575+
Faith & Flower
576+
705 W 9th St Downtown
577+
(213) 239-0642
578+
```
579+
580+
Notice that we some of the data is missing because due to some error we reduce the risk
581+
of code crashing by setting the values to ```None```

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /