Sunday, September 22, 2019
Useful commands for data cleaning in Machine Learning(ML) training - Part 1
If you are a CLI user , some of these commands may be familiar to you. The hardest part in whole Machine learning(ML) is providing clean data to train. ML is all about data and there is lot of data available these days and huge data is generating every data in the form of text, videos, images etc. All this data is not structured and that is one of the biggest problem. How to make unstructured into a usable format? By using combination of below commands, we can clean raw data by removing noise and make a structured data which further can be used for training a model.
- grep
- pipe(|)
- cat
- awk
- > (redirection)
- wc -l
- more
- head
- tail
- ls -lh
- sed
grep: is a command to find or search for a particular pattern in a given file. you can find more on this.
pipe(|): pipe alone we cant use. But this will be used to join more than one command.
cat: is a command to show contents of a given file in the console.
awk: is very powerful command line utility for text processing.
>(redirection): is the utility to redirect results of any command from console to any other file
wc: This command will give total lines, words and characters from a given file. -l will give only total lines from a given file. This will be very useful to know how big the data file is.
more: this is to see the contents of a given file in pagination format if file content is more than one page. Initially I used to use vi to see the contents of a file, after knowing about more command, I started using more instead of vi and it is very effective as well.
head: to see sample content(first 10 lines) from a given file. you can use -n to specify how many lines you want to show on the console
tail: same as head, but from bottom. To see sample content(last 10 lines) from a given file. you can use -n to specify how many lines you want to show from bottom on the console
ls -lh: Many of us know about ls command. trick here is l and h. l will show file details like permissions and owner etc, h will show the size of the file in human readable format. like if file size is 1424, ls -h will show 1.4K. you can easily tell later one is more readable.
sed: is a stream editor. this command mostly I used for replacing a pattern in the same file.
Happy Learning!!!
Subscribe to:
Post Comments (Atom)
Popular Posts
-
A universally unique identifier ( UUID ) is an identifier standard used in software construction, standardized by the Open...
-
Recently I started working on Japser Studio professional for my new project Cloud to generate the reports. I was very new to all cloud ...
-
Below is C program for AVL Tree implementation. #include<stdio.h> #include<malloc.h> typedef struct bst { int info; int hei...
-
strcmp is another string library function which is used to compare two strings and returns zero if both strings are same , returns +ve valu...
-
One of the complex operation on binary search tree is deleting a node. Insertion is easy by calling recursive insertion. But deletion wont...
-
We have recently faced one tricky issue in AWS cloud while loading S3 file into Redshift using python. It took almost whole day to inde...
-
Object slicing: when a derived class object is assigned to a base class object. only base class data will be copied from derived class and...
-
We have faced lot of weird issues while loading S3 bucke t files into redshift. I will try to explain all issues what we faced. Before go...
-
Below code is to find the cube root of a given integer number with out using pow math library function. Its very simple and brute force...
-
Recently we faced one issue in reading messages from SQS in AWS cloud where we are processing same message multiple times. This issue we...
No comments:
Post a Comment