Technical Stuff: Unix

Showing posts with label Unix. Show all posts

Monday, September 2, 2019

How to validate data file is formatted properly before training ML model!!

In machine learning(ML), clean data for training is the key for a better model.
Must remove noise(unwanted data like special characters) from input data to get clean clean data.
.To do this one of the first step is to make sure data format is consistent in entire input file.
One of the best way to check this is below command. This command should return
Only one value, if not your data file is not properly formatted.

cat file_filename | awk -F’,’ ‘{print NF}’ | sort -u

Let’s see in detail about this command. This command contains below sub parts

cat: is the command to display contents of a text file
awk: this is very power full text processing tool.
sort:this command is to sort and u is for unique and removing duplicates
NF: is number of fields. This will be used In AWK
Pipe(|): this is also very power full command line utilitiy which combines more than one command.

Let’s take a sample file which contains name, age and city of some customers separated by comma, which means each line should contain only three fields separated by comma. But if you see in the below data, in the second line it contains four fields separated by comma.This is very very common scenario in Machine learning training data. And checking this format in a large file not easy. Using above command we can easily verify it.

Chandra, 35,Singapore
Sekhar,Chandra, 26,Guntur
Sekhar,35,Singapore

Applying above command on this text will return the results 3 and 4 as shown below as input data is not consistent. This is due to second line contains one extra comma in it.

After removing extra comma from the second line, that command returns only one value which is 3 as shown below. This return value may vary basing on number of fields in each line, but each line should contain same number of fields which makes this command returns only one value.

cat command will display the contents of the file and redirect to AWK because of pipe and awk will split each line with comma as -F is a field separator and NF will return all the fields after split each line. In this case we have three lines in that first and third lines has three and second line has 4 after split using comma. Sort will show the sorted result with duplicates and -u will show only unique values.

This command will work efficiently on very large files as well. I tried on data files whose file size is around one million lines and it returned results in seconds.

Happy Learning!!!

No comments:

Labels: CLI, ML, Unix

Tuesday, March 12, 2019

GREP unix command useful tips!!!

Recently I learned useful tips from my manager for 'grep' which helps a lot when troubleshooting the issues with logs. In this post I will explain about below

How to search multiple strings in a file at the same time.
How to display found string in the color.
How to exclude particular string in the search result
How to display total no.of lines in the grep result

Search a string in a file using grep: Check this link for basic usage of grep command.

Lets see this tip with example. Below is the content of a file demo.txt and this file name we are going to use in this example as well.

How to search multiple strings in a file:
In the below screen we are search for the strings 'demo', 'show' and 'multiple' in the file demo.txt and and the result is as follows.

How to display found string in color:
In the above screen, even though grep found the patterns, it is difficult to identify in which line these patterns available. For that you can use grep command property 'color' as shown below.

From the above screen, we can easily identify the found strings as they are highlighted in red color. This color utility will save lot of time when you are searching in debug logs while troubleshooting.

How to exclude particular string in the search result: GREP command will supports option to exclude particular pattern from the result. This is basically not including a string in the result.

In the above screen, initially search for the strings 'demo' and 'show' and in the results I want to exclude the string 'multiple'.

How to display total no.of lines in the grep result: We can use -c option to get the total number of lines of the grep result. If the pattern is unique in each line, that count will be total no.of occurrences of the pattern.

In the above screen, searching for the patterns 'demo' and 'show' results two lines and using -c option will show the total count to two. If the grep results are more, this count option will be very useful.

Happy Learning!!!

No comments:

Labels: grep, Unix

Thursday, October 17, 2013

What is the difference between exit() and _exit() in C and Unix

In C programming language, exit function is used to terminate the process. Exit function will be called automatically when the program is terminates. exit() cleans up the user-mode data related to library. exit() function internally calls _exit() function which cleans up kernel related data before terminating the process.

exit() flushes the IO related buffer before exiting the process and calls the _exit() function. Where as _exit() function just terminates the process without cleaning up or flushing user data. It is not advisable to call the _exit() function in your programming until unless you are very clear. There is another function called _Exit() function which also works same as _exit() functionally. We need to use strlid.h for exit() and unistd.h for _exit() functions.

Sample code with exit function:

#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>
int main()
{
 printf("Hello");
 exit(0);
}

Output:

programs$ ./a.out
Helloprograms$

Explanation:
Result we got as expected and there is no \n at the end, so on the same line prompt came.

Sample code with _exit function:

#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>
int main()
{
 printf("Hello");
 _exit(0);
}

OutPut:
It prints nothing.
Explanation:
This is due to ,we called _exit() function directly, so IO related data is not flushed, so printf data is not flushed, because of this, it has printed nothing.

Sample code with _exit function with \n:

#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>
int main()
{
 printf("Hello\n");
 _exit(0);
}

OutPut:

programs$ ./a.out
Hello
programs$

Explanation:

We got the output Hello, this is due to we are forcefully flushing the data using '\n'. Infact printf() function wont print or flush the data until buffer completes or end of the character is \n. printf internally maintains some buffer.

Using GDB, we can see functions which are called when the process terminates. giving below for your info for simple c program

int main()
{
 printf("Hello");
}

programs$ gdb a.out
GNU gdb 6.3.50-20050815 (Apple version gdb-1824) (Wed Feb 6 22:51:23 UTC 2013)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries .. done
(gdb) br main
Breakpoint 1 at 0x100000f14: file _exit.c, line 5.
(gdb) r
Starting program: /Users/kh1279/Desktop/practice/Blog/programs/a.out 
Reading symbols for shared libraries +............................. done
Breakpoint 1, main () at _exit.c:5
5		printf("Hello");
(gdb) 
(gdb) n
6	}
(gdb) 
0x00007fff933ab7e1 in start ()
(gdb) 
Single stepping until exit from function start, 
which has no line number information.
0x00007fff933ab808 in dyld_stub_exit ()
(gdb) 
Single stepping until exit from function dyld_stub_exit, 
which has no line number information.
0x00007fff8b4a4f74 in exit ()
(gdb) 
Single stepping until exit from function exit, 
which has no line number information.
Hello0x00007fff8b4eb576 in dyld_stub___exit ()
(gdb) 
Single stepping until exit from function dyld_stub___exit, 
which has no line number information.
0x00007fff8efa3ae0 in _exit ()
(gdb) 
Single stepping until exit from function _exit, 
which has no line number information.
Program exited with code 0377.
(gdb) 
The program is not being run.
(gdb)

Explanation:
you just compile your program using -g option for gcc, without this option, we cant use GDB. after compilation, launch GDB debugger using gdb with a.out or your program binary file. put a break point at main function and forward using next or n gdb command. You can find the exit and _exit functions calling in the above gdb process in red colour.

No comments:

Labels: C program, Programming in C, Unix

Thursday, August 15, 2013

The Unix file system!!

The Unix file system is divided into four sequential blocks namely Boot block, Super block, Inode block and the data block as shown below. In all the blocks except data blocks, all are fixed at the time of creation. and data blocks will be changed when the file content is changed.

The Boot Block: This block is starting of the file system and booting code (bootstrapping) will be stored here.Each file system should have one boot block to boot the system. Only one boot block is need to boot the system and if you have more than one file systems, other boot blocks will be empty. So for starting the machine, Operating systems reads the data from boot block to load.
The Super Block: This block specifies the status of the file system. It provides the information like how much space available, how large the file system, total size of the disk, total used blocks, bad blocks etc.
The Inode Block: This blocks specifies the information about the files. each file will have one unique inode (information node) on the disk. The inode contains the information like owner of the file, file type, permissions of the file etc.
The Data Blocks: This block starts immediately after the inode block and this block contains actual data or file content. It also contains other files which contains user data like FIFO, regular files, directory files, character special , block special and socket files etc.

I will post in detail about the Inode structure later.

No comments:

Labels: Operating system, Unix

Thursday, March 15, 2012

GREP command in UNIX!!

The grep is a very powerfull search command in Unix. It is used to search the string or pattern or regular exression in a file line by line and displays the line which contains the given pattern. grep stands for global regular expression parser or print. The grep family includes grep, egrep, fgrep.

Syntax:

grep [options] pattern filename

grep: is a orginal command, its basic use to select the line and search for the string
egrep: is Extended grep and it uses extended regular expression (supporting logical OR in pattern) to select the lines to process.
fgrep: is a Fast grep and it will not use the regular expressions, but it uses string literals to process.

Example for grep:

$grep ^[aeiou] test.txt
in a file line by
or print. The grep family
includes grep, egrep, fgrep
a son of srking.
a son of srqueen .

The above grep command searches for the patterns starting with vowel small letters in the starting of the line in the file test.txt and displays them

Example for egrep:

$egrep 'jr|sr|super ' test.txt
jrking is
a son of srking.
srking is a son of supersrking.
supersking grand sun is jrking.

$grep 'jr|sr|super ' test.txt
$

The above egrep searches for the strings jr or sr or super in test.txt file. single quotes are mandatory. grep displays nothing as it doesnot support logical OR.

Example for fgrep:

$fgrep king test.txt
jrking is
a son of srking.
srking is a son of supersrking.
supersking grand sun is jrking.

$fgrep ^[aeiou] test.txt
$

From the above examples, in the first fgrep is looking for the string literal 'king' in test.txt and displays the lines which contains the 'king' literal. Where as in second command fgrep is looking for the regular expression ^[aeiou] and it displays nothing as fgrep doesn't support regular expressions. If you use grep instead of fgrep , you will get the result as shown in example for grep above.

Some of usefull grep options:

-i : ignore the case
-r: search for the pattern recursively (usefull for searching in the directory/sub-directory)
-n: displays the lines with line no.s
-c: displays the count of the found patterns
-l: displays the file name which contains the given pattern (this is usefull while searching in sub-directories)

No comments:

Labels: grep, Unix

Wednesday, March 14, 2012

Find command in UNIX !!

The find comand is used to locate the files in unix or Linux systems. You can search for the files by name, owner, group, type, permission, date etc. syntax for the comman is given below.

find whereToLook criteria WhatToDo

All feilds for the find command are optional, and all the feilds having the default values. i.e whereToLook defaults to . (current working directory), criteria defaults to none (showing all files in the pwd), WhatToDo defaults to -print, which means printing the find results on the standard output.

Example:

Unix78:~/> find 
Unix78:~/> find .
Unix78:~/> find . -print
Unix78:~/> find -print

All the above four commands will produce the same result. because of the default values.
OutPut:
.
./nosemicolon.c
./const.cpp
./leakc
./a.out
./struct.cpp

To find the exact file name:

 find / -name test.c

This will look for the file name test.c in the system (/ is for root) and displays the path. Here / is for whole system (you can specify required path) -name test.c is for Criteria saying that look for file name and WhatToDo feild is defaults to print. If there is no specified file in the system or specified path, find displays nothing.

Find command is need to run as a root. If you dont run as root, find will display error message for each directory on which you dont have read permission. To avoid these error message, we can redirect the message to null device as shown below.

 find / -name test.c 2>/dev/null

Some of the find criteria:

-name filename : search for the exact name
-iname filename : ignore the case . eg. test, Test, teSt all are same
-group gname: search based on group name (numeric grp name is allowed)
-gid n: files numeric grp id is n;
-P : never follow symbolic links. this is the default behaviour
-L : follow symbolic links

No comments:

Labels: Unix

Technical Stuff

Pages

Monday, September 2, 2019

How to validate data file is formatted properly before training ML model!!

Tuesday, March 12, 2019

GREP unix command useful tips!!!

Thursday, October 17, 2013

What is the difference between exit() and _exit() in C and Unix

Thursday, August 15, 2013

The Unix file system!!

Thursday, March 15, 2012

GREP command in UNIX!!

Wednesday, March 14, 2012

Find command in UNIX !!

Popular Posts