Splitting log file into smaller files

Question 1

I had a file that looked like this:

Mar 06 22:00:00 [10.251.132.246] logger: 10.64.69.219 - - [06/Mar/2011:22:.....
Mar 06 22:00:00 [10.251.132.246] logger: 10.98.137.116 - - [06/Mar/2011:22:0....

that I wanted to split into smaller files using the ip address after "logger"

This is what I came up with:

file = ARGV.shift
split_file = {}
pattern = /logger: ([^\s]*)/
File.open(file, 'r') do |f|
 f.each do |l|
 match = l[pattern]
 if match
 list = split_file[1ドル]
 list = [] if list == nil
 list << l
 split_file[1ドル] = list
 end
 end
end
split_file.each_pair do |k, v|
 File.open("#{file}.#{k}", "a+") do |f|
 v.each do |l|
 f.print l
 end
 end
end

Suggestions, warnings, improvements are very welcome :)

One thing I noticed is that the new files are created in the same directory as the original file, not at the current working directory (so ./logsplitter.rb ../log.log creates files in the .. directory).

Thank you

[edit: typo]

Question 2

Do you mind me asking why you'd want to split log files this way? It looks like you want to have a log per IP instead of one giant log. Is this for documentation purposes? Do you just want a way of checking all access for a particular IP?

Question 3

I needed to split the log per ip for further analysis and thought ruby would be the easiest way to do it.. for me at least :)

Question 4

@Dale - Ok, yes I got that, but I meant, "what kind of analysis?". If you're just looking to check up on a specific IP, for example, cat log.log | grep 10.64.69.210 is a better approach than a splitting script (if you want a count of how many times they visited, pipe the output of that through wc -l). If you just want a list of unique IPs that visited, then awk '{print 6ドル}' log.log | sort -u might be enough for you (again, pipe wc to taste). I'm asking to see if a ruby script (which you will now need to maintain) is actually the best solution for you.

Question 5

Revised result (heavily cut down) : bitbucket.org/dwijnand/logsplitter/src/999b61673f65/…

Question 6

@Inaimathi sorry didn't see your comment. It was to see try and follow what was going on for specific ips. Yes I could have hand-picked a few and grep them into individual files, but this was simple enough (and a nice excercise) to warrant a ruby script :)

Question 7

First of all it is a pretty wide-spread convention in ruby to use 2 spaces for indendation not 4. Personally I don't care, but there are some ruby developers who will complain when seeing code indented with 4 spaces, so you'll have an easier time just going with the stream.

file = ARGV.shift

Unless there is a good reason to mutate ARGV (which in this case doesn't seem to be the case), I'd recommend not using mutating operations. file = ARGV[0] will work perfectly fine here.

match = l[pattern]
if match
 list = split_file[1ドル]
 list = [] if list == nil
 list << l
 split_file[1ドル] = list
end

First of all you should avoid using magic variables. Using MatchData objects is more robust than using magic variables. As an example consider this scenario:

Assume that you decide you want to do some processing on the line before storing it in split_file. For this you decide to use gsub. Now your code looks like this:

match = l[pattern]
if match
 list = split_file[1ドル]
 list = [] if list == nil
 list << l.gsub( /some_regex/, "some replacement")
 split_file[1ドル] = list
end

However this code is broken. Since gsub also sets 1ドル, 1ドル now no longer contains what you think it does and split_file[1ドル] will not work as expected. This kind of bug can't happen if you use [1] on a match data object instead.

Further the whole code can be simplified by using a very useful feature of ruby hashes: default blocks. Hashes in ruby allow you to specify a block which is executed when a key is not found. This way you can create hash of arrays which you can just append to without having to make sure the array exists.

For this you need to change the initialization of split_file from split_file = {} to split_file = Hash.new {|h,k| h[k] = [] }. Then you can replace the above code with:

match = l.match(pattern)
if match
 split_file[ match[1] ] << l
end

One thing I noticed is that the new files are created in the same directory as the original file, not at the current working directory (so ./logsplitter.rb ../log.log creates files in the .. directory).

If you want to avoid that use File.basename to extract only the name of the file without the directory from the given path and then build the path of the file to be created from that. I.e.:

File.open("#{ File.basename(file) }.#{k}", "a+") do |f|

Speaking of this line: I don't see why you use "a+" instead of just "a" as the opening mode - you never read from it.

Question 8

Thanks sepp2k, great great tips! I thought I remembered something about default blocks in hashes.. there we go :) Thanks again, accepting.

Question 9

Oh, one mistake however. Using the square brackets that way does something different. The following: match = l[pattern] split_file[match[1]] << l if match must be replaced with: match = l[pattern,1] split_file[match] << l if match Or using Regexp.match(str) instead...

Question 10

@Dale: True. I somehow did not notice that you were using [] and not match.

sepp2k sepp2k 9,0122 gold badges39 silver badges51 bronze badges · Accepted Answer · 2011-03-08 15:15:41Z

First of all it is a pretty wide-spread convention in ruby to use 2 spaces for indendation not 4. Personally I don't care, but there are some ruby developers who will complain when seeing code indented with 4 spaces, so you'll have an easier time just going with the stream.

file = ARGV.shift

Unless there is a good reason to mutate ARGV (which in this case doesn't seem to be the case), I'd recommend not using mutating operations. file = ARGV[0] will work perfectly fine here.

match = l[pattern]
if match
 list = split_file[1ドル]
 list = [] if list == nil
 list << l
 split_file[1ドル] = list
end

First of all you should avoid using magic variables. Using MatchData objects is more robust than using magic variables. As an example consider this scenario:

Assume that you decide you want to do some processing on the line before storing it in split_file. For this you decide to use gsub. Now your code looks like this:

match = l[pattern]
if match
 list = split_file[1ドル]
 list = [] if list == nil
 list << l.gsub( /some_regex/, "some replacement")
 split_file[1ドル] = list
end

However this code is broken. Since gsub also sets 1ドル, 1ドル now no longer contains what you think it does and split_file[1ドル] will not work as expected. This kind of bug can't happen if you use [1] on a match data object instead.

Further the whole code can be simplified by using a very useful feature of ruby hashes: default blocks. Hashes in ruby allow you to specify a block which is executed when a key is not found. This way you can create hash of arrays which you can just append to without having to make sure the array exists.

For this you need to change the initialization of split_file from split_file = {} to split_file = Hash.new {|h,k| h[k] = [] }. Then you can replace the above code with:

match = l.match(pattern)
if match
 split_file[ match[1] ] << l
end

One thing I noticed is that the new files are created in the same directory as the original file, not at the current working directory (so ./logsplitter.rb ../log.log creates files in the .. directory).

If you want to avoid that use File.basename to extract only the name of the file without the directory from the given path and then build the path of the file to be created from that. I.e.:

File.open("#{ File.basename(file) }.#{k}", "a+") do |f|

Speaking of this line: I don't see why you use "a+" instead of just "a" as the opening mode - you never read from it.

Thanks sepp2k, great great tips! I thought I remembered something about default blocks in hashes.. there we go :) Thanks again, accepting.
Oh, one mistake however. Using the square brackets that way does something different. The following: match = l[pattern] split_file[match[1]] << l if match must be replaced with: match = l[pattern,1] split_file[match] << l if match Or using Regexp.match(str) instead...
@Dale: True. I somehow did not notice that you were using [] and not match.

Stack Exchange Network

Splitting log file into smaller files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Splitting log file into smaller files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions