[NTLUG:Discuss] Short script question

Fred James fredjame at fredjame.cnc.net
Tue May 4 18:56:20 CDT 2010


Bobby Wrenn wrote:
(omissions for brevity)
> Pretty close. Only what I want is a new file with all the lines where 
> the regex matches more than one line. I don't want to remove 
> duplicates I want a list of line where the first part of the line is 
> duplicated in more than one line.
>
> Regards,
> Bobby
Bobby Wrenn
Let me spin this one more time, but a little different ...
    (1) copy the original_file to copy_file
    (2) use Sed (or whatever you like) to modify copy_file - keeping 
every line but chipping off the '.*' part of each
... Now follow this thought ... if the pattern in line one of copy_file 
also matches line 10 and 2000 in original file, then the pattern in line 
10 of copy_file will also match line 2000 in original file, and at 
minimum you could wind up with ...
    line 1 once
    line 10 twice
    line 2000 twice
... which I doubt is optimal(?) ... so ...
    (3) sort copy_file > copy_file_0
    (4) uniq copy_file_0 > copy_file
... and then, in a loop (such as in a sh/bash shell script) ...
    (5) for each line in copy_file
    (5.1) grep line original_file > temporary_file
    (5.2) if lc temporary_file > 1
    (5.3) then cat temporary_file >> duplicates_file
... once that is complete ...
    (6) grep -vf duplicates_file original_file > new_file
... the result(s) ...
       duplicates_file contains all the lines that matched any of the 
regex's where more than one match was found
       new_file is original_file with all the lines in duplicates_file 
removed
... and you still have original_file in case you have to back up and try 
again

More clear mud?  Anywhere near what you are looking for?
Regards
Fred James




More information about the Discuss mailing list