[NTLUG:Discuss] Short script question

Tue May 4 14:31:39 CDT 2010

Fred James wrote:
> Bobby Wrenn wrote:
>> Bobby Wrenn wrote:
>>> I know this will be trivial to someone who deals with this sort of 
>>> thing every day. However, I do not fall into that category.
>>>
>>> I have been looking on the web for pointers on doing this and have 
>>> come up dry. Usually you want to delete duplicate lines. But I need 
>>> to do the opposite. I need to find lines in a tab delimited file 
>>> which are partial matches and save the matches to a new file 
>>> something like this;
>>>
>>> read a line into a buffer 1
>>> find another line that matches the regex of the line in buffer 1 put 
>>> it in buffer 2
>>> find another line that matches the regex of the line in buffer 1 put 
>>> it in buffer 3
>>> recurs to end of file
>>> append all the buffered lines to another file
>>> clear the buffer
>>> go to the next line and do it again until the end of the file
>>>
>>> The file is tab delimited and the regex will get the first word the 
>>> first tab the next word space and the first three character/numbers 
>>> of the next word as the search criteria. The rest of the line will 
>>> be any character. The part to match will be everything up to the 
>>> first three characters of the second word after the first tab.
>>>
>>> Can someone point me in the right direction? Perhaps an on line 
>>> tutorial that might cover something like this. I've looked at sed 
>>> and awk but all the examples I can find expect that you want to 
>>> remove duplicates.
>>>
>>> Thanks in advance
>>> Bobby Wrenn
>> Starting to answer my own question. I have the regex that will select 
>> the line
>> ^([A-Z|0-9]+\t)([A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
>> So I can search for a match to \1 but then I have to copy the rest of 
>> the line that does not match \2 then append both lines to a file, and 
>> recurs.
> Bobby Wrenn
> 'grep' should do what you want in terms of writing all (complete 
> lines) wherein a match is found ... so ... maybe you could ...
>    (1) read the part(s) of the lines in the original file that you 
> want to match into a "pattern_file"
>    (2) use grep with the -f option to use the pattern_file, and maybe 
> the -n to get line numbers as well
> ???
> Hope this helps - or did I miss the point all together?
> Regards
> Fred James
>
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>
>
I'm not sure I am explaining it well.
^([A-Z|0-9]+\t[A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
Everything within the parentheses is what I want to match. The .* 
selects the rest of the line.
What I am trying to do is cull from a list of 53K records all those 
records which have the same data in the first part of the line then 
output/append both or multiple lines where the first part matches to 
another file.