[NTLUG:Discuss] Short script question

Tue May 4 18:12:07 CDT 2010

Fred James wrote:
> Bobby Wrenn wrote:
>> Fred James wrote:
>>> Bobby Wrenn wrote:
>>>> Bobby Wrenn wrote: (omissions for brevity)
>>>>> read a line into a buffer 1
>>>>> find another line that matches the regex of the line in buffer 1 
>>>>> put it in buffer 2
>>>>> find another line that matches the regex of the line in buffer 1 
>>>>> put it in buffer 3
>>>>> recurs to end of file
>>>>> append all the buffered lines to another file
>>>>> clear the buffer
>>>>> go to the next line and do it again until the end of the file
>>>>>
>>>>> The file is tab delimited and the regex will get the first word 
>>>>> the first tab the next word space and the first three 
>>>>> character/numbers of the next word as the search criteria. The 
>>>>> rest of the line will be any character. The part to match will be 
>>>>> everything up to the first three characters of the second word 
>>>>> after the first tab.
>>>>>
>>>>> Can someone point me in the right direction? Perhaps an on line 
>>>>> tutorial that might cover something like this. I've looked at sed 
>>>>> and awk but all the examples I can find expect that you want to 
>>>>> remove duplicates.
>>>>>
>>>>> Thanks in advance
>>>>> Bobby Wrenn
>>>> Starting to answer my own question. I have the regex that will 
>>>> select the line
>>>> ^([A-Z|0-9]+\t)([A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
>>>> So I can search for a match to \1 but then I have to copy the rest 
>>>> of the line that does not match \2 then append both lines to a 
>>>> file, and recurs.
>>> Bobby Wrenn
>>> 'grep' should do what you want in terms of writing all (complete 
>>> lines) wherein a match is found ... so ... maybe you could ...
>>>    (1) read the part(s) of the lines in the original file that you 
>>> want to match into a "pattern_file"
>>>    (2) use grep with the -f option to use the pattern_file, and 
>>> maybe the -n to get line numbers as well
>>> ???
>>> Hope this helps - or did I miss the point all together?
>>> Regards
>>> Fred James
>> I'm not sure I am explaining it well.
>> ^([A-Z|0-9]+\t[A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
>> Everything within the parentheses is what I want to match. The .* 
>> selects the rest of the line.
>> What I am trying to do is cull from a list of 53K records all those 
>> records which have the same data in the first part of the line then 
>> output/append both or multiple lines where the first part matches to 
>> another file.
> Bobby Wrenn
> What I am picking up on this time around - sorry if I missed it the 
> first time - is "What I am trying to do is cull from a list of 53K 
> records all those records which have the same data in the first part 
> of the line then output/append both or multiple lines where the first 
> part matches to another file."  I believe I would use AWK for that, 
> but several languages could do the same thing ... something like this ...
>    starting with n = 1, increment n with each iteration
>    read line n
>       sarting with m = n + 1, increment m with each iteration
>       read line m
>          if regex(m) = regex(n)
>             if m = n + 1
>                append line n to new_file
>             append line m to new_file
>
> Once new_file is complete, use grep -vf to get those lines out of the 
> original file ... something like ...
>    grep -vf new_file original_file > temporary_file
> ... then rename original_file to hold_file, and rename temporary_file 
> to original_file, and test your result ... if something failed, rename 
> hold_file to original_file and you are back at square 1
> OK?  Or have I missed it again?
>
> Somehow I think all I may have done is restate your original message, 
> which would mean that I may not have answered your question?
> Regards
> Fred James
>
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>
>
Pretty close. Only what I want is a new file with all the lines where 
the regex matches more than one line. I don't want to remove duplicates 
I want a list of line where the first part of the line is duplicated in 
more than one line.

Regards,
Bobby