[NTLUG:Discuss] Short script question

Tue May 4 23:19:58 CDT 2010

Bobby Wrenn wrote:
> Fred James wrote:
>> Bobby Wrenn wrote:
>>> Fred James wrote:
>>>> Bobby Wrenn wrote:
>>>>> Bobby Wrenn wrote: (omissions for brevity)
>>>>>> read a line into a buffer 1
>>>>>> find another line that matches the regex of the line in buffer 1 
>>>>>> put it in buffer 2
>>>>>> find another line that matches the regex of the line in buffer 1 
>>>>>> put it in buffer 3
>>>>>> recurs to end of file
>>>>>> append all the buffered lines to another file
>>>>>> clear the buffer
>>>>>> go to the next line and do it again until the end of the file
>>>>>>
>>>>>> The file is tab delimited and the regex will get the first word 
>>>>>> the first tab the next word space and the first three 
>>>>>> character/numbers of the next word as the search criteria. The 
>>>>>> rest of the line will be any character. The part to match will be 
>>>>>> everything up to the first three characters of the second word 
>>>>>> after the first tab.
>>>>>>
>>>>>> Can someone point me in the right direction? Perhaps an on line 
>>>>>> tutorial that might cover something like this. I've looked at sed 
>>>>>> and awk but all the examples I can find expect that you want to 
>>>>>> remove duplicates.
>>>>>>
>>>>>> Thanks in advance
>>>>>> Bobby Wrenn
>>>>> Starting to answer my own question. I have the regex that will 
>>>>> select the line
>>>>> ^([A-Z|0-9]+\t)([A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
>>>>> So I can search for a match to \1 but then I have to copy the rest 
>>>>> of the line that does not match \2 then append both lines to a 
>>>>> file, and recurs.
>>>> Bobby Wrenn
>>>> 'grep' should do what you want in terms of writing all (complete 
>>>> lines) wherein a match is found ... so ... maybe you could ...
>>>>    (1) read the part(s) of the lines in the original file that you 
>>>> want to match into a "pattern_file"
>>>>    (2) use grep with the -f option to use the pattern_file, and 
>>>> maybe the -n to get line numbers as well
>>>> ???
>>>> Hope this helps - or did I miss the point all together?
>>>> Regards
>>>> Fred James
>>> I'm not sure I am explaining it well.
>>> ^([A-Z|0-9]+\t[A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
>>> Everything within the parentheses is what I want to match. The .* 
>>> selects the rest of the line.
>>> What I am trying to do is cull from a list of 53K records all those 
>>> records which have the same data in the first part of the line then 
>>> output/append both or multiple lines where the first part matches to 
>>> another file.
>> Bobby Wrenn
>> What I am picking up on this time around - sorry if I missed it the 
>> first time - is "What I am trying to do is cull from a list of 53K 
>> records all those records which have the same data in the first part 
>> of the line then output/append both or multiple lines where the first 
>> part matches to another file."  I believe I would use AWK for that, 
>> but several languages could do the same thing ... something like this 
>> ...
>>    starting with n = 1, increment n with each iteration
>>    read line n
>>       sarting with m = n + 1, increment m with each iteration
>>       read line m
>>          if regex(m) = regex(n)
>>             if m = n + 1
>>                append line n to new_file
>>             append line m to new_file
>>
>> Once new_file is complete, use grep -vf to get those lines out of the 
>> original file ... something like ...
>>    grep -vf new_file original_file > temporary_file
>> ... then rename original_file to hold_file, and rename temporary_file 
>> to original_file, and test your result ... if something failed, 
>> rename hold_file to original_file and you are back at square 1
>> OK?  Or have I missed it again?
>>
>> Somehow I think all I may have done is restate your original message, 
>> which would mean that I may not have answered your question?
>> Regards
>> Fred James
>>
>>
>> _______________________________________________
>> http://www.ntlug.org/mailman/listinfo/discuss
>>
>>
> Pretty close. Only what I want is a new file with all the lines where 
> the regex matches more than one line. I don't want to remove 
> duplicates I want a list of line where the first part of the line is 
> duplicated in more than one line.
>
> Regards,
> Bobby
>
> _______________________________________________
> http://www.ntlug.org/mailman/listinfo/discuss
>
Haven't seen anyone suggest sort as a first step, makes the script 
simpler.  Read and output lines to a given file while the match part is 
equal.  On a change of match part, close the current file and change to 
a new one.  Continue until EOF.