[NTLUG:Discuss] Short script question

Fred James fredjame at fredjame.cnc.net
Tue May 4 16:42:51 CDT 2010


Bobby Wrenn wrote:
> Fred James wrote:
>> Bobby Wrenn wrote:
>>> Bobby Wrenn wrote: 
>>> (omissions for brevity)
>>>> read a line into a buffer 1
>>>> find another line that matches the regex of the line in buffer 1 
>>>> put it in buffer 2
>>>> find another line that matches the regex of the line in buffer 1 
>>>> put it in buffer 3
>>>> recurs to end of file
>>>> append all the buffered lines to another file
>>>> clear the buffer
>>>> go to the next line and do it again until the end of the file
>>>>
>>>> The file is tab delimited and the regex will get the first word the 
>>>> first tab the next word space and the first three character/numbers 
>>>> of the next word as the search criteria. The rest of the line will 
>>>> be any character. The part to match will be everything up to the 
>>>> first three characters of the second word after the first tab.
>>>>
>>>> Can someone point me in the right direction? Perhaps an on line 
>>>> tutorial that might cover something like this. I've looked at sed 
>>>> and awk but all the examples I can find expect that you want to 
>>>> remove duplicates.
>>>>
>>>> Thanks in advance
>>>> Bobby Wrenn
>>> Starting to answer my own question. I have the regex that will 
>>> select the line
>>> ^([A-Z|0-9]+\t)([A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
>>> So I can search for a match to \1 but then I have to copy the rest 
>>> of the line that does not match \2 then append both lines to a file, 
>>> and recurs.
>> Bobby Wrenn
>> 'grep' should do what you want in terms of writing all (complete 
>> lines) wherein a match is found ... so ... maybe you could ...
>>    (1) read the part(s) of the lines in the original file that you 
>> want to match into a "pattern_file"
>>    (2) use grep with the -f option to use the pattern_file, and maybe 
>> the -n to get line numbers as well
>> ???
>> Hope this helps - or did I miss the point all together?
>> Regards
>> Fred James
> I'm not sure I am explaining it well.
> ^([A-Z|0-9]+\t[A-Z|0-9]+ [A-Z|0-9][A-Z|0-9][A-Z|0-9]).*
> Everything within the parentheses is what I want to match. The .* 
> selects the rest of the line.
> What I am trying to do is cull from a list of 53K records all those 
> records which have the same data in the first part of the line then 
> output/append both or multiple lines where the first part matches to 
> another file.
Bobby Wrenn
What I am picking up on this time around - sorry if I missed it the 
first time - is "What I am trying to do is cull from a list of 53K 
records all those records which have the same data in the first part of 
the line then output/append both or multiple lines where the first part 
matches to another file."  I believe I would use AWK for that, but 
several languages could do the same thing ... something like this ...
    starting with n = 1, increment n with each iteration
    read line n
       sarting with m = n + 1, increment m with each iteration
       read line m
          if regex(m) = regex(n)
             if m = n + 1
                append line n to new_file
             append line m to new_file

Once new_file is complete, use grep -vf to get those lines out of the 
original file ... something like ...
    grep -vf new_file original_file > temporary_file
... then rename original_file to hold_file, and rename temporary_file to 
original_file, and test your result ... if something failed, rename 
hold_file to original_file and you are back at square 1
OK?  Or have I missed it again?

Somehow I think all I may have done is restate your original message, 
which would mean that I may not have answered your question?
Regards
Fred James




More information about the Discuss mailing list