[NTLUG:Discuss] Discuss Digest, Vol 55, Issue 2

xianfeng gao gaoxianfeng at msn.com
Wed Jul 4 08:51:18 CDT 2007


edit bash file:maillist.sh

#!/bin/sh
i=1
grep -v "<" $1 > $1.1
cat $1.1 | while read dateValue
do
i=`expr   $i + 1`;
        echo $dateValue;

        grep -v $dateValue $1 > $1.2
  done
        cat $1.1 $1.2 > $1.complete

then chmod 777 maillist.sh
./mailist.sh your_maillist.file

>From: discuss-request at ntlug.org
>Reply-To: discuss at ntlug.org
>To: discuss at ntlug.org
>Subject: Discuss Digest, Vol 55, Issue 2
>Date: Tue, 03 Jul 2007 09:16:12 -0500
>
>Send Discuss mailing list submissions to
>	discuss at ntlug.org
>
>To subscribe or unsubscribe via the World Wide Web, visit
>	http://www.ntlug.org/mailman/listinfo/discuss
>or, via email, send a message with subject or body 'help' to
>	discuss-request at ntlug.org
>
>You can reach the person managing the list at
>	discuss-owner at ntlug.org
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of Discuss digest..."
>
>
>Today's Topics:
>
>    1. eliminating lines with the same information (Lance Simmons)
>    2. Re: eliminating lines with the same information (Stuart Johnston)
>    3. Re: eliminating lines with the same information (Kenneth Loafman)
>    4. Re: eliminating lines with the same information (Fred James)
>    5. Re: eliminating lines with the same information (Carl Haddick)
>    6. Re: eliminating lines with the same information (Chris Cox)
>    7. Re: eliminating lines with the same information (Wayne Walker)
>    8. Re: eliminating lines with the same information (Stuart Johnston)
>    9. Re: eliminating lines with the same information (Wayne Walker)
>   10. All non-US IP list? (. Daniel)
>   11. Re: All non-US IP list? (Kenneth Loafman)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Mon, 2 Jul 2007 12:55:33 -0500
>From: "Lance Simmons" <simmons.lance at gmail.com>
>Subject: [NTLUG:Discuss] eliminating lines with the same information
>To: "NTLUG Discussion List" <discuss at ntlug.org>
>Message-ID:
>	<5149823c0707021055u1013c8f0h14b7fac6ccdd7638 at mail.gmail.com>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>I have a text file with several thousand email addresses, many of
>which are duplicates. I've used "sort" and "uniq" to make the list
>smaller, but there are still almost a thousand..
>
>But I still have many duplicates.  For example, three lines in the file 
>might be
>
>   jsmith at abc.org
>   "John Smith" <jsmith at abc.org>
>   "Mr. John Smith" <jsmith at abc.org>
>
>Obviously, I'd like to get rid of two of those lines without having to
>manually go through and decide which to keep.  And I don't care about
>keeping names, I'm only interested in addresses.
>
>Also, the duplicates are not all on lines near each other, so even if
>I wanted to do it manually, it would be a huge hassle.
>
>Any suggestions?
>
>--
>Lance Simmons
>
>
>
>------------------------------
>
>Message: 2
>Date: Mon, 02 Jul 2007 13:14:44 -0500
>From: Stuart Johnston <saj at thecommune.net>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <46894094.4070805 at thecommune.net>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Extract just the addresses then re-sort|uniq.  I'd use Perl with
>Email::Address from CPAN.
>
>perl -n -MEmail::Address -e '($a) = Email::Address->parse($_); print
>$a->address, "\n";' < emails | sort | uniq
>
>
>Lance Simmons wrote:
> > I have a text file with several thousand email addresses, many of
> > which are duplicates. I've used "sort" and "uniq" to make the list
> > smaller, but there are still almost a thousand..
> >
> > But I still have many duplicates.  For example, three lines in the file 
>might be
> >
> >   jsmith at abc.org
> >   "John Smith" <jsmith at abc.org>
> >   "Mr. John Smith" <jsmith at abc.org>
> >
> > Obviously, I'd like to get rid of two of those lines without having to
> > manually go through and decide which to keep.  And I don't care about
> > keeping names, I'm only interested in addresses.
> >
> > Also, the duplicates are not all on lines near each other, so even if
> > I wanted to do it manually, it would be a huge hassle.
> >
> > Any suggestions?
> >
>
>
>
>------------------------------
>
>Message: 3
>Date: Mon, 02 Jul 2007 13:16:36 -0500
>From: Kenneth Loafman <kenneth at loafman.com>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <46894104.4010702 at loafman.com>
>Content-Type: text/plain; charset=ISO-8859-1
>
>Normalize the file to only email addresses like line 1 (sed or any regex
>capable editor should do the trick), then use:
>
>   $ sort < input | uniq -u > output
>
>to generate the list.
>
>...Ken
>
>Lance Simmons wrote:
> > I have a text file with several thousand email addresses, many of
> > which are duplicates. I've used "sort" and "uniq" to make the list
> > smaller, but there are still almost a thousand..
> >
> > But I still have many duplicates.  For example, three lines in the file 
>might be
> >
> >   jsmith at abc.org
> >   "John Smith" <jsmith at abc.org>
> >   "Mr. John Smith" <jsmith at abc.org>
> >
> > Obviously, I'd like to get rid of two of those lines without having to
> > manually go through and decide which to keep.  And I don't care about
> > keeping names, I'm only interested in addresses.
> >
> > Also, the duplicates are not all on lines near each other, so even if
> > I wanted to do it manually, it would be a huge hassle.
> >
> > Any suggestions?
> >
>
>
>
>------------------------------
>
>Message: 4
>Date: Mon, 02 Jul 2007 13:42:14 -0500
>From: Fred James <fredjame at fredjame.cnc.net>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <46894706.401 at fredjame.cnc.net>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Lance Simmons wrote:
>
> >I have a text file with several thousand email addresses, many of
> >which are duplicates. I've used "sort" and "uniq" to make the list
> >smaller, but there are still almost a thousand..
> >
> >But I still have many duplicates.  For example, three lines in the file 
>might be
> >
> >  jsmith at abc.org
> >  "John Smith" <jsmith at abc.org>
> >  "Mr. John Smith" <jsmith at abc.org>
> >
> >Obviously, I'd like to get rid of two of those lines without having to
> >manually go through and decide which to keep.  And I don't care about
> >keeping names, I'm only interested in addresses.
> >
> >Also, the duplicates are not all on lines near each other, so even if
> >I wanted to do it manually, it would be a huge hassle.
> >
> >Any suggestions?
> >
> >
> >
>Lance Simmons
>(1) the awk script below will extract only lines that contain a '@'
>character (as in email addresses)
>(2) the sed script below will then remove the '<' and '>' characters, if 
>any
>
>An example command line would be ...
>  ># gawk -f sandbox.awk inputfile | sed -f sandbox.sed
>
>An example of the output from your file would be ...
>jsmith at abc.org
>jsmith at abc.org
>jsmith at abc.org
>
>... and then of course sort and uniq would apply very nicely ...
>something like ...
>  ># gawk -f sandbox.awk inputfile | sed -f sandbox.sed | sort | uniq >
>outputfile
>... would do it for you, I should think.
>
>Hope this helps
>Regards
>Fred James
>
>[AWK script]
>{
>         for(i=1;i<=NF;i++) {
>                 if (index($i,"@")) {
>                         print $i
>                 } else {
>                         continue
>                 }
>         }
>}
>
>[SED script]
>s/<//g
>s/>//g
>
>
>
>
>
>------------------------------
>
>Message: 5
>Date: Mon, 2 Jul 2007 14:39:48 -0500
>From: Carl Haddick <sysmail at glade.net>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <20070702193948.GA4897 at glade.net>
>Content-Type: text/plain; charset=us-ascii
>
>Howdy from a lurker,
>
>I rarely post here, mostly due to time constraints, but I appreciate the
>chance to see the discussion.
>
>This is pretty much the same thing I did a few weeks ago.  Being
>somewhat of a nutcase for Python, here's what I used:
>
>#!/usr/local/bin/python
>
>import re,sys,sets
>
>addrset=sets.Set()
>
>addrre=re.compile('([0-9a-z_\-\.]+@[0-9a-z_\-\.]+)',re.I)
>for l in file(sys.argv[1]):
>     m=addrre.search(l)
>     if m:
>         if not m.group(1) in addrset:
>             addrset.add(m.group(1))
>             print l,
>
>Save that in a file, and run that file with the argument being the name
>of your email list.  It will print out the lines represented by unique
>addresses.
>
>Note that it just finds the first email address on each line, and
>rejects lines with no email addresses at all.  You may need to modify it
>for your purposes, to say the least.
>
>Run it at your risk, of course.
>
>Regards,
>
>Carl
>
>On Mon, Jul 02, 2007 at 12:55:33PM -0500, Lance Simmons wrote:
> > I have a text file with several thousand email addresses, many of
> > which are duplicates. I've used "sort" and "uniq" to make the list
> > smaller, but there are still almost a thousand..
> >
> > But I still have many duplicates.  For example, three lines in the file 
>might be
> >
> >   jsmith at abc.org
> >   "John Smith" <jsmith at abc.org>
> >   "Mr. John Smith" <jsmith at abc.org>
> >
> > Obviously, I'd like to get rid of two of those lines without having to
> > manually go through and decide which to keep.  And I don't care about
> > keeping names, I'm only interested in addresses.
> >
> > Also, the duplicates are not all on lines near each other, so even if
> > I wanted to do it manually, it would be a huge hassle.
> >
> > Any suggestions?
> >
> > --
> > Lance Simmons
> >
> > _______________________________________________
> > http://www.ntlug.org/mailman/listinfo/discuss
>
>
>
>------------------------------
>
>Message: 6
>Date: Mon, 02 Jul 2007 15:05:35 -0500
>From: Chris Cox <cjcox at acm.org>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <46895A8F.1050800 at acm.org>
>Content-Type: text/plain; charset=ISO-8859-1
>
>Lance Simmons wrote:
> > I have a text file with several thousand email addresses, many of
> > which are duplicates. I've used "sort" and "uniq" to make the list
> > smaller, but there are still almost a thousand..
> >
> > But I still have many duplicates.  For example, three lines in the file 
>might be
> >
> >   jsmith at abc.org
> >   "John Smith" <jsmith at abc.org>
> >   "Mr. John Smith" <jsmith at abc.org>
> >
> > Obviously, I'd like to get rid of two of those lines without having to
> > manually go through and decide which to keep.  And I don't care about
> > keeping names, I'm only interested in addresses.
> >
> > Also, the duplicates are not all on lines near each other, so even if
> > I wanted to do it manually, it would be a huge hassle.
> >
> > Any suggestions?
> >
>
>Since everyone else has one to offer... here's my attempt:
>
>sed -e 's/[^<]*[<]\([^>][^>]*\).*/\1/' email.txt | sort -u
>
>
>
>
>------------------------------
>
>Message: 7
>Date: Mon, 2 Jul 2007 17:05:27 -0500
>From: Wayne Walker <waynewalker at bybent.com>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <20070702220527.GC6732 at bybent.com>
>Content-Type: text/plain; charset=us-ascii
>
>cat file_with_names | sed -e 's/.*<//' -e 's/>.*//' | sort -u
>
>This will HOSE you if any of the addresses are in the valid but rarely used 
>form of :
>
>wwalker at bybent.com <Wayne Walker>
>
>If you have any of those, instead use:
>
>cat file_with_names | sed -e 's/^[^@]*<//' -e 's/>[^@]*$//' | sort -u
>
>Wayne
>
>On Mon, Jul 02, 2007 at 12:55:33PM -0500, Lance Simmons wrote:
> > I have a text file with several thousand email addresses, many of
> > which are duplicates. I've used "sort" and "uniq" to make the list
> > smaller, but there are still almost a thousand..
> >
> > But I still have many duplicates.  For example, three lines in the file 
>might be
> >
> >   jsmith at abc.org
> >   "John Smith" <jsmith at abc.org>
> >   "Mr. John Smith" <jsmith at abc.org>
> >
> > Obviously, I'd like to get rid of two of those lines without having to
> > manually go through and decide which to keep.  And I don't care about
> > keeping names, I'm only interested in addresses.
> >
> > Also, the duplicates are not all on lines near each other, so even if
> > I wanted to do it manually, it would be a huge hassle.
> >
> > Any suggestions?
> >
> > --
> > Lance Simmons
> >
> > _______________________________________________
> > http://www.ntlug.org/mailman/listinfo/discuss
>
>
>
>------------------------------
>
>Message: 8
>Date: Mon, 02 Jul 2007 17:09:08 -0500
>From: Stuart Johnston <saj at thecommune.net>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: Wayne Walker <wwalker at bybent.com>, 	NTLUG Discussion List
>	<discuss at ntlug.org>
>Message-ID: <46897784.8050809 at thecommune.net>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Woo Hoo!!  Wayne wins the coveted Useless Use of Cat Award!!  ;)
>
>http://partmaps.org/era/unix/award.html
>
>  > sed -e 's/.*<//' -e 's/>.*//' < file_with_names | sort -u
>
>Wayne Walker wrote:
> > cat file_with_names | sed -e 's/.*<//' -e 's/>.*//' | sort -u
> >
> > This will HOSE you if any of the addresses are in the valid but rarely 
>used form of :
> >
> > wwalker at bybent.com <Wayne Walker>
> >
> > If you have any of those, instead use:
> >
> > cat file_with_names | sed -e 's/^[^@]*<//' -e 's/>[^@]*$//' | sort -u
> >
> > Wayne
> >
> > On Mon, Jul 02, 2007 at 12:55:33PM -0500, Lance Simmons wrote:
> >> I have a text file with several thousand email addresses, many of
> >> which are duplicates. I've used "sort" and "uniq" to make the list
> >> smaller, but there are still almost a thousand..
> >>
> >> But I still have many duplicates.  For example, three lines in the file 
>might be
> >>
> >>   jsmith at abc.org
> >>   "John Smith" <jsmith at abc.org>
> >>   "Mr. John Smith" <jsmith at abc.org>
> >>
> >> Obviously, I'd like to get rid of two of those lines without having to
> >> manually go through and decide which to keep.  And I don't care about
> >> keeping names, I'm only interested in addresses.
> >>
> >> Also, the duplicates are not all on lines near each other, so even if
> >> I wanted to do it manually, it would be a huge hassle.
> >>
> >> Any suggestions?
> >>
> >> --
> >> Lance Simmons
> >>
> >> _______________________________________________
> >> http://www.ntlug.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > http://www.ntlug.org/mailman/listinfo/discuss
>
>
>
>------------------------------
>
>Message: 9
>Date: Mon, 2 Jul 2007 18:55:07 -0500
>From: Wayne Walker <waynewalker at bybent.com>
>Subject: Re: [NTLUG:Discuss] eliminating lines with the same
>	information
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <20070702235507.GB6934 at bybent.com>
>Content-Type: text/plain; charset=us-ascii
>
>On Mon, Jul 02, 2007 at 05:09:08PM -0500, Stuart Johnston wrote:
> > Woo Hoo!!  Wayne wins the coveted Useless Use of Cat Award!!  ;)
> >
> > http://partmaps.org/era/unix/award.html
> >
> >  > sed -e 's/.*<//' -e 's/>.*//' < file_with_names | sort -u
>
>And Stuart gets the useless use of < award...
>
>sed -e 's/.*<//' -e 's/>.*//' file_with_names | sort -u
>
>Wayne used cat at the front because in a tutorial pure left to right action 
>is clearer.
>   :-)
>	(damn IT Instructors trying to unobfuscate things)
> > Wayne Walker wrote:
> > > cat file_with_names | sed -e 's/.*<//' -e 's/>.*//' | sort -u
>
>
>Thanks for the pointer to that site though.  I personally hand out the 
>Useless use of Kill -9 award almost daily!!!!
>
>Wayne
>
>
>
>
>------------------------------
>
>Message: 10
>Date: Tue, 03 Jul 2007 08:55:12 -0500
>From: ". Daniel" <xdesign at hotmail.com>
>Subject: [NTLUG:Discuss] All non-US IP list?
>To: discuss at ntlug.org
>Message-ID: <BAY105-F2729B7B7171DA54367EF85A50C0 at phx.gbl>
>Content-Type: text/plain; charset=iso-2022-jp; format=flowed
>
>This is something of a follow-up on the previous discussion of blocking all
>chinese and korean IPs at the greylist filter.
>
>I have followed the advice of list members here suggesting that I use
>spamassassin and rank the values of emails from certain countries higher.
>And that has certainly helped in one regard: The email is trapped and
>scanned on my MailScanner machine.  But let me tell you, while that is
>certainly effective, it's not enough.
>
>Recently, I have been seeing emails coming from more countries than I can
>list in that particular set of rules.  Further, the sheer amount of email
>coming in and being processed is simply killing my server.  (Yes, I need a
>bigger server... maybe one day but not today.)  At some point, the box
>simply stops sending email on to my exchange server for reasons I have been
>unable to detect.  The sendmail queue just says "sending" and nothing is
>sent.  Rebooting the machine clears it up until the next time it gets
>congested like that.
>
>Previously someone wrote a little perl script for me to parse through some
>IP addresses for china and korea in a way that is suitable for relaydelay.
>Obviously, this will help but isn't going to fix the larger problem.  Where
>before the majority of such traffic was coming from those two areas, now
>it's coming from all of Europe and South American countries.
>
>I've been googling for lists of non-US IP addresses and there is no
>shortage of discussion on the topic.  (A lot of people offering what a bad
>idea it is and all that but without stating WHY it's a bad idea... not
>offering a scenario where it could be bad.)  In my case, this is a business
>that does business exclusively in Texas and exclusively for schools.  There
>is absolutely no business reason for incoming mail from outside Texas, let
>alone outside of the U.S.
>
>If only I could get a list of non-US IP addresses, I would be a happier 
>man.
>
>_________________________________________________________________
>????????????????????????5????????
>http://messenger.live.jp/oshigoto/index.htm
>
>
>
>
>------------------------------
>
>Message: 11
>Date: Tue, 03 Jul 2007 09:15:50 -0500
>From: Kenneth Loafman <kenneth at loafman.com>
>Subject: Re: [NTLUG:Discuss] All non-US IP list?
>To: NTLUG Discussion List <discuss at ntlug.org>
>Message-ID: <468A5A16.5030408 at loafman.com>
>Content-Type: text/plain; charset=ISO-2022-JP
>
>Try http://blackholes.us and you can find country lists, company lists,
>etc.  It works and I've used it successfully.
>
>...Ken
>
>. Daniel wrote:
> > This is something of a follow-up on the previous discussion of blocking 
>all
> > chinese and korean IPs at the greylist filter.
> >
> > I have followed the advice of list members here suggesting that I use
> > spamassassin and rank the values of emails from certain countries 
>higher.
> > And that has certainly helped in one regard: The email is trapped and
> > scanned on my MailScanner machine.  But let me tell you, while that is
> > certainly effective, it's not enough.
> >
> > Recently, I have been seeing emails coming from more countries than I 
>can
> > list in that particular set of rules.  Further, the sheer amount of 
>email
> > coming in and being processed is simply killing my server.  (Yes, I need 
>a
> > bigger server... maybe one day but not today.)  At some point, the box
> > simply stops sending email on to my exchange server for reasons I have 
>been
> > unable to detect.  The sendmail queue just says "sending" and nothing is
> > sent.  Rebooting the machine clears it up until the next time it gets
> > congested like that.
> >
> > Previously someone wrote a little perl script for me to parse through 
>some
> > IP addresses for china and korea in a way that is suitable for 
>relaydelay.
> > Obviously, this will help but isn't going to fix the larger problem.  
>Where
> > before the majority of such traffic was coming from those two areas, now
> > it's coming from all of Europe and South American countries.
> >
> > I've been googling for lists of non-US IP addresses and there is no
> > shortage of discussion on the topic.  (A lot of people offering what a 
>bad
> > idea it is and all that but without stating WHY it's a bad idea... not
> > offering a scenario where it could be bad.)  In my case, this is a 
>business
> > that does business exclusively in Texas and exclusively for schools.  
>There
> > is absolutely no business reason for incoming mail from outside Texas, 
>let
> > alone outside of the U.S.
> >
> > If only I could get a list of non-US IP addresses, I would be a happier 
>man.
> >
> > _________________________________________________________________
> > ????????????????????????5????????
> > http://messenger.live.jp/oshigoto/index.htm
> >
> >
> > _______________________________________________
> > http://www.ntlug.org/mailman/listinfo/discuss
> >
>
>
>
>------------------------------
>
>_______________________________________________
>Discuss mailing list
>Discuss at ntlug.org
>http://www.ntlug.org/mailman/listinfo/discuss
>
>
>End of Discuss Digest, Vol 55, Issue 2
>**************************************

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




More information about the Discuss mailing list