A Collection of Procmail Posts

Posts made by me or follow-ups to them with procmail related content.

From eli@panix5.panix.com Mon Sep 18 13:14:58 EDT 2000
Article: 46792 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: or'ing procmail conditions
Date: 20 Jul 2000 21:45:46 GMT
Organization: Some absurd concept
Lines: 69
Message-ID: <8l7ru9$nuk$1@news.panix.com>
References: <8l5qud$ngc$1@panix3.panix.com>  <8l77ib$8v4$1@panix2.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 964129546 24532 166.84.0.230 (20 Jul 2000 21:45:46 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 20 Jul 2000 21:45:46 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:46792

In panix.questions, Analysis&Solutions  wrote:
> In  ja@panix.com (ja) writes:
> >:0 B
> >* (not MLM|financial breakthrough)
> >spam
> Yeah.  But, making one long line is harder to read/maintain.

Use scoring. Scoring can do some very nifty things.
You can specify lines having positive or negative scores. If
the net score for the recipe is positive the action at the
end is performed.

	:0 B
	* 1^0 not MLM
	* 1^0 financial breakthrough
	spam

A cheat sheet on scoring:

	* N^M test

If M is zero, N will be added once if the test succeeds.
If M is one,  N will be added each time the test succeeds.
If M is some other value, see the complicated formula in
procmailsc(5). (N and M can be real values.)

If you want the recipe to stop at the first match and not 
run them all to reach a final score, use N of 2147483647
or -2147483647 depending on the direction you wish to
sway the results. Internally 2147483647 is infinity to 
procmail.

If you want to save a score, copy it out of $= *immediately*
after the recipe. It can then be seeded back into another
recipe as in this example

	# Count bytes in the body.
	:0 B
	* 1^1 .
	{
		Bytes=$=
	}

	# For messages over 1000 bytes...
	# (A leading $ on the condition line is needed to tell
	# procmail to expand variables on the line, grrrr.)
	:0 B
	*   -1000^0
	* $ $Bytes^0
	{
		# Allow a maximum of one web address per thousand bytes.
		:0 B
		* $ -$Bytes^0
		*   1000^1        http://
		$junk
	}

Sidenote:
You might be tempted to use something like this to check if the
mail is less than 1000 bytes, but in my tests (on the Sun and
NetBSDs at panix), it is broken and will always be true. 

	* -2147483647^0 < 1000

(Verbose logging can be *so* helpful.)

Elijah
------
realizes he has never used the '<' and '>' conditions


From eli@panix5.panix.com Mon Sep 18 13:15:17 EDT 2000
Article: 48186 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: or'ing procmail conditions
Date: 18 Sep 2000 17:13:41 GMT
Organization: Some absurd concept
Lines: 61
Message-ID: <8q5ig5$cpt$1@news.panix.com>
References: <8l5qud$ngc$1@panix3.panix.com> <8l77ib$8v4$1@panix2.panix.com> <8l7ru9$nuk$1@news.panix.com> <8q227o$jf1$1@news.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 969297221 13117 166.84.0.230 (18 Sep 2000 17:13:41 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 18 Sep 2000 17:13:41 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:48186

In panix.questions, Dallman Ross   wrote:
> On 20 July 2000, B. Elijah Griffin  wrote:
> > A cheat sheet on scoring:
> Really well done, Elijah.  I feel like I'm PDG at procmail,

Thanks.

> having done it since 1994.  But I'd only found the wherewithal
> to tackle the scoring manpages (man procmailsc) about three
> months ago.  Your explication is nicely done, and easier to
> follow.

I hate the procmail manpages. Stephen van den Berg seems to think
that mentioning something once in an offhand kind of way
constitutes full documentation.


[this is broken on panix procmail versions]
> > 	* -2147483647^0 < 1000
> Hmm. I haven't tested with the newer version of procmail that
> was installed last month.  But I do recall some *very* recent
> (last 6 weeks or so) discussion on the procmail list about
> a problem with "infinity" on some platforms, and new rules
> for ver. 3.15, now out.

I haven't followed the procmail list in years, but I thought I
remembered this being brought up once. I tried searching through
the changes file for one recent procmail for fixes that would
affect it but did not turn up anything.

> > Elijah
> > realizes he has never used the '<' and '>' conditions
> I was gonna *say* . . . :)

I have used scoring to count the number of characters (and
words) in messages and checked those with regular expressions.

  # Calculate some sizes for use in recipes

  # "B"ody only. Headers vary too much. Using ":0B" doesn't work, hence
  # this weird syntax. It is so easy to hate procmail for stuff like this.
  # Add this "* 1^0 ^^" to fix an off by one size problem, that does not
  # bother me, though.
  :0
  * B ?? 1^1 ^.*$
  { }
  Lines     = $=

  :0
  * B ?? 1^1 [^         ]+([    ]+|$)
  { }
  Words    = $=

  :0
  * B ?? 1^1 > 1
  { }
  Chars    = $=

Elijah
------
so I lied I have used > in a recipe, but in a rather obscure way


From dattier@panix.com Wed Sep 20 21:06:32 EDT 2000
Article: 48230 of panix.questions
Path: news.panix.com!panix3.panix.com!not-for-mail
From: dattier@panix.com (David W. Tamkin)
Newsgroups: panix.questions
Subject: Re: or'ing procmail conditions
Date: 20 Sep 2000 17:28:05 -0400
Organization: PANIX -- Public Access Networks Corp.
Lines: 19
Message-ID: <8qba55$o89$1@panix3.panix.com>
References: <8l5qud$ngc$1@panix3.panix.com>  <8l77ib$8v4$1@panix2.panix.com> <8l7ru9$nuk$1@news.panix.com>
NNTP-Posting-Host: panix3.panix.com
X-Trace: news.panix.com 969485286 27843 166.84.0.228 (20 Sep 2000 21:28:06 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 20 Sep 2000 21:28:06 GMT
Xref: news.panix.com panix.questions:48230

In article <8l7ru9$nuk$1@news.panix.com>,
B. Elijah Griffin  wrote:

>Use scoring.

Or do the double-reverse DeMorgan:

  :0 condition-related flags
  * ! condition1
  * ! condition2
  * ! condition3
  { }  # this is a no-op
  :0E action-related flags
  action

Note that, like using the supremum weight with scoring, once a condition
passes (and thus its negated form fails), procmail skips right down to
the action without testing the other conditions.



From eli@panix.com Mon Oct 16 16:06:14 EDT 2000
Article: 46437 of panix.questions
Path: news.panix.com!panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Asian Spam Filter?
Date: 12 May 2000 00:26:14 GMT
Organization: Some absurd concept
Lines: 82
Message-ID: <8ffj36$igk$1@news.panix.com>
References: <8ffe30$h8h$1@news.panix.com>
NNTP-Posting-Host: panix.com
X-Trace: news.panix.com 958091174 18964 166.84.0.226 (12 May 2000 00:26:14 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 12 May 2000 00:26:14 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:46437

In panix.questions,   wrote:
> I made some usenet posts in a Hong Kong movie news group, about three
> years ago, and have been receiving Asian spam ever since--at least two
> messages per week, and often more. I can't read the messages, because they
> use an encoding scheme with high-ASCII characters (above decimal 127).
> Since I can't read the text, I can't remove myself from the lists.
> 
> Is there any pre-existing or easily created email filter to block messages
> that have high-ASCII chacters in the body of the mail?

That could be done, but filtering on the body is slow. It also could
have problems with messages that have curly quotes and similar dumb
highbit stuff left in. Chances are that these messages you want to
filter have some sort of charset declaration in the headers that
would be faster/easier to search on.

For instance, offhand, I would expect any mail that had a charset
declaration, but did not have it set to US-ASCII, ISO-8859-1, or
WINDOWS-* or DOS-*, to not be English.

	:0:
	* charset=
	* ! ^Content-Type.*charset=['"]?(us-ascii|iso-8859-1|windows-|dos-)
	$HOME/odd-charset.mbox

Although now that I check my email, I see a few others that
were English, but I subscribe to some mailing lists with
large international readership.

	UTF-8			mail from my dad
	x-UNICODE-2-0-UTF-7	mail from someone at panix
	unknown-8bit		spam
	ks_c_5601-1987		from hananet.net (Korean ISP?)
	iso-2022-jp		from someone in Japan
	iso-8859-2		from zloty.it.com.pl (Poland)
	iso-8859-7		from space.gr (Greece)
	X-UNKNOWN		mailing list followup to previous
	koi8-r			from nat.bg (Bulgaria)
	iso01.f16		carma.isirc.is (Iceland)


And some I got that were not English:

	us-ascii		had German content
	ks_c_5601-1987		from igroup.co.kr (Korea)

The other non-English ones I have are not Asian languages,
and don't have charsets specified. These are typically 
German, Portugese, or Spanish; judging by sending domain.

So based on this furhter research, I'd let through more unknowns,
not limit ISO-8859-* to just ISO-8859-1, and then do body filtering
on the rest. Something like this:

	:0:
	* ^Content-Type:.*charset=
	* ! ^Content-Type.*charset=['"]?(us-ascii|iso-8859|windows-|dos-|
		utf|(x-)?(unknown|unicode)
	{

	  # If any line has three or more highbit characters, it is odd.
	  :0B:
	  * [^^A-^?].*[^^A-^?].*[^^A-^?]
	  $HOME/odd-charset.mbox
        }

The second Content-Type line should not be folded in the procmailrc,
and the character class in the body search is really:

	[^-]

Which translatates into "any char not in the set between 0x01 and 0x7f
(inclusive)".  To type it in vi, I use s before the control
characters.

This could knock some mail that has "ascii" art using highbit
characters, though. Best would be to apply the body filter to
an opt-in list of charsets, not an opt-out one.

Elijah
------
ascii characters don't have highbits set


From eli@panix5.panix.com Mon Oct 16 16:54:05 EDT 2000
Article: 48821 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Asian Spam Filter?
Date: 16 Oct 2000 20:52:31 GMT
Organization: Some absurd concept
Lines: 116
Message-ID: <8sfpqf$ear$1@news.panix.com>
References: <8ffe30$h8h$1@news.panix.com> <8ffj36$igk$1@news.panix.com> <8sd46q$mie$1@news.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 971729551 14683 166.84.0.230 (16 Oct 2000 20:52:31 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 16 Oct 2000 20:52:31 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:48821

In panix.questions, Dallman Ross   wrote:
> [On] 12 May 2000 00:26:14 GMT, eli@panix.com spake thusly:
> > 	:0:
> > 	* ^Content-Type:.*charset=
> > 	* ! ^Content-Type.*charset=['"]?(us-ascii|iso-8859|windows-|dos-|
> > 		utf|(x-)?(unknown|unicode)
> > 	{
> 
> > 	  # If any line has three or more highbit characters, it is odd.
> > 	  :0B:
> > 	  * [^^A-^?].*[^^A-^?].*[^^A-^?]
> > 	  $HOME/odd-charset.mbox
> >         }

[Ooops. Tab damage.]

> > The second Content-Type line should not be folded in the procmailrc,
> Though that could be done with a continuation line, as I'm sure
> Elijah knows:
> 
>  	  * [^^A-^?].*[^^A-^?].*[^^A-^?]\
>  	  $HOME/odd-charset.mbox

Um, yes, but that is not the line I was refering to. I was refering to:

 	* ! ^Content-Type.*charset=['"]?(us-ascii|iso-8859|windows-|dos-|
 		utf|(x-)?(unknown|unicode)

> recently trying to understand more than I presently do about
> high-bit expression in character classes.[1]  But anyway, back
> to the matter that was being discussed: I'm thinking that
> a useful approach here might be simply to look for Chinese
> IP space and *then* do the body egrep for high-bit stuff.
> As an aside, I find it useful to look at the Subject: and
> From: headers for non-Western characters as well - or
> rather, first.  Anyway, filtering on Chinese IP space in
> the Received: headers would seem to obviate the need to
> get so complex with Content-Type header-filtering.

Headers are supposed to be seven-bit only, and any 8bit content
in them is supposed to have a weird encoding (RFC1342). In
practice you'll see some 8bit content, but From: is more likely
to be encoded than Subject. Examples from the RFC:

   From: =?US-ASCII?Q?Keith_Moore?= 
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= 
   CC: =?ISO-8859-1?Q?Andr=E9_?= Pirard 
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

Which translate to:

   From: "Keith Moore" 
   To: "Keld Jrn Simonsen" 
   CC: "Andr Pirard" 
   Subject: If you can read this you understand the example

> I recently asked on the procmail list about regexes to work
> in procmail for finding non-Western chars.  I use them for
> my test of Subject:/From: headers.  DWT (dattier here at Panix) 
> gave me an ideal answer that has worked well.  I'm thinking it
> would also be useful in the scenario described by the original
> poster, above.  He defines for procmail a set of "nonprinting"
> characters that should be recognized as non-Western when
> found in mail:
> 
>   # Whitespace in the brackets contains a tab and a space,
>   # *in that order*
>   NONPRINTING=[^	 -~]

This is okay for headers (procmail will deal with the whitespace
of multiline headers for you), but you might want to allow ^L,
etc, in the actual body of the message.

>     :0  # find non-Western character sets
>   * $ ^(From|Subject):.*$NONPRINTING
>   | $SPAMSNAG
> 
> The char delimiter should also work well in body egreps.

In the body, procmail won't hide newlines from you (for grep
purposes[1]), and I'm not sure what would happen with DOS
style line endings.

> [1] The below recipe, or something similar to it, was
> published on the procmail list about six months ago.  I
> tried it briefly in my anti-spam arsenal but found it
> gave me false-positives.  I don't understand the char-class
> expression in the condition line, though, which is another
> way of saying that I don't know exactly what this recipe
> tries to do.  Elijah, _et al._, seems like the man to
> tell me.

I am an 'et alia' now?

>    :0 BH
>    * ^Content-transfer-encoding:.*quoted-printable
>    * -40^0
>    * 1^1 =[89A-F][0-9A-F]
>    { #something unrelated to my question will go here }

This looks a quoted-printable messages, sets the initial
score to -40, then adds one for each high bit QP char. The
QP chars are <=>. If the score 
becomes positive, the condition gets executed.

Elijah
------
just noticed RFC1342 encoded-words MUST NOT be used in an actual email address

> -- 
>    \     .-.     .-.     .-.     .-.     .-.     .-.     .-.
>     \-d-/-m-\-a-/-n-\-@-/-p-\-a-/-n-\-i-/-x-\-.-/-c-\-o-/-m-\
>      '-'     '-'     '-'     '-'     '-'     '-'     '-'     \




From eli@panix5.panix.com Wed Oct 18 13:36:49 EDT 2000
Article: 48866 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Another Procmail question
Date: 18 Oct 2000 17:24:20 GMT
Organization: Some absurd concept
Lines: 47
Message-ID: <8skmc4$n84$1@news.panix.com>
References: <20000214.1815.2000754snz@microvest.demon.co.uk> <889k48$q5k$1@byzantium.nyc.access.net> <88cm9o$2kf$1@news.panix.com> <8sih65$70e$1@news.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 971889861 23812 166.84.0.230 (18 Oct 2000 17:24:20 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 18 Oct 2000 17:24:20 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:48866

In panix.questions, Rik Kabel  wrote:
> B. Elijah Griffin  wrote:
> >And here's how to match an empty To: header:
> >  * ^To:[      $]*$
> >(That's a space, tab, and dollarsign in there. The $ really matches a
> >real or implicit newline, it is not a zero-width condition to procmail.)
> Inside character class delimiters the $ matches a literal $. If this

Whoops. Yes.

> Of course, if by an empty To: header you mean one with no syntactically
> recognizable address in it, there are other empty To: header problems.

That is a can of worms that does not need to be opened here.

> For instance, consider:
>             To: (this is an RFC822 comment)
> Given the difficulty of parsing comments, it is difficult to detect any
> but the most basic 'empty' header. For instance, how do you handle group
> names, such as 'undisclosed:;'?

	TO_HEADER=`formail -X To: | formail -r`
	:0
	* TO_HEADER ?? ^^foo@bar^^
	{
	   # condition to act upon a functionally empty To: header
	}

> I have been playing with
>             [ \t]*(\([^()]*\)[ \t]*)*
> to match a comment, where \t is a tab, but that doesn't begin to
> represent what comments can look like.

No kidding. The general case cannot be done with true regular
expressions (jumping through many hoops perl's regexps can do it,
but you don't want to try). Procmail's regexps are true ones.

> To match embedded newlines along with tabs and spaces, use a construct
> like
>             ( |\t|$)
> (again, \t is a tab).

Yes.

Elijah
------
sometimes misses zero-width assertions in procmail regexps


From eli@panix5.panix.com Tue Feb  6 14:40:40 EST 2001
Article: 50310 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: screening for file attachment size
Date: 6 Feb 2001 18:15:43 GMT
Organization: Some absurd concept
Lines: 40
Message-ID: <95pf0f$70d$1@news.panix.com>
References: <95nu6e$5fn$1@panix3.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 981483343 7181 166.84.0.230 (6 Feb 2001 18:15:43 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 6 Feb 2001 18:15:43 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50310

In panix.questions, Andreas Ringstad  wrote:
> anyone have any tricks they recommend for filtering
> mail based on file attachment size?

You are going to need a full fledged MIME parser to figure out
how big attachments are. Figuring out the size of the whole
message is much easier.

> a procmail recipe could do the trick, i guess.  any
> other solutions?

Maybe my imagination is failing me, but I can't figure out
a way to measure the size of an attachment in procmail.
I can think of ways of counting attachments (without the
ability to recurse into nested messages, though) and ways
to measure the size of the whole message.

Read the procmailsc and procmailrc man pages for more
help on scoring in recipies.

  # Count characters in body. From ~eli/procmail/rc.dupes
  :0
  * B ?? 1^1 > 1
  { }
  Chars    = $=


  # Reject messages larger than 100,000 characters. (Untested)
  :0
  * -100000^0  ^^
  * $ $Chars^0 .
  {
    # /usr/include/sysexits.h:
    # define EX_NOPERM      77      /* permission denied */
    EXITCODE=77
  }

Elijah
------
should write a decent procmailrc interfacable MIME tool some day


From eli@panix5.panix.com Tue Feb  6 14:40:52 EST 2001
Article: 50313 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: screening for file attachment size
Date: 6 Feb 2001 19:40:12 GMT
Organization: Some absurd concept
Lines: 53
Message-ID: <95pjur$8jd$1@news.panix.com>
References: <95nu6e$5fn$1@panix3.panix.com> 
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 981488412 8813 166.84.0.230 (6 Feb 2001 19:40:12 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 6 Feb 2001 19:40:12 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50313

In panix.questions, ja  wrote:
> Here is an example of what might work: (untested)
> 
> :0:
> * > 2000000 | gzip -9c >> mobymail

I think you mean:

  :0:
  * > 2000000
  | gzip -9c >> mobymail

But why "mobymail"? And wouldn't it be better to compress
each message to a seperate file?

> :0
> * > 2000000 
> { :0 c: 
> 
> | gzip -9c >> $HOME/mobymail 
> 
> :0 h
> 
> | (cat -; echo "You have mobymail" ) >> $DEFAULT
> 
> }

More recent versions of procmail may be able to deal with
those blank lines between the :0 and the action, but v3.11pre3
(on the Sun) will choke.

A good indentation style is very helpful with procmail, since
the syntax is so obtuse.

  :0
  * > 2000000
  {
    :0 c:
    | gzip -9c >> $HOME/mobymail
 
    :0 h:
    | (cat -; echo "You have mobymail" ) >> $DEFAULT
  }

> There is no locking colon in the initial action line for the second
> recipe; I have no idea why...I suppose that the presence of the
> locking colon in the braces is adequate, but I don't know why.

Nope, you need a locking colon there.

Elijah
------
or a LOCKFILE=foo


From eli@panix5.panix.com Tue Feb  6 15:49:12 EST 2001
Article: 50315 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: screening for file attachment size
Date: 6 Feb 2001 20:48:50 GMT
Organization: Some absurd concept
Lines: 36
Message-ID: <95pnvi$9u2$1@news.panix.com>
References: <95nu6e$5fn$1@panix3.panix.com>  <95pjur$8jd$1@news.panix.com> 
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 981492530 10178 166.84.0.230 (6 Feb 2001 20:48:50 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 6 Feb 2001 20:48:50 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50315

In panix.questions, ja  wrote:
> On 6 Feb 2001 19:40:12 GMT, B. Elijah Griffin in wrote:
> >In panix.questions, ja  wrote:
> >> :0:
> >> * > 2000000 | gzip -9c >> mobymail
> >I think you mean:
> >  :0:
> >  * > 2000000
> >  | gzip -9c >> mobymail
> Yep. May I ask why I mean that? Will mail processing break in my
> example?

Yes.

I ran this rc file through procmail:
	VERBOSE=yes

	:0
	* > 200 | gzip -9c >> mobymail

	:0
	normal

And got this verbose log (line lengths adjusted):
	procmail: [28375] Tue Feb  6 15:42:31 2001
	procmail: Skipped "| gzip -9c >> mobymail"
	procmail: Match on "> 200 | gzip -9c >> mobymail"
	procmail: Assigning "LASTFOLDER=:0"
	procmail: Opening ":0"
	From test@example.com  Mon May  1 00:50:18 2000
	 Subject: test message with > 200 bytes in headers and in body
	  Folder: :0                                                        430

Elijah
------
not all whitespace is created equal


From eli@panix5.panix.com Wed Feb 14 15:13:12 EST 2001
Article: 50441 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Stripping HTML from email on the panix box?
Date: 14 Feb 2001 20:13:00 GMT
Organization: Some absurd concept
Lines: 42
Message-ID: <96eosc$jth$1@news.panix.com>
References: <96ed4r$q1l$1@panix6.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 982181580 20401 166.84.0.230 (14 Feb 2001 20:13:00 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 14 Feb 2001 20:13:00 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50441

In panix.questions, Robert Cutler  wrote:
> I have telnet access only to panix and read email with elm. An increasing 
> amount of email, especially from certain e-lists, comes already formatted 
> in HTML. This is no problem for people with PPP and mailreaders that 
> display HTML on their home machine, if they never want to do string-
> searches. I doubt that I am the only person who does not fall into that 
> category.
> 
> Is there a way automatically to strip HTML coding from email, operating 
> from the panix command line? I've done 'apropos' and looked at a number of 
> man pages, and there is nothing self-evident. HTML::FormatText might be 
> what I'm looking for but I have no effective or useful knowledge of perl. 
> (Yes I'd love to learn it. I've been loving to learn it for the last few 
> years. My every-day incentive structure has not yet allowed me to do so.)

Given only the most basic testing, this seems okay.

	:0
	* ^Content-Type:\<*text/html
	{
	  :0hfw
	  | perl -wpe 's,^(Content-Type:\s*text)/html,$1/plain,i'

	  :0bfw
	  | perl -MHTML::TreeBuilder -MHTML::FormatText \
		-we 'my $tree = HTML::TreeBuilder->new->parse_file("-"); \
		     my $format = HTML::FormatText->new(leftmargin => 0, \
			rightmargin => 50); \
		     print $format->format($tree);'
	}

Join the \ lines together (without the backslashes) or procmail may not
like the multiline script.

This won't work on oldsun, since it does not seem to have the modules.
(You could install them in your home directory and tweak the search
path, but you may not be up to that.) I don't know if it will work on
the NetBSDs, I didn't check for the modules there.

Elijah
------
/usr/local/contrib/rc anyone?


From eli@panix5.panix.com Wed Feb 14 17:24:36 EST 2001
Article: 50445 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 14 Feb 2001 22:23:07 GMT
Organization: Some absurd concept
Lines: 58
Message-ID: <96f0ga$md0$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 982189387 22944 166.84.0.230 (14 Feb 2001 22:23:07 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 14 Feb 2001 22:23:07 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50445

In panix.questions, alice faber  wrote:
> Is there any command I can add to some of my procmail recipes that would
> write to the log file which specific recipe dev nulled a piece of mail? 
> Most of my recipes send things to a junk file (which hasn't gotten any false
> positives in a long time), but one of my father's massively forwarded
> jokes ended up being totally trashed. I've reverted the last recipes I
> edited (futile attempts to /dev/null anything in the big5 character set),
> but, should this problem arise again, I'd like to be able to identify the
> offending recipe, and I couldn't find any likely candidates in the 
> procmail and procmailex man pages.

Without verbose logging, procmail won't tell you what recipe produced
any particular result. Trouble is, verbose logging is really verbose.

You can code your procmailrc around this, two examples below, but
it does not help with the panix shared filters.

Logging in each recipe:

	NL='
	'

	:0
	* some spam rule
	{
	  LOG="some spam rule triggered$NL"
	  :0:
	  $TRASH
	}

Or logging only recipes:

	NL='
	'
	Reason=

	:0
	* some spam rule
	{
          Reason=$Reason'some spam rule;'
	}

...

	:0
	* Reason ?? .+
	{
          LOG="Found junk: $Reason$NL"
	  :0:
	  $TRASH
	}

I prefer the second approach, because it lets you know sooner if
more than one recipe would have caught a message.

Elijah
------
currently at about 16% getting doublely damned


From dman+news@wecontrolthevertical.com Thu Feb 15 12:06:03 EST 2001
Article: 50464 of panix.questions
Path: news.panix.com!not-for-mail
From: Dallman Ross 
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 15 Feb 2001 13:29:43 GMT
Organization: Res Ipsa Loquitur
Lines: 557
Message-ID: <96glk7$8jn$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96f0ga$md0$1@news.panix.com> 
Reply-To: dman+news@wecontrolthevertical.com
NNTP-Posting-Host: panix2.panix.com
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: news.panix.com 982243783 8823 166.84.0.227 (15 Feb 2001 13:29:43 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 15 Feb 2001 13:29:43 GMT
X-Opinion: Lex axilla justitiae
User-Agent: tin/1.4.4-20000803 ("Vet for the Insane") (UNIX) (NetBSD/1.4.3 (i386))
Xref: news.panix.com panix.questions:50464

In pertinent part in , Cliff Heller
 spake thusly:

> I came up with a solution to identify the panix shared filters, but
> it is complex and prone to error if panix ever changes the way they
> implement the shared filters.

> [snip]

> The advantage is that I set a variable that indicates 1) whether or
> not the panix rules caught the spam, and 2) which rule cuaght it.
> Strictly speaking it only indicates the last rule that caught it.  I'm
> thinking about a simple mod to not fire the rule if the SPAM variable
> already has a value.

Here's what I do (but not on panix currently/yet).  This is most of
the spam section of a .procmailrc that handles 16 domains (of mine).
Most or all of the variables seen are defined up-top in the rc.
Okay, here goes.  Other comments follow after the big insert:

-------------- 474 lines from my .procmailrc follow until "^-----"

  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Now we're "really" in the spam section ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

  ##  Spam is identified in sections based on header keywords. It then
  ##  is marked for logging and trapping in the SPAMSNAG folder.  We
  ##  will have excluded listmail earlier up-top (this has not yet
  ##  been implemented, so is now still an ugly duplicated effort in
  ##  each applicable recipe).  Sections proceed hence:
  ##
  ##    Section "A" initializes its testing heuristics on `From_'
  ##    Section "B" initializes testing heuristics on `Received:'
  ##    Section "C" initiates on `Message-ID:'
  ##    Section "D" initiates on `From:'
  ##    Section "E" initiates on `Subject:'
  ##    Section "F" initiates on `To:'
  ##    Section "H" initiates on other headers
  ##
  ##
  ##  We do NOT *initiate* spam searches based on message-body; nor do
  ##  we generally invoke lists (whether long or short) for grepping
  ##  of blacked-out I(S)Ps,[1] domain blocks, RBL nominations, etc.
  ##  While some body grepping ultimately is necessary, we try to
  ##  minimize it based on "smart" heuristics that proceed in stages
  ##  from least- to most-"invasive."  We thus tend to rely, first,
  ##  on suspicious headers (though "suspicion" may sometimes only be
  ##  discerned when headers are cross-compared in a rather evolved
  ##  fashion). Minimized impact on server resources and thoughtful,
  ##  effective, organized, well-commented, and *elegant* coding
  ##  is the goal, if not a completely realistic assessment of the
  ##  result.
  ##
  ##  The structure for our approach here continues from the sections
  ##  outlined above to sub-regions expressed in a decimal notational
  ##  system.  Thus, "A-14a-BC" represents Recipe No. 14 (subpart
  ##  `a') in Section A (based on `From_'), with the trailing
  ##  letters indicating condition lines based on `Received:' and
  ##  `Message-ID:'.  This recipe is one degree deep.  A three-degree
  ##  recipe initialized on `Subject:', with `Message-ID' testing
  ##  in the second level and body testing at level 3, might read
  ##  "F-04.C.X", with the `X' indicating a body egrep.  Note, for
  ##  the first degree, that no extra condition elements (as would
  ##  be represented by upper-case letters) are stated: in this
  ##  example, no further explication at that degree was indicated for
  ##  Recipe F-04.  Overall, the schema helps us develop additional
  ##  heuristics in a forthright fashion, build on previous models and
  ##  code, and quickly identify and categorize the "type" of spam
  ##  found as it is expressed in the appended label or log entry.
  ##
   #
   #  [1] The one exception is that we do i.d. based on China
   #  Telecom's IP block.  Whitelist "demurers" can be called from
   #  an INCLUDERC if desired.  We also sometimes test, based on
   #  country codes (such as `jp', `ru') found in the `Received:' or
   #  other headers, for putative transient delivery servers that are
   #  suspiciously "high-noise."  But such a high-S/N- or "suspect"
   #  class is never, alone, a sufficient earmark such that we would
   #  feel comfortable labeling mail definitively as spam.  IOW,
   #  suspect-class servers must exist in the headers along with
   #  *other compelling evidence* before we are prepared to find that
   #  the "clear and persuasive bar" has been reached such that we may
   #  mark the mail as spam.


  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section A - Bogus `From_' Header       ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`A' $NL"

     :0  # hotmail address in FROM but no hotmail.com server in Received:
     * FROM ?? hotmail
     * ! $ ^Received:.*[$WHITESPACE]hotmail\.com
     { RECIPE = "${RECIPE:+$RECIPE }UBE_A-01-B" }


     :0  # real AOL mail will have the Message-ID
     * FROM ?? @aol\.com
     * ! $ ^Message-ID:[$WHITESPACE]+\<[a-z0-9.]+@aol\.com\>$
     { RECIPE = "${RECIPE:+$RECIPE }UBE_A-02-C" }


     :0  # if not from common TLD . . .
     * $ ! FROM ?? \.(com|net|org|de|co\.uk)\>?($WHITESPACE|$)
     * ! (^TO|^Sender:.*)(list|track\.)
     {
        :0 B  # egrep the body for typical phrasing
        * $ $SPAMISH
        { RECIPE = "${RECIPE:+$RECIPE }UBE_A-03.X" }
     }


     #  [Others snipped here]
  }


  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section B - Bogus `Received:' Headers  ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`B' $NL"

     :0  # 50+ alphabet chars in a bogus "word" in Received:
     * ^Received:.*[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]\
                   [a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]\
                   [a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]\
                   [a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]\
                   [a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]
     { RECIPE = "${RECIPE:+$RECIPE }UBE_B-01" }


     # [snippage]


     :0  # high-S/N country-code in Received: . . .
     * ^Received:.*\.\/(jp|ru)
     * ! ^Precedence:.*bulk
     * ! ^Sender:.*list
     {
        nonFavoredNATION = MATCH

        :0  # . . . AND *inconsistent* Message-ID: with no typical TLD
        * ! $ ^Message-ID:.*@.*\.($nonFavoredNATION|com|net|org)>
        { RECIPE = "${RECIPE:+$RECIPE }UBE_B-03a.C" }


        :0 B  # . . . or AND suspect phrasing in body
        * ([!$][!$]|mail.*remove|remove.*mail)
        { RECIPE = "${RECIPE:+$RECIPE }UBE_B-03b.X" }
     }


         # Last edited 8-Oct-00: opt-out for Raymond or Judith
     :0  # Chinese IP space.  `h' is implicit with nested braces
     * ! ^Sender:.*list
     * ! ^TO_track\.
     * ! LOCALNAME ?? ()(raymond|judith)
     * ^(Received|Message-ID):.*(202\.(9[6-9]|10[0-9]|11[0-1])|\
                                  61\.(12[8-9]|13[0-9]|14[0-9]|15[0-9]))\.
     { RECIPE = "${RECIPE:+$RECIPE }UBE_B-04-C" }


     # [Lots more snippage]

     :0  # munged "IP space" that is really the spammer's server name
     * $ ^Received:.*[${WHITESPACE}_]by[${WHITESPACE}_]\
                   .*[${WHITESPACE}_]by[${WHITESPACE}_]
     * $ ^Received:.*[${WHITESPACE}_]\[[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+]
     { RECIPE = "${RECIPE:+$RECIPE }UBE_B-06-B" }


     # [snip, snip]
  }


  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section C - Suspect/Bogus `Message-ID:'##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`C' $NL"

         # Last edited 14-Oct-00: removed test for `.' from first condition
     :0  # empty or malformed Message ID is bogus
     * ! ^Message-ID:[^@]+@?>?$
     * ! ^Message-ID:.*\.([0-9][0-9]+)?>$
     * ! ^Message-ID:[^0-9@]+$
     * ! ^Message-ID:[^.]+$
     * ! ^Message-ID:.*mail\.mydomain\.com>$
     { }
           # Last edited 11-Oct-00: worked on body egrep
     :0 E  # (reverse-DeMorgan with ELSE is just coding slickness)
     * ! ^Sender:.*list
     * ! ^TO_track\.
     * $ -$INFINITY^0 ^TODallman
     *   -150^0
     *    100^1 ! SUBJ ?? RE:
     *    100^0  ^Content-Type:.*html
     *     75^0   B ?? (earn|profit|remov)
     { RECIPE = "${RECIPE:+$RECIPE }UBE_C-01.UFEHXS" }
  }



  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section D - Problematic `From:'        ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`D' $NL"

     :0  # find non-Western character sets; scoring finesses OR conditions
     * $ $INFINITY^0 ^From:.*$NONPRINTING
     * $ $INFINITY^0 SUBJ ?? ()$NONPRINTING
     { RECIPE = "${RECIPE:+$RECIPE }UBE_D-01-ES" }


     # [more snippage]
  }



  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section E - Questionable `Subject:'    ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`E' $NL"

     :0 D  # (`D' means case-sensitive) SUBJ contains keywords . . .
     * SUBJ ?? FREE
     {
        :0  # . . . AND body contains suspect phrasing
        * $ B ?? $SPAMISH
        { RECIPE = "${RECIPE:+$RECIPE }UBE_E-01a" }


        :0  # . . . or AND body contains other suspect phrasing
        * B ?? free report
        { RECIPE = "${RECIPE:+$RECIPE }UBE_E-01b-X" }


        :0  # . . . or AND wasn't addressed to my domain
        * ! $ ^TO_$DOM
        { RECIPE = "${RECIPE:+$RECIPE }UBE_E-01c-F" }
     }


     # (very large snip)


     :0  # if NO listed subject, egrep body for typical phrasing
     * SUBJ ?? ^^\[subject was (blank|missing)]^^
     * $ B ?? $SPAMISH
     { RECIPE = "${RECIPE:+$RECIPE }UBE_E-03-X" }


     :0 D  # all-caps in 3+ words of SUBJ . . .
     *   -250^0
     * $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][a-z]+([$WHITESPACE]|$)
     {
        :0  # . . . AND SPAMISH body phrase is suspect
        * ! ^Sender:.*(list|postmaster)
        * ! ^TO_track\.
        * ! FROM ?? abuse
        * $ B ?? $SPAMISH
        { RECIPE = "${RECIPE:+$RECIPE }UBE_E-04a.X" }


            # Last edited 28-Sep-00: excluded `track.'
        :0  # . . . or AND `!' or `$' in SUBJ . . .
        * ! ^Sender:.*(list|postmaster)
        * ! ^TO_track\.
        * SUBJ ?? [!$]
        {
           :0 B # . . . AND 2+ body exclamation marks or dollar signs
           * -200^0
           *  100^1 (\!|\$)
           { RECIPE = "${RECIPE:+$RECIPE }UBE_E-04b.E.FX" }
        }
     }


     # [snip, snip, snip, snip]


           # Last edited 11-Oct-00: added numbers to recognized string
     :0 D  # 10+ spaces + 5 lower-case letters or 4 numbers in SUBJ
     * $ SUBJ ?? ()$SPACE$SPACE$SPACE$SPACE$SPACE\
                   $SPACE$SPACE$SPACE$SPACE$SPACE\
                   ([a-z][a-z][a-z][a-z][a-z]|[0-9][0-9][0-9][0-9])^^
     { RECIPE = "${RECIPE:+$RECIPE }UBE_E-07" }


     # [snip-o]


     :0  # sloppy spaces after SUBJ text and SPAMISH
     * $ SUBJ ?? ()[$WHITESPACE]+$
     * $ B ?? $SPAMISH
     { RECIPE = "${RECIPE:+$RECIPE }UBE_E-10-X" }
  }


  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section F - Questionable `To:' Headers ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`F' $NL"

     :0  # if not directly to domain I use . . .
     * ! ^(From|Return-Path|Sender):?.*(abuse|list|nomotek)
     * ! $ ^To:.*($DOM|Dallman)
     {
        :0  # . . .  look for suspicious high priority
        * ^X-MSMail-Priority: High
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-01a.H" }


        :0 B  # . . . egrep body for typical phrasing
        * $ $SPAMISH
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-01b.X" }
     }


     :0  # if no To: or Cc: at *all* . . .
     * ! ^To:
     * ! ^Cc:
     {
        :0 B  # THEN egrep body for `Remov' and 3+ `$' or `!'
        * Remov
        * -200^0
        *  100^1 [$!]
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-02a-H.XS" }


        :0  # . . . OR look for largish AND non-plaintext AND "referred"
        * > 5000
        * ^Content-Type:.*(multipart|html)
        * B ?? (thi|3)rd party thought (that )?you would be interested
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-02b-H.HHX" }
     }


     :0  # Popular spam signature
     * ^To: undisclosed-recipients:\;
     * ^Content-Type:.*html
     * ! FROM ?? root
     { RECIPE = "${RECIPE:+$RECIPE }UBE_F-03-A" }


     :0  # not to one of my domains and no listmail . . .
     * ! ^(Received|Return-Path|Sender):.*(list|newsletter)
     * ! ^Precedence: (bulk|junk)
     * $ ! ^TO_($DOM|(D(all)?man)
     * ! ^FROM_DAEMON
     * ! ^From:.*Dallman
     {
            # Last edited 8-Oct-00: added Received: to condition
        :0  # . . . AND Asian time zone
        * ^(Date|Received):.* \+0[89]00$
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-04a.H" }


        :0 B  # . . . or AND suspect body phrasing
        * $ $SPAMISH
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-04b.X" }


        :0 DB  # . . . or AND all-caps in 3+ words of body and `$'
        * $ ^[^a-z]*[A-Z]+[$WHITESPACE]+[A-Z]+[$WHITESPACE]+[A-Z]+
        * \\$
        {
           :0 B  # . . . AND no `remove' in body
           * remove
           { RECIPE = "${RECIPE:+$RECIPE }UBE_F-04c.XX.X" }
        }


        :0  # . . . or AND private IP space in Received:
        * ^Received:.*\[192\.168\.[0-9.]+]
        { RECIPE = "${RECIPE:+$RECIPE }UBE_F-04d.B" }
     }


     # [snip]
  }


  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  ## Section H - Other Suspect Headers      ##
  ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
  #
  :0
  {
     LOG = "  ::: WE'VE ENTERED SPAMSNAG SECTION \`H' $NL"

     :0  # check time zone in Date:
     * ! ^Sender:.*list
     * ! ^FROM_DAEMON
     * $ ^Date:.*:[0-5][0-9][$WHITESPACE]+\/.*
     {
        TZ = $MATCH

              # Last edited 18-Oct-00: added `BST' option
        :0 D  # if TZ doesn't match acceptable timezone string . . .
        * ! TZ ?? ^^([+-][01][0-9]00)? ?\(?(([BCEGIMPW][DEMS]S?|U)T)?\)?^^
        { RECIPE = "${RECIPE:+$RECIPE }UBE_H-01.H" }
     }


     :0  # check year conformity in Date:
     * ^Date:\/[^:]+
     * ! MATCH ?? ()\/20[0-9][0-9]
     { RECIPE = "${RECIPE:+$RECIPE }UBE_H-02" }


     :0  # blank Reply-To: AND `$' in body
     * ^Reply-To:[^a-z]+$
     * B ?? \\$
     { RECIPE = "${RECIPE:+$RECIPE }UBE_H-03-X" }


     :0  # typical spammer ploy - empty Return-Path: . . .
     * $ ^Return-Path:[$WHITESPACE]*<>$
     {
        :0  # . . . and missing To: OR missing or empty From:
        *   ^To:
        * $ ^From:.*[^$WHITESPACE<>]
        { }
        :0 E
        { RECIPE = "${RECIPE:+$RECIPE }UBE_H-04.FD" }
     }


     :0  # filter spam on known bogus header
     * ^Comments: Authenticated sender is
     { RECIPE = "${RECIPE:+$RECIPE }UBE_H-05" }


     # [a big snip of one that needs work]


     :0   # Asian time offset in Date:/Received: AND SPAMISH
     * !      RECIPE ?? UBE_F-04a\.H
     *        ^(Date|Received):.* \+0[89]00$
     *   B ?? remove request
     { RECIPE = "${RECIPE:+$RECIPE }UBE_H-07.X" }


     :0  # we're running a reverse DeMorgan here for convenience of grouping
     * $ ! ^X-Mailer:.*[^$WHITESPACE)([]a-z0-9._-]+$
     * $ ! ^X-Mailer:[$WHITESPACE]*NetMailer
     { }
     :0 E  # if the header has non-Western chars or reads `NetMailer', brand it
     * ! ^X-BeenThere:.*list
     { RECIPE = "${RECIPE:+$RECIPE }UBE_H-08-U" }
  }


# [my anti-virus major section comes next, cut here]
# [next is where we "brand" the messages and store them away from inbox]



  #  #  #  #  #  #  #  #  #  #  #  #  #  #  #
  #  Let's log, mark, and file our catch    #
  :0  # h implicit in nested braces
  * $ RECIPE ?? ^^(VIR|UBE)_([A-Z]-)?[A-Z0-9.-]+
  {
     LOG = "  ::: RECIPE-ID: >$RECIPE< $NL"

     :0 fhiw
     | formail -A"X-Recipe-ID: $RECIPE"

     :0:
     * RECIPE ?? VIR
     virus!

     :0 E:  # Else . . . (N.B.: if SPAMSNAG = /dev/null, remove the lock)
     $SPAMSNAG
  }

-------------------------

Okay, to tell you the truth, the above - which is part of Major Revision 4
of my .procmailrc, with version 1 having started in 1993 or 1994 -
has already grown long of tooth in my eyes, and I have had in mind another
major heuristical/algorithmic shift for about five months, but have
been too engrossed in other major projects to get to it.  The quest
is always for elegance and some sort of attempt at expressing
my epistemological understanding vis--vis Occam's Razor. :-)
(No, I'm never even close to satisfied that I've achieved any
such thing.)

Every spam recipe is "branded" and, if hit, gets reported in highly
visible/greppable format in my log.


> . . . .  The only way to really curtail spam is to restrict huge
> netblocks and not accept any bcc's.  Even then, a few things slip
> through.
> These potentially result in false positives.
>  
> I have a whitelist system implemented as well, but I don't currently
> accept mail only from whitelisted addresses as I don't consider it
> fair to make my friends jump through hoops just to send me email.

As you can see, I don't quite accept that premise.  I try to train
my .procmailrc to "see" what I see when I look as a human.  There
are some scoring recipes that I left out because I found recently
that I had a flawed understanding of the heuristic and they don't
quite work right; and I haven't gotten around to the gruntwork
I'll need to fix them.  I'm sort of saving that for Rev. 5, anyway,
which, as it is developing in my head, ends up dividing the mail
along programmatically simpler genera.  I should have written down
some of my middle-of-the-night brainsorming and elucidation on
this from some months ago, because I've since forgotten some of
my inspiration of that time for how Rev. 5 will look.  Oh, well.
I do know that one point was to have defined sets of suspect
patterns from each header section that I could combine in
complex mixtures at ease.

Let's talk about results.  I have about a 97% capture rate
right now, and I save the others to analyze and be able to
add improvements to the recipes.  I have very, very few
false-positives.  If I do get them, I find out why and fix
things, and the rate goes down further.

If there is a false-positive, the only thing that happens
is that the mail goes in my $SPAMSNAG file, but I will
see it anyway.  I don't /dev/null anything.  False-positives
are about one or two a week right now, and I get 100-150 emails
a day.

I get about 30 spams a day (and report every one).

Even with what I have now, I can grep the hits out of my log
(or look at the spam, since I brand it via formail in the headers)
and decide which recipes aren't getting hit often enough to be
worthwhile, etc.

HTH.

-- 
|) /\ |_ |_ |\/| /\ |\|   "My other .sig's a 5-liner"


From eli@panix5.panix.com Thu Feb 15 12:55:23 EST 2001
Article: 50467 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 15 Feb 2001 17:00:39 GMT
Organization: Some absurd concept
Lines: 76
Message-ID: <96h1vn$ce9$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96f0ga$md0$1@news.panix.com> 
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 982256439 12745 166.84.0.230 (15 Feb 2001 17:00:39 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 15 Feb 2001 17:00:39 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50467

In panix.questions, Cliff Heller   wrote:
> I came up with a solution to identify the panix shared filters, but it is
> complex and prone to error if panix ever changes the way they implement 
> the shared filters.
> 
> It is also probably inefficient.

Not too much more than non-spam mail that goes through the filters.

> What I do is to mutate each panix filter from
> :0
> * condition
> $TRASH
> 
> to 
> 
> :0
> * condition1
> {
> SPAM=panix001
> }

Thing is, this will tell you only the last recipe that caught a piece
of mail.

> The advantage is that I set a variable that indicates 1) whether or not the 
> panix rules caught the spam, and 2) which rule cuaght it.

Yup. Also you can then just add a header to the mail indicating it
was caught rather than filtering.

> Strictly speaking it only indicates the last rule that caught it.  I'm
> thinking about a simple mod to not fire the rule if the SPAM variable
> already has a value.

	:0
	* ! SPAM ?? .
	* condition1
	{
	  SPAM=panix001
	}


I prefer the append-to-the-variable method:

	:0
	* condition1
	{
	  SPAM=$SPAM"panix001;"
	}

> I accomplish this in a fairly creative manner.
> Early on in my personal rule set, I run the following rule:
> :0 w
> {
> LOG=`${PMDIR}/filterload.pl`"$NL"
> }
> 
> This simply executes filterload.pl and prints the result in the log.

You could have filterload print to STDERR to log stuff.

  IGNORE=`${PMDIR}/filterload.pl >&2`


I've considered doing something like your approach, but never felt
enough interest in trying panix's recipes.

Personally I feel that restricting whole netblocks is not needed,
but that spam filters do need to be customized to the mail one
should be getting. (Eg, being forgiving if email includes your
name or other personal info, and strict if it begins 'Dear Friend'.)

Elijah
------
seldom whitelists people


From eli@panix5.panix.com Thu Feb 15 12:57:12 EST 2001
Article: 50468 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 15 Feb 2001 17:45:47 GMT
Organization: Some absurd concept
Lines: 119
Message-ID: <96h4kb$d7k$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96f0ga$md0$1@news.panix.com>  <96glk7$8jn$1@news.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 982259147 13556 166.84.0.230 (15 Feb 2001 17:45:47 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 15 Feb 2001 17:45:47 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50468

In panix.questions, Dallman Ross   wrote:
>   ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
>   ## Now we're "really" in the spam section ##
>   ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

And a very impressive set of recipes, too.

>      :0  # hotmail address in FROM but no hotmail.com server in Received:
>      * FROM ?? hotmail
>      * ! $ ^Received:.*[$WHITESPACE]hotmail\.com
>      { RECIPE = "${RECIPE:+$RECIPE }UBE_A-01-B" }

I would never do something like that, since I have known enough
people that use Hotmail, or the like, for reading email but not
for sending it. (When you send mail through Hotmail, it adds
a header with your IP address, which upsets people who use it
for privacy.)

>      :0  # if not from common TLD . . .
>      * $ ! FROM ?? \.(com|net|org|de|co\.uk)\>?($WHITESPACE|$)
>      * ! (^TO|^Sender:.*)(list|track\.)
>      {
>         :0 B  # egrep the body for typical phrasing
>         * $ $SPAMISH

What is $SPAMISH set to?

>      :0  # high-S/N country-code in Received: . . .
>      * ^Received:.*\.\/(jp|ru)

What about this or the like?

	Received: from mail2.jpmorgan.com
		by localhost with POP3 (fetchmail-5.0.0)
		for user@host (single-drop);
		Wed, 03 Jan 2001 11:36:43 -0800 (PST)

It is times like this that a true zero-width assertion like
perl's \b or lookahead would be useful.

>          # Last edited 14-Oct-00: removed test for `.' from first condition
>      :0  # empty or malformed Message ID is bogus

These are tricky to count on.

>      * ! ^Message-ID:[^@]+@?>?$

I see one real message I got the last three months that breaks that.

>      * ! ^Message-ID:.*\.([0-9][0-9]+)?>$

I'v got a couple of real ones that match that.

>      * ! ^Message-ID:[^0-9@]+$

That looks safe.

>      * ! ^Message-ID:[^.]+$

I've gotten plenty of real mail that breaks that.

>      :0 D  # all-caps in 3+ words of SUBJ . . .
>      *   -250^0
>      * $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][a-z]+([$WHITESPACE]|$)

If you use [A-Z][a-z]+, then how does it catch all-caps?
Also, do you realize that if you had this:

       * $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][A-Z]+([$WHITESPACE]|$)

Then a subject like:

	Subject: AAA BBB CCC DDD EEE

Would not trigger it, but

	Subject: AAA BBB CCC DDD EEE FFF

would. Since each match for scoring starts where the previous left
off (and ^^ won't match there[*]), so for ' AAA ' the trailing
whitespace that would have been the leading whitespace for BBB has
been eaten.

* With the version of procmail I'm using. This may be a bug that has
  been fixed in someother version.

>      :0  # typical spammer ploy - empty Return-Path: . . .
>      * $ ^Return-Path:[$WHITESPACE]*<>$

That could be a bounce, too.

>      {
>         :0  # . . . and missing To: OR missing or empty From:
>         *   ^To:
>         * $ ^From:.*[^$WHITESPACE<>]

Not likely for a bounce, but...

>      :0  # filter spam on known bogus header
>      * ^Comments: Authenticated sender is
>      { RECIPE = "${RECIPE:+$RECIPE }UBE_H-05" }

There is a real mail program that adds that. It also adds a

	X-mailer: Pegasus [version string]

header, though.

> Okay, to tell you the truth, the above - which is part of Major Revision 4
> of my .procmailrc, with version 1 having started in 1993 or 1994 -
> has already grown long of tooth in my eyes, and I have had in mind another
> major heuristical/algorithmic shift for about five months, but have
> been too engrossed in other major projects to get to it.

When you do it, please post it. Should make for good reading.

Elijah
------
now keeping a collection of procmail related posts in ~eli/procmail/posts


From eli@panix5.panix.com Mon Feb 19 14:38:23 EST 2001
Article: 50507 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 19 Feb 2001 19:35:12 GMT
Organization: Some absurd concept
Lines: 34
Message-ID: <96rshf$e13$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96k0b2$aah$1@news.panix.com> <96ku6j$sfi$1@panix6.panix.com> <96mik2$8t1$1@panix1.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 982611312 14371 166.84.0.230 (19 Feb 2001 19:35:12 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 19 Feb 2001 19:35:12 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50507

In panix.questions, DWT  wrote:
> Third one: if there are no further rcfiles (or no rcfiles at all) named on
> procmail's command line, then instead of dropping the messages into /dev/null
> or a side door thereto, do something like this:
> 
>  :0 any requisite condition-related flags
>  * conditions, however many
>  { LOG="Reason
> " HOST }
> 
> When there are no more rcfiles on procmail's command line, unsetting HOST is
> a more efficient way to drop a message than filing it to /dev/null.

A couple of comments.

I use NL to hold a new line, so that I can just append a $NL to a log
line to terminate it without having the quotes end on the next line.
Much nice for indentation reasons.

Second, that HOST variable idea is (in my opinion) an abuse of a rather
obscure feature of procmail. Better to set the DELIVERED variable to
'yes'. 

(The HOST variable was intended so that you could invoke procmail
with several RC files, each intended for a different machine. Assigning
to the variable causes a 'if this string matches the current value of
HOST, continue processing this file, otherwise move on the next RC
file' action. Useful, eg, if you want to set up your procmailrc to
work differently on panix5 versus other panix machines for 
.forwardonintel testing.)

Elijah
------
knows what the 0 of :0 means, too


From dattier@panix.com Mon Feb 19 18:51:39 EST 2001
Article: 50509 of panix.questions
Path: news.panix.com!panix1.panix.com!not-for-mail
From: dattier@panix.com (DWT)
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 19 Feb 2001 17:07:56 -0600
Organization: Pan's Alternative Nightlife In Xanadu
Lines: 36
Message-ID: <96s90c$df4$1@panix1.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96ku6j$sfi$1@panix6.panix.com> <96mik2$8t1$1@panix1.panix.com> <96rshf$e13$1@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 982624076 19122 166.84.0.226 (19 Feb 2001 23:07:56 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 19 Feb 2001 23:07:56 GMT
Xref: news.panix.com panix.questions:50509

eli@panix5.panix.com (B. Elijah Griffin) wrote in
<96rshf$e13$1@news.panix.com>:

| I use NL to hold a new line, so that I can just append a $NL to a log
| line to terminate it without having the quotes end on the next line.

Yes, that's already been illustrated.  I prefer the literal newline.

| Second, that HOST variable idea is (in my opinion) an abuse of a rather
| obscure feature of procmail. Better to set the DELIVERED variable to 'yes'. 

How can something that doesn't accomplish the job be "better"?  Setting
DELIVERED to `yes' tells procmail to fake an orgasm to the MTA to pretend
that all is well so that the MTA will not return a bounce even if EXITCODE
is non-zero, but it does not in and of itself prevent delivery.

Did you actually try it?  I did.

As long as there are no more rcfiles on the command line unsetting HOST
does lose the message, and it loses it more efficiently than writing it to
/dev/null.

| The HOST variable was intended so that you could invoke procmail
| with several RC files, each intended for a different machine.

And yet, if you misset or unset HOST during the last rcfile on the command
line, procmail drops the message undelivered and exits.  There must have been
a reason for that as well.  Stephen could have designed procmail to write the
message to $ORGMAIL when that happens, but he chose otherwise.  It's part of
the way the HOST variable was designed to work from the beginning.  I would
not call using the feature "abuse."

| knows what the 0 of :0 means, too

So do lots of us.



From eli@panix5.panix.com Wed Feb 21 13:53:01 EST 2001
Article: 50524 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail Q: spam w/o "To:" ?
Date: 21 Feb 2001 18:52:15 GMT
Organization: Some absurd concept
Lines: 19
Message-ID: <9712ov$fj3$1@news.panix.com>
References: <96udu9$d80$1@news.panix.com> <96ugjb$e3n$1@news.panix.com> <96v0nt$in4$3@news.panix.com> <96v847$428$1@panix1.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 982781535 15971 166.84.0.230 (21 Feb 2001 18:52:15 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 21 Feb 2001 18:52:15 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50524

In panix.questions, Thor Lancelot Simon  wrote:
> Regular expressions in Procmail have an implicit terminal '$', don't they?

Does grep/egrep?

Why should procmail?

Greedy matching is much slower than non-greedy (in the author's tests,
at least):

     Because this speeds up the search by an order of  magnitude,
     the procmail internal egrep will always search for the left-
     most shortest match, unless it is determining what to assign
     to  MATCH,  in  which  case it searches the leftmost longest
     match.

Elijah
------
thinks that message is in the wrong manpage (procmailsc instead of procmailrc)


From dman+news@wecontrolthevertical.com Fri Mar  2 16:18:56 EST 2001
Article: 50679 of panix.questions
Path: news.panix.com!not-for-mail
From: Dallman Ross 
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 2 Mar 2001 11:04:29 GMT
Organization: Res Ipsa Loquitur
Lines: 198
Message-ID: <97nunt$2sq$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96f0ga$md0$1@news.panix.com>  <96glk7$8jn$1@news.panix.com> <96h4kb$d7k$1@news.panix.com>
Reply-To: dman+news@wecontrolthevertical.com
NNTP-Posting-Host: panix2.panix.com
X-Trace: news.panix.com 983531069 2970 166.84.0.227 (2 Mar 2001 11:04:29 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 2 Mar 2001 11:04:29 GMT
X-Opinion: Lex axilla justitiae
User-Agent: tin/1.4.4-20000803 ("Vet for the Insane") (UNIX) (NetBSD/1.4.3 (i386))
Xref: news.panix.com panix.questions:50679

B. Elijah Griffin  spake thusly:

> In panix.questions, Dallman Ross   wrote:

>>   ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
>>   ## Now we're "really" in the spam section ##
>>   ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##

> And a very impressive set of recipes, too.

Thanks, Eli.  I've been ignoring your response for as long as I could.
That wasn't because I didn't appreciate it!  Rather, it was because
it was going to take a bit of grunt work to respond as I wanted to,
and I had neither the time nor the extra attention span to spare for
about a month.  Luckily, we have loooong expires here in the local
panix groups, so I just kept marking this unread.


>>      :0  # hotmail address in FROM but no hotmail.com server in Received:
>>      * FROM ?? hotmail
>>      * ! $ ^Received:.*[$WHITESPACE]hotmail\.com
>>      { RECIPE = "${RECIPE:+$RECIPE }UBE_A-01-B" }

> I would never do something like that, since I have known enough
> people that use Hotmail, or the like, for reading email but not
> for sending it. (When you send mail through Hotmail, it adds
> a header with your IP address, which upsets people who use it
> for privacy.)

Okay, I'll keep that in mind for my next revision.  It hadn't
occurred to me that someone might actually want to use hotmail
as a primary address while having the ability to send out from
another SMTP.  So far, I've not had anybody send me mail that
way, and I get an awful lot of mail.  But it's definitely a
thought, a possible problem, and I'll keep it in mind.
I may have to use the hotmail test as a score that pushes
in the direction of deciding the mail is spam, but not actually
use this one alone for the decisive conclusion.

>>      :0  # if not from common TLD . . .
>>      * $ ! FROM ?? \.(com|net|org|de|co\.uk)\>?($WHITESPACE|$)
>>      * ! (^TO|^Sender:.*)(list|track\.)
>>      {
>>         :0 B  # egrep the body for typical phrasing
>>         * $ $SPAMISH

> What is $SPAMISH set to?

It's a bit hard even for me to parse anymore, as it's grown
into a monster over many months.  I think it's probably time
to rebuild it modularly from smaller bites of variable
assignments.  But right now it looks like this:

 SPAMISH="((^|(To be|Please) )(.*mailto:.*)?removed?\\
           (.*>| .*([@:]|(from|please|send))|$)|\\
           (cannot be considered spam|This (ad|message) \\
            is (being )?sent (to you )?in compliance))"       # for B egrep

I've been trying to find the right balance between tight enough
to avoid false positives and loose enough to be useful at
catching lots of things.


>>      :0  # high-S/N country-code in Received: . . .
>>      * ^Received:.*\.\/(jp|ru)

> What about this or the like?

> 	Received: from mail2.jpmorgan.com
> 		by localhost with POP3 (fetchmail-5.0.0)
> 		for user@host (single-drop);
> 		Wed, 03 Jan 2001 11:36:43 -0800 (PST)

> It is times like this that a true zero-width assertion like
> perl's \b or lookahead would be useful.

Good point.  I could put a "\>" word-end marker after the expression.
However, all that happens if mail is ID'd here is that it gets
subjected to some further probes.  So the jpmorgan.com server's mail
to me would be (perhaps needlessly) body-egrepped.  It won't
in general cause gobs of excess CPU cycles or do any damage in
my use of it.  But I would rather have it be more exact, so will
insert a fix based on your observation.  Thanks.

>>          # Last edited 14-Oct-00: removed test for `.' from first condition
>>      :0  # empty or malformed Message ID is bogus

> These are tricky to count on.

>>      * ! ^Message-ID:[^@]+@?>?$
> I see one real message I got the last three months that breaks that.
>>      * ! ^Message-ID:.*\.([0-9][0-9]+)?>$
> I'v got a couple of real ones that match that.
>>      * ! ^Message-ID:[^0-9@]+$
> That looks safe.
>>      * ! ^Message-ID:[^.]+$
> I've gotten plenty of real mail that breaks that.

I'd really enjoy it if you could find time to send me some actual
mail samples that violate these.  I haven't seen any false positives
>from that condition set in many months.


>>      :0 D  # all-caps in 3+ words of SUBJ . . .
>>      *   -250^0
>>      * $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][a-z]+([$WHITESPACE]|$)

> If you use [A-Z][a-z]+, then how does it catch all-caps?

I must have copied the comment from another recipe and forgotten
to change it.  I think it's supposed to say "leading" where it
says "all."


> Also, do you realize that if you had this:

>        * $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][A-Z]+([$WHITESPACE]|$)

> Then a subject like:

> 	Subject: AAA BBB CCC DDD EEE

> Would not trigger it, but

> 	Subject: AAA BBB CCC DDD EEE FFF

> would. Since each match for scoring starts where the previous left
> off (and ^^ won't match there[*]), so for ' AAA ' the trailing
> whitespace that would have been the leading whitespace for BBB has
> been eaten.

> * With the version of procmail I'm using. This may be a bug that has
>   been fixed in someother version.

Whoops, let me pause and go refill my coffee cup, as the pani(c|x)
alarm just went off telling me we're out of range of my current
ability to concentrate and think this through. :-)  Back in a
sec. . . .

  No, actually, I don't see it.
Better said, I see your point about leading whitespace being
eaten, and that does help me understand why a couple of my
scoring recipes end up working the way they do (or why a
couple of my beta scoring recipes don't work the way I
thought they should).  So that was helpful.  But in the
case you cite, won't the `$' market match at the end of ` EEE'?


>>      :0  # typical spammer ploy - empty Return-Path: . . .
>>      * $ ^Return-Path:[$WHITESPACE]*<>$

> That could be a bounce, too.

Up above in my rc, in a section I didn't quote, I've tried to
identify and pull out already any legit bounces.  The mail
getting sifted through in the spam section has already passed
muster up there.  So in my algorithm it shouldn't be a bounce;
though it's a useful observation in general in case anyone
is attempting to employ some of these recipes.


>>      {
>>         :0  # . . . and missing To: OR missing or empty From:
>>         *   ^To:
>>         * $ ^From:.*[^$WHITESPACE<>]

> Not likely for a bounce, but...

>>      :0  # filter spam on known bogus header
>>      * ^Comments: Authenticated sender is
>>      { RECIPE = "${RECIPE:+$RECIPE }UBE_H-05" }

> There is a real mail program that adds that. It also adds a

> 	X-mailer: Pegasus [version string]

> header, though.

Okay, I'll add a test for Pegasus (which is a nice MUA, imho).
But would a legit mail coming through Pegasus be likely to have
either no To: or a missing/empty From: header?

>> Okay, to tell you the truth, the above - which is part of Major Revision 4
>> of my .procmailrc, with version 1 having started in 1993 or 1994 -
>> has already grown long of tooth in my eyes, and I have had in mind another
>> major heuristical/algorithmic shift for about five months, but have
>> been too engrossed in other major projects to get to it.

> When you do it, please post it. Should make for good reading.

Will do.

> now keeping a collection of procmail related posts in ~eli/procmail/posts

Good idea.  Thanks again for the comments.

-- 
dman (whose "fixit" temp file has grown to 800K of sample mail now.) :-(


From eli@panix5.panix.com Fri Mar  2 17:30:13 EST 2001
Article: 50707 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail log file
Date: 2 Mar 2001 22:29:35 GMT
Organization: Some absurd concept
Lines: 92
Message-ID: <97p6se$mok$1@news.panix.com>
References: <96eth9$6dg$1@panix3.panix.com> <96glk7$8jn$1@news.panix.com> <96h4kb$d7k$1@news.panix.com> <97nunt$2sq$1@news.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 983572175 23316 166.84.0.230 (2 Mar 2001 22:29:35 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 2 Mar 2001 22:29:35 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50707

In panix.questions, Dallman Ross   wrote:
> B. Elijah Griffin  spake thusly:
> Thanks, Eli.  I've been ignoring your response for as long as I could.
> That wasn't because I didn't appreciate it!  Rather, it was because

I fully understand.

> >>      :0  # high-S/N country-code in Received: . . .
> >>      * ^Received:.*\.\/(jp|ru)
> > What about this or the like?
> > 	Received: from mail2.jpmorgan.com
> But I would rather have it be more exact, so will
> insert a fix based on your observation.  Thanks.

Whenever I look at procmail recipes I look for that sort
of error. Often times it is made by people who don't understand
regular expressions. Sometimes it is made by people who are
lazy or don't care. (Tom Christiansen blocked 'usa.net' in a
way that prevented me from emailing him when I had a 'netusa.net'
address. Near as I can tell, he fell into the 'don't care'
category.) Sometimes, I think you fall in this category, it
is just not carefully thinking about accidental matches.

[message ID header checks that I thought bad]
> I'd really enjoy it if you could find time to send me some actual
> mail samples that violate these.  I haven't seen any false positives
> from that condition set in many months.

hgrep is a good tool for finding these things. I've pulled a
bunch from a moderated mailing list I subscribe to (BUGTRAQ)
and will send them to you.

> > Also, do you realize that if you had this:
> >        * $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][A-Z]+([$WHITESPACE]|$)
> > Then a subject like:
> > 	Subject: AAA BBB CCC DDD EEE
> > Would not trigger it, but
> > 	Subject: AAA BBB CCC DDD EEE FFF
> > would. Since each match for scoring starts where the previous left
> > off (and ^^ won't match there[*]), so for ' AAA ' the trailing
> > whitespace that would have been the leading whitespace for BBB has
> > been eaten.
>   No, actually, I don't see it.
> Better said, I see your point about leading whitespace being
> eaten, and that does help me understand why a couple of my
> scoring recipes end up working the way they do (or why a
> couple of my beta scoring recipes don't work the way I
> thought they should).  So that was helpful.  But in the
> case you cite, won't the `$' market match at the end of ` EEE'?

'$' matches \n in procmail. (procmail has *NO* zero width assertions,
internally a \n is added to the begining and end of the message so
that '^' and '$' will work there.)

I guess I should mention that I am assuming the SUBJ does not
have that final \n in it, because you probably captured it like
this:
	:0
	^Subject:\/.*
	{
          SUBJ=$MATCH
	}

So, for
	* $  100^1 SUBJ ?? (^^|[$WHITESPACE])[A-Z][A-Z]+([$WHITESPACE]|$)
Applied to
	Subject: AAA BBB CCC DDD EEE

It will be consumed like this:
	         AAA BBB CCC DDD EEE
	        11111   22222   3333

> Okay, I'll add a test for Pegasus (which is a nice MUA, imho).
> But would a legit mail coming through Pegasus be likely to have
> either no To: or a missing/empty From: header?

Empty To violates spec, as I recall, but some programs will allow
it (with addresses in CC only). Don't know if Pegasus is one of those.

> > now keeping a collection of procmail related posts in ~eli/procmail/posts
> Good idea.  Thanks again for the comments.

Your post was added to it. Maybe I'll put them in a web archive...

> dman (whose "fixit" temp file has grown to 800K of sample mail now.) :-(

I'm just going to send you the headers, not the whole messages,
but my samples total about 18k (uncompressed).

Elijah
------
hopes you can deal with uuencoded gzip ar archives


From eli@panix5.panix.com Tue Mar  6 13:00:00 EST 2001
Article: 50745 of panix.questions
Path: news.panix.com!panix5.panix.com!eli!not-for-mail
From: eli@panix5.panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: Procmail question
Date: 6 Mar 2001 17:59:38 GMT
Organization: Some absurd concept
Lines: 30
Message-ID: <9838ia$aj7$1@news.panix.com>
References: <9837rj$abi$1@news.panix.com>
NNTP-Posting-Host: panix5.panix.com
X-Trace: news.panix.com 983901578 10855 166.84.0.230 (6 Mar 2001 17:59:38 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 6 Mar 2001 17:59:38 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:50745

In panix.questions, Michael Roach  wrote:
> :0:
> * ^From:.*techrepublic*>
> TECH

Why that instead of just this?

  :0:
  * ^From:.*techrepublic
  TECH

The "ic*>" bit in yours will try to match an 'i' followed by zero
or more 'c's followed by a '>'. You are not going to find that in
a header like this:

> From: Perl Tips at TechRepublic.com 

I suspect you were trying to enter one of these:

  * ^From:.*techrepublic.*>
or
  * ^From:.*techrepublic\>

But I don't think there is an advantage to either of those. (".*>"
will match zero or more of any non-newline chars followed by a '>';
"\>" is the same as perl's \W, match a non-alphanumeric.)

Elijah
------
perl's \W *without locale settings* that is


From eli@panix.com Thu May  3 20:05:24 EDT 2001
Article: 10162 of panix.upgrade
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.upgrade
Subject: Re: Grid.net
Date: 2 May 2001 20:02:46 GMT
Organization: Some absurd concept
Lines: 17
Message-ID: <9cpp56$rnm$1@news.panix.com>
References: 
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 988833766 28406 166.84.0.226 (2 May 2001 20:02:46 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 2 May 2001 20:02:46 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.upgrade:10162

In panix.upgrade, ja  wrote:
> #Panix staff addresses received at least 82 in two days from
> #grid.net in 4.01; all IPs are from UU.net's dialup pools.
> #not all servers insert brackets; hence the question mark to
> #ensure that a match occurs.
> :0:
> * ^Received:.*\[?63\.49\.
> $TRASH

The others looked okay, but since this one only matches two numbers
the scope for mismatching is huge.

	Received: from mail.internal.foo.net [10.163.49.94] ...

Elijah
------
admits he found no non-grid.net match in his own year 2001 mail for that RE


From eli@panix.com Thu May  3 20:05:38 EDT 2001
Article: 10165 of panix.upgrade
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.upgrade
Subject: Re: Grid.net
Date: 2 May 2001 22:00:44 GMT
Organization: Some absurd concept
Lines: 33
Message-ID: <9cq02c$ol$1@news.panix.com>
References:  <9cpp56$rnm$1@news.panix.com>  <9cpuec$d29$1@panix2.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 988840844 789 166.84.0.226 (2 May 2001 22:00:44 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 2 May 2001 22:00:44 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.upgrade:10165

In panix.upgrade, Ayana Craven  wrote:
> ja  wrote:
>> B. Elijah Griffin wrote:
>>> ja  wrote:
>>>> * ^Received:.*\[?63\.49\.
>>> The others looked okay, but since this one only matches two numbers
>>> the scope for mismatching is huge.
>> Am I missing a subtlety here?
> 
> What's specified is (beginning of line)Received:(zero or no
> characters)(maybe a [)63.49.

Yes.

> which would, I think, match
> Received: from mail.internal.foo.net [###.63.49.##] 
> Received: from mail.internal.foo.net [###.163.49.##] 
> Received: from mail.internal.foo.net [163.49.##.##] 
> Received: from mail.internal.foo.net [63.49.##.##] 

And more.

> The last one is the one that we *want* to catch.
> 
> I think what's needed is to specify that the "63" is preceded by
> either a "[" or " ".  Then again, I could be wrong -- I'm hardly one

That sounds like a good start. Maybe "(" too. I'm not sure what ja is
seeing before the IPs, but it should defintely be [^a-z0-9.-] or tighter.

Elijah
------
still examining the charset/language filter issue


From eli@panix.com Thu May  3 20:05:48 EDT 2001
Article: 10169 of panix.upgrade
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.upgrade
Subject: Re: Grid.net
Date: 2 May 2001 22:26:02 GMT
Organization: Some absurd concept
Lines: 40
Message-ID: <9cq1hq$1ae$1@news.panix.com>
References:   <9cpuec$d29$1@panix2.panix.com> 
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 988842362 1358 166.84.0.226 (2 May 2001 22:26:02 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 2 May 2001 22:26:02 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.upgrade:10169

In panix.upgrade, ja  wrote:
> From the man page:
> 
> Extended regular expressions
> 
> a?        Either zero or one a.
> 
> In the case under discussion:
> 
> [?	  Either zero or one [.
> 
> Restated:
> 
> Either zero or one and only one [.
...
> But the question mark causes the procmail internal egrep to look at the
> immediately preceding character only. Characters after the question
> mark aren't considered by the procmail internal egrep.

The question mark causes the procmail internal egrep to allow one of
the immediately preceding character only if it needs to. Unless you
are capturing (used a "\/" sequence) procmail's egrep is non-greedy
since that is much faster. 

Since you have a ".*" followed by a "[?", procmail will expand that
"." as needed until it can get the later constraints ("[?" and
"63\.49\.") to match. But since the ? is non-greedy, and since the
"." can match "[", your regexp would be functionally identical if
you omitted the "[?".


> > I think what's needed is to specify that the "63" is preceded by
> > either a "[" or " ".
> Which is exactly what I was trying to do.

Regexps benefit from extra precision.

Elijah
------
procmail REs are odd in their own ways


From eli@panix.com Thu May  3 20:05:53 EDT 2001
Article: 10174 of panix.upgrade
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.upgrade
Subject: Re: Grid.net
Date: 3 May 2001 23:07:24 GMT
Organization: Some absurd concept
Lines: 87
Message-ID: <9csobc$pc8$1@news.panix.com>
References:   <9cq1hq$1ae$1@news.panix.com> 
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 988931244 25992 166.84.0.226 (3 May 2001 23:07:24 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 3 May 2001 23:07:24 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.upgrade:10174

In panix.upgrade, ja  wrote:
> On 2 May 2001 22:26:02 GMT, B. Elijah Griffin wrote:
> > But since the ? is non-greedy, and since the
> > "." can match "[", your regexp would be functionally identical if
> > you omitted the "[?".
> It is just that I thought that the * only referred to characters
> preceding the *; with something like an \ or any other character
> ending the regex engine's attempt to match.

Yes, you are right about *, at issue is when there are multiple
maybe-matches in a row, what takes precedence? Since the ".*" in
your RE could match a string like 'foo.bar [123.', the \[? is
not useful. It will just match '' and the RE engine goes on.

> Now I will spend my weekend figuring out why. Maybe this problem is
> simpler than I thought; or it may be that the egrep implementation in
> procmail is different from the one discussed in the O'Reilly
> book. Most likely, however, it is just a faulty comprehension on my
> part.

The procmail egrep implementation is compatible only with very old
egreps. It matches in a leftmost-shortest way for speed.

About leftmost-shortest:

The 'leftmost' bit means that when there is alternation that can match
multiple ways like:

	(.*foo|.*bar|.*qux)

Applied to:

	Received: from foobar.qux.net [209.15.15.132]

The leftmost part will match.

The 'shortest' bit means that for each 'match zero or more', 'match
one or more', 'match zero or one' metacharacter encountered, the RE
will try matching the shortest possible way that allows it to move
on. 

Example:

RE:
	^Received:.*\[?63\.49\.

Text:
	Received: from notquite.gridnet.co [10.63.49.12]

For that '.*' RE engine will iteratively try '', ' ', ' f', ' fr',
' fro', ' from', ... until the rest of the line matches. Since it
will find a match for .* == ' from notquite.gridnet.co [10.' and
\[? == '', the line will match.

> Now: is the .*\[0\.0\.0\.0 a source of error in any possible case?

In the case of matching strings, possibly. In the case of matching
to-spec Received: headers, exceedingly unlikely. But RFC822 does
allow comment blocks in Received: headers, so there are no guarantees.
True REs cannot cope with RFC822 comment rules, so procmail's RE
engine cannot ignore comments.

> > procmail REs are odd in their own ways
> Jeez, that sounds frightful.

In general, procmail does not have the concept of a zero-width
match. ^ and $ (and ^^) all match actual characters, sometimes ones
which are put there just so matching will work right.

Also procmail lacks useful constructs like the {n,m} matching
limitation.

The leftmost-shortest rule could have odd effects on scoring recipes,
eg:

	:0 B
	* 1^1 .+
	{
	  Score=$=
	}

That will count non-newline bytes in body. If the '+' were greedy
it would count non-empty lines in the body.

Elijah
------
regular expressions are surprizingly difficult to grok


From eli@panix.com Wed May  9 14:27:54 EDT 2001
Article: 52188 of panix.questions
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 9 May 2001 16:42:26 GMT
Organization: Some absurd concept
Lines: 19
Message-ID: <9dbs1h$ec1$1@news.panix.com>
References:   <9da2t0$6c1$1@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989426546 14721 166.84.0.226 (9 May 2001 16:42:26 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 9 May 2001 16:42:26 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:52188

In panix.questions, Tom Betz  wrote:
> Quoth Rich Alderson  in
> :
> |[1] I've not figured out whether using procmail will preclude my using the
> |    automagic "+boxname" form of addressing which I consider one of the best
> |    things about moving to Panix; I think it will and I don't want to screw
> |    with it.
> It won't, unless you make a procmail recipe that specifically looks for and 
> does somthing special with plussed e-mail.

As far as I have been able to tell, the only way to reliably figure
out that mail was sent to +boxname with the Panix mail set up is to
have a .forward+boxname file that passes that info to procmail. This
is a tad awkward, since it requires a bunch of extra files if you
want to use it for multiple filtering purposes.

Elijah
------
should ask the postfix list if there is a trick around that


From eli@panix.com Wed May  9 16:45:47 EDT 2001
Article: 52210 of panix.questions
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 9 May 2001 20:45:14 GMT
Organization: Some absurd concept
Lines: 28
Message-ID: <9dca8q$j7m$1@news.panix.com>
References:  <9dbs1h$ec1$1@news.panix.com> <9dc2n6$lmj$1@panix1.panix.com> <9dc5r6$hbp$3@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989441114 19702 166.84.0.226 (9 May 2001 20:45:14 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 9 May 2001 20:45:14 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:52210

In panix.questions, Brett Frankenberger  wrote:
> (205) panix3:rbf [/net/u/1/r/rbf] > cat .forward+netcom
> "|IFS=' ';exec $HOME/.procmail/procmail.sh rbf+netcom || exit 75 #rbf"
> 
> (206) panix3:rbf [/net/u/1/r/rbf] > cat $HOME/.procmail/procmail.sh
> IFS=' '
> exec /usr/local/bin/procmail -t RCPT=$1 || exit 75

Why use two files? I presume you are reusing $HOME/.procmail/procmail.sh
for a bunch of .forward+* files, but each one of those could simply
set the -t and be done with it.

> Of course, it's not as cool as: 
>    "|exec /usr/local/bin/procmail -a $0"
> which could be implemented as a single .forward and a bunch of links to

Yup. That would be nice.

> It would be nice if Panix would set an environment variable (when
> running programs from .forward) to the name of the mailbox to which
> delivery is being attempted ... (hint, hint)

Then you could use "procmail -p" and not have to worry about reading
$1 in procmail....

Elijah
------
-a arg comes into procmail as $1, but procmail won't use it in REs


From eli@panix.com Wed May  9 19:36:40 EDT 2001
Article: 52223 of panix.questions
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 9 May 2001 23:19:03 GMT
Organization: Some absurd concept
Lines: 18
Message-ID: <9dcj97$lvu$1@news.panix.com>
References:  <9dc5r6$hbp$3@news.panix.com> <9dca8q$j7m$1@news.panix.com> <9dci17$lgo$1@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989450343 22526 166.84.0.226 (9 May 2001 23:19:03 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 9 May 2001 23:19:03 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:52223

In panix.questions, Brett Frankenberger  wrote:
> B. Elijah Griffin  wrote:
[boxname in environment]
> >Then you could use "procmail -p" and not have to worry about reading
> >$1 in procmail....
[snip]
> >-a arg comes into procmail as $1, but procmail won't use it in REs
> 
> Presumably you could then do
>    SOMEVARIABLENAME=$1
> and you would then be able to use SOMEVARIABLENAME in an RE.

Yes. And that extra step is why the box in the environment and then
using "procmail -p" would be cleaner than "procmail -a boxname".

Elijah
------
procmail apparently can't take -a multiple times


From tbetz@pobox.com Thu May 10 17:21:14 EDT 2001
Article: 52232 of panix.questions
Path: news.panix.com!not-for-mail
From: tbetz@panix.com (Tom Betz)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 10 May 2001 02:24:38 GMT
Organization: Society for the Elimination of Junk Unsolicited Bulk Email
Lines: 206
Message-ID: <9dcu56$p14$1@news.panix.com>
References:  <9d8ols$m3u$1@news.panix.com> 
Reply-To: tbetz@pobox.com
NNTP-Posting-Host: panix3.panix.com
X-Trace: news.panix.com 989461478 25636 166.84.0.228 (10 May 2001 02:24:38 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 10 May 2001 02:24:38 GMT
X-No-Productlinks: yes
X-Fight-Junk-Email-URL: Join the fight against Junk EMail at .
X-Newsreader: trn 4.0-test74 (May 26, 2000)
Originator: tbetz@panix.com (Tom Betz)
Xref: news.panix.com panix.questions:52232


Quoth dmeyers@panix.com in :
|tbetz@panix.com (Tom Betz) writes:
|
|> |in RSS, DUL, RBL, or all three.
|> 
|> I'll take all three, thank you; but if I must be limited to
|> oneandonlyone, make it DUL, because I can implement (and have
|> already implemented) the rest in procmail, whereas I haven't figured
|> out how to implement DUL reliably in procmail.
|
|Care to elaborate?  Or, perhaps, post a recipe?

The key is this (stolen) perl code:
----- cut here -----
#!/usr/local/bin/perl -w
#
# rblchk
#
# Rblchk is a mail filter intended to catch spam by checking "Received"
# headers for invalid IP addresses or IP addresses recognized as sources
# of spam by inclusion in the MAPS Realtime Blackhole List.
#
# If rblchk thinks a message is spam, it will add a user-specified header
# line after the first defective "Received" header.  Otherwise it will
# add nothing.  In either case the message is passed through to standard
# output.
#
# This is based on a fairly detailed specification by Anne Bennett.
#
# Author: Michael Assels 
# Date:   v1.0  December  4, 1997
# Date:   v1.1  December 12, 1997
#
# See ChangeLog file in distribution for modification history.
#
# COPYRIGHT
#    Copyright (c) 1997 Concordia University.  All rights
#    reserved.  This program is free software; you may
#    redistribute it and/or modify it under the same terms as
#    Perl itself.
#

$MAPS          = '.blackholes.mail-abuse.org';
$OK            = '';
$InvalidIP     = '1 Invalid IP address ';
$RcvBlackHole  = '2 Received from RBL-registered spam site ';
$RlyBlackHole  = '3 Relayed through RBL-registered spam site ';

# *I* think rblchk's a nice name, but I can be anyone you like.
($myname = $0) =~ s#.*/##;

$USAGE=<;
defined($_) || exit;  # Don't bomb on empty input.
LOOP: {
   if ( /^$/ ) {      # A blank line means end of headers.
      print;
      last LOOP;
   }
   # Gather a complete header line with its continuation lines.
   local($header) = $_; 
   while ( <> ) {
      /^[ \t]/ || last;
      $header .= $_;
   }
   # Note: $_ now contains the line *after* $header
   print $header;
   if ( $header =~ /^Received:/i ) {  # Test any Received headers.
      local($tag) = &checkit($rcvCount,$header);
      if ( $tag ) {
         #
         # It's spam.  Tag it and get out of loop.
         #
         print "$spamheader: $myname: $tag\n";
         print;
         last LOOP;
      }
      $rcvCount++;  # Any further Received lines won't be the first.
   }
   last LOOP unless defined $_;
   redo LOOP;
}
#
# Pass everything else through.
#
print while <>;
exit;

#
# checkit: $relay is false on the first call, true on all others.
#          $rcvd is a "Received:" header.
#          Returns OK or an error code.
#
sub checkit {
   local($relay,$rcvd) = @_;
   local($IP,@IP) = $rcvd =~ /\[((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))\]/;
   local($name,$x);
   #
   # We can't complain if there's no IP address in this Received header.
   #
   return ($OK) unless defined $IP;
   #
   # Outer limits lose
   #
   return ($InvalidIP.$IP) if $IP eq '0.0.0.0';
   return ($InvalidIP.$IP) if $IP eq '255.255.255.255';
   #
   # All @IP components must be >= 0 and <= 255
   #
   foreach $x ( @IP ) {
      return ($InvalidIP.$IP) if $x > 255;
      return ($InvalidIP.$IP) if $x =~ /^0\d/;    # no leading zeroes allowed
   }
   #
   # Wrap the gethostbyname call with eval in case it times out.
   #
   eval {
      alarm($timeout);
      ($name) = gethostbyname(join('.',reverse @IP) . $MAPS);
      alarm(0);
   };
   return($OK) if $@ =~ /^alarm/;  # Timed out.  Let it through.
   return($OK) unless $name;       # If it's ok with MAPS, it's OK with us.
   return($relay ? $RlyBlackHole.$IP : $RcvBlackHole.$IP);
}
----- cut here -----

Change the value of $MAPS to check a different blacklist.

The recipe calling this code is:

----- cut here -----
# Block RBLed domains
# Special return code
#   
# MAPS RBL check.
# Tag spam with "X-Antispam: rblchk: N reason"
# Timeout MAPS DNS queries in 10 seconds
:0 f
| rblchk X-Antispam 10

# Now junk any messages tagged as spam
:0 H
* ^X-Antispam:
{
   EXITCODE=65
   MSG="5.7.1 You are listed in the MAPS RBL -- PERMISSION DENIED"
   FILEID=$EMAILID
   :0 c
   $PMDIR/$FILEID.rbl.tmp

# Wrap the spam in explanatory text (RBLtop on top, .rbl.signature on the bottom
# Bounce with RBL Error

   :0 c
   | ( $FORMAIL -rt -i"Subject: PERMISSION DENIED. Realtime Black List Domain IP In Header." \
   -i"Reply-To: tbetz+reply@panix.com" -A"X-Loop: Xenu Loves Me"  ;\
cat $PMDIR/RBLtop $PMDIR/$FILEID.rbl.tmp $PMDIR/bouncewarn $PMDIR/.rbl.signature ) | $SENDMAIL -oi -t ; rm -f $PMDIR/$FILEID.rbl.tmp

   :0
    rbl_junk.log
}
----- cut here -----

|Thanks very much!
|
|--d (thinking about just killing everything with anywhere in
|     asia or canada in a received header...)
|
|-- 
|dmeyers@panix.com
|
|Please don't use HTML in e-mail.  Here's how not to:
|http://www.geocities.com/CapitolHill/1236/nomime.html


-- 
|I always wanted to be someone,|            Tom Betz, Generalist               |
|but now I think I should have | Want to send me email? FIRST, READ THIS PAGE: |
|been a wee bit more specific. |  |
| "Fuck NANAE." -- Paul Vixie  | YO! MY EMAIL ADDRESS IS HEAVILY SPAM-ARMORED! |


From dattier@panix.com Thu May 10 17:25:02 EDT 2001
Article: 52239 of panix.questions
Path: news.panix.com!panix1.panix.com!not-for-mail
From: dattier@panix.com (DWT)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 10 May 2001 09:51:19 -0500
Organization: Customer of Panix
Lines: 27
Message-ID: <9de9t7$8r9$1@panix1.panix.com>
References:  <9dca8q$j7m$1@news.panix.com> <9dci17$lgo$1@news.panix.com> <9dcj97$lvu$1@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989506280 18269 166.84.0.226 (10 May 2001 14:51:20 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 10 May 2001 14:51:20 GMT
Xref: news.panix.com panix.questions:52239

eli@panix.com (B. Elijah Griffin) wrote in <9dcj97$lvu$1@news.panix.com>:

| Yes. And that extra step is why the box in the environment and then
| using "procmail -p" would be cleaner than "procmail -a boxname".

Then one runs the risks of telling procmail to trust the environment on a
message that originates locally.  Rather, we should do it like this.  Say
that the suffix is saved in environmental variable SUFFIX:

 "|IFS=' ' && exec /usr/local/bin/procmail SUFFIX=${SUFFIX-unset}"

to set the .procmailrc's $SUFFIX to the MTA's $SUFFIX, or

 "|IFS=' ' && exec /usr/local/bin/procmail -a \"${SUFFIX=unset}\""

to set the .procmailrc's $1 to the MTA's $SUFFIX, each way preserving null
values (although there would be no distinction between mail to
username+unset@panix.com and mail to user@panix.com) in case of mail to
username+@panix.com (alternatively, one can take care of mail to
username+@panix.com with instructions in ~/.forward+).

| procmail apparently can't take -a multiple times

Stephen was talking about implementing that.  One can set more than one
positional parameter in conjunction with the -m option.  That does unset
ORGMAIL, which in turn has both good and bad effects to deal with.



From eli@panix.com Thu May 10 18:56:07 EDT 2001
Article: 52244 of panix.questions
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 10 May 2001 21:48:33 GMT
Organization: Some absurd concept
Lines: 46
Message-ID: <9df2bh$q0p$1@news.panix.com>
References:  <9dci17$lgo$1@news.panix.com> <9dcj97$lvu$1@news.panix.com> <9de9t7$8r9$1@panix1.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989531313 26649 166.84.0.226 (10 May 2001 21:48:33 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 10 May 2001 21:48:33 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:52244

In panix.questions, DWT  wrote:
> eli@panix.com (B. Elijah Griffin) wrote in <9dcj97$lvu$1@news.panix.com>:
> | Yes. And that extra step is why the box in the environment and then
> | using "procmail -p" would be cleaner than "procmail -a boxname".
> Then one runs the risks of telling procmail to trust the environment on a
> message that originates locally.  Rather, we should do it like this.  Say
> that the suffix is saved in environmental variable SUFFIX:

When is local mail not going to go through the MTA?

Sending mail to myself on panix1 without using a hostname (that's
what you mean by "originates locally" isn't it?) I see four environment
variables not added by procmail:

	USER=eli
	HOME=/net/u/3/e/eli
	PATH=/bin:/usr/bin:/usr/ucb
	SHELL=/usr/local/bin/ksh

I've got this forward file, show me that you can set a new variable
in it, if you can:

:r! cat ~/.forward+aaa
"|(/bin/cat - ; /usr/bin/env ; /bin/echo ) > /users/eli/aaa.mbox"

I've made aaa.mbox world readable, so you can check your results.

>  "|IFS=' ' && exec /usr/local/bin/procmail SUFFIX=${SUFFIX-unset}"
...
> values (although there would be no distinction between mail to
> username+unset@panix.com and mail to user@panix.com) in case of mail to
> username+@panix.com (alternatively, one can take care of mail to

Clever use of an "unset" value could probably save you. I don't
know what postfix allows, but I seem to recall that sendmail would
not like an '@' in a mbox name.

> Stephen was talking about implementing that.  One can set more than one
> positional parameter in conjunction with the -m option.  That does unset
> ORGMAIL, which in turn has both good and bad effects to deal with.

I run procmail with -m regularly.

Elijah
------
uses -m to avoid procmail processing /etc/procmailrc


From br@panix.com Thu May 10 18:56:09 EDT 2001
Article: 52246 of panix.questions
Path: news.panix.com!br
From: br@panix.com (Ben Rosengart)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 10 May 2001 21:57:37 GMT
Organization: PANIX Public Access Internet and UNIX, NYC
Lines: 11
Message-ID: 
References:  <9dci17$lgo$1@news.panix.com> <9dcj97$lvu$1@news.panix.com> <9de9t7$8r9$1@panix1.panix.com> <9df2bh$q0p$1@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989531857 26839 166.84.0.226 (10 May 2001 21:57:37 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 10 May 2001 21:57:37 GMT
User-Agent: slrn/0.9.7.0 (NetBSD)
Xref: news.panix.com panix.questions:52246

In article <9df2bh$q0p$1@news.panix.com>, B. Elijah Griffin wrote:
> 
> When is local mail not going to go through the MTA?

As I understand it, never.

-- 
Ben Rosengart
(212) 741-4400 x215

"My hair is my password.  Verify me."


From eli@panix.com Fri May 11 14:19:25 EDT 2001
Article: 52315 of panix.questions
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.questions
Subject: Re: some musings on spam
Date: 11 May 2001 18:16:58 GMT
Organization: Some absurd concept
Lines: 33
Message-ID: <9dhaaq$gbs$1@news.panix.com>
References:  <9da2t0$6c1$1@news.panix.com> <9dbs1h$ec1$1@news.panix.com> <9dh637$ecm$2@news.panix.com>
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 989605018 16764 166.84.0.226 (11 May 2001 18:16:58 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 11 May 2001 18:16:58 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.questions:52315

In panix.questions, Wotan  wrote:
> B. Elijah Griffin  posted:
> >As far as I have been able to tell, the only way to reliably figure
                                                      *^^^^^^^^*
> >out that mail was sent to +boxname with the Panix mail set up is to
> >have a .forward+boxname file that passes that info to procmail. This
> >is a tad awkward, since it requires a bunch of extra files if you
> >want to use it for multiple filtering purposes.
> 
> Procmail can find the + addressing without anything special.  

You failed to take into account the reliability aspect.

> :0:
> * $ LOGNAME\+\/something@
> $MATCH
> 
> Would look for the + address anywhere in the headers, then save such a 
> message to $MATCH == something

$ sendmail 'wotan+foo@panix.com','eli@panix.com'
To: Wotan 
CC: Wotan 
Subject: wotan+boxname@panix test message
Message-ID: 

Care to get procmail to figure out who this was for?
.
$

Elijah
------
didn't actually send that message


From ja@panix.com Thu May 31 12:07:59 EDT 2001
Article: 52688 of panix.questions
Path: news.panix.com!ja
From: ja@panix.com (ja)
Newsgroups: panix.questions
Subject: Re: Why is the Seven Dwarfs virus eMail not filtered?
Date: 30 May 2001 20:51:45 GMT
Organization: PANIX Public Access Internet and UNIX, NYC
Lines: 39
Message-ID: 
References: <9f34kt$f3b$1@news.panix.com>
Reply-To: ja@panix.com
NNTP-Posting-Host: panix6.panix.com
X-Trace: news.panix.com 991255905 22223 166.84.0.231 (30 May 2001 20:51:45 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 30 May 2001 20:51:45 GMT
User-Agent: slrn/0.9.7.0 (NetBSD)
Xref: news.panix.com panix.questions:52688

In <9f34kt$f3b$1@news.panix.com>, Bob DeYoung said:
> I am using the standard Panix procmail spam filters and today I rec'd
> the infamous Seven Dwarfs eMail.  Since I always read my mail in shell,
> it's really no big deal since I can easily delete it before it could
> ever get to my hard drive, but I was just curious as to why this is not
> caught by the Panix procmail spam filters?
> 
> Thanks!
> 
> -Bob

This is in rc.body:

:0 B
* Today, Snowhite was turning 18. The (7|[S-s]even) Dwarfs
* Content-Type: APPLICATION/OCTET-STREAM;
NAME="(dwarf4you\.exe|midgets\.scr|joke\.exe|sexy.virgin\.scr)"
* Content-Disposition: ATTACHMENT;
FILENAME="(dwarf4you\.exe|midgets\.scr|joke\.exe|sexy.virgin\.scr)"
$TRASH

The condition lines are wrapped.

Most likely, the Content-Type: and/or Content-Disposition: lines
changed somehow.

If you are using the rc.body recipes, which it seems that you are,
then please send me the sample you received and I will take a look at
it.

Thanks,

ja

-- 
J. Altman		Radar Love? Too Loud? That's oxymoronic.
Panix Staff		Poetry and Wonder
(212) 741-4400
panix.com


From dman@walkerect.com Thu May 31 12:08:03 EDT 2001
Article: 52699 of panix.questions
Path: news.panix.com!panix.com!dman
From: Dallman Ross 
Newsgroups: panix.questions
Subject: Re: Why is the Seven Dwarfs virus eMail not filtered?
Date: 31 May 2001 09:39:24 GMT
Organization: Res Ipsa Loquitur
Lines: 32
Message-ID: <9f53gc$ej0$1@news.panix.com>
References: <9f34kt$f3b$1@news.panix.com> 
NNTP-Posting-Host: panix3.panix.com
X-Trace: news.panix.com 991301964 14944 166.84.0.228 (31 May 2001 09:39:24 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 31 May 2001 09:39:24 GMT
User-Agent: tin/1.5.9-20010312 ("Blue Water") (UNIX) (NetBSD/1.4.3 (i386))
Xref: news.panix.com panix.questions:52699

In pertinent part in , ja  spake thusly:

> This is in rc.body:

> :0 B
> * Today, Snowhite was turning 18. The (7|[S-s]even) Dwarfs
> * Content-Type: APPLICATION/OCTET-STREAM;
> NAME="(dwarf4you\.exe|midgets\.scr|joke\.exe|sexy.virgin\.scr)"
> * Content-Disposition: ATTACHMENT;
> FILENAME="(dwarf4you\.exe|midgets\.scr|joke\.exe|sexy.virgin\.scr)"
> $TRASH

> The condition lines are wrapped.

> Most likely, the Content-Type: and/or Content-Disposition: lines
> changed somehow.

> If you are using the rc.body recipes, which it seems that you are,
> then please send me the sample you received and I will take a look at
> it.

The following generic recipe has found quite a few Win viruses
for me, including the latest Homepage one.

  :0  # conditions here originated with Philip Guenther
  * 9876543210^0 ^Content-[-a-z0-9_]+:.*="?[^"]*\.(vb[se]|ws[fh]|hta|shs)
  * 9876543210^0 B ?? ^Content-[-a-z0-9_]+:.*($[        ].*)*=[  ]*\
                       ($[      ]+)*"?[^"]*\.(vb[se]|ws[fh]|hta|shs)
  virus

-- 
dman


From eli@panix.com Fri Jul 27 14:14:58 EDT 2001
Article: 10515 of panix.upgrade
Path: news.panix.com!eli!not-for-mail
From: eli@panix.com (B. Elijah Griffin)
Newsgroups: panix.upgrade
Subject: Re: Exec Recipe
Date: 27 Jul 2001 01:11:35 GMT
Organization: Some absurd concept
Lines: 100
Message-ID: <9jqf47$5bk$1@news.panix.com>
References:  <9jq1qi$p0s$1@news.panix.com> 
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 996196296 5492 166.84.0.226 (27 Jul 2001 01:11:35 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 27 Jul 2001 01:11:35 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.upgrade:10515

In panix.upgrade, J. Altman  wrote:
> In <9jq1qi$p0s$1@news.panix.com>, B. Elijah Griffin said:
> > As I understand SirCam, it prepended to a random document found on
> > the victim's computer. It is certainly possible to identify JPEGs,
> > GIFs, etc, in base64 attachments through procmail matching the
> > magic numbers. Perhaps SirCam could be found like this, too.
> 
> Well, possibly. But while I am studying the problem, asking on the
> procmail list, going back to panix.questions....lather, rinse,
> repeat...the thing is still spreading like nobody's business.

Using a different signature for the virus ("^@^Y^@^@^@^A| SCam32^@^P~CMI")
and different base64 offsets, I have this RE to catch it:

:0BD
* AAAAGgU0NhbTMyABCDTUlN|AAAAAaBTQ2FtMzIAEINNSU1F|ABkAAAABoFNDYW0zMgAQg01J
sircamvirus.box

> No, I don't think so. Terseness may not be appropriate, but none of
> the greater-than-ten in the abuse@panix.com box and nearly ten in the
> spam-sites box had anything other than what I am filtering on...I
> simply don't see a hook in your example. If there are magic number
> like objects, I don't know how to find them....I can run strings
> against the saved attachment, but that is a long way from a procmail
> recipe.

I started with 'strings' found something not likely to be in other
binary attachments, base64 encoded it, then fiddled for possible
offset differences. All three of these are base64 fragments that
can be decoded to see the source text they match against:

	AAAAGgU0NhbTMyABCDTUlN
	AAAAAaBTQ2FtMzIAEINNSU1F
	ABkAAAABoFNDYW0zMgAQg01J

(One has to be careful selecting start and stop bytes for the
decoding to work as well as the matching.)

I don't have a large enough sample of wild SirCams to test to
see if this works reliably: it could be line breaks will get
in the way. That would make the RE more complicated, but still
doable.

> Here is what *seems* to be at the beginning of each attachment:
> (I have not examined each sample, and some of my statements are based
> on the observations of others on various lists)
> 
> This program must be run under Win32
> CODE
> `DATA
> .idata

This is just generic program stuff. Yes you could use it to catch
programs in the mail, but that might be too much. Particularly
for something just billed as spam filtering.

> Finally, the users' document is appended.

The 'SCam32' signature is very nearly at the end of the virus:
136416 bytes into the 137216 byte sample I first looked at. 

> Can you predict these signatures? I did post in panix.questions asking
> for just this type of advice. All of the samples I have seen have a
> double extension; but virii can, do, and will be written to,
> mutate. Thats' what virus authors do. How do I predict what the next
> one will do?

And scripting or word doc or $newfileformat viruses won't be caught by
looking for executibles. If you want to do general virus scanning, do
that. If you want to stop a particular virus, get a signature for just
that virus.

I get .xls and .doc files by mail sometimes that are not viruses. I
would not want my spam filter junking them. And if I wanted to match
attachments based on filename extension, I would use a much tighter
RE than:

* ^Content-[-a-z0-9_]+:.*=[     ]*"?[^"]*\.(vb[se]|ws[fhe]|hta|shs|exe|bat|
pif|dll|scr|com|xls|doc)


Try this, for example:

EXT="(vb[se]|ws[fhe]|hta|shs|exe|bat|pif|dll|scr|com|xls|doc)"

# Either:
# Content-Type:.*\
References:   <9jqf47$5bk$1@news.panix.com> 
NNTP-Posting-Host: panix1.panix.com
X-Trace: news.panix.com 996257659 16947 166.84.0.226 (27 Jul 2001 18:14:19 GMT)
X-Complaints-To: abuse@panix.com
NNTP-Posting-Date: 27 Jul 2001 18:14:19 GMT
Encrypted: double rot-13
X-Newsreader: Sony Playstation 5.0MIPS
Xref: news.panix.com panix.upgrade:10522

In panix.upgrade, J. Altman  wrote:
> In <9jqf47$5bk$1@news.panix.com>, B. Elijah Griffin said:
> > Using a different signature for the virus ("^@^Y^@^@^@^A| SCam32^@^P~CMI")
> > and different base64 offsets, I have this RE to catch it:
> > 
> >:0BD
> > * AAAAGgU0NhbTMyABCDTUlN|AAAAAaBTQ2FtMzIAEINNSU1F|ABkAAAABoFNDYW0zMgAQg01J
> > sircamvirus.box
> Boggle. But okay. I am not clear on exactly what I am seeing here: is
> procmail actually "peeking" into the binary attachment? 

It is regexp matching a base64 encoded keyword from the virus, so I
guess that is a good explanation.

> > EXT="(vb[se]|ws[fhe]|hta|shs|exe|bat|pif|dll|scr|com|xls|doc)"
> > 
> > # Either:
> > # Content-Type:.*\ > # Content-Disposition:.*\ > HEADER_NAME="Content-(Type:.*\<|Disposition:.*\ > 
> >:0
> > * $ ^$HEADER_NAME[ 	]*\"?[^\"]*\.$EXT([\" 	]|$)
> > matched
> 
> Wow. I think I get it, actually.

I thought the variables would make it clearer.

Elijah
------
keeps meaning to put his collected procmail posts on the web