Creating new spam checks

This document covers how to create a new spam check, which essentially turns out to be a guide on findspam.py.

1. The B(are|ear) Necessities

The very first thing you need to do before creating a new spam check is test if it’s actually necessary.

2. Write a Check

…preferably addressed to me, with plenty of money involved. (Ha! Spelling jokes!)

If you’ve determined that a new check is necessary, now you need to actually write it. The preferred way of doing this is a regex — these are (usually) simpler and easier to maintain than the alternatives. Writing a regex check is pretty easy: just write the regex. You can use regex-checking websites like regex101 to check if you’ve written it right — doing this is encouraged, because regex is not an easy language to speak.

The alternative, which should only be used if a regex doesn’t do the job, is to write a check method. This is done by writing a new method in findspam.py, which takes two parameters: s and site. s is a string of stuff that you need to check (like the title, username, or post body), and site is the site that the post is on. Give your method a descriptive name, so that its purpose can be judged at a glance.

Your method should return a pair of values. The first is a boolean, indicating whether or not you think the post is spam. The second is a string, the why data for the post. It should be a short descriptive text that describes why you think it’s spam, e.g. Contains keyword *male-enhancement*. See the text box under the reasons list of a metasmoke record for an example.

Here’s an example check method. This method will say that any s longer than 3 characters is spam.

def ridiculous_spam_check(s, site):
    if len(s) > 3:
        return True, "Length is greater than 3 characters"
    else:
        return False, ""

If you need to combine multiple pieces of information from a post (e.g Username similar to website, where you need both the post body and the username), see section 4 below.

3. Endless Lists

Checks are our ammunition against spam; now you need a gun to fire it from. In our case, it’s a GLoCK — a Giant List of Checks and Keywords.

Read the code of the existing checks in findspam.py, there are two kinds of them:

Master Rule Creator

The prototype of the master Rule Creator™ function is

create_rule(reason, regex=None, **kwargs)

Accepted keyword arguments are:

The default values for those options are:

create_rule(reason,
    all=True,
    sites=[],
    title=True,
    body=True,
    username=False,
    body_summary=False,
    max_rep=1,
    max_score=0,
    question=True,
    answer=True,
    stripcodeblocks=False,
    whole_post=False,
    disabled=False
)

Implementation details: create_rule takes an optional regex argument. If it’s provided, then it creates a regular expression check. If no regex is provided, it returns a decorator that can be used to decorate a function-based check. This is how it works in two ways.

4. Other neat things

Partial reason

If a check is designed to check multiple aspects of a post individually, you may want separate reasons for the aspects checked. For example, bad keyword in title, bad keyword in body and bad keyword in username are three reasons.

To keep things minimal, you can use a pair of curly brackets in the reason, which will be replaced by one of title, body and username, if the corresponding part of the post is caught as spam. So you can write this as a reason:

create_rule("bad keyword in {}", "male\W?enhancement")

Note: We’re currently using str.replace() to insert the post aspect instead of using .format(), so only bare braces are supported.

whole_post check

If you need multiple pieces of information from a post, instead of checking title, body and username separately, you can define a method like this (example):

@create_rule("this is a reason", whole_post=True)
def another_ridiculous_check(post):
    if post.user_name in post.body and post.user_name in post.title:
        return False, True, False "Username in both title and body"
    return False, False, False, ""

The method should take one post object. You can find its available properties starting from here. Then you can perform the check with the information provided in the object.

The return value is also a bit different from a standard check. A whole_post check should return 4 values, the format of which would be

return title_is_spam, username_is_spam, body_is_spam, why

The first three boolean values indicate whether you think the title, the username or the post body is spam. The last value is why data, which is identical to the why data mentioned above.

When defining the entry for your check, you need to add whole_post=True to the rule creator, so that the check dispatcher can provide the post object to your method, instead of s, site.