This document covers how to create a new spam check, which essentially turns out to be a guide on
findspam.py
.
B(are|ear)
NecessitiesThe very first thing you need to do before creating a new spam check is test if it’s actually necessary.
!!/test
commands to check this.…preferably addressed to me, with plenty of money involved. (Ha! Spelling jokes!)
If you’ve determined that a new check is necessary, now you need to actually write it. The preferred way of doing this is a regex — these are (usually) simpler and easier to maintain than the alternatives. Writing a regex check is pretty easy: just write the regex. You can use regex-checking websites like regex101 to check if you’ve written it right — doing this is encouraged, because regex is not an easy language to speak.
The alternative, which should only be used if a regex doesn’t do the job, is to write a check method. This is done by writing a new method in findspam.py
, which takes two parameters: s
and site
. s
is a string of stuff that you need to check (like the title, username, or post body), and site
is the site that the post is on. Give your method a descriptive name, so that its purpose can be judged at a glance.
Your method should return a pair of values. The first is a boolean, indicating whether or not you think the post is spam. The second is a string, the why
data for the post. It should be a short descriptive text that describes why you think it’s spam, e.g. Contains keyword *male-enhancement*. See the text box under the reasons list of a metasmoke record for an example.
Here’s an example check method. This method will say that any s
longer than 3 characters is spam.
def ridiculous_spam_check(s, site):
if len(s) > 3:
return True, "Length is greater than 3 characters"
else:
return False, ""
If you need to combine multiple pieces of information from a post (e.g Username similar to website, where you need both the post body and the username), see section 4 below.
Checks are our ammunition against spam; now you need a gun to fire it from. In our case, it’s a GLoCK — a Giant List of Checks and Keywords.
Read the code of the existing checks in findspam.py
, there are two kinds of them:
RegEx-based checks are created and registered into the GLoCK with a call to create_rule
:
create_rule("bad keyword in post", r"male\W?enhancement", max_rep=5, max_score=1)
Functional checks are created by using create_rule
as a decorator (PEP 318):
@create_rule("spam answer", question=False)
def ridiculous_check(s, site):
return True, "All answers are spam"
This way, the function ridiculous_check
is wrapped into a Rule
object and registered to the GLoCK.
The prototype of the master Rule Creator™ function is
create_rule(reason, regex=None, **kwargs)
Accepted keyword arguments are:
title
, username
, body
, body_summary
: These are the different parts of a post. Set them to True if your check should be performed against these aspectsall
and sites
: all
defines whether the check should apply on all sites, and sites
define exception sites. If all=True
, then posts not from a site in sites
will be checked (whitelisted). If all=False
, then only posts on sites
will be checked.max_rep
, max_score
: Upper limits of owner reputation and post score. Generally, if the owner has more reputation or the post has a high score, it’s unlikely that it will be spam.question
, answer
: Whether the check is designed for questions and answers. For example, you probably want to specify answer=False
for a “bad question” check. Note that they’re in the singular form.stripcodeblocks
: Whether code blocks like this
should be stripped before running the check.whole_post
: See belowdisabled
: Just a neat way to create a rule without putting it into production 😃. If turthy, then the Rule
is not added to those used by FindSpam
.elapsed_time_reporting
: Allows you to specify the type and minimum elapsed time to report as an issue for that specific test. IF not supplied, then the default values will be used. This can be used in development to have logging of how long your new test is taking to process. In production, if you are not going to use the default values, then the times should be set such that logging and reporting into chat are actually exceptional cases which indicate an issue which should be addressed.rule_id
: Every test mush have a rule_id
. This value is used to uniquely identify the test, as multiple tests can have the same reason
. However, for the first, and only the first, test which uses a specific reason
, the rule_id
will, by default if not explicitly specified, be assigned a value identical to the reason
. Thus, no value for the rule_id
needs to be explicitly specified for the first test, and only the first, test that uses a particular reason
.skip_creation_sanity_check
: If truthy, it prevents running Rule.sanity_check()
upon creation of the Rule
. This is used on the main blacklist and watchlist rules, because the regular expression text for them doesn’t exist at the time they are created and Rule.sanity_check()
validates that the test has either a function to call or regular expression text.The default values for those options are:
create_rule(reason,
all=True,
sites=[],
title=True,
body=True,
username=False,
body_summary=False,
max_rep=1,
max_score=0,
question=True,
answer=True,
stripcodeblocks=False,
whole_post=False,
disabled=False
)
Implementation details: create_rule
takes an optional regex
argument. If it’s provided, then it creates a regular expression check. If no regex is provided, it returns a decorator that can be used to decorate a function-based check. This is how it works in two ways.
If a check is designed to check multiple aspects of a post individually, you may want separate reasons for the aspects checked. For example, bad keyword in title, bad keyword in body and bad keyword in username are three reasons.
To keep things minimal, you can use a pair of curly brackets in the reason, which will be replaced by one of title
, body
and username
, if the corresponding part of the post is caught as spam. So you can write this as a reason:
create_rule("bad keyword in {}", "male\W?enhancement")
Note: We’re currently using str.replace()
to insert the post aspect instead of using .format()
, so only bare braces are supported.
whole_post
checkIf you need multiple pieces of information from a post, instead of checking title, body and username separately, you can define a method like this (example):
@create_rule("this is a reason", whole_post=True)
def another_ridiculous_check(post):
if post.user_name in post.body and post.user_name in post.title:
return False, True, False "Username in both title and body"
return False, False, False, ""
The method should take one post
object. You can find its available properties starting from here. Then you can perform the check with the information provided in the object.
The return value is also a bit different from a standard check. A whole_post
check should return 4 values, the format of which would be
return title_is_spam, username_is_spam, body_is_spam, why
The first three boolean values indicate whether you think the title, the username or the post body is spam. The last value is why
data, which is identical to the why
data mentioned above.
When defining the entry for your check, you need to add whole_post=True
to the rule creator, so that the check dispatcher can provide the post object to your method, instead of s, site
.