Only allow specific HTML code

Question by Progger99 | 29/12/2011 at 22:00

I have a form on my website, so that visitors may comment on the articles in my blog. I would now give users the ability to use certain formatting and links.

However, only certain HTML tags should be allowed, since I would not like, that user chop up the design of my site, nor I wish that someone injects scripts or malicious code in this way.

So, what I would need, would be a function, that keeps tags like <a>, <br> or <p>, but filters out anything else such as <script> or similar things. I've taken into account a function with regular expressions, but somehow I can not continue there. Can someone help me?



The magic word is strip_tags. Look at php.net/manual/en/function.strip-tags.php. There, you will find everything important to that topic!
30/12/2011 at 18:23

To explain it in a bit more detail: The function strip_tags() expects a string as a parameter and optionally the allowed tags. Example:

echo strip_tags($s);
// output 1: 'word'
echo strip_tags($s,'<p><a>');
// output 2: '<p>word</p>'

In output 1, no tags are allowed, only 'word' will be the output. That is different in the second ouput, where the HTML codes <p> and <a> are allowed. Since the string $s contains no <a> but <p>, all <p> will be kept. The line break <br> will be deleted, as this tag is not allowed. If you would write strip_tags($s, '<p><a><br>'), also the line break <br> would be allowed.
30/12/2011 at 20:54

Stefan Trost

Attention! You should not rely solely on strip_tags! Within a permitted HTML tag, "bad" users could put malicious code via an onmouseover event or similar things. This will not be removed by strip_tags alone.

You can overcome this problem like this:

// string with tags and malicious code
$txt = '<p class="x" onmouseover="alert(1);">
          Text Text Text <strong>Text</strong>
// clean up
$txt = strip_tags($s,'<p>');
$regex = "#<(/?\w+)\s+[^>]*>#is";
$txt = preg_replace($regex, '<${1}>',$txt);
// output
echo $txt; // '<p>Text Text Text Text</p>'

First, this code uses strip_tags() in order to delete all tags up to <p> from the string. Then, a regular expression is used to delete all attributes from the tags. Thus, the onmouseover command disappears from the p tag, but also what is indicated in class or other potentially unwanted attributes. If you wish to keep certain attributes, you can change the function accordingly.
31/12/2011 at 20:25

