Develop a blog - Part 4: HTML and Markdown parsers

In the previous articles we made sure that we can create a BlogPost and that we can publish or schedule it. Today we will focus on writing of the blog post itself. In the requirements from part 1 we stated that the blog post author wants to decide on using Markdown or plain HTML to write their post. We expected that there will be requirements for more parsers in the future and we accommodated for that in our design.

As a refresher the UML diagram of our Design below:aDDRUTZ8KpmBVjWglK4wjKkISJr4enD8k8lFHlBe.pngWe introduce a Parser interface which will be responsible for parsing the introduction and content of a BlogPost and provide a parsed result. Every BlogPost will have a reference to its own parser, allowing the Author to decide per BlogPost which parser / technique to use.

This article is part of a series. Read the other parts if you didn't read them yet:

The source code for the full series and the changes I made during this article are available on Github.

Introducing the Parser interface

We start at the BlogPost again. There are a few design decisions that we need to make:

  1. How do we provide the correct parser to the BlogPost?
  2. How does the user of the BlogPost class get the parsed results?

Question number 1 we can answer in two ways:

  1. We provide a parser in the constructor
  2. We provide a setter where we can provide the parser

In this case the constructor would be the most logical place to provide the parser. The choice of parser needs to be known before the user is going to write an introduction and content. As we are providing those to the constructor when we create a BlogPost, we also need to provide the correct parser at the same time.

Question number 2 is easier to answer. We already have two methods to get to the introduction and content: getIntroduction and getContent. These methods can return the parsed introduction and content respectively.

Now that we made all necessary design decisions we can write the first test:

protected function createBlogPost(string $title = 'My first blog post',
    string $introduction = 'A short introduction to the BlogPost',
    string $content = 'The content of the full article',
    Parser $parser = null
): BlogPost {
    if($parser == null) {
        $parser = $this->createStub(Parser::class);
        $parser->method('parse')
            ->will($this->returnArgument(0));
    }

    return new BlogPost(
        author: new Author('Mark'),
        category: new Category('PHP'),
        title: $title,
        introduction: $introduction,
        content: $content,
        parser: $parser,
    );
}

/** @test */
public function the_introduction_is_parsed()
{
    $parser = $this->createStub(Parser::class);
    $parser->method('parse')
        ->willReturn('Parsed introduction');

    $blogPost = $this->createBlogPost(parser: $parser);

    $this->assertEquals('Parsed introduction', $blogPost->getIntroduction());
}

/** @test */
public function the_content_is_parsed()
{
    $parser = $this->createStub(Parser::class);
    $parser->method('parse')
        ->willReturn('Parsed content');

    $blogPost = $this->createBlogPost(
        parser: $parser,
    );
    $this->assertEquals('Parsed content', $blogPost->getContent());
}

We want to focus on the interface of a Parser, not the concrete implementations. We will test those later, individually, when we develop them. To still be able to test the Parser interface used by BlogPost we use a stub. In the parser specific tests I created a stub that returns a specific string. In all other cases we want a stub that just returns its argument and does nothing.

public function __construct(
    Parser $parser,
    Author $author,
    Category $category,
    string $title,
    string $introduction,
    string $content,
) {
    $this->validate($title, $introduction, $content);

    $this->author = $author;
    $this->category = $category;
    $this->title = $title;
    $this->introduction = $introduction;
    $this->content = $content;

    $this->status = new Draft();
    $this->parser = $parser;
}

public function getIntroduction() : string
{
    return $this->parser->parse($this->introduction);
}

public function getContent() : string
{
    return $this->parser->parse($this->content);
}

The way the test is set up allows us to focus only on the way we interact with the Parser interface. This way we can create a separate test class to focus on the functionality of each individual parser.

Markdown parser

Let's start with the Markdown parser. This parser will be easier than the HTML parser, because we don't need to consider malicious HTML and Cross site scripting (XSS).

The Markdown parser has one functionality: it takes in a Markdown string and it needs to give back a parsed string. How we are going to achieve the parsing isn't relevant for the test:

<?php

namespace Tests\Parsers;

use PHPUnit\Framework\TestCase;
use Webdevils\Blog\Parsers\MarkdownParser;

class MarkdownParserTest extends TestCase
{
    /** @test */public function it_converts_markdown_to_html()
    {
        $parser = new MarkdownParser();

        $this->assertEquals(
            "<h1>Heading 1</h1>\n" .
            "<p>And a paragraph</p>\n" .
            "<p>Some <strong>bold</strong> and some <em>italic</em> text</p>\n" .
            "<ol>\n" .
            "<li>Short</li>\n" .
            "<li>List</li>\n" .
            "</ol>\n" .
            "<ul>\n" .
            "<li>Another</li>\n" .
            "<li>List</li>\n" .
            "</ul>\n" .
            "<p>And an image:</p>\n" .
            "<p><img src=\"random.png\" alt=\"Not found\" /></p>\n" .
            "<p>And a nice link to <a href=\"https://webdevils.nl\">Webdevils</a></p>\n",
            $parser->parse('
# Heading 1
And a paragraph
            
Some **bold** and some *italic* text
            
1. Short
2. List

- Another
- List

And an image:

![Not found](random.png)

And a nice link to [Webdevils](https://webdevils.nl)
            ')
        );
    }
}

For the implementation of the test it is relevant how we are going to achieve this. And the answer is: in the most boring way possible. I'm a lazy person and not very good at creating parsers. Other people are much better at it and a quick google search will give you multiple viable open-source libraries that will do the Markdown parsing for us. That is also the reason why I created quite an elaborate test. The goal is to verify that it can be parsed, but we aren't trying to retest the full library that we are using.

After some Googling I came across the CommonMark package from ThePHPLeague. This package seems to be decent, well documented and also well maintained. The last commit was 21 hours ago when I was writing this article, it has 1,9k stars on Github and people are actively replying to issues and PRs.

We can use composer to add a dependency to our project:

composer require league/commonmark

The CommonMark parser supports many different extensions that can be useful for more advanced use cases. For a simple blog those features aren't really necessary so I won't use them. You can read the CommonMark documentation how to enable extensions and add the code to the MarkdownParser object we will create.

I will provide some configuration to the CommonMark parser. The documentation states that by default certain HTML is allowed, unsafe links are enabled and the max nesting level is MAX_INT. If you know who is providing the Markdown this should be fine, but in case of our blog the authors might be non-technical. We don't want them to just copy some HTML or Markdown from the web and create a security vulnerability in our Blog.

<?php


namespace Webdevils\Blog\Parsers;

use League\CommonMark\CommonMarkConverter;

class MarkdownParser implements Parser
{
    public function parse(string $string): string{
        $converter = new CommonMarkConverter([
            'html_input' => 'escape',
            'allow_unsafe_links' => false,
            'max_nesting_level' => 5
        ]);
        return $converter->convertToHtml($string);
    }
}

Nothing special here. We initialise the CommonMarkConverter with the security aware config and parse the provided string.

HTML Parser

Next up is the HTML parser. We have to be careful here. We want to provide authors enough freedom to express themselves, but we don't want them to shoot themselves in the foot with broken and insecure HTML Initially you would think that is easy. Just remove any script tags and you should be fine. Unfortunately there are plenty of other ways to inject executing code into an HTML document. We could use events like onclick or onload for example. Or just an anchor tag with href="javascript:". And there are plenty more ways to inject code in HTML

Before we worry about all possible ways an Author can break our Blog we can create a simple test that shows some of these problems:

<?php

namespace Tests\Parsers;

use PHPUnit\Framework\TestCase;
use Webdevils\Blog\Parsers\HTMLParser;

class HTMLParserTest extends TestCase
{
    /** @test */public function it_cleans_html()
    {
        $parser = new HTMLParser();

        $this->assertEquals(
            "<h1>Header</h1>\n" .
            "<p>Next line is a paragraph</p>\n" .
            "<p><a>Test</a> A link to <a href=\"https://webdevils.nl\">Webdevils.nl</a>\n" .
            "</p><ul>\n" .
            "<li>Item 1</li>" .
            "</ul>",
            $parser->parse(
                "<h1>Header</h1>\n" .
                "<script>alert('xss')</script>" .
                "<p>Next line is a paragraph</p>\n" .
                "<p><a href=\"javascript:alert('boe!')\">Test</a> A link to <a href=\"https://webdevils.nl\">Webdevils.nl</a>\n" .
                "<ul>\n" .
                "<li>Item 1"
            )
        );
    }
}

Luckily, we don't have to come up with all the possibilities you can abuse HTML. There are other people that already did the work for us. There is a PHP library called HTMLPurifier. HTML purifier uses a whitelist of allowed tags, attributes and values and strips away all other HTML

We can install HTMLPurifier as a dependency on our project:

composer require ezyang/htmlpurifier

Basic usage of HTMLPurifier is straightforward. We create a default configuration, create a HTMLPurifier object and we purify our HTML.

<?php


namespace Webdevils\Blog\Parsers;

class HTMLParser implements Parser
{
    public function parse(string $string): string{
        $config = \HTMLPurifier_Config::createDefault();
        $purifier = new \HTMLPurifier($config);

        return $purifier->purify($string);
    }
}

For this project the default settings are good enough. For your project you can decide to make the config as complicated as you want. You can create the object and the config in the parse method, or even move it to the constructor and request the config from the user of the class. It is all up to you. Read the documentation of HTMLPurifier to see all its possibilities.

Next steps and source code

In the next article I want to focus on slugs. Slugs are human readable URLs that we can use to identify a Category or a BlogPost. We will need these to allow our readers to easily identify a BlogPost. Also, we need to ensure that we implement minimum Search Engine Optimalisations (SEO).

The source code of the Blog project and the changes made in this article are available on Github:

Author

Mark Kazemier's avatar
Mark Kazemier

Hi, my name is Mark. I'm the founder of webdevils.nl and love developing websites and other web applications. Through Webdevils.nl I want to spread my enthousiasm about the web and PHP. In my professional live I'm a security expert specialised in security monitoring.

View all posts