My Attempts to Automate Content Writing

Oct 09 2020

As I dabbled in SEO and content marketing, I quickly ran into a problem: I had way more content ideas to write about than there was time for.

An article about a not-too-complicated topic would take me 2 hours to write, plus research and editing. But some of these topics weren't all that interesting, and I didn't want to spend all my time writing.

I needed a way to scale this.

Outsourcing

The first thought was to hire freelance writers. In this case, I could focus on finding topics and outlining them and have others flesh these out into complete articles. This is a common practice among content marketers.

I hired three writers from different countries on Upwork to write about different topics. The cost came down to about $10 for a 700-word article. The turnaround was fast too. But the quality wasn't immediately publishable. I would still have to spend a lot of time editing the drafts myself, on top of the overhead of working with the writers.

At this point, the hacker in me said, "that's it, I'm turning my attention to automation."

Generating Content with GPT-3 GPT-2

The first content automation approach was to use a generative language model given a prompt. OpenAI's GPT-3 has been making waves with the huge model and cool demos, so it was a good place to start.

At the time of writing, I didn't have access to GPT-3 (by the way, if anyone can hook me up that'd be great). Instead, I used Hugging Face 's implementation of GPT-2, which was similar but with fewer parameters.

I would give it an article prompt, say the title or first sentence I had in mind, and GPT-2 would spit out a plausible-sounding paragraph. The output read like human writing, which was good. But it was too free-form and off-topic for my use case. Playing with the model parameters that I could tweak didn't help with this either.

I needed more control over the output topics.

Trying Production Services

Then I thought, maybe there are more production-ready services I could use. Sure enough, I came across Contentyze on an IndieHackers podcast episode and was excited to try it out.

It promised a lot on the landing page. And it had an interface that was easy to use. But the outputs just weren't usable. It just seemed like a wrapper on top of the language models I already tried.

Paraphrasing a Paragraph

I thought of a different approach. What if I took existing content, automatically paraphrased it, and added my unique angle at the end?

I could take the content that ranked on the first page of Google for my topic and repurpose it as a starting point for my draft. And to simplify the problem, I could start with paraphrasing one paragraph (or sentence) at a time.

And if I could crack this, this would be immediately useful for content writers and college students alike.

This direction seemed promising!

Syntax-guided Controlled Generation of Paraphrases (SCGP)

I skimmed the latest natural language processing (NLP) research for the task of automatic paraphrasing. The first paper that caught my attention was the Syntax-guided Controlled Generation of Paraphrases (SCGP) paper (2020) by researchers at the Indian Institute of Science, Microsoft Research, and Google.

It worked like this: given an input sentence (the sentence to be paraphrased) and an exemplar sentence, the model would "generate syntax conforming sentences [conforming to the exemplar] while not compromising on relevance". They even included the code and a pre-trained model.

Image from the Syntax-guided Controlled Generation of Paraphrases paper

Big promise from a strong team. Let's give it a try!

And...the output was disappointing. I took the pre-trained model and ran it on the test data they provided (that is, not even using my own content). The results were far from even correct sentences. Here's one example:

Input Sentence: why do some people like cats more than dogs ?

Exemplar Sentence: why do some people develop food allergies later in life ?

Output Paraphrase: why do some people prefer dog puppies more than dogs in ?

Fancy approach on paper, but this wouldn't fit my use case.

Word Embedding Attention Network (WEAN)

What about a different approach for automatic paraphrasing? I found a 2018 paper titled "Word Embedding Attention Network" that also provided code and pre-trained models.

The model would use word embeddings (vector representation) to capture the meanings of the words and generate words from decoding the embeddings. That sounded great in theory.

Image from the Word Embedding Attention Network paper

In practice though, the outputs still weren't good enough. Using their pre-trained model, the outputs resembled correct sentences, but only for very simple phrases. For more meaningful sentences, often the model would not paraphrase at all. That is, the output would look exactly the same as the input. That was a deal-breaker for this use case since I couldn't just plagiarize other people's content.

Conclusion

This was my first foray into delegating my writing. Many approaches I took sounded great in theory, but the outputs just weren't good enough at the time of writing. Though directionally I still think this is promising, as the automated techniques get more sophisticated.

To this end, there are a few more ideas I can try:

Run the paraphrase model outputs through a grammar-checking API. At least then I would eliminate the grammatical errors.

Train models some more with my custom data, or tweak the model architecture to apply more constraints to the text generation.

Try GPT-3 when I get access to it.

Generate content from underlying structured data.

Thanks for reading this much.

Hope you enjoyed this post. Let's stay in touch.

Follow @JHLYeung on Twitter