To publish online and remain anonymous, boffins from Bulgaria and Qatar advise being mediocre. And if you can't manage that on your own, they have a technique to make your prose less scintillating.
Distinctive writing tends to point to a specific author. That's what stylometry, the study of linguistic patterns, aims to reveal.
Once the domain of literature professors and forensic experts, stylometry has become a competency of computers, which turn out to be adept at digesting text samples and analyzing them for specific characteristics.
Since 2011, at an annual conference called PAN, researchers have been assessing author identification techniques alongside author obfuscation techniques, an evaluation that echoes the back-and-forth pattern seen in cybersecurity research.
A handful of these number crunchers – Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev , and Ivan Koychev from Sophia University and Preslav Nakov from Qatar Computing Research Institute – have released a paper describing improvements in techniques put forth last year to be presented at PAN @ CLEF 2017 in September.
People posting anonymously online expect to remain anonymous, but that's generally unrealistic, the researchers observe, because there are so many ways to track users online. Even without obvious ways to link anonymous posts to data associated with a known user – IP addresses, user names, and the like – written text often contains clues to an author's identity.
Countering stylometric analysis has its own set of challenges, however.
"Unlike authorship attribution or author profiling, this is not a simple text classification problem but rather a complex text generation task, where not only the author’s style has to be hidden, but the text needs to remain grammatically correct and the original meaning has to be preserved as much as possible," the paper states.
The researchers note that techniques like cycling text through a series of machine translations, from one language to another and then back to the original, tends to produce nonsensical word salad. They're not too keen on selective word substitution either.
The authors' previous attempt to obscure the authorship of sample texts performed well in terms of safety – protecting the author from forensic analysis – but lagged in sensibility – not calling attention to itself as an attempt to conceal authorship.
One of the original passages:
I am proud. Though I carry my love with me to the tomb, he shall never, never know it.
PAN 2016 text:
myself ’m proud in them, and though myself carry my beloved with me to the tomb he shall ever ever know it.
As can be seen from the example above, last year's transformation of stands out as odd.
The researchers' revised approach reads better. This particular sample may feature only minor variations on the original text, but if it can defy stylometric analysis, it has accomplished its job.
PAN 2017 text:
I’m proud of them; and though I carry my beloved with me to the tomb he shall ever ever know it.
The revised approach, which introduces a way to limit the magnitude of text changes, aims for more mediocrity. Described in more detail in the paper, the technique "pushed towards average values for some general stylometric characteristics, thus making these characteristics less discriminative."
That is, the transformed text scores closer to the average score for metrics like number of nouns, punctuation to word ratios, and the like.
"Overall, we can conclude that our method with transformation magnitude is promising and performs well (better than the three systems that participated in the PAN-2016 Author Obfuscation task) in terms of sensibility and soundness," they explain. "In future work, we need to study how it performs in terms of safety." ®