成人VR视频

Subscribe to the OSS Weekly Newsletter!

Why Did LLMs Steal Our Em-Dashes?

The em-dash, or 鈥溾斺,听is a writing tool which allows for a clearer expression of complex thoughts, and AI seems to think so too. As well-articulating students, researchers, and other writers attempt听to navigate this dashless听existence, two questions arise. When are we getting them back? Will we ever?

Upon ChatGPT's release in 2022, I realized that I wrote like AI. My sentences were long, my writing patterns were predictable, and my use of em-dashes was frequent. Initially, I was not concerned: if models are being trained to write like me, I must be doing听a good job, right?

Then, however, came the "AI detectors" used by teachers and reviewers. These AI detectors are听designed to spot. They measure the predictability of a written work鈥檚听wording ("perplexity"), the variability of the sentence structure ("burstiness"), and other markers. Essentially, AIs听are being used to spot content created by other AIs. At this point, I began to change my writing style. No more back-to-back 20+ word sentences. No more dash-filled phrases, semicolons, or groups of threes; I was not willing to risk being flagged.

Returning to the point of the em-dash (and other snubbed marks), there are two key reasons why they鈥攁long with words like "delve" and "underscores"鈥攁re so frequent in AI-generated writing: the training data, and the assessment of said data.

First, let's听look at the data collection process for LLMs (large language models). Over 60% of the training data used in early models like GPT-3听, which collect听publicly available text off the internet. After the data is collected, it is used to train models to predict language structures and patterns. Most LLMs are trained to predict听, internalizing patterns in grammar and style along the way. If a particular structure (like the em-dash) appears often enough and isn't听adjusted later on, it can become a characteristic aspect of the model's output.

As a result of this pattern-based learning鈥攁nd the fact that these patterns aren't听always corrected鈥攎odels can take on specific stylistic habits that become hard-wired as their "instinct". As听Medium听writer Brent Csutoras听demonstrated听in his failed attempts to听, the em-dash has become embedded into the output style of today's LLMs.

To be clear, you are not imagining this em-dash overuse. According to Freeburg, an independent researcher, LLMs use em-dashes much more frequently听than human writers, with GPT-4.1 having a听in standard essays. Similarly to Csutoras' conclusion, they found that em-dashes were almost entirely听resistant to prompt manipulations and user restrictions.

Now, how did no one realize that AIs were learning to use em-dashes so frequently? Some journalists, including听The Economist's听Alex Hern, believe that the听听is a key factor. African English uses words like "delve" much more frequently听than the internet at large, which may affect the regulators' choices. However, the work of these听听mostly ties to removing sexist, racist, and other harmful content, not directly altering the linguistic choices of the models.

Initially, I hypothesized that the explanation was tied to the datasets being used to train LLMs. However, after a听听comparing word frequencies听in听鈥攁 text dataset of popular modern media (think Star Trek)鈥攁nd听, a set which mimics AIs training data, I found that while OpenWebText听often "won out" in terms of frequency, the gap wasn't听significant.

The em-dash frequency of OpenWebText听was so high (1621.88 uses/million), I had to remove it from this chart. I have no reference for COCA and am only drawing conclusions based on words.

I then turned to another potential argument: implicit bias, or the internal perceptions听and judgements of individuals. Before em-dashes rose back into fame, they were mostly used in prose and other writing spaces that encourage wide vocabularies听and creative structuring. Many people didn't听know what they were before LLMs began splicing them into their sentences, and听given our more regular reliance on short-form content like text and emails, they didn't听need to. In contrast, LLM training involves, where em-dashes are more common than in the average person's consumed media. Bias explains why em-dashes feel听so听out of place, but not why em-dashes are actually being used more frequently than normal.

The generally accepted听hypothesis to explain this overuse ties back LLMs鈥 training and reinforcement processes. As models learn to predict language patterns, they begin to use their learned patterns to do so.听However, this isn鈥檛听the only factor determining听which patterns get used more often. Models like Claude and ChatGPT have an additional听goal with their responses: to provide users with clarity. Em-dashes, which allow for explanatory pauses and the breaking down of complex ideas,听. As such, LLMs are not only introduced to more em-dashes, but their training also reinforces their usage. This results in em-dashes appearing more frequently听than in typical human writing.

So听what does this mean in the long term? Personally, I believe that these models will soon reduce their use of em-dashes. Individuals are currently avoiding em-dashes and other AI "red flags", so their overall usage is decreasing. LLMs are trained to replicate the styles of human writers, and as LLMs get more frequent content updates, the decreased use of these writing tools should have an influence on their responses.

The only question is whether we, as writers, will ever go back. This fear of being "caught" has begun to overtake what writing once was: freedom of expression. There are now countless听, flagging everything from empty questions to the use of writing structures that people were once taught to use. To write "humanly", we have to write less creatively.


Lia Erisson is a second year (U2) Computer Science & Economics student minoring in Physiology. She loves exploring the intersection of technology, wellbeing, and the human experience.

Part of the OSS mandate is to foster science communication and critical thinking in our students and the public. We hope you enjoy these pieces from our Student Contributors and welcome any feedback you may have!

Back to top