Home

A Model of the Language Process

Brandon Duderstadt & Hayden Helm
With generous support from The Cosmos Institute and Helivan Corp.

Philosophy is a battle against the bewitchment of our intelligence by means of language.

Ludwig Wittgenstein

Introduction

Language is a process. As we use language to communicate, new vocabulary emerges, word meanings shift, and narratives progress. This evolution endows language with an inherent temporal structure. Traditional Large Language Models (LLMs) do not explicitly account for this temporal structure. Instead, they treat all documents in their training data as if they all occurred at once.

In this report, we remedy this by introducing the Temporal Language Model 1 (TLM-1), a BERT style transformer that directly models the language process by jointly learning to predict document contents and classify document dates. We train TLM-1 on a general-purpose monitor corpus of contemporary American English, enabling us to probe it for temporal trends in language relevant to the United States from 1990 to 2019.

To query TLM-1, we introduce a Bayesian framework that disentangles its temporal dynamics from several sources of temporal bias, including base model anachronism and query anachronism. We demonstrate that our framework can recover temporally sensitive relationships that are otherwise hidden when naively evaluating the model's likelihood function. Using our query framework, we investigate several macroscopic trends in the evolution of contemporary American English.

We also perform a geometric analysis of TLM-1's learned time embeddings. TLM-1 frames the document dating problem as multiclass classification, which does not impose any a priori ordinal structure on its time embeddings. Despite this, we show empirically that TLM-1's learned time embeddings recover a 1D curve indexed by time, which we call the temporal control curve. We argue that the existence of the temporal control curve provides evidence that TLM-1 is able to effectively reconstruct temporal language dynamics.

We conclude with a discussion of the implications of our work and possible next steps.

Background

Language Modeling

Early large-scale encoder models like BERT and RoBERTa established the transformer architecture and the masked language modeling (MLM) objective as the dominant pre-training paradigm for encoder models. More recent encoder variants, such as Etin400M Model, provide contemporary encoder base models with improved performance and efficient scaling.

Another recent improvement to the encoder training stack is SpanBERT, which introduced span-level masking to better capture multi-token dependencies. This feature critical for handling words that may fragment into multiple subword units at tokenization time.

One similarity between all of these models is that they treat language as atemporal, modeling their training corpora without regard to when their constituent documents were written.

Temporal Modeling

The literature on temporal language modeling overwhelmingly focuses on the task of detecting changes in word senses. This task, known as diachronic semantic change detection, has been an active area of study for over a decade. Kutuzov et al. provide the most recent survey paper on the field, which details the progression from classical N-Gram models to more sophisticated neural word embedding models. The approaches outlined in the survey are mostly post-hoc; they primarily align and analyze word embeddings after training rather than building temporal sensitivity directly into the training objective.

TempoBERT was the first serious attempt to adapt the self supervised transformer paradigm to explicitly include document time information. It introduced the idea of time tokens and combined a high rate of time token masking with a conventional masked language modeling objective. In the TempoBERT paper, the authors train several highly specific TempoBERT models for targeted tasks. In one instance, they train a TempoBERT to predict diachronic semantic change in Reddit comments about LiverpoolFC. In another, they train a TempoBERT to date New York Times articles. While the TempoBERT approach is promising, the narrow training data of each TempoBERT instance makes it unsuitable for generalized temporal language modeling.

Training Procedure

TLM-1 is the logical next step in the progression of temporal language modeling. TLM-1 draws heavily on the training procedure of TempoBERT, but opts to train on a general purpose monitor corpus of American English as opposed to narrow, domain specific tasks. This approach preserves the scalability and generality of BERT-style models while explicitly integrating temporal information into the pretraining loss.

Unlike previous approaches, TLM-1 does not assume time sensitivity must be engineered for a narrow domain. Instead, it treats temporal modeling as a first-class, general-purpose model capability.

Loss Function

The goal of TLM-1 is to create a general-purpose model for temporal language tasks by jointly modeling document contents and dates. For this purpose, we begin by considering a general form of the joint temporal-content loss function:

LTC=λ1LT+λ2LC\mathcal L_{TC} = \lambda_1 \mathcal L_{T} + \lambda_2 \mathcal L_{C}

where LT\mathcal L_{T} is a temporal modeling loss, LC\mathcal L_{C} is a content modeling loss, and λ1\lambda_1 and λ2\lambda_2 are hyperparameters.

LTC\mathcal L_{TC} generalizes several common encoder losses. Let SS be the length of the sequence being encoded, MLMp\text{MLM}_{p} be the masked language modeling objective with token mask rate pp, and SpanMLMp;l\text{SpanMLM}_{p;l} be the span masked language modeling objective as presented in SpanBERT with truncated geometric distribution parameters pp and ll. Under these definitions, the RoBERTa, SpanBERT, and TempoBERT loss functions are all instances of LTC\mathcal L_{TC}, as shown in the table below:

Nameλ1\lambda_1LT\mathcal{L}_{T}λ2\lambda_2LC\mathcal{L}_{C}
RoBERTa0.00.0N/A1.01.0MLMp=0.15\text{MLM}_{p=0.15}
SpanBERT0.00.0N/A1.01.0SpanMLMp=0.2;l=10\text{SpanMLM}_{p=0.2; l=10}
TempoBERT1S\frac{1}{S}MLMp=0.9\text{MLM}_{p=0.9}S1S\frac{S-1}{S}MLMp=0.15\text{MLM}_{p=0.15}

The TLM-1 loss is also a variant of LTC\mathcal L_{TC}, and is written:

LTLM-1=0.5MLMp=0.9+SpanMLM*p=0.2;l=4\mathcal L_{\text{TLM-1}} = 0.5 \cdot \text{MLM}_{p=0.9} + \text{SpanMLM*}_{p=0.2; l=4}

Looking at every term of the TLM-1 loss individually will help us gain an intuition for what it endeavors to learn. Let's start with the temporal loss term.

TLM-1's temporal loss LT=MLMp=0.9\mathcal L_T = \text{MLM}_{p=0.9} is the same as the TempoBERT temporal loss. Time tokens are added to the model's vocabulary and masked out at a rate of 0.90.9. TLM-1 sets λ1=0.5\lambda_1=0.5, which significantly upweights the loss from temporal modeling compared to TempoBERT. Empirically, we found that this was necessary for the model to perform sufficiently well on the document dating task.

TLM-1's content loss LC=SpanMLM*p=0.2;l=4\mathcal L_C = \text{SpanMLM*}_{p=0.2;l=4} is similar to the SpanBERT content loss. The decision to use a span-based loss for TLM-1, as opposed to the MLM loss in TempoBERT, is purely mechanical. TempoBERT goes to great lengths to avoid splitting words that are of interest to them at query time, even going so far as to add all relevant words to their tokenizer before training begins.

The TempoBERT setup is unrealistic if we hope to build a model that is useful for general-purpose historical linguistics, as we don't know a priori what words will be of interest at query time. By using a span-based objective, we provide a training setup where TLM-1 regularly sees short sequences of masked tokens. This is similar to what the model will see if a word of interest gets split into multiple mask tokens at query time.

We make two further modifications to the SpanMLM loss relative to SpanBERT. First, we eliminate the span boundary objective present in SpanBERT for the sake of simplicity. We write SpanMLM*\text{SpanMLM*} to denote the SpanBERT loss without the span boundary term. Second, we reduce the maximum of the span length distribution from 1010 to 44, which we believe more closely aligns with our goal of preparing TLM-1 to cope with words of interest that may be split into multiple tokens.

Dataset

We train TLM-1 on the Corpus of Contemporary American English (COCA). COCA is an English monitor corpus that contains 1 billion words of English text in dated sequences written between 1990 and 2019. The corpus is composed of roughly 8 different genres: spoken words, fiction, magazines, newspapers, academic texts, television and movie subtitles, blogs, and other web pages.

There are several important considerations we need to make when modeling the COCA corpus. First, all of the articles in the blog and other web page genres are sequences from 2012, resulting in a large temporal and topical imbalance. We remove these genres from the TLM-1 train set to avoid this imbalance.

Second, every copy of COCA has a "fingerprint" where a small percentage of words in the corpus are replaced with a sequence of 10 @ signs. The goal of this fingerprint is to enable the corpus author to track pirated versions of the corpus back to their original purchaser. We can model this quite naturally by tokenizing the sequences of 10 @ signs as a special [MASK_NOLOSS] token that acts like a mask token but does not receive gradients.

Finally, sequences in COCA are timestamped with yearly granularity. As a result, TLM-1 adds 30 time tokens to its vocabulary, one for each year in COCA. Every sequence in COCA is prepended with its corresponding time token during training.

Other Optimization Details

After removing the web and blog genres, the COCA corpus contains about 750 million words. This is too little data to train a reasonably sized encoder model from a random initialization; for comparison, the original BERT was trained on about 3.3 billion words. As a result, we use a similar approach to TempoBERT and fine tune a base model using our dataset. We opt for Etin400M, a contemporary encoder architecture with strong performance and efficient scaling, as our base model.

When expanding the Etin400M vocabulary to include time tokens, we found that the default initialization for new tokens invoked by Hugging Face's resize_token_embeddings lead to training instability. We were able to remedy this by initializing all time tokens to have the same embedding as the space token.

We optimized our model for 22 epochs using the AdamW optimizer with the default parameters of β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, and ϵ=1e8\epsilon=1e^{-8}. We train our model on a single H100 in bf16 precision to accommodate a batch size of 6464. We use a gradient accumulation of 88 to reduce gradient variance. We use a linear learning rate schedule that warms up to 1e41e^{-4} over 55k steps before linearly decaying back to 00.

Over the course of training, our model loss drops from a peak of ~11.811.8 to a minimum of ~2.82.8. Our final model achieves ~54%54\% top-1 content infill accuracy and ~70%70\% top-1 time token infill accuracy. While we believe these metrics can be improved significantly given additional data and compute, our empirical investigations show they are sufficient for practical use of TLM-1.

Crucially, we do not claim the optimality of any part of our training procedure. Extensive ablations would be required to make such claims, and we can only hope to one day have the computing power required to run such ablations.

Query Framework

The traditional query methodology for masked language models involves evaluating the probability of a fill FF when given a context CC, or P(FC)P(F|C). As a result, it is tempting to query TLM-1 by directly evaluating the probability of a fill FF when given a context CC and a time TT, or P(FC;T)P(F|C;T). However, this method of querying TLM-1 is vulnerable to several sources of anachronism. To understand why, first apply Bayes' Rule:

P(FC;T)=P(TF;C)P(TC)P(FC)P(F |C;T) = \frac{\displaystyle P(T|F;C)}{\displaystyle P(T|C)} P(F|C)

From Bayes' Rule, we see that P(FC;T)P(F | C;T) depends heavily on P(FC)P(F|C), or the prior probability of a fill given a particular context, independent of time. Moreover, temporally imbalanced training datasets or temporally insensitive base models will have a large effect on P(FC)P(F|C), thereby complicating temporal analysis. For TLM-1 specifically, the anachronism introduced by the temporally insensitive Etin400M base model has a distorting effect on P(FC)P(F|C). We can correct for this by replacing P(FC)P(F|C) with a more appropriate prior.

There are two main considerations when choosing a prior. First, we want to ensure that anachronism in the base model or imbalance in the training set do not overwhelm the temporal information in the fill. Second, we do not want to admit fills that have strong temporal relevance but are nonsensical in the provided context. To achieve both of these goals, we adopt the uniform nucleus prior:

P~(FC)={1/FFF0else\tilde P(F|C) = \begin{cases} 1/|\mathcal F| & F \in \mathcal F\\ 0 & \text{else} \\ \end{cases}

where F\mathcal F denotes a set of feasible fills. In practice, the set of feasible fills can be selected using the the top-k fills surfaced by the empirical P(FC)P(F|C) distribution, or could be manually set if the user is investigating a particular phenomenon.

The figure below shows the direct evaluation and posterior fill probabilities for the context "President [MASK] made a speech today." The naive likelihood function (left) always predicts Obama as the fill, presumably due to training-set imbalance or anachronism introduced by the Etin base model. In contrast, the Bayesian posterior (right) with a uniform nucleus prior over the top 44 fills recovers the term of each president.

Each term in our Bayesian query framework has a natural interpretation. As previously discussed, P(FC)P(F|C) models the prior probability of a fill given a context, independent of temporal information. The fraction P(TF;C)/P(TC)P(T|F;C) / P(T|C) is known as the Bayes Factor, and models how we should update our prior distribution on fill probabilities when provided with temporal information.

The numerator of the Bayes Factor, P(TF;C)P(T|F;C), models the document date distribution given the complete content of the document. The denominator of the Bayes Factor, P(TC)P(T|C), models the document date distribution when given context alone. P(TC)P(T|C) acts as a normalizing factor, enabling users of TLM-1 to query the model for different fill probabilities without worrying about anachronism introduced in the wording of the context itself.

With this query methodology at hand, we can now progress to using TLM-1 to quantitatively investigate the language process.

Investigating the Language Process

When using TLM-1 to quantify the language process, we must remember that all linguistics is corpus linguistics. By this, we mean that the language process captured by TLM-1 is fundamentally an artifact of the corpus it was trained on; therefore, we must be careful about extrapolating TLM-1 results to populations whose language may not be captured in COCA. Despite this, we believe that COCA is a sufficiently broad and complete monitor corpus for enabling TLM-1 to capture several trends of interest to the broader computational linguistics community.

The Long Arc

A natural place to begin our investigation of the language process is surfacing posteriors that have an approximately monotonic trend over the entire time period covered by the corpus. We refer to this style of investigation as "Long Arc" investigation, as it captures the most slow-moving but persistent trends in the corpus.

Take, for example, the phrase "I am generally [MASK] about the future." We compute a posterior over its fills using a top-25 uniform nucleus prior for each year and compute the correlation between each fill's posterior share and time. The figure below visualizes the fills with the 5 highest absolute correlations with time:

From this figure, we see the probability that attitudes about the future are described as "happy," "good," or "comfortable" decreases starkly over time. Accordingly, we see the probability that attitudes about the future are described as "concerned" or "cautious" increases starkly over time. Based on this, we conclude that the general sentiment towards the future of the language process captured by the COCA corpus is becoming more negative over time.

TLM-1 also enables us to query how the language used to describe entities of interest changes over time. Consider the same procedure as above, but applied to the phrase "The United States is a [MASK] country."

We see several key narratives reflected in this posterior. First, we see a stark decrease in the probability that the United States is described as a "young" country, perhaps reflecting the fact that the United States is facing a rapidly aging population.

Second, we see a stark increase in the probability that the United States is described as a "federal" country. This could indicate a broad trend toward centralized and expanded federal power, perhaps through the increasing scope of the executive order.

Third, we see a shift away from describing the United States as "rich" and toward describing it as "powerful." This could indicate that the source of US power is shifting away from economic dominance and toward force projection. We speculate that a potential cause of this could be the increasingly precarious U.S. federal debt situation.

Overall, we believe that these investigations demonstrate how TLM-1 can be used to perform "Long Arc" analysis on temporal corpora.

Diachronic Semantic Change

Another application of TLM-1 is the diachronic semantic change task, which involves detecting if a word changes its meaning over time. We approach this task through the lens of Saussure's linguistic paradigm. In structural linguistics, a paradigm refers to a set of words that can be substituted for a target word in a given phrase. Historically, paradigms were used to reason about how the meaning of a sentence changes as different words within the same paradigm were substituted with the target word.

In our work, we concern ourselves with determining if the target word's paradigm distribution is changing over time. We posit that a change in the paradigm distribution of a word is a sufficient condition for the semantic change of that word.

As an example, we can consider the target word cell, which is known to have undergone a semantic change from 1990 to 2019. Before the introduction of the cell phone, the word cell primarily occurred in biological and physical contexts (e.g., the mitochondria is the powerhouse of the cell; Jean Valjean was locked away in a jail cell). As a result, the pre-2000s paradigms for the word cell will contain primarily biological and physical words (e.g., the mitochondria is the powerhouse of the body; Jean Valjean was locked away in a jail house). After the introduction of the cell phone, the word cell acquired a new sense (e.g., give me a call on my cell phone). The word cell's paradigm distribution shifted as a result of this new sense, and words from its newly acquired paradigm became feasible alternative fills (e.g., give me a call on my mobile phone).

We can computationally model a word's paradigm distribution by tracking the frequency of its alternative fills using TLM-1. To do this, we begin by mining a uniform random sample of uses of a target word from the COCA corpus. Then, for each mined use, we mask the target word and compute a posterior over alternative fills using TLM-1 and a uniform nucleus prior. Next, we aggregate these posteriors into time buckets by using a uniform distribution over documents within a time bucket.

We can organize the results of this process into a matrix M[0,1]B×FM \in [0, 1]^{|B| \times |F|}, where B|B| is the number of time buckets, and F|F| is the size of the set of all feasible alternative fills.

Formally, let d=(c,t)d=(c, t) be a particular document containing a context cc at time tt. Let bib_i denote the iith time bucket, and P(D=dB=bi)=1/biP(D=d|B=b_i) = 1/|b_i| denote the uniform probability of selecting a particular document dd from time bucket bib_i. Let fjf_j denote a particular fill in FF. Then:

Mij=d=(c,t)biP(D=dBi=bi)P(F=fjC=c,T=t)M_{ij} = \sum_{d = (c, t) \in b_i} P(D=d|B_i=b_i)P(F=f_j| C=c, T=t)

MM can be interpreted as a weighted bipartite graph GG, where one set of nodes corresponds to the set of time buckets BB, another set of nodes corresponds to the set of fills FF, and the weight of an edge ijij corresponds to the probability that a fill fjf_j is substituted for our target word in time period bib_i. This interpretation enables us to transform questions about the change in the paradigm distribution of a target word into questions about the community structure of GG. Intuitively, if bib_i and bjb_j reside in different communities in GG, then there is evidence that the alternative fill distribution is sensitive to the choice of time bucket. We call GG a paradigm graph, and write GtargetG_{\text{target}} to indicate the paradigm graph for a particular target word.

We can visualize the structure of a paradigm graph using a force directed layout. In each of these visualizations, we begin by constructing a paradigm graph for a target word with a sample size of 10001000 and a uniform nucleus prior over the k=10k=10 most likely substitutions for each document. We then apply a force directed layout to the adjacency matrix of GG to create a visualization. We limit the visualization to the 1515 most common alternative fills in each time bucket to reduce crowding. We also impose a minimum distance between nodes in the visualization to improve readability. We call the resulting visualization a paradigm map.

The paradigm map for GcellG_{\text{cell}} is shown below:

In the paradigm map above, notice how there are two distinct visual communities; one roughly centered around cell:1990-1995\text{cell:1990-1995}, and another roughly centered around cell:2015-2020\text{cell:2015-2020}. The community centered around cell:1990-1995\text{cell:1990-1995} contains alternative fills like window, wall, building, blood, bone, and biochemical. From these alternative fills, we can infer that cell was frequently used in paradigms relating to containment and biology from 1990-1995\text{1990-1995}. Contrast this to the community centered around cell:2015-2020\text{cell:2015-2020}, which contains fills like camera, flip, smart, home, tissue, and germ. From these alternative fills, we can infer that cell was frequently used in paradigms relating to technology and biology from 2015-2020\text{2015-2020}. Overall, we can conclude that the paradigm distribution for the word cell has shifted toward more frequent usage in the technological paradigm between 19901990 and 20202020.

Compare the paradigm map for cell to the paradigm map for the word seven, which contains only a single visual community:

The existence of only one community in the paradigm map for seven indicates that the distribution of paradigms seven is used in has not changed between 19901990 and 20202020.

We can formalize this notion of "visual communities" by computing the modularity of our paradigm graphs. Paradigm graphs with a high modularity will exhibit a more prominent community structure, indicating that their target word has undergone a change in its paradigm distribution.

Formally, let Q:G[0.5,1)Q: G \to [-0.5, 1) be the Clauset-Newman-Moore greedy modularity function. Then, Q(Gcell)=0.1562Q(G_{\text{cell}})=0.1562 and Q(Gseven)=0.0254Q(G_{\text{seven}})=0.0254. We interpret these scores as indicating that the word cell has gone through a greater degree of paradigm distribution shift than the word seven from 19901990 to 20202020.

Evaluating Semantic Change Detection

Before drawing conclusions about diachronic semantic change with TLM-1, we first need to establish whether the model can perform the semantic change detection task at all. Unfortunately, as noted by prior work [1] [2] [3], there are no strong, standardized benchmark datasets for this task. This lack of consensus complicates evaluation and makes model comparison difficult.

To address this gap, we manually curate a benchmark dataset tailored to our period of interest (1990–2020). The dataset contains 70 words: 50 positive examples, which we identify as having undergone a sense shift, and 20 negative examples, which we consider semantically stable over the same interval.

We source the list of 50 positive words from a combination of the Oxford English Dictionary and the Collins Online Dictionary. We also provide a short description of each word's sense change, which you can view by accessing this file.

For our negative examples, we exploit the fact that low-limit number words are consistently observed to be semantically stable across long time periods. As such, we use the numerical words "one" through "twenty" as our negative list.

The figure below reports the modularity scores for the paradigm graphs corresponding to each word in our benchmark.

As shown in the figure, the known semantically changing words achieve systematically higher modularities than the numeric words. We interpret this as evidence that TLM-1 is able to perform the semantic change detection task.

Overall, though, we want to reiterate the need for more extensive semantic change detection benchmarks. Based on our early results, we believe that TLM-1 could support this effort by supplying soft labels for a much larger annotated semantic change benchmark.

The Temporal Control Curve

To increase our understanding of TLM-1's learned time representations, we perform an analysis of its learned time token embeddings. Recall that the time token embeddings were all identically initialized, and that the TLM-1 objective treats document dating as a 30 way classification problem. This means that the time token embeddings are not endowed with any a priori geometric structure relating to the linear progression of time.

We start our investigation by building a matrix from the rows corresponding to time tokens in TLM-1's vocabulary matrix. We call this matrix the time token embedding matrix. We use PCA to project the time token embedding matrix into 3D and visualize it in the figure below. Much to our surprise, TLM-1 seems to have learned a geometry that organizes time tokens on a curve according to their actual ordinal temporal order.

To test whether TLM-1's learned time token geometry does in fact recover an ordinal progression of time, we use Isomap to project the time token embedding matrix to 1D and perform a Kendall-Tau test to measure the ordinal association between the Isomap coordinates and the actual temporal ordering of the time tokens. In particular, we are testing:

  • H0H_0: There is no ordinal association between a time token's Isomap 1D coordinate and its actual temporal ordering.
  • HAH_A: There is an ordinal association between a time token's Isomap 1D coordinate and its actual temporal ordering.

We reject H0H_0 with a p-value of 3e283e^{-28}, and conclude there is overwhelming evidence of an ordinal association between a time token's Isomap 1D coordinate and its actual temporal ordering. This evidence becomes even more apparent when visualizing the Isomap1D coordinates versus the actual temporal ordering of the time tokens, as shown below:

Overall, we conclude based on our hypothesis test and our qualitative investigation that TLM-1's time tokens are recovering a curve whose geometry recovers the ordinal progression of time. We call this curve the "temporal control curve." Perhaps the most exciting implication of the temporal control curve is that it gestures at a natural procedure for forecasting with TLMs. First, one would fit a principal curve to the time tokens learned by the TLM. Then, this principal curve could be extrapolated to create soft tokens for future time bins. Finally, these soft tokens could be supplied to the TLM as prompts to obtain forecasted posteriors over future language.

The implications of a successful language forecasting procedure are wide reaching; a scaled up TLM could exhibit capabilities including the prediction of election outcomes, black swan events, and geopolitical narratives.

Conclusion

In this report, we introduced TLM-1, a model of the language process that learns by jointly predicting document contents and classifying document dates. We train TLM-1 on a general purpose monitor corpus of American English, and provide a query methodology that enables us to probe TLM-1 for temporal trends in language relevant to the United States from 1990 to 2019. We find that TLM-1 accurately reflects several "long arc" trends in contemporary American English and effectively surfaces semantic changes in word meanings.

Furthermore, an interpretability analysis of TLM-1's time token embeddings reveals that they learn a curve whose geometry recovers the ordinal progression of time. We conjecture that TLMs can be used to forecast the likelihood of future language by extrapolating soft tokens from a fit this curve.

Despite these promising characteristics, TLM-1 has several drawbacks. TLM-1 is trained on only 750 million words from a single monitor corpus, which prevents it from taking advantage of the blessings of scale. Moreover, TLM-1 is reliant on an anachronistic base model, which complicates temporal analysis and necessitates our Bayesian Query Framework.

We conjecture that these problems can be addressed with a theoretical TLM-2 model. The training set of TLM-2 could minimally include COCA, COHA, NOW, and Wikipedia, which would increase its size from 750 million words to 30+ billion words. This would be enough data to train a GPT-2-sized TLM-2. Further, if we were to date the sequences in the RedPajama V2 dataset, we would have enough data to train a GPT-3-sized TLM-2 model.

Beyond scaling, there are several concrete procedural, architectural, and data composition questions that TLM-1 does not address. Procedurally, the rate of time masking seems to be a particularly important parameter that is worth studying via ablation. We conjecture that the time token masking rate controls a tradeoff where a high masking rate improves the model's document dating ability while a low masking rate improves the model's ability to perform naive temporally sensitive fills.

Architecturally, there may be utility in exploring decoder-only variants of TLM-1. One challenge of a decoder variant, in particular, is how to integrate time masking. If the time token remains as the first token in the sequence, a decoder model's causal attention structure would prevent context information from being used to fill it when masked. Further, if the time token is not the first token in the sequence, then tokens preceding it will not receive any temporal information in their fills. As a result of these challenges, the task of integrating time tokens into a decoder variant of TLM will require careful consideration and experimentation.

From a data composition standpoint, the challenges of creating an improved dataset for TLM-2 are similar to the challenges associated with creating any monitor corpus. A corpus of sufficient size for a GPT-2- or GPT-3-class model will necessarily require heterogeneous data sources, which could lead to topical skew in the data sources if not integrated correctly. Moreover, there is inherent temporal bias in digital monitor corpora because of the increasing rate of digital data production. This creates a recency bias, as a model that spends more capacity modeling recent phenomena will perform well on the outsized share of recent data in the dataset. Combating both of these biases remains an open challenge in both the language modeling and monitor corpus construction communities.

We believe that the capabilities of an idealized TLM are well worth the effort of solving these challenges. An idealized TLM would be an incredibly valuable resource for the historical linguistics community, as it provides a unified interface that can be used for interacting with monitor corpora.

Furthermore, we believe that an idealized TLM will have nontrivial prediction capabilities, and conjecture that these capabilities can be accessed by forecasting the Temporal Control Curve. Since language models are multitask learners, we conjecture that an ideal TLM must learn to model several underlying temporal processes that lead to the production of the language in the corpus. We believe that an idealized TLM could forecast the likelihood of various statements arising in the future by inherently forecasting the dynamics of this ensemble of underlying processes. As a result, an idealized TLM could forecast the probability of future events, including election outcomes, black swan events, and geopolitical narratives.

Overall, we believe that TLM-1 makes an important first step towards an idealized TLM that can realize this vast array of capabilities.