[back to home]

On N-dimensional Rotary Positional Embeddings

July 26th, 2025 · Jerry Xiong

Fig. 1. Left: golden gate RoPE, right: axial RoPE. Cosine similarities between a fixed query and rotations of that query over varying positions.

RoPE in one dimension

One of the simplest ways of encoding relative positional information in attention is to add a scalar to each of the attention logits, with a value somehow depending on the distance between the corresponding query and key (e.g. learned values in T5, fixed values decreasing with distance in ALiBi).

However, this makes it difficult for a query to attend to any specific (key, relative position) pair. In particular, the query must have a component pointing in the direction of the desired key, but this increases the attention scores with all tokens with that key, regardless of their position.

The rotary positional embedding (RoPE, Su et al. 2021) solves this problem by rotating the query and key vectors for every token by an angle proportional to the token's 1d coordinate position.

To be specific, each attention head has its \(D\) channel dimensions divided into \(D / 2\) dimension pairs. For a given query or key input vector \(x \in \mathbb R^D\) located at position \(t\), the \(i\)th dimension pair \((x_{2i + 1}, x_{2i + 2})\) is rotated about the origin by an angle \(\omega_i \cdot t\), where \(\omega_i\) is the angular frequency corresponding to \(i\):

\[ \text{RoPE1d} (x, t) = {\small \begin{pmatrix} \cos (\omega_1 t) & -\sin (\omega_1 t) & 0 & 0 & \cdots & 0 & 0 \\ \sin (\omega_1 t) & \cos (\omega_1 t) & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos (\omega_2 t) & -\sin (\omega_2 t) & \cdots & 0 & 0 \\ 0 & 0 & \sin (\omega_2 t) & \cos (\omega_2 t) & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos (\omega_{D / 2} t) & -\sin (\omega_{D / 2} t) \\ 0 & 0 & 0 & 0 & \cdots & \sin (\omega_{D / 2} t) & \cos (\omega_{D / 2} t) \\ \end{pmatrix} } x. \]

Note that the composition of independent rotations on orthogonal 2d planes is just a higher-dimensional rotation (see this article or this post).

Typically, a range of log-spaced frequencies are selected, where the \(i\)th frequency is \[\omega_i = \omega_\text{min} \cdot (\omega_\text{max} / \omega_\text{min}) ^{\frac{i}{D/2 - 1}}.\]

Larger frequencies enforce the prior that queries should be more specific about position, which can improve expressivity but hurt generalizability. Similarly, smaller frequencies lead to more invariance w.r.t position, improving generalizability but hurting expressivity. Including a wide range of frequencies (\(\omega_\text{min} \ll \omega_\text{max}\)) enables a wide range of specificities, but reduces the number of query/key dimensions available for use at any particular level of specificity.

Fig. 2. Cosine similarities between a fixed query and the 1d-RoPE-rotated version of itself, over varying positions.

As more frequencies are added, their periodic oscillations cancel each other out, resulting in an attention map concentrated at a specific 1d coordinate position. [1]

Extensions to 2 or more dimensions

The most common extension of RoPE to 2 dimensions used in vision transformer (ViT) implementations today is called axial RoPE, which applies 1d RoPE twice: rotating the first \(D/2\) dimensions of the query/key according to x-position, and the remaining \(D/2\) dimensions according to y-position.

Similar to 1d RoPE, 2d axial RoPE also encodes purely relative positional information. However, 2d axial RoPE does not enable attending to specific (key, relative position) pairs. The first half of a query, which rotates according to x-position, contributes the same amount to the attention score for a key regardless of the key's y-position. Similarly, the second half of a query contributes the same amount regardless of x-position. Attending to a token necessarily means attending to any tokens with similar keys located in the same row or column, with a cosine alignment at least half as large, roughly speaking.

The main insight is that, rather than rotating any particular dimension pair based on only x-position or based on only y-position, the rotations can instead be based on the tokens' positions measured along arbitrary 2d directions.

Last year, Heo et al. in their paper titled "Rotary position embedding for vision transformer" described this idea as part of their proposed 'mixed RoPE' method. However, it seems like many of the various recent works which cite Heo et al., such as SAM 2 and Qwen-Image, are actually still using axial RoPE, based on their official implementations. Maybe this is because it isn't obvious from only reading the abstract of Heo et al. that the paper introduces a novel approach that differs from axial RoPE.

At initialization, unit directions \(\{\mathbf{u}_i\}_{i=1}^{D/2}\), \(\mathbf{u}_i \in \mathbb{R}^2\), \(\|\mathbf{u}_i\|_2 = 1\) are selected for each of the \(D/2\) dimension pairs. During inference, the angle of rotation for the \(i\)th pair is given by the frequency magnitude \(\omega_i\) times the dot product of the token's 2d position with \(\mathbf{u}_i\) (whereas for 1d RoPE, this angle was just \(\omega_i\) times the token's 1d coordinate position).

To be precise, the general N-dimensional RoPE for positions \(\mathbf{t} \in \mathbb R^N\) can be written as

\[ \text{RoPE} (x, \mathbf{t}) = {\small \begin{pmatrix} \cos (\omega_1 \langle \mathbf{u}_1, \mathbf{t} \rangle) & -\sin (\omega_1 \langle \mathbf{u}_1, \mathbf{t} \rangle) & \cdots & 0 & 0 \\ \sin (\omega_1 \langle \mathbf{u}_1, \mathbf{t} \rangle) & \cos (\omega_1 \langle \mathbf{u}_1, \mathbf{t} \rangle) & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \cdots & \cos (\omega_{D / 2} \langle \mathbf{u}_{D / 2}, \mathbf{t} \rangle) & -\sin (\omega_{D / 2} \langle \mathbf{u}_{D / 2}, \mathbf{t} \rangle) \\ 0 & 0 & \cdots & \sin (\omega_{D / 2} \langle \mathbf{u}_{D / 2}, \mathbf{t} \rangle) & \cos (\omega_{D / 2} \langle \mathbf{u}_{D / 2}, \mathbf{t} \rangle) \\ \end{pmatrix} } x. \]

This includes axial RoPE as a special case where each \(\mathbf{u}_i\) is in the standard basis.

In other words, the \(i\)th dimension pair is rotated by an angle proportional to "the position of the token measured according to the direction \(\mathbf{u}_i\)". By selecting \(\{\mathbf{u}_i\}_{i=1}^{D / 2}\) uniformly from the unit circle rather than constraining them to axis-aligned directions, it turns out that RoPE can also produce concentrated attention maps in 2d!

Fig. 3. Cosine similarities between a fixed query and the 2d-RoPE-rotated version of itself over varying positions. Here, the similarities are evaluated on a prefix of the dimension pairs. The attention map becomes gradually more concentrated as the number of frequencies increases.

Selecting frequency directions

Mixed RoPE (Heo et al. 2024) initializes \(\{\mathbf{u}_i\}_{i=1}^{D / 2}\) by sampling uniformly random vectors from the unit circle, then treating the frequency vectors \(\mathbf{f}_i = \omega_i \mathbf{u}_i\) as learnable parameters. However, it's unclear a priori whether learnable frequency vectors are actually beneficial, especially if the frequencies are selected in such a way that it's already possible to query unique positions at initialization.

Intuitively, typical neural network parameters, e.g. the weights of a linear layer, have the property that a gradient step on some training input will generally only induce a large change in the layer's outputs for inputs similar to that training input. In contrast, changes to the frequency vectors of RoPE will nontrivially modify the attention scores between almost all query-key pairs, which could make these frequencies less amenable to gradient optimization.

If these frequency vectors are kept frozen instead, then ideally we would want to initialize them in a more principled, deterministic fashion. In the initial discussion on the EleutherAI discord, Kevin Yin suggested arranging the frequency vectors for each head in order of increasing magnitude, and rotating the \(i\)th vector to an angle of \(i\cdot 2\pi / \varphi\) where \(\varphi = (1 + \sqrt{5}) / 2\) is the golden ratio. This is the approach used in the experiments below, though, as Kevin later pointed out, rotating by \(i \cdot \pi / \varphi\) is likely to perform better. (edit: For a proper explanation about why this works well, check out Kevin's blog post.)

For embedding positions in more than 2 dimensions, one option that is easy to implement and works well enough is to take samples from \(U(0, 1)\) quasi-randomly via the generalized golden ratio, mapping them to Gaussian samples using the inverse CDF, and then normalizing to length one. An example implementation is provided here. (edit: nor wrote a great blog post discussing better ways to select frequency directions in >2 dimensions.)

NOTE: since the sequence lengths relevant for language modeling are typically much longer than the side length of image inputs for vision transformers, the minimum and maximum frequency magnitudes should be adjusted to compensate. I recommend using a coordinate system with positions normalized to [-1.0, 1.0] and using \(\omega_\text{min}\) between 0.2 and 1.0 and \(\omega_\text{max}\) between 20 and 100.

ViT experiments

Here, I compare:

CIFAR10

I trained some small (7M parameter) ViTs with 4x4 patches for 200 epochs on CIFAR10. I searched over \(\omega_\text{min} \in \{0.5, 1.0\}\) with \(\omega_\text{max} = 100 \cdot \omega_\text{min}\). Below, I'm reporting the best validation negative log-likelihood (NLL) and accuracy, mean ± std over 2 seeds, for each approach. See hyperparameters here and code here.

The official implementation of mixed RoPE uses \(\omega_\text{min}\) = 0.65, \(\omega_\text{max}\) = 6.5. (Well, more precisely, they used frequencies from 0.1 to 1.0 when working with integer row/column positions ranging from 0 to 13, but here I'm expressing the frequencies w.r.t positions normalized to [-1.0, 1.0].)

CIFAR10 ViT(dim=384, mlp_dim=768, depth=6) / patch 4 @ 200 epochs

Method Learned \(\omega_\text{min}\) \(\omega_\text{max}\) Valid NLL (↓) Valid Accuracy (%) (↑)
APE N/A N/A 0.4287 ± 0.0031 89.70 ± 0.03
SinCos 1.00 100.0 0.4144 ± 0.0124 89.93 ± 0.45
Axial RoPE 0.50 50.0 0.3535 ± 0.0018 91.95 ± 0.05
Mixed RoPE, original freqs 0.65 6.5 0.3550 ± 0.0072 91.63 ± 0.32
Mixed RoPE, adjusted freqs 1.00 100.0 0.3394 ± 0.0015 92.43 ± 0.07
LieRE N/A N/A 0.3461 ± 0.0023 91.95 ± 0.04
Golden gate RoPE 1.00 100.0 0.3292 ± 0.0023 92.43 ± 0.05

Both absolute position embedding methods performed poorly. Mixed RoPE underperformed when using the frequencies from the official implementation, but after adjusting the frequencies, mixed RoPE outperformed axial RoPE and did about as well as golden gate RoPE. During hyperparam tuning, axial RoPE was the only approach which benefited from a frequency range of (0.5-50) rather than (1.0-100.0) on CIFAR10. LieRE performed slightly better than axial RoPE when LieRE's skew symmetric parameters were not shared between layers or heads, though this resulted in high memory usage in my implementation compared to the other approaches, and still underperformed compared to mixed or golden gate RoPE. Golden gate RoPE had the best NLL.

ImageNet-1K

I trained ViT B/16 sized models (86M parameters) on ImageNet-1K for 90 epochs. I broadly used the same data augmentation and preprocessing scheme as Beyer et al. 2022, training at 224x224 with inception cropping and a small amount of RandAugment and MixUp, though with the same architecture and optimizer setup as the CIFAR10 experiments above. Due to resource limits, I only compared the fixed sinusoidal positional embedding, axial RoPE, mixed RoPE, and golden gate RoPE. See hyperparameters here and code here.

ImageNet-1K ViT B/16 @ 90 epochs

Method Learned \(\omega_\text{min}\) \(\omega_\text{max}\) Zero freqs Valid NLL (↓) Valid Accuracy (%) (↑)
SinCos 1.0 100.0 N/A 0.8444 78.71
Axial RoPE 0.2 20.0 0 / 16 0.8034 79.58
" " " 8 / 16 0.8055 79.61
Mixed RoPE 0.2 20.0 N/A 0.8025 79.73
Golden gate RoPE 0.2 20.0 0 / 32 0.8064 79.67
" " " 8 / 32 0.7979 79.78
" " " 16 / 32 0.8002 79.68

I searched over \(\omega_\text{min} \in \{0.2, 0.5, 1.0\}\), and, in contrast to CIFAR10, I found that lower frequency magnitudes (0.2-20.0) performed better for the RoPE approaches on ImageNet. In my initial runs, axial RoPE, mixed RoPE, and golden gate RoPE performed about the same, with mixed RoPE at a slight advantage. From analyzing the learned frequencies of mixed RoPE, I found that a good portion of the learned frequencies decreased to almost zero during training. Based on modded-nanogpt and Barbero et al. 2024, I set 8 / 32 of the frequencies of golden gate RoPE to zero and it outperformed mixed RoPE by a slim margin. (Here, 8 / 32 means that 8 of the 32 unique frequency magnitudes were set to zero; axial RoPE repeats the frequency magnitudes for x and y, so there were only 16 unique frequency magnitudes total.)

Here, I evaluated how well each method generalized to different resolutions at inference time. The models were trained at 224x224 only, and to evaluate at a higher resolution (e.g. 384x384), the positions of each patch were scaled to still span [-1.0, 1.0] (so adjacent patches end up with coordinates which are closer together than before). I also tried scaling the temperature of the softmax to account for the increased token count (like e.g. here), using a temperature of \(\log(\text{new_res}^2 / p^2)\) \(/ \log (\text{old_res}^2 / p^2)\) in this case, where \(p\) is the patch size.

ViT B/16 resolution generalization: validation accuracies (%) (vs in-dist @ 224x224)

Method Learned \(\omega_\text{min}\) \(\omega_\text{max}\) Zero freqs 224x224 (in-dist) 384x384 384x384 w/ temp 512x512 w/ temp
SinCos 1.0 100.0 N/A 78.71 75.12 (-3.59) 77.29 (-1.42) 74.95 (-3.76)
Axial RoPE 0.2 20.0 0 / 16 79.58 78.31 (-1.27) 79.57 (-0.01) 77.18 (-2.40)
Mixed RoPE 0.2 20.0 N/A 79.73 78.61 (-1.12) 79.83 (+0.10) 77.55 (-2.18)
Golden gate RoPE 0.2 20.0 8 / 32 79.78 79.19 (-0.59) 80.41 (+0.63) 79.15 (-0.63)

Axial RoPE generalized better than the fixed sinuisoidal positional embedding, and mixed RoPE generalized better than axial RoPE. Surprisingly, golden gate RoPE generalized better than mixed RoPE, and both mixed RoPE and golden gate RoPE actually had higher validation accuracy at 384x384 when combined with temperature scaling than they had at the training resolution of 224x224.

As a sidenote, I found that thinner vision transformers (dim=384, mlp_dim=768, 15M params) with smaller 8x8 patches and the same depth=12 (due to increased # of patches, approximately equivalent in FLOPs to B/16) resulted in improved performance across the board:

ImageNet-1K ViT(dim=384, mlp_dim=768, depth=12) / patch 8 @ 90 epochs

Method Learned \(\omega_\text{min}\) \(\omega_\text{max}\) Valid NLL (↓) Valid Accuracy (%) (↑)
SinCos 1.0 100.0 0.7834 79.46
Axial RoPE 0.2 20.0 0.7393 80.48
Mixed RoPE 0.2 20.0 0.7455 80.26
Golden gate RoPE 0.2 20.0 0.7456 80.42

(These runs used nonzero initialization frequencies only, though I'd expect better performance with some frequencies set to zero.)

Discussion

Overall, it seems like golden gate RoPE was consistently among the best approaches when the frequency magnitudes were properly tuned. Mixed RoPE, with learnable frequencies, also performed well. However, mixed RoPE, despite being learnable, was still sensitive to the initialization magnitudes of the frequencies, and demonstrated poorer generalization to different resolutions at inference time. I would recommend defaulting to golden gate RoPE, and performing at least a small amount of tuning on the RoPE frequency magnitudes regardless of the approach used.

Reference implementations

Golden gate RoPE, 2d (PyTorch)

class GoldenGateRoPE2d(nn.Module):
    def __init__(
        self,
        image_size: tuple[int, int],
        n_heads: int,
        head_dim: int,
        min_freq: float,
        max_freq: float,
        p_zero_freqs: float = 0.0,
        direction_spacing: float = math.pi * (math.sqrt(5) - 1) / 2,
    ):
        """
        Args:
            image_size: expected height and width of (patchified) input
            n_heads: number of attention heads
            head_dim: attention head dimensionality
            min_freq, max_freq: lowest and highest nonzero frequency magnitudes
            p_zero_freqs: proportion of frequencies set to 0
            direction_spacing: difference in radians between adjacent directions along
                which position is measured
        
        Dimension key:
            N: batch size
            H: image_size[0]
            W: image_size[1]
            h: n_heads
            d: head_dim
            F: num_freqs == d // 2
        """
        super().__init__()
        assert head_dim % 2 == 0
        assert 0 <= p_zero_freqs <= 1
        n_freqs = head_dim // 2
        n_zero_freqs = round(p_zero_freqs * n_freqs)
        omega_F = torch.cat(
            (
                torch.zeros(n_zero_freqs),
                min_freq
                * (max_freq / min_freq) ** torch.linspace(0, 1, n_freqs - n_zero_freqs),
            )
        )
        phi_hF = (
            torch.arange(n_heads * n_freqs).reshape(n_heads, n_freqs)
            * direction_spacing
        )
        directions_hF2 = torch.stack((torch.cos(phi_hF), torch.sin(phi_hF)), dim=-1)
        freqs_hF2 = omega_F.unsqueeze(-1) * directions_hF2

        H, W = image_size
        xlim, ylim = math.sqrt(W / H), math.sqrt(H / W)
        x_HW = torch.linspace(-xlim, xlim, W).reshape(1, W).expand(H, W)
        y_HW = torch.linspace(-ylim, ylim, H).reshape(H, 1).expand(H, W)
        positions_HW112 = torch.stack((x_HW, y_HW), dim=-1).reshape(H, W, 1, 1, 2)

        theta_HWhF = (freqs_hF2 * positions_HW112).sum(dim=-1)
        self.register_buffer("cos_HWhF", torch.cos(theta_HWhF))
        self.register_buffer("sin_HWhF", torch.sin(theta_HWhF))

    def forward(self, input_NHWhd: torch.Tensor) -> torch.Tensor:
        x_NHWhF, y_NHWhF = input_NHWhd.float().chunk(2, dim=-1)
        x_out_NHWhF = x_NHWhF * self.cos_HWhF - y_NHWhF * self.sin_HWhF
        y_out_NHWhF = x_NHWhF * self.sin_HWhF + y_NHWhF * self.cos_HWhF
        output_NHWhd = torch.cat((x_out_NHWhF, y_out_NHWhF), dim=-1)
        return output_NHWhd.type_as(input_NHWhd)
This implementation assumes a fixed input height/width and precomputes the rotation matrix.

direction_spacing has a default value of \(\pi / \varphi \approx 1.9416\) here, which should be the best performing value, but note that the experiments above mistakenly used a slightly suboptimal value of \(2\pi / \varphi \approx 3.8832\) instead.

direction_spacing can be set to \(\pi / 2\) to (approximately) emulate axial RoPE.

Golden Gate RoPE, Nd (PyTorch)

def _phi(m: int) -> float:
    x = 2.0
    for _ in range(10):
        x = (1 + x) ** (1.0 / (m + 1.0))
    return x


def make_directions(n: int, d: int) -> torch.Tensor:
    g = _phi(d)
    alpha = (1.0 / g) ** torch.arange(1, d + 1, dtype=torch.float64)
    i = torch.arange(1, n + 1, dtype=torch.float64).unsqueeze(1)
    z = torch.fmod(i * alpha, 1.0)
    directions = torch.erfinv(2.0 * z - 1.0)
    directions = directions / directions.norm(dim=1, keepdim=True)
    return directions.float()


class GoldenGateRoPENd(nn.Module):
    def __init__(
        self,
        pos_dim: int,
        n_heads: int,
        head_dim: int,
        min_freq: float,
        max_freq: float,
        p_zero_freqs: float = 0.0,
    ):
        """
        Args:
            pos_dim: dimensionality of the token positions
            n_heads: number of attention heads
            head_dim: attention head dimensionality
            min_freq, max_freq: lowest and highest nonzero frequency magnitudes
            p_zero_freqs: proportion of frequencies set to 0

        Dimension key:
            N: batch size
            L: number of tokens per sample
            P: pos_dim
            h: n_heads
            d: head_dim
            F: num_freqs == head_dim // 2
        """
        super().__init__()
        n_freqs = head_dim // 2
        n_zero_freqs = round(p_zero_freqs * n_freqs)
        omega_F = torch.cat(
            (
                torch.zeros(n_zero_freqs),
                min_freq
                * (max_freq / min_freq) ** torch.linspace(0, 1, n_freqs - n_zero_freqs),
            )
        )

        directions_hFP = make_directions(n_heads * n_freqs, pos_dim).reshape(
            n_heads, n_freqs, pos_dim
        )
        self.register_buffer("freqs_hFP", directions_hFP * omega_F.reshape(n_freqs, 1))

    def forward(self, input_NLhd: torch.Tensor, pos_NLP: torch.Tensor) -> torch.Tensor:
        x_NLhF, y_NLhF = input_NLhd.float().chunk(2, dim=-1)
        theta_NLhF = (self.freqs_hFP * pos_NLP[..., None, None, :].float()).sum(dim=-1)
        cos_NLhF = torch.cos(theta_NLhF)
        sin_NLhF = torch.sin(theta_NLhF)
        x_out_NLhF = x_NLhF * cos_NLhF - y_NLhF * sin_NLhF
        y_out_NLhF = x_NLhF * sin_NLhF + y_NLhF * cos_NLhF
        output_NLhd = torch.cat((x_out_NLhF, y_out_NLhF), dim=-1)
        return output_NLhd.type_as(input_NLhd)
This implementation takes the token positions as an additional input to forward().

I recommend decreasing the ratio max_freq / min_freq as the number of positional dimensions increases.

Hyperparameters

CIFAR10

Steps 10,000 (200 epochs)
Batch size 1,000
Muon LR 0.06
Muon momentum 0.95
Muon weight decay 0.01
AdamW LR 0.003
AdamW betas (0.9, 0.95)
AdamW weight decay 0.01
LR cooldown start 7,500
Label smoothing 0.1
Patch size 4
dim 384
MLP dim 768
depth 6

ImageNet-1K

Steps 187,650 (90 epochs)
Global batch size 1,024
Muon LR 0.03
Muon momentum 0.95
Muon weight decay 0.01
AdamW LR 0.001
AdamW betas (0.9, 0.95)
AdamW weight decay 0.01
LR cooldown start 150,120 (70 epochs)
Patch size 16
dim 768
MLP dim 3072
depth 12

Acknowledgements

Kevin Yin for suggesting fixed rotations based on the golden ratio, as well as general feedback. Stephen Huan for general feedback.

How to cite

@misc{xiong2025ndrope,
    author = {Jerry Xiong},
    title = {On n-dimensional rotary positional embeddings},
    year = {2025},
    url = {https://jerryxio.ng/posts/nd-rope/}
}
  1. Note that Fig. 2 is only centered about 0 because it shows cosine similarities between a query and rotations of itself, i.e. \(\langle q\), \(\text{RoPE1d}(q, t)\rangle\) over varying \(t\). Creating a plot centered at some arbitrary position \(\tilde{t}\) is as simple as plotting \(\langle q, \text{RoPE1d}(\tilde{q}, t)\rangle\) instead, where \(\tilde{q} = \text{RoPE1d}(q, -\tilde{t})\). If the plot showed cosine similarities between a query and rotations of some arbitary \(k \in \mathbb R^D\) instead, i.e. \(\langle q, \text{RoPE1d}(k, t)\rangle\) over varying \(t\), then \(k\) could be selected to create a plot centered at any arbitrary position, or even with an arbitrary number of modes. For example, by selecting \(k\) to be the average of multiple different rotations of \(q\) at unique positions, the resulting cosine similarities curve would be the linear combination of the corresponding curves for each position. This is all to say that RoPE doesn't intrinsically bias attention towards short-horizon temporal interactions, but rather, the figures in this post describe how 'the contribution of a component of a query pointing in the direction corresponding to some (key, relative position) pair' decays as the relative position is shifted away from that value; in other words, how selective the average query is w.r.t position.
  2. In an earlier version of this post, the suggested approach (selecting frequency directions rotated based on the golden ratio) was called uniform RoPE. Fluffy on the EleutherAI discord suggested the name 'golden gate RoPE'.