This is a viewer only at the moment see the article on how this works.
To update the preview hit Ctrl-Alt-R (or ⌘-Alt-R on Mac) or Enter to refresh. The Save icon lets you save the markdown file to disk
This is a preview from the server running through my markdig pipeline
Sunday, 09 November 2025
If you've been following this blog, you know I'm a bit obsessed with machine translation. I've written about using EasyNMT, building background translation services, and even improving EasyNMT. But I realized I've never actually explained how neural machine translation works under the hood.
NOTE: This is part of my experiments with AI / a way to spend $1000 Calude Code Web credits. I've fed this a BUNCH of papers, my understanding, questions I had to generate this article. It's fun and fills a gap I haven't seen filled anywhere else.
So let's fix that! This post will take you from "translation is magic" to "translation is clever math" by explaining the key concepts behind neural machine translation in plain English (with lots of diagrams and some C# code to make it concrete).
By the end of this post, you'll understand:
Don't worry if you're not a math person - I'll keep this as practical and visual as possible. Let's dive in!
Here's the first mind-bending concept: neural machine translation treats translation as a purely mathematical process. You feed in a sentence, the system converts it to numbers, performs millions of mathematical operations, and out pops a translation.
graph LR
A[Source Sentence] --> B[Convert to Numbers]
B --> C[Neural Network Processing]
C --> D[Convert to Words]
D --> E[Translated Sentence]
But wait - how do you convert words to numbers? And how does math "understand" language? Let's start with the building blocks.
Before we can understand neural machine translation, we need to understand artificial neural networks. Despite the fancy name, they're actually pretty simple at their core.
An artificial neuron is a mathematical function that:
Here's a visual representation:
graph LR
I1[Input 1] -->|weight: 0.8| N[Neuron]
I2[Input 2] -->|weight: -0.3| N
I3[Input 3] -->|weight: 0.5| N
N --> O[Output]
style N stroke-width:4px
Let's make this concrete with some C# code:
public class Neuron
{
private double[] weights;
private double bias;
public double Activate(double[] inputs)
{
// Step 1: Multiply each input by its weight and sum them
double sum = bias;
for (int i = 0; i < inputs.Length; i++)
{
sum += inputs[i] * weights[i];
}
// Step 2: Apply activation function (tanh keeps values between -1 and 1)
return Math.Tanh(sum);
}
}
The activation function (in this case Math.Tanh) is what makes neural networks interesting. It introduces non-linearity, allowing the network to learn complex patterns. Without it, no matter how many neurons you stacked together, you'd just have a fancy linear function.
The magic happens when you connect thousands or millions of these neurons together in layers:
graph LR
subgraph Input Layer
I1[Input 1]
I2[Input 2]
I3[Input 3]
end
subgraph Hidden Layer 1
H1[Neuron 1]
H2[Neuron 2]
H3[Neuron 3]
H4[Neuron 4]
end
subgraph Hidden Layer 2
H5[Neuron 5]
H6[Neuron 6]
H7[Neuron 7]
end
subgraph Output Layer
O1[Output 1]
O2[Output 2]
end
I1 --> H1 & H2 & H3 & H4
I2 --> H1 & H2 & H3 & H4
I3 --> H1 & H2 & H3 & H4
H1 --> H5 & H6 & H7
H2 --> H5 & H6 & H7
H3 --> H5 & H6 & H7
H4 --> H5 & H6 & H7
H5 --> O1 & O2
H6 --> O1 & O2
H7 --> O1 & O2
Each layer "refines" the representation of the input, extracting increasingly abstract patterns. In translation:
Here's the clever part: we don't manually set all those weights. The network learns them from examples!
graph TD
A[Start with Random Weights] --> B[Feed in Training Example]
B --> C[Network Makes Prediction]
C --> D{Is Prediction Correct?}
D -->|No| E[Calculate Error]
E --> F[Adjust Weights Slightly]
F --> B
D -->|Yes| G[Try Next Example]
G --> B
style E stroke-width:2px
style F stroke-width:2px
This process is called gradient descent. For each wrong answer, the network figures out "which weights were most responsible for this error?" and tweaks them a tiny bit. After seeing millions of examples, the weights settle into values that work well.
Here's a simplified version in C#:
public class SimpleNeuralNetwork
{
private double learningRate = 0.01;
private Neuron[] neurons;
public void Train(TrainingExample[] examples, int epochs)
{
for (int epoch = 0; epoch < epochs; epoch++)
{
foreach (var example in examples)
{
// Forward pass: make a prediction
double prediction = Predict(example.Input);
// Calculate error
double error = example.ExpectedOutput - prediction;
// Backward pass: adjust weights
// (simplified - real networks use backpropagation)
AdjustWeights(error * learningRate);
}
}
}
}
Real neural networks use a more sophisticated algorithm called backpropagation that efficiently calculates how to adjust weights throughout the entire network, but the principle is the same: learn from mistakes.
Now we hit our first major challenge: how do we feed words into a neural network? We need to convert text into numbers.
The naive approach is "one-hot encoding" - assign each word a unique number:
// Don't do this!
var wordToNumber = new Dictionary<string, int>
{
{"cat", 1},
{"dog", 2},
{"king", 3},
{"queen", 4},
{"man", 5},
{"woman", 6}
};
But this has a huge problem: the numbers are arbitrary. The network can't tell that "cat" and "dog" are more similar than "cat" and "queen". The numbers 1 and 2 don't mean anything.
Instead, we represent each word as a vector of numbers, typically 300-1000 dimensions. Similar words get similar vectors:
public class WordEmbedding
{
// Each word is represented by a vector of floats
private Dictionary<string, float[]> embeddings;
public float[] GetEmbedding(string word)
{
return embeddings[word]; // e.g., [0.2, -0.5, 0.8, 0.1, ...]
}
// Calculate similarity between two words
public double Similarity(string word1, string word2)
{
var vec1 = GetEmbedding(word1);
var vec2 = GetEmbedding(word2);
// Cosine similarity: how "aligned" are the vectors?
return CosineSimilarity(vec1, vec2);
}
private double CosineSimilarity(float[] a, float[] b)
{
double dot = 0, magA = 0, magB = 0;
for (int i = 0; i < a.Length; i++)
{
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.Sqrt(magA) * Math.Sqrt(magB));
}
}
Here's what makes embeddings magical - they capture semantic relationships:
graph TD
subgraph "2D Projection of 300D Space"
King[King<br/>0.8, 0.6]
Queen[Queen<br/>0.75, 0.55]
Man[Man<br/>0.3, 0.2]
Woman[Woman<br/>0.25, 0.15]
Dog[Dog<br/>-0.6, 0.4]
Cat[Cat<br/>-0.65, 0.38]
end
King -.similar.-> Queen
Man -.similar.-> Woman
Dog -.similar.-> Cat
King -.same vector.-> Man
Queen -.same vector.-> Woman
The famous example: king - man + woman ≈ queen
public float[] AnalogicalReasoning(string a, string b, string c)
{
// king - man + woman = ?
var vecA = GetEmbedding(a); // king
var vecB = GetEmbedding(b); // man
var vecC = GetEmbedding(c); // woman
var result = new float[vecA.Length];
for (int i = 0; i < vecA.Length; i++)
{
result[i] = vecA[i] - vecB[i] + vecC[i];
}
// Find the word closest to this vector
return FindClosestWord(result); // Should return "queen"
}
Embeddings are learned by training a neural network on a simple task: "predict the surrounding words". The idea is that words appearing in similar contexts should have similar meanings.
graph LR
A["The cat sat on the mat"] --> B[Train Neural Network]
B --> C["cat → [0.1, 0.5, -0.3, ...]"]
B --> D["sat → [0.2, -0.1, 0.4, ...]"]
B --> E["mat → [0.15, 0.45, -0.25, ...]"]
After training on billions of sentences, words that appear in similar contexts end up with similar embeddings. "Cat" and "dog" both appear near "pet", "feed", "cute", so their embeddings end up close together.
Here's a critical insight: when translating "The cat sat on the mat" to French, different source words matter for different target words:
graph TD
subgraph English
E1[The]
E2[cat]
E3[sat]
E4[on]
E5[the]
E6[mat]
end
subgraph French
F1[Le]
F2[chat]
F3[s'est assis]
F4[sur]
F5[le]
F6[tapis]
end
E2 -. focus 90% .-> F2
E3 -. focus 70% .-> F3
E4 -. focus 80% .-> F4
E6 -. focus 85% .-> F6
E1 -. focus 30% .-> F1
Attention is a mechanism that lets the network "focus" on different parts of the input when generating each output word.
Think of attention as asking: "To translate this word, which source words should I pay attention to?"
public class AttentionMechanism
{
// Calculate attention weights for each source word
public double[] CalculateAttention(
float[] currentTargetState, // Where we are in translation
float[][] sourceWordStates) // All source words
{
int sourceLength = sourceWordStates.Length;
double[] scores = new double[sourceLength];
// Step 1: Calculate relevance scores
for (int i = 0; i < sourceLength; i++)
{
scores[i] = DotProduct(currentTargetState, sourceWordStates[i]);
}
// Step 2: Convert to probabilities (softmax)
return Softmax(scores);
}
public float[] ApplyAttention(
double[] attentionWeights,
float[][] sourceWordStates)
{
// Create weighted average of source words
int dim = sourceWordStates[0].Length;
float[] result = new float[dim];
for (int i = 0; i < sourceWordStates.Length; i++)
{
for (int j = 0; j < dim; j++)
{
result[j] += (float)(attentionWeights[i] * sourceWordStates[i][j]);
}
}
return result;
}
private double[] Softmax(double[] scores)
{
double[] result = new double[scores.Length];
double sum = 0;
for (int i = 0; i < scores.Length; i++)
{
result[i] = Math.Exp(scores[i]);
sum += result[i];
}
for (int i = 0; i < scores.Length; i++)
{
result[i] /= sum;
}
return result;
}
private double DotProduct(float[] a, float[] b)
{
double sum = 0;
for (int i = 0; i < a.Length; i++)
{
sum += a[i] * b[i];
}
return sum;
}
}
The attention mechanism:
When translating "The agreement on the European Economic Area was signed in August 1992" to German, the attention looks like this:
graph TD
subgraph "English (Source)"
E1[The]
E2[agreement]
E3[on]
E4[the]
E5[European]
E6[Economic]
E7[Area]
end
subgraph "German (Target)"
G1[Das]
G2[Abkommen]
G3[über]
G4[den]
G5[Europäischen]
end
E1 -->|0.8| G1
E2 -->|0.9| G2
E3 -->|0.6| G3
E4 -->|0.3| G3
E5 -->|0.85| G5
E6 -->|0.7| G5
When generating "Abkommen" (agreement), the network pays 90% attention to "agreement", 5% to "the", and 5% distributed among other words.
Modern transformers use self-attention: words in the same sentence attend to each other to build better representations.
graph LR
subgraph "Self-Attention for 'bank'"
B[bank]
R[river]
W[water]
F[fish]
end
B -.high attention.-> R
B -.high attention.-> W
B -.medium attention.-> F
In "The fish swam near the river bank", the word "bank" attends strongly to "river", "fish", and "water", helping the network understand it means "riverbank" not "financial institution".
Now we can put it all together! Neural machine translation uses an encoder-decoder architecture:
graph LR
subgraph Encoder
E1[Word Embeddings] --> E2[Self-Attention Layer 1]
E2 --> E3[Self-Attention Layer 2]
E3 --> E4[Self-Attention Layer N]
E4 --> E5[Contextual Representations]
end
subgraph Decoder
D1[Previous Translations] --> D2[Self-Attention]
D2 --> D3[Cross-Attention to Encoder]
D3 --> D4[Feed Forward]
D4 --> D5[Next Word Prediction]
end
E5 -.provides context.-> D3
The encoder's job is to read the source sentence and build rich representations:
public class TransformerEncoder
{
private WordEmbedding wordEmbedding;
private SelfAttentionLayer[] layers;
public float[][] Encode(string[] sourceWords)
{
// Step 1: Convert words to embeddings
float[][] embeddings = sourceWords
.Select(w => wordEmbedding.GetEmbedding(w))
.ToArray();
// Step 2: Add positional encoding (so network knows word order)
float[][] withPositions = AddPositionalEncoding(embeddings);
// Step 3: Apply multiple self-attention layers
float[][] representations = withPositions;
foreach (var layer in layers)
{
representations = layer.Forward(representations);
}
return representations; // Contextual representations for each word
}
private float[][] AddPositionalEncoding(float[][] embeddings)
{
// Add position-specific patterns so network knows word order
// (transformers don't naturally understand sequence order)
int sequenceLength = embeddings.Length;
int embeddingDim = embeddings[0].Length;
for (int pos = 0; pos < sequenceLength; pos++)
{
for (int i = 0; i < embeddingDim; i++)
{
double angle = pos / Math.Pow(10000, (2.0 * i) / embeddingDim);
// Use sine for even dimensions, cosine for odd
embeddings[pos][i] += (float)(i % 2 == 0 ? Math.Sin(angle) : Math.Cos(angle));
}
}
return embeddings;
}
}
After encoding, each source word has a representation that captures:
The decoder generates the translation one word at a time:
public class TransformerDecoder
{
private WordEmbedding targetEmbedding;
private SelfAttentionLayer[] selfAttentionLayers;
private CrossAttentionLayer[] crossAttentionLayers;
private FeedForwardLayer[] feedForwardLayers;
public string[] Decode(float[][] encodedSource, int maxLength)
{
List<string> translation = new List<string>();
translation.Add("<START>"); // Special token to begin
while (translation.Count < maxLength)
{
// Get next word
string nextWord = GenerateNextWord(encodedSource, translation.ToArray());
if (nextWord == "<END>") break; // Stop token
translation.Add(nextWord);
}
return translation.Skip(1).ToArray(); // Remove <START> token
}
private string GenerateNextWord(float[][] encodedSource, string[] partialTranslation)
{
// Step 1: Embed the partial translation
float[][] targetEmbeddings = partialTranslation
.Select(w => targetEmbedding.GetEmbedding(w))
.ToArray();
// Step 2: Self-attention on target words
float[][] selfAttended = ApplySelfAttention(targetEmbeddings);
// Step 3: Cross-attention to source (this is where translation happens!)
float[][] crossAttended = ApplyCrossAttention(selfAttended, encodedSource);
// Step 4: Feed forward
float[] finalState = ApplyFeedForward(crossAttended[^1]); // Last position
// Step 5: Predict next word
return PredictWord(finalState);
}
private string PredictWord(float[] state)
{
// Convert state to probability distribution over all possible words
Dictionary<string, double> wordProbabilities = CalculateWordProbabilities(state);
// Return most likely word (or sample from distribution)
return wordProbabilities.OrderByDescending(kv => kv.Value).First().Key;
}
}
The complete translation process:
sequenceDiagram
participant Input as Source Sentence
participant Encoder
participant Decoder
participant Output as Translation
Input->>Encoder: "The cat sat"
Encoder->>Encoder: Build representations
Encoder->>Decoder: Encoded states
Decoder->>Decoder: Generate [START]
Decoder->>Output: Emit START token
Decoder->>Decoder: Attend to "The" → Generate "Le"
Decoder->>Output: "Le"
Decoder->>Decoder: Attend to "cat" → Generate "chat"
Decoder->>Output: "chat"
Decoder->>Decoder: Attend to "sat" → Generate "s'est assis"
Decoder->>Output: "s'est assis"
Decoder->>Decoder: Generate [END]
Output->>Output: "Le chat s'est assis"
Training a translation model requires three things:
public class TranslationTrainer
{
private TransformerEncoder encoder;
private TransformerDecoder decoder;
private double learningRate = 0.0001;
public void Train(ParallelCorpus corpus, int epochs)
{
foreach (var epoch in Enumerable.Range(0, epochs))
{
double totalLoss = 0;
int batchCount = 0;
foreach (var batch in corpus.GetBatches(batchSize: 32))
{
// Forward pass
var predictions = new List<string[]>();
var losses = new List<double>();
foreach (var pair in batch)
{
// Encode source
var encoded = encoder.Encode(pair.Source);
// Try to decode target
var predicted = decoder.Decode(encoded, pair.Target.Length);
// Calculate loss (how different is prediction from target?)
double loss = CalculateLoss(predicted, pair.Target);
losses.Add(loss);
}
// Backward pass: adjust weights
double avgLoss = losses.Average();
UpdateWeights(avgLoss);
totalLoss += avgLoss;
batchCount++;
}
Console.WriteLine($"Epoch {epoch}: Average Loss = {totalLoss / batchCount}");
}
}
private double CalculateLoss(string[] predicted, string[] target)
{
// Cross-entropy loss: how far off were our word predictions?
double loss = 0;
for (int i = 0; i < Math.Min(predicted.Length, target.Length); i++)
{
if (predicted[i] != target[i])
{
loss += 1.0; // Simplified - real loss is more nuanced
}
}
return loss / target.Length;
}
private void UpdateWeights(double loss)
{
// Backpropagation: adjust all weights in encoder and decoder
// to reduce the loss (simplified here)
// Real implementation uses automatic differentiation
}
}
Training takes enormous amounts of data and computation:
graph TD
A[10M Sentence Pairs] --> B[Initial Random Weights]
B --> C[Epoch 1: Loss = 5.2]
C --> D[Epoch 2: Loss = 3.8]
D --> E[Epoch 3: Loss = 2.1]
E --> F[Epoch 10: Loss = 0.8]
F --> G[Epoch 20: Loss = 0.3]
G --> H[Trained Model!]
style A stroke-width:2px
style H stroke-width:4px
For a state-of-the-art model:
Let's trace through translating "The cat sat on the mat" to French:
graph TD
subgraph "1. Encoding"
A1[The] --> E1[emb: 0.2, -0.1, ...]
A2[cat] --> E2[emb: 0.5, 0.3, ...]
A3[sat] --> E3[emb: -0.1, 0.4, ...]
A4[on] --> E4[emb: 0.1, -0.2, ...]
A5[the] --> E5[emb: 0.2, -0.1, ...]
A6[mat] --> E6[emb: 0.4, 0.2, ...]
end
subgraph "2. Self-Attention in Encoder"
E1 --> SA[Self-Attention]
E2 --> SA
E3 --> SA
E4 --> SA
E5 --> SA
E6 --> SA
SA --> C1[ctx: 0.3, 0.1, ...]
SA --> C2[ctx: 0.6, 0.4, ...]
SA --> C3[ctx: -0.2, 0.5, ...]
end
subgraph "3. Decoding"
D1[START] --> G1[Le]
C2 -.attend.-> G1
G1 --> G2[chat]
C2 -.attend.-> G2
G2 --> G3[s'est assis]
C3 -.attend.-> G3
end
Step 1: Encode "The"
var theEmbedding = encoder.GetEmbedding("The");
// [0.2, -0.1, 0.3, 0.05, ..., 0.1] (300 dimensions)
Step 2: Encode "cat"
var catEmbedding = encoder.GetEmbedding("cat");
// [0.5, 0.3, -0.2, 0.4, ..., 0.15] (300 dimensions)
Step 3: Self-Attention
// "cat" attends to other words
var catAttention = attention.CalculateAttention(catEmbedding, allWordEmbeddings);
// [0.1, 0.3, 0.2, 0.05, 0.1, 0.25]
// High attention to "sat" (0.3) and "mat" (0.25)
var catContextual = attention.ApplyAttention(catAttention, allWordEmbeddings);
// Weighted average incorporating context
Step 4: Decoder starts with
var decoderState = decoder.InitialState();
Step 5: Generate "Le"
// Attend to source
var sourceAttention = crossAttention.Calculate(decoderState, encodedSource);
// [0.8, 0.05, 0.05, 0.02, 0.05, 0.03]
// Strong focus on "The" (0.8)
var nextWordProbs = decoder.PredictNextWord(decoderState, sourceAttention);
// {"Le": 0.85, "La": 0.08, "Les": 0.04, ...}
var firstWord = "Le";
Step 6: Generate "chat"
decoderState = decoder.UpdateState(decoderState, "Le");
var sourceAttention = crossAttention.Calculate(decoderState, encodedSource);
// [0.05, 0.9, 0.02, 0.01, 0.01, 0.01]
// Strong focus on "cat" (0.9)
var nextWordProbs = decoder.PredictNextWord(decoderState, sourceAttention);
// {"chat": 0.92, "chien": 0.03, ...}
var secondWord = "chat";
Step 7: Continue until
Final translation: "Le chat s'est assis sur le tapis"
The transformer architecture (the "T" in ChatGPT!) became dominant because:
graph LR
subgraph "Old: Recurrent Neural Networks"
R1[Word 1] --> R2[Word 2]
R2 --> R3[Word 3]
R3 --> R4[Word 4]
end
subgraph "New: Transformers"
T1[Word 1] -.attend.-> T2[Word 2]
T1 -.attend.-> T3[Word 3]
T1 -.attend.-> T4[Word 4]
T2 -.attend.-> T3
T2 -.attend.-> T4
T3 -.attend.-> T4
end
Recurrent networks process sequentially (slow!), while transformers process all words at once (fast!).
Now that you understand how it works, here's how you'd actually use it in a .NET application:
public class TranslationService
{
private readonly HttpClient _httpClient;
private readonly string _nmtServiceUrl;
public TranslationService(HttpClient httpClient, IConfiguration config)
{
_httpClient = httpClient;
_nmtServiceUrl = config["NMT:ServiceUrl"];
}
public async Task<TranslationResult> TranslateAsync(
string text,
string sourceLang,
string targetLang)
{
var request = new TranslationRequest
{
Text = new[] { text },
SourceLang = sourceLang,
TargetLang = targetLang
};
var response = await _httpClient.PostAsJsonAsync(
$"{_nmtServiceUrl}/translate",
request);
response.EnsureSuccessStatusCode();
var result = await response.Content
.ReadFromJsonAsync<TranslationResponse>();
return new TranslationResult
{
Original = text,
Translated = result.Translated[0],
SourceLanguage = sourceLang,
TargetLanguage = targetLang,
TranslationTime = result.TranslationTime
};
}
}
public class TranslationRequest
{
[JsonPropertyName("text")]
public string[] Text { get; set; }
[JsonPropertyName("source_lang")]
public string SourceLang { get; set; }
[JsonPropertyName("target_lang")]
public string TargetLang { get; set; }
}
public class TranslationResponse
{
[JsonPropertyName("translated")]
public string[] Translated { get; set; }
[JsonPropertyName("translation_time")]
public double TranslationTime { get; set; }
}
Usage:
// In your controller or service
public class BlogPostController : ControllerBase
{
private readonly TranslationService _translator;
public async Task<IActionResult> TranslatePost(int postId, string targetLang)
{
var post = await _blogService.GetPostAsync(postId);
var translatedTitle = await _translator.TranslateAsync(
post.Title,
"en",
targetLang);
var translatedContent = await _translator.TranslateAsync(
post.Content,
"en",
targetLang);
return Ok(new
{
Title = translatedTitle.Translated,
Content = translatedContent.Translated,
OriginalLanguage = "en",
TargetLanguage = targetLang
});
}
}
NMT models have input length limits (typically 512-1024 tokens). Solution: chunking!
public async Task<string> TranslateLongText(string longText, string targetLang)
{
const int maxChunkSize = 500; // characters
// Split on paragraph boundaries
var paragraphs = longText.Split(new[] { "\n\n", "\r\n\r\n" },
StringSplitOptions.RemoveEmptyEntries);
var translatedParagraphs = new List<string>();
foreach (var paragraph in paragraphs)
{
if (paragraph.Length <= maxChunkSize)
{
var result = await _translator.TranslateAsync(paragraph, "en", targetLang);
translatedParagraphs.Add(result.Translated);
}
else
{
// Split long paragraph into sentences
var sentences = SplitIntoSentences(paragraph);
var translatedSentences = new List<string>();
foreach (var sentence in sentences)
{
var result = await _translator.TranslateAsync(sentence, "en", targetLang);
translatedSentences.Add(result.Translated);
}
translatedParagraphs.Add(string.Join(" ", translatedSentences));
}
}
return string.Join("\n\n", translatedParagraphs);
}
Markdown, HTML, and other markup can confuse NMT models:
public async Task<string> TranslateMarkdown(string markdown, string targetLang)
{
// Extract text from markdown while preserving structure
var doc = Markdig.Markdown.Parse(markdown);
var textSegments = new List<(string text, int position)>();
// Walk the AST and extract translatable text
foreach (var node in doc.Descendants())
{
if (node is LiteralInline literal)
{
var text = literal.Content.ToString();
if (!string.IsNullOrWhiteSpace(text) && !IsImagePath(text))
{
textSegments.Add((text, literal.Span.Start));
}
}
}
// Translate all segments
var translations = await Task.WhenAll(
textSegments.Select(async seg => new
{
seg.position,
translated = (await _translator.TranslateAsync(seg.text, "en", targetLang)).Translated
}));
// Reconstruct markdown with translations
var result = markdown;
foreach (var translation in translations.OrderByDescending(t => t.position))
{
result = result.Remove(translation.position, textSegments
.First(s => s.position == translation.position).text.Length)
.Insert(translation.position, translation.translated);
}
return result;
}
Translating one sentence at a time is slow. Batch them!
public async Task<Dictionary<string, string>> TranslateBatch(
IEnumerable<string> texts,
string targetLang)
{
const int batchSize = 32;
var results = new Dictionary<string, string>();
foreach (var batch in texts.Chunk(batchSize))
{
var request = new TranslationRequest
{
Text = batch.ToArray(),
SourceLang = "en",
TargetLang = targetLang
};
var response = await _httpClient.PostAsJsonAsync(
$"{_nmtServiceUrl}/translate",
request);
var result = await response.Content
.ReadFromJsonAsync<TranslationResponse>();
for (int i = 0; i < batch.Length; i++)
{
results[batch[i]] = result.Translated[i];
}
}
return results;
}
Understanding the computational costs helps you architect better solutions:
graph TD
A[Translation Request] --> B{Model Size}
B -->|Small 100M params| C[Fast: 50-100ms]
B -->|Medium 500M params| D[Medium: 200-500ms]
B -->|Large 1B+ params| E[Slow: 1-3 seconds]
A --> F{Hardware}
F -->|CPU| G[Slow: 2-5x slower]
F -->|GPU| H[Fast: Baseline]
F -->|TPU/Special AI chips| I[Very Fast: 2-3x faster]
style C stroke-width:2px
style E stroke-width:2px
Typical performance (per sentence on GPU):
CPU vs GPU: GPU is 2-10x faster for NMT
Batch processing: Can achieve 100+ sentences/second with batching
Neural machine translation might seem like magic, but it's actually an elegant combination of several clever ideas:
The amazing thing is that modern NMT systems don't have any explicit rules about grammar, syntax, or translation equivalents. They learn everything from examples!
If you're working with NMT in production (as I do for this blog), understanding these principles helps you:
The field is still evolving rapidly. Models get bigger, faster, and better every year. But the core concepts - embeddings, attention, encoder-decoder - remain fundamental.
Now when you hit "translate" on your blog posts, you'll know exactly what's happening under the hood! 🚀
If you want to dive deeper:
© 2025 Scott Galloway — Unlicense — All content and source code on this site is free to use, copy, modify, and sell.