Tokenization and Vectorization in Deep Learning

Summary: In natural language processing (NLP), raw text cannot be directly processed by neural networks. We first apply tokenization (breaking text into tokens) and then vectorization (converting tokens into numeric embeddings). This blog walks through the main approaches with code examples.

1. Tokenization

Tokenization breaks text into smaller units (words, subwords, or characters) and maps them to integer IDs.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

texts = ["I love AI", "AI loves Python"]

# Create tokenizer and fit on texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Convert to sequences
sequences = tokenizer.texts_to_sequences(texts)
print(sequences)

Output: [[1, 2, 3], [3, 4, 5]] — each word is replaced by its index in the vocabulary.

2. Vectorization Approaches

(a) Embedding Layer (Deep Learning Standard)

Keras Embedding layers learn dense vector representations of words while training the model.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, GlobalAveragePooling1D

# vocab_size=1000, embedding_dim=16
model = Sequential([
    Embedding(input_dim=1000, output_dim=16, input_length=10),
    GlobalAveragePooling1D(),
    Dense(1, activation='sigmoid')
])

model.summary()

The embedding layer maps word IDs to trainable vectors in a 16-dimensional space.

(b) Pre-trained Word Embeddings (Word2Vec, GloVe)

Instead of learning embeddings from scratch, you can load pre-trained vectors.

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")

print("Vector for 'AI':", model["ai"])
print("Similarity (AI vs ML):", model.similarity("ai", "ml"))

These embeddings capture semantic meaning: words used in similar contexts have similar vectors.

(c) Transformers (Contextual Embeddings)

Models like BERT generate embeddings that depend on the sentence context.

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

sentence = "AI is transforming business"
inputs = tokenizer(sentence, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Average embeddings of all tokens
sentence_vector = outputs.last_hidden_state.mean(dim=1)
print(sentence_vector.shape)  # torch.Size([1, 768])

Here, each token embedding is contextual: the vector for "AI" differs depending on the sentence.

3. Why It Matters in Deep Learning

Classification: sentiment analysis, intent detection
Sequence modeling: machine translation, summarization
Semantic search: vector similarity, clustering

✅ In short: Tokenization converts text into IDs, and vectorization transforms IDs into embeddings. Deep learning models consume these embeddings, not raw text.

🚀 Final Thoughts

Tokenization and vectorization are the foundations of modern NLP. From simple embeddings to powerful transformer-based contextual vectors, these techniques enable deep learning models to truly "understand" human language.

← Back to Blog Index