Summary: In natural language processing (NLP), raw text cannot be directly processed by neural networks. We first apply tokenization (breaking text into tokens) and then vectorization (converting tokens into numeric embeddings). This blog walks through the main approaches with code examples.
1. Tokenization
Tokenization breaks text into smaller units (words, subwords, or characters) and maps them to integer IDs.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
texts = ["I love AI", "AI loves Python"]
# Create tokenizer and fit on texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
# Convert to sequences
sequences = tokenizer.texts_to_sequences(texts)
print(sequences)
    Output: [[1, 2, 3], [3, 4, 5]] — each word is replaced by its index in the vocabulary.
2. Vectorization Approaches
(a) Embedding Layer (Deep Learning Standard)
Keras Embedding layers learn dense vector representations of words while training the model.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, GlobalAveragePooling1D
# vocab_size=1000, embedding_dim=16
model = Sequential([
    Embedding(input_dim=1000, output_dim=16, input_length=10),
    GlobalAveragePooling1D(),
    Dense(1, activation='sigmoid')
])
model.summary()
    The embedding layer maps word IDs to trainable vectors in a 16-dimensional space.
(b) Pre-trained Word Embeddings (Word2Vec, GloVe)
Instead of learning embeddings from scratch, you can load pre-trained vectors.
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")
print("Vector for 'AI':", model["ai"])
print("Similarity (AI vs ML):", model.similarity("ai", "ml"))
    These embeddings capture semantic meaning: words used in similar contexts have similar vectors.
(c) Transformers (Contextual Embeddings)
Models like BERT generate embeddings that depend on the sentence context.
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
sentence = "AI is transforming business"
inputs = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
# Average embeddings of all tokens
sentence_vector = outputs.last_hidden_state.mean(dim=1)
print(sentence_vector.shape)  # torch.Size([1, 768])
    Here, each token embedding is contextual: the vector for "AI" differs depending on the sentence.
3. Why It Matters in Deep Learning
- Classification: sentiment analysis, intent detection
 - Sequence modeling: machine translation, summarization
 - Semantic search: vector similarity, clustering
 
🚀 Final Thoughts
Tokenization and vectorization are the foundations of modern NLP. From simple embeddings to powerful transformer-based contextual vectors, these techniques enable deep learning models to truly "understand" human language.