Consider the following two documents :
Document 1 : ML and DL are part of AI.
Document 2 : DL is a subset of ML.
Implement all four steps of the Bag of Words (BoW) model to create a document vector table. Depict the outcome of each step.
Generated by claude-sonnet-4-6 · 2026-06-21 03:19 · grounding rag
Model Answer
Step 1: Text Processing (Pre-processing)
Remove stop words (is, a, of) and convert to lowercase.
- Doc 1: [ml, dl, part, ai]
- Doc 2: [dl, subset, ml]
Step 2: Create a Dictionary (Vocabulary)
List all unique words from both documents:
| ml | dl | part | ai | subset |
|----|----|------|----|--------|
Step 3: Create Document Vector for Doc 1
| ml | dl | part | ai | subset |
|----|----|------|----|--------|
| 1 | 1 | 1 | 1 | 0 |
Step 4: Create Document Vectors for All Documents
| Document | ml | dl | part | ai | subset |
|----------|----|----|------|----|--------|
| Doc 1 | 1 | 1 | 1 | 1 | 0 |
| Doc 2 | 1 | 1 | 0 | 0 | 1 |
Source: Chapter 6, Section 6.5 – Bag of Words
---
Explanation
- Examiners award 1 mark per step, so label each step clearly.
- Step 1 must show stop-word removal ("and", "is", "a", "of") and lowercasing — these are the pre-processing actions visible to the examiner.
- Step 2: the dictionary contains only unique words (5 words here).
- Steps 3 & 4 must show the actual frequency table; since no word repeats within a single document here, all values are 0 or 1.
- Do not skip labelling rows as Doc 1 / Doc 2 — that shows you understand "document vectors."