Web Mining and Applied NLP¶

This project implements a structured EVTAL pipeline to extract, validate, transform, analyze, and load text data from HTML web pages.

The pipeline is applied to the arXiv abstract page for Attention Is All You Need (Vaswani et al., 2017) — the paper that introduced the Transformer architecture.

Pipeline Stages¶

Extract — fetch raw HTML from the arXiv abstract page and save it locally
Validate — confirm required HTML elements (title, authors, abstract, subjects, dateline) are present
Transform — extract fields, clean and normalize text using spaCy, engineer NLP features
Analyze — compute token frequency distributions and generate word cloud and bar chart visualizations
Load — write the analysis-ready DataFrame to a CSV file

Paper Analyzed¶

Field	Value
Title	Attention Is All You Need
Authors	Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Subject	Computer Science > Computation and Language
arXiv ID	1706.03762

Key Results¶

Metric	Value
Raw abstract word count	166
Token count (after cleaning)	99
Unique token count	79
Type-token ratio	0.798

The cleaned abstract removes stopwords, punctuation, and casing — reducing the text by ~40%. The most frequent tokens (transformer, attention, translation, models, bleu) confirm the paper's focus on a new attention-based architecture benchmarked on machine translation tasks.

Project Documentation Pages¶

Home - this documentation landing page
NLP Evolution - a concise discussion of NLP evolution
Glossary - project terms and concepts