Skip to content

NLP Portfolio — Venkat Teja Nallamothu

Web Mining and Applied Natural Language Processing

Northwest Missouri State University · 2026


1. NLP Techniques Implemented

Across six modules, I implemented a full progression of NLP techniques — starting from environment setup and word clouds through multi-stage EVTAL pipelines with frequency analysis and lexical feature engineering.

Technique Module(s) Implementation Detail
Environment setup & tooling nlp-01 Configured spaCy en_core_web_sm, virtual environment, and Jupyter notebooks
Word cloud generation nlp-01, nlp-02 Frequency-weighted visual output using the wordcloud library
Tokenization nlp-02, nlp-03, nlp-06 Word-level splitting via spaCy tokenizer and str.split()
Text normalization nlp-02, nlp-06 Lowercasing (str.lower()), punctuation removal (str.maketrans()), whitespace collapse (re.sub(r'\s+', ' ', ...))
Stopword removal nlp-02, nlp-06 Filtered using spaCy token.is_stop; reduced abstract token counts by ~40%
Frequency analysis nlp-03, nlp-06 Unigram counts via collections.Counter; top-20 rankings logged and visualized
Co-occurrence / bigram analysis nlp-03 Context-window co-occurrence and bigram frequency across a structured multi-category corpus
Corpus exploration nlp-03 Token comparisons across categories (dog, cat, truck, car); global vs. per-category token rankings
JSON API extraction nlp-04 EVTL pipeline against jsonplaceholder.typicode.com/posts; raw JSON → validated → structured CSV
HTML web scraping nlp-05, nlp-06 BeautifulSoup tag selectors (h1.title, div.authors, blockquote.abstract, div.dateline) on arXiv pages
Metadata extraction nlp-05 Extracted sentence_count, avg_word_length, author_count, PDF URL, version count from arXiv HTML
Type-token ratio (TTR) nlp-06 unique_tokens / total_tokens; 0.798 for Attention Is All You Need, 0.917 for Agents of Chaos
Feature engineering nlp-05, nlp-06 Derived abstract_word_count, token_count, unique_token_count, type_token_ratio, author_count
Visualization nlp-01, nlp-06 Horizontal bar charts (matplotlib) and word clouds (viridis colormap, 800×400px)

2. Systems and Data Sources

Module Source Format What Was Analyzed
nlp-01 Web content HTML General web text; first spaCy word cloud
nlp-02 Local text files Plain text Text records in data/; preprocessing pipeline
nlp-03 Structured corpus Plain text Multi-category corpus (dog, cat, truck, car) for comparative token analysis
nlp-04 JSONPlaceholder API (/posts) JSON 100 synthetic post objects; validated field structure before CSV export
nlp-05 arXiv — Disentangling cosmic distance tensions (2604.08530) HTML Academic abstract; sentence count, avg word length, metadata fields
nlp-06 arXiv — Attention Is All You Need (1706.03762) + Agents of Chaos (2602.20021) HTML Full EVTAL pipeline; token frequency, TTR, visualizations

Handling variable structure: - JSON APIs required null-safe key traversal for optional fields - HTML pages required structural validation before extraction to prevent silent field corruption (missing div.authors or blockquote.abstract) - Plain text required whitespace normalization to remove HTML-encoded artifacts before tokenization


3. Pipeline Structure (EVTL)

Every module from nlp-04 onward followed an explicit EVTL or EVTAL architecture. The nlp-06 pipeline is the most complete implementation:

Extract → Validate → Transform → Analyze → Load
Stage File Source → Sink
Extract stage01_extract.py HTTP GET with custom User-Agent headers → data/raw/teja_raw.html
Validate stage02_validate_teja.py Raw HTML → BeautifulSoup; checks h1.title, div.authors, blockquote.abstract, div.subheader, div.dateline
Transform stage03_transform_teja.py Validated soup → Pandas DataFrame; raw extraction (3.1), text cleaning (3.2), feature engineering (3.3)
Analyze stage04_analyze_teja.py DataFrame → teja_top_tokens.png, teja_wordcloud.png; frequency table to project.log
Load stage05_load.py DataFrame → data/processed/teja_processed.csv (13 columns)

Configuration is separated into config_case.py and config_teja.py — each defines PAGE_URL, request headers, and output paths — so the same pipeline logic runs against different sources without code changes.

Earlier pipeline evolution:

Module Pipeline Type Key Addition
nlp-04 EVTL (JSON) First structured pipeline; JSON API → validated fields → CSV
nlp-05 EVTL (HTML) HTML scraping added; richer metadata extraction
nlp-06 EVTAL (HTML) Analyze stage added; spaCy NLP features + visualizations

4. Signals and Analysis Methods

Word Frequency (Unigram)

collections.Counter on cleaned token lists. For Attention Is All You Need, top tokens were transformer, attention, translation, models, bleu — accurately reflecting the paper's contribution from abstract text alone.

Type-Token Ratio (TTR)

Measures lexical diversity: unique_tokens / total_tokens

Paper Raw Words Clean Tokens Unique Tokens TTR
Attention Is All You Need 166 99 79 0.798
Agents of Chaos 177 121 111 0.917

The higher TTR for the red-teaming paper reflects its broader vocabulary spanning AI safety, agent behavior, and evaluation methodology.

Token Reduction Rate

The cleaning pipeline (lowercase → punctuation removal → stopword filter) reduced raw word counts by ~40–45%, isolating content-bearing tokens.

Co-occurrence and Bigrams (nlp-03)

Context-window analysis identified which tokens appeared together most frequently within the multi-category corpus, producing category-level association patterns beyond simple frequency ranking.

Metadata Signals (nlp-05)

avg_word_length, sentence_count, and author_count were engineered as structured document-level features alongside text content, enabling comparison across papers without reading the full text.


5. Visualizations

Word Cloud — nlp-01: First spaCy Word Cloud

Word Cloud – nlp-01

First word cloud generated from web-sourced text using spaCy en_core_web_sm and the wordcloud library. Established the baseline visualization workflow used in all later modules.


Word Cloud — nlp-02: Text Preprocessing Output

Word Cloud – nlp-02

Word cloud produced after the full preprocessing pipeline (tokenization → lowercasing → punctuation removal → stopword filtering). Visually confirms that cleaning removes grammatical noise and surfaces content-bearing tokens.


Word Cloud — nlp-03: Corpus Exploration

Word Cloud – nlp-03

Word cloud from the multi-category corpus (dog, cat, truck, car). Token size reflects frequency across the full corpus; category-level analysis revealed per-domain signals hidden in the global view.


Token Frequency — Attention Is All You Need (arXiv 1706.03762)

Top Tokens – Attention Is All You Need

Top tokens confirm the paper's focus: transformer, attention, translation, models, and bleu dominate after stopword removal.


Word Cloud — Attention Is All You Need

Word Cloud – Attention Is All You Need

Frequency-weighted word cloud generated from the cleaned abstract using viridis colormap (800×400px).


Token Frequency — Agents of Chaos (arXiv 2602.20021)

Top Tokens – Agents of Chaos

Top tokens reflect the red-teaming and AI safety focus: agents, llm, attack, safety, autonomous.


Word Cloud — Agents of Chaos

Word Cloud – Agents of Chaos

The broader, more varied vocabulary (TTR 0.917) is visible in the word cloud's wider spread of similarly-sized terms compared to the Attention paper.


6. Insights

Cleaning reveals signal, not noise. Reducing Attention Is All You Need's abstract from 166 raw words to 99 clean tokens surfaced transformer and attention as top terms — no labels needed.

TTR distinguishes domain breadth. A TTR of 0.917 vs. 0.798 reflects the red-teaming paper's wider scope. A single ratio captures vocabulary diversity across two very different research areas.

Metadata encodes structure. Agents of Chaos has 38 authors vs. 8 for Attention. Author count as a structured feature captures collaboration scale without parsing text.

Validation prevents silent failures. The validate stage in nlp-06 caught edge cases in whitespace encoding and tag nesting that would have silently corrupted extracted fields. Structural checks are not optional in real pipelines.

Pipelines are reusable. Separating config_case.py from pipeline_web_html.py meant running the same pipeline against two different arXiv papers required only a config file swap — no code changes.

Corpus structure shapes frequency results. In nlp-03, category-level token rankings (dog, cat, truck, car) showed that global frequency rankings can obscure domain-specific signals invisible without segmentation.


7. Representative Work — All Modules

nlp-01: Environment Setup & First Word Cloud

Configured the Python NLP environment with spaCy (en_core_web_sm), virtual environment, and Jupyter notebooks. Produced the first word cloud visualization from web-sourced text. Foundational to all subsequent modules.

nlp-02: Text Preprocessing Pipeline

Built a tokenization and normalization pipeline: lowercasing, punctuation removal, whitespace normalization, and spaCy-based stopword filtering applied to local text files. First structured preprocessing workflow.

nlp-03: Corpus Exploration & Bigram Analysis

Applied frequency analysis, context-window co-occurrence, and bigram ranking to a structured multi-category corpus. Demonstrated how corpus segmentation reveals signals that global ranking hides.

nlp-04: EVTL Pipeline — JSON API

First full EVTL pipeline: fetched 100 posts from a public JSON API, validated field structure, transformed into a Pandas DataFrame, and exported to CSV. Established the repeatable pipeline pattern used in later modules.

nlp-05: HTML Scraping & Metadata Extraction

Extended the EVTL pipeline to HTML: scraped an arXiv abstract page, extracted 13 structured fields (sentence_count, avg_word_length, author_count, PDF URL, version count), and exported to CSV. First HTML-based pipeline.

nlp-06: Full EVTAL Pipeline with NLP Feature Engineering

The most complete implementation — five-stage EVTAL pipeline applied to two arXiv papers. Adds spaCy-based NLP, type-token ratio, token frequency bar charts, and word cloud visualizations. Demonstrates modular, config-separated, multi-source pipeline design.


8. Skills

Python data processing Built structured pipelines with Pandas DataFrames; used collections.Counter for frequency analysis; engineered derived features (TTR, token_count, author_count) from raw text.

spaCy NLP processing Applied en_core_web_sm for tokenization, stopword filtering, and linguistic annotation across multiple modules.

Web scraping and HTML extraction Fetched HTML with requests using custom headers; navigated tag hierarchies with BeautifulSoup; validated structural expectations before extraction.

JSON API integration Consumed public REST APIs; handled null-safe key traversal and variable field presence; structured output into reproducible CSV format.

Corpus and frequency analysis Computed unigram frequency, bigrams, co-occurrence windows, and type-token ratio; interpreted results across different domain vocabularies.

Handling messy or inconsistent inputs Normalized HTML-encoded whitespace, multi-author strings, and variable field presence using regex and BeautifulSoup fallbacks.

Repeatable pipeline design Separated configuration from stage logic to enable multi-source execution; logged every stage with explicit source → process → sink documentation.

Communicating results with Markdown and visuals Produced matplotlib bar charts and word cloud PNGs as deliverable artifacts; documented pipelines in docs/index.md, docs/glossary.md, and docs/nlp-evolution.md.


Repository: vnallam09/nlp-06 · Northwest Missouri State University · 2026