Taming LLMs - Chapter 3: Managing Input Data

One home run is much better than two doubles.

Taming LLMs

Jan 13, 2025

Chapter 3: “Managing Input Data” of the book Taming LLMs is now available for review.

Visit github repo to access Chapter in the following formats:

pdf
web and
python notebook formats.

The pdf format is recommended as it contains the highest quality copy.

Please share feedback via one of the following:

Open an Issue in the book’s github repo
Send a message via substack, linkedin or twitter

We cover the following in this chapter:

Data parsing

We will explore some useful open source tools such as Docling and MarkItDown that help transform data into LLM-compatible formats, demonstrating their impact through a case study of structured information extraction from complex PDFs. In a second case study, we will introduce some chunking strategies to help LLMs process long inputs and implement a particular technique called Chunking with Contextual Linking the enables contextually relevant chunk processing.

Retrieval augmentation

We will explore how to enhance LLMs with semantic search capabilities for incorporating external context using RAGs (Retrieval Augmented Generation) with Vector Databases such as ChromaDB.
We also discuss whether RAGs will be really needed in the future given the rise of long-context language models.

Long-context windows

We will extract insights from a large knowledge base without the need for complex retrieval systems.
We build a quiz generator from open books from Project Gutenberg.
We will also explore prompt caching and response verification through citations using ``Corpus-in-Context'' (CIC) Prompting.

I hope you enjoy the reading as much as enjoyed writing it. Please share comments particularly pertaining to clarity, relevance and scope. I still have time to make fixes before February 2nd’s official release.

All reviewers will be acknowledged in the book.

—

Tharsis

Thanks for reading Taming LLMs! This post is public so feel free to share it.

Taming LLMs

Discussion about this post

Ready for more?