Taming LLMs - Chapter 3: Managing Input Data
One home run is much better than two doubles.
Chapter 3: “Managing Input Data” of the book Taming LLMs is now available for review.
Visit github repo to access Chapter in the following formats:
web and
python notebook formats.
The pdf format is recommended as it contains the highest quality copy.
Please share feedback via one of the following:
Open an Issue in the book’s github repo
We cover the following in this chapter:
Data parsing
We will explore some useful open source tools such as Docling and MarkItDown that help transform data into LLM-compatible formats, demonstrating their impact through a case study of structured information extraction from complex PDFs. In a second case study, we will introduce some chunking strategies to help LLMs process long inputs and implement a particular technique called Chunking with Contextual Linking the enables contextually relevant chunk processing.
Retrieval augmentation
We will explore how to enhance LLMs with semantic search capabilities for incorporating external context using RAGs (Retrieval Augmented Generation) with Vector Databases such as ChromaDB.
We also discuss whether RAGs will be really needed in the future given the rise of long-context language models.
Long-context windows
We will extract insights from a large knowledge base without the need for complex retrieval systems.
We build a quiz generator from open books from Project Gutenberg.
We will also explore prompt caching and response verification through citations using ``Corpus-in-Context'' (CIC) Prompting.
I hope you enjoy the reading as much as enjoyed writing it. Please share comments particularly pertaining to clarity, relevance and scope. I still have time to make fixes before February 2nd’s official release.
All reviewers will be acknowledged in the book.
—
Tharsis



