Organizing My Library with Semantic Search

In a previous post of my reading history, I noted that one interesting aspect of my reading journey is the genres I explore. These are not the conventional genres defined by Goodreads, but rather the bookshelves I have created for my books, resulting in some unique labels. While this manual classification system has served me well, it relies entirely on my subjective categorization at the moment of reading.

Inspired by this post by Joel Lehman, I thought about exploring a different way to organize my readings based on “the vibe” or semantic feeling of the text.

To carry out this small project, I exported my most recent Goodreads dataset of books. However, instead of importing the CSV into an SQLite database for aggregation as I did previously, I processed the data using Python and vector embeddings. This method converts the title, author, my custom shelves, and my personal reviews into vectors, so I can query my library using abstract concepts rather than traditional genre tags.

Approach

To execute this analysis I installed the sentence-transformers library, a framework that facilitates the generation of dense vector representations for textual data.

The initial phase involved importing the Goodreads export file using the pandas library. An important step in this process was cleaning the bookshelves column. The raw data contains administrative tags generated by the platform such as “to-read” and “currently-reading”, which dont reflect the thematic content or “vibe” of the book. So, I implemented a filtering function to strip these labels, ensuring that only my custom genre tags (e.g. “history-bio,” “life-hack”) remained in the dataset.

Then, I built a composite “rich text” field for each entry by concatenating the title, author, cleaned tags, and my personal rating. This consolidated string served as the input for the all-MiniLM-L6-v2 transformer model. This neural network converted the textual data into high-dimensional numerical vectors, or “embeddings.” To retrieve results, the system calculates the cosine similarity between the vector of a search query and the vectors of the books in the library, ranking them by geometric proximity.

Thematic Alignment

The initial results suggest that the model successfully captures the thematic essence of my reading list when the query is conceptually distinct. For example, when searching for “philosophical books that make me question reality,” the algorithm retrieved titles that align closely with my own perception of the genre.

Thoughts of a Philosophical Fighter Pilot
Why Buddhism Is True
The Mind’s I: Fantasies and Reflections on Self and Soul

These results indicate that the semantic search is capable of synthesizing the custom tags I created, such as “philosophy” and “mind,” with the semantic content of the titles. The presence of The Mind’s I is particularly validating, as it is a text explicitly concerned with the nature of consciousness, fitting the query’s intent precisely.

Similarly, a query for “books about startups” yielded a highly accurate list, initiated by Zero to One and The $100 Startup. These works are tagged in my “business” shelf, and the model correctly identified their specific focus on new ventures rather than general economics.

The Limits of Semantic Interpretation

However, the analysis also reveals the limitations of this computational approach. While the model excels at identifying explicit topics, it struggles with idiomatic or cultural concepts. This discrepancy becomes evident when querying for “beach reads.”

In common language, a “beach read” implies a light, engaging fiction or non-fiction book suitable for a vacation. The model, in contrast, appears to interpret the query literally, searching for semantic connections to the physical characteristics of a beach or a shore. Consequently, the results included:

Kafka on the Shore
The Liberator: One World War II Soldier’s 500-Day Odyssey from the Beaches of Sicily to the Gates of Dachau
Dune

While Kafka on the Shore might arguably fit a vacation mood, The Liberator is a historical account of World War II. Its inclusion appears to be driven solely by the phrase “Beaches of Sicily” in the subtitle. Similarly, Dune likely appears due to its setting on a desert planet, despite being a dense work of science fiction which I also, by pure chance, highlighted in my previous post due to its complexity.

Lexical Overlap vs. Thematic Vibe

This pattern of lexical confusion persisted in other queries. When searching for a “fast-paced thriller with high stakes,”the model returned Thinking, Fast and Slow by Daniel Kahneman.

Although this book deals with the mechanics of thought, it is a dense non-fiction work on psychology and behavioral economics, not a thriller. The algorithm likely prioritized the word “Fast” in the title over the contextual meaning of “fast-paced” as a genre descriptor.

A similar misalignment occurred with the query “cozy aesthetic with happy ending.” The results included Marketing Aesthetics. Once again, the model latched onto the word “Aesthetics” in the title, disregarding the fact that a book on strategic brand management offers little in the way of a “cozy” narrative experience.

Final Thoughts

This experiment demonstrates interesting outcomes of applying NLP to my reading history and interests. The method succeeded in clustering books with clear subject matter, such as philosophy and business, reinforcing the utility of my original custom shelves. However, it faltered when presented with abstract vibes or idioms.

In any way, I will improve this tool to analyze my library and to get recommendations over my “to-read” shelve this 2026, as my personal project for this next year will be to clean, as much as I can, that “backlog” of 369 books.

acovilla