Audrey Miller
January 30, 2024

Thoughts on AI – Frameworks

Tapestry's Audrey Miller presents her curated shortlist of the most useful frameworks to better understand the growing world of LLMs, plus some bonus resources to explore the topic further.

The first public lecture to mention computer intelligence took place in London in 1947 by Alan Turing about the “possibility of letting the machine alter its own instructions” introduced in his unpublished paper, “Intelligent Machinery” [1]. Since then we’ve seen the invention of ELIZA (1964), Watson (2004-2011), and an explosion of new language models beginning in 2018 with OpenAI’s GPT-1. I’m no history buff but the internet is a wonderful place where one can find a timeline of AI model development and a compiled list of models sorted by lab, parameter size, and date announced (no gatekeeping!).

Though development has rapidly accelerated there remain limitations inherent to the technology that slow broader adoption. Hazy memory, constrained environment operability, and sensitivities limit use cases to situations where imperfect outputs are tolerated. In response, many projects have emerged to confront these obstacles to implementation.

In an effort to better understand these emerging tools, I set out to explore already proposed frameworks for mapping tools for deploying this growing world of LLMs. I originally set out to propose my own, perhaps a combination of the inputs I ingested, but found them all quite robust. Rather than reinvent the wheel, I present here my curated shortlist and views on the best frameworks I have found, the areas I’m excited about, plus the most interesting other resources I’ve come across that have informed my thinking. 

Years ago, I developed an internal framework for categorizing companies depending on where they sit on the value chain. I called it my “four-layer company cake”. In this framework all companies can be described as being one of four layers: data layer, “highway” layer (think APIs), orchestration layer, or consumption layer. Later, I realized this was oddly similar to (yet an overly simplified version of) the OSI model – a conceptual model for communicating levels of abstraction between systems. The idea of using a framework as short-hand for categorizing emerging technologies isn’t novel and various have been proposed for the surge of AI companies in the market.

Similar to my four-layer cake, Tomasz Tunguz’s Theory Ventures proposes a four-part mental model for exploring the LLM stack: (1) data layer, (2) model layer, (3) deployment layer, (4) interface layer. I like many things about this proposed framework. In particular, I find the delineation of the four categories in the “deployment layer” helpful. In my head, however, the categorization of his “Model Layer” ignores circularities inherent in the training, routing and fine-tuning process. Few applications, categorized in this framework as “Interface Layers”, can run linearly with a model directly integrated into a user interface. There is an interdependence on these layers that is lost in an overly linear framework that, though helpful for simplification, actually can make it more difficult to visualize, for lay people like me!

Heather Miller of Two Sigma recently published a Guide to Large Language Model Abstractions that digs even deeper into the proposed “third layer” above. In one article they provide two frameworks for dissecting this layer. First they proposed Language Model System Interface Model (LMSI) – their seven-layer framework to think about LLM abstraction. Presented in order of level of abstraction they start with layer one – neural networks that directly access LLM architecture, and move to layers that focus on prompt input, rules, circular loops, optimizations, and lastly, applications. Their final “layer” is “user”, which describes the application layer where humans perform tasks.

Miller also proposes a secondary framework that categorizes projects not by theoretical levels of abstraction but by functionality. Divided into five groups, the framework outlines (1) Controlled Generation - defining output constraints, (2) Schema-Driven Generation - user-based type-level output, (3) Compilation - automatically-generated, high-quality prompt chains, (4) Prompt Engineering Tools with Pre-Packaged Modules - tools for generating prompts for more meaningful LLM interactions, and (5) Open-Ended Agent or Multi-Agents - orchestrating LLMs for general purpose problem solving.

Menlo Ventures’s Naomi Pilosof defined the building blocks of the modern AI stack across four layers in her State of Generative AI in the Enterprise Report last year. Pilasof’s first layer, compute and foundation, groups foundational models with GPU providers and training and deployment tools. This spans all three of the first three layers in Tunguz’s approach. The “Data” layer sits above the model layer in this framework and is split into pre-processing, databases, and pipelines. While I agree that this layer is an interesting space to spend time, I disagree with the placement as I find it more intuitive to think about data layers below the model layers. Like Tunguz, Pilosof buckets compute and foundational models in their layer one, to contain training and fine-tuning infrastructure. Another great read is their updated “The Modern AI Stack: Design Principles for the Future of Enterprise AI Architectures”.

Other frameworks I like include Andre Retterath’s at Earlybird VC, though I found his earlier piece on value accrual in AI even more interesting. Felix Becker of Heartcore Capital also outlines the blueprint of a modern AI app which neatly describes the building blocks of AI applications. This framework looks very similar to Matt Bornstein and Rajko Radovanovic of A16Z’s proposed framework in their “Emerging Architectures for LLM Applications” piece that highlights the circularity of the stack. It shows app hosting tools (which in the earlier framework would sit between deployment and interface layers) acting as both inputs and outputs of orchestration tools. However, it is less “full stack” than Tunguz’s framework, only focusing on “middle layers” of model and deployment. 

Lastly Sonya Huang and Pat Grady at Sequoia have been prolific in an effort to position themselves at the helm of the AI narrative. The frameworks they’ve published are robust and simple. I particularly liked the simplicity of their landscape map by modality split into model vs application layer. Their proposed Gen AI infrastructure stack includes non-AI observability tools at the top and mixes data labeling alongside the model layer. They also include synthetic data as its own category which, though small today, is an interesting hint at how they may see that space evolving. Last but not least, I appreciate the simplicity of Index’s proposed three-layer stack (models, infra, applications). At the end of the day, that’s what it boils down to.

Rather than propose my own framework, which sits somewhere in the middle of all these, I’d rather share a few questions I’ve been thinking through and, for those that make it to the end, my favorite resources for playing in the space that have helped refine my thinking.

  • What is the value of all the “middleware”? While development frameworks like Langchain have made building at the application layer easier, recent criticisms highlight issues that suggest this approach may actually create more complexity than simplicity. Issues like lack of reusability, intensive efforts for prompt tuning and data sterilization, and data orchestration inefficiencies suggest some skepticism emerging around what should be the optimal stack. Are companies at the development layer fundamentally limited by the current state of underlying models? How might advancements in LLMs affect the viability of intermediary frameworks?
  • Customizing LLMs: RAG vs finetune? RAG (Retrieval-Augmented Generation) and fine-tuning are two distinct methods for customizing large language models (LLMs). RAG combines the power of LLMs with external knowledge retrieval, particularly useful for tasks requiring up-to-date or specific information not contained within the training data of the LLM. Fine-tuning trains the LLM on a specific dataset to tailor its responses to a particular domain or style, making it more specialized in certain contexts. How will the choice between RAG and fine-tuning impact the accuracy and relevance of LLM outputs in specialized applications? Could combining both methods offer more robust solutions? 
  • Is bigger always better? While large, general purpose models are great for many use cases, they require expensive compute and storage to develop, fine-tune, and use. As businesses continue to adopt AI they will need to weigh costs with effectiveness on domain-specific tasks. Smaller, more focused models may be more economically viable and superior in performance of specialized tasks. Is there diminishing returns on parameter counts? Will this lead to more competition and less oligopolistic dynamics in the foundational model landscape? How will this change the world of the “connectors” and deployment layer? Will integrations into various models be favored or will tools verticalize to the smaller models that serve specialized use cases?
  • abcdefGPU? GPU shortages continue to delay progress and companies’ ability to ship new products. We need less reliance on traditional GPUs with monopolistic market dynamics. This means either more efficient models will emerge to free compute resources or more efficient hardware! Traditionally model efficiency came at a trade off to accuracy (due to quantization) but will we soon see cheaper, faster, more efficient, and equally accurate software or hardware?
  • When all else fails, where’s my insurance? There’s endless talk about guardrails and safety. As the AI regulatory landscape evolves, companies will begin facing consequences of new legal failings, many perhaps unintentional. Going back to my fintech roots, where there are probabilistic outcomes, there’s insurance. We have cyber insurance already, but will policies cover AI breaches? Who will be the first to offer AI insurance for harmful hallucinations or damaging outputs? To come full circle, maybe AI risk models will be used to even calculate AI insurance premiums!

Lastly, as promised, my favorite fun resources:

  1. A lesson in prompt injection and tricking LLMs to revealing secrets. Email me if you can get past level 8…: https://gandalf.lakera.ai/ 
  2. Comparing results from open source LLMs: https://aviary.anyscale.com/
  3. A reminder how far we’ve come – a 2007 game that used “artificial intelligence” to guess what you were thinking: https://en.akinator.com/