Converging Approaches in LLMs
About 6 months ago, I felt very uncertain about the direction practical LLM use would take. It was definitely going to happen fast, but I wasn't quite sure what set of techniques would be needed for applied use cases across most companies. I've been reading the literature and playing around with LLMs a lot, both for work and as a hobby. Over the last few weeks, I finally feel like I see some convergence on what the approaches will be for companies practically using LLMs in their organization. I am happy people are making foundational improvements, but I am personally interested in how we can apply LLMs to create value for customers and stakeholders. And with that, I do think we finally have a good idea of what it will look like.
At a high level, I think we can expect:
- Foundation models are consolidated and not a space for startups. I expect almost no company should try to touch this.
- RAG (retrieval augmented generation) is likely the space for the primary effort.
- Fine-tuning will be commonly used, as it really helps improve RAG outputs.
- Chained prompts will become more commonly used.
- Human-in-the-loop prompts will dramatically outperform in most workflows.
Foundation Models will consolidate. I think we will see a few more startups try here, but between AWS offering foundation models (AWS Bedrock and their recent big investment in Anthropic), Microsoft's heavy investment (both in OpenAI and their own models), OpenAI's current dominance at the top of the field (GPT4), Google's existential fight to keep up (Gemini), and Facebook trying to commoditize the rest (with LLAMA), it seems hard to see how startups can compete long term. Foundation models primarily require big compute and big data. I don't see right now how smaller players keep up.
So I think most people will pick the latest foundation model based on price/performance characteristics. Small open-sourced ones for local-first, cheap inference. Finetuned GPT-3.5-like models for good performance at cost/latency. GPT-4 for good overall performance at higher prices. We will see what a finetuned GPT-4 looks like, but I expect it will make sense for the most important use cases.
RAG (Retrieval Augmented Generation) is not my favorite term honestly, but it effectively just means "providing your LLM context in the prompt." For most real-world use cases of LLMs, you don't want them relying solely on what was trained in a foundation model. I expect RAG to effectively always be used in real-world purposes.
Now the tricky part I have found about RAG is the focus on embedding models. In my experience, it is very hard to get embeddings to reliably work on plain foundation models. It requires a lot of fine-tuning of the embedded set if you are working with documents. I have found that taking a set of docs for example, and asking "What is your cancellation policy?", can often respond with hallucinated answers even when there is an exact keyword match in the embedded documents. It does depend on the use case, but dumping documents into a chunker-embedder has not been a reliable way of getting the LLM to reply with accurate data. I expect most companies to see the solution to these problems to be the following:
More context windows. This increases the budget, but expanding the context window allows you to overcome most weaknesses of RAG. It is basically a brute-force approach, but it works! Anthropic's 100k context windows would really change the game for most use cases (I have not tried it yet). Even OpenAI's 32k context window makes a big difference. It does feel like context windows today are a bit like RAM through the 2000s. When I got my first computer the sales rep said I should have more RAM than I would ever need, 16MB. Context windows seem like the kind of thing that we would find easily useful up to 1m context windows and beyond, with diminishing returns beyond 10m context windows (outside some specific use cases, such as the law, which may require loading huge volumes of case law). Not that you would use all of this in most applications, just that they would be useful!
Finetuning. Finetuning has been very impressive IMO. It doesn't seem to do a great job in my experience without RAG, but it can help tweak models to make better use of RAG and it seems to embed some small information into the model itself. Not enough to fully teach it something new it wasn't trained on, but enough to make a meaningful difference. I like what I am seeing with Finetuning, and with OpenAI it makes practical sense to use over 3.5 in my opinion. 3.5 is just not capable enough for zero-shot use cases I have come across.
Improved embeddings for RAG. I do think hybrid search and not relying solely on embeddings is important, but improvements to the embedding process for RAGs would have big improvements. I do think we will see more embedding pipelines that involve having intermediary LLMs summarize and make intelligent decisions about how to chunk the data into the embedding model. I have seen a few people do this, and this to me seems to make sense. You pass your document into a large context window document so it can "see" the whole thing. You ask it to remove extraneous content and then embed the content using a different model. I certainly could see this all occurring in a single step with a new embedding model to replace what we currently have.
Prompt Chaining + human-in-the-loop workflows. I think critical to most real business use cases will be chaining prompts together, and including humans in the loop. AutoGPT got a lot of buzz early in 2023, but I have played with it enough to find no practical use case. There is probably some real algorithm to explain the issues, but my impression is that the fundamental issue is the probability of failure is too high at each step. If it is "only" 20%, then you can see that 90% of AutoGPT runs fail beyond 10 steps. In my experience, for most tasks even the "successes" are "close but not fully accurate", so it is likely even worse than that.
I suspect each of these 3 things will play a part in improving LLM query results, and I think are going to be the most exciting things to watch over the next year.
Applying West Point Leadership Principles to Engineering Management
When it comes to leadership, one might be surprised to find out that the principles guiding an engineering manager at a startup aren't too different from those learned at West Point, the United States Military Academy. The principles I learned as a cadet, and later applied leading soldiers preparing for deployment to Iraq, have proven timeless and are as relevant in the tech startup world as they are on the battlefield.
Every cadet has to learn these principles, and while they are designed for soldiering, I have liked them as a well proven set of leadership principles to always apply.
1. Know Yourself and Seek Self-Improvement
Understanding yourself, your management style, and your strengths and weaknesses is vital1. Self-awareness and a continuous pursuit of self-improvement set the tone for the entire organization, inspiring others to do the same.
2. Be Technically and Tactically Proficient
Having technical expertise in engineering leadership is essential. Staying up-to-date with technological advancements isn't just a luxury, it's a necessity. Regularly engaging with your craft and being able to discuss the latest developments demonstrates dedication and can earn you respect within your team. Not in our case, "tactically" proficient basically means being good at the little things of your job. For example, making sure you are good at manager's scheduling.
3. Seek Responsibility and Take Responsibility for Your Actions
The ability to take charge and own up to mistakes is a mark of a strong leader. It's important to give credit to others when things go well and to take responsibility when they don't. This encourages a culture of accountability within the team.
4. Make Sound and Timely Decisions
Effective decision-making in both speed and quality is critical. The OODA (observe, orient, decide, act) loop is an effective framework to guide this process. It promotes analytical thinking and swift problem-solving.
5. Set the Example
Leadership isn't about implementing protocols — it's about setting a positive example that your team will follow. Your actions will inspire others more than any written policy can.
6. Know Your Team and Look Out for Their Well-Being
Truly caring about your team members is essential. This doesn't mean you need to become close friends with everyone, but it does mean knowing enough about them to understand how to manage them effectively.
7. Keep Your Team Informed
Transparency and open communication are cornerstones of effective management. Sharing your vision and updates with your team not only fosters a sense of inclusion but also enables them to make micro-decisions aligned with the company's goals.
8. Develop a Sense of Responsibility in Your Team
Empowering your team and encouraging ownership and accountability can help cultivate a sense of responsibility. This goes a long way in fostering a motivated and self-starting team.
9. Ensure that the Task is Understood, Supervised, and Accomplished
Clarity in task assignment, monitoring progress, and providing feedback are critical for ensuring tasks are effectively completed. Using a single task list that outlines all priorities can greatly simplify this process.
10. Build the Team
Building your team doesn't stop at hiring. It's about nurturing talent, weeding out those not contributing positively, and fostering individual growth to create a team that's more than the sum of its parts.
11. Use Your Team Wisely (Employ Your Team in Accordance with Its Capabilities)
Every team member has unique strengths. Leveraging these strengths appropriately ensures the team operates at its highest potential.
The leadership principles from West Point have stood the test of time, proving valuable both on and off the battlefield. As an engineering manager, you have the opportunity to implement these principles to create a robust, efficient, and highly motivated team.
Development Process Experimentation: A Leadership Necessity
Creating software doesn't come with a one-size-fits-all guide. There are many ways to succeed, as shown by the wide range of methods used in the industry. For leaders in early-stage startups, it's important to know when and how to change up the development process.
Try New Things, Not Just in Your Product
A leader's job isn't just about overseeing projects. Just like a product manager might try out new things to make the user experience better, a leader should be ready to test and tweak the development process. What worked in your last job might not work in a new environment, so be ready to experiment.
A Real-World Example: Shopify's Six-Week Cycles
Shopify's engineering team gives us a great example. In 2019, they started using six-week development cycles. The goal was to make workloads easier to handle, improve how they decided what to work on first, and make it easier to show off high-quality work. It started with one team, but it worked so well they started using it across the whole company.
Just like traffic lights might slow down a driver for a bit but make the overall flow of traffic better, a well-planned development process helps the whole team.
Keys to Experimenting Successfully
If you're a leader in an early-stage startup, where people often work across different areas and there's a lot of uncertainty, here are some tips for experimenting:
- Try out and evaluate different development processes.
- Keep a record of what didn't work, so you can possibly give it another shot in the future. Context is key.
- Stick to your opinions but be ready to change them if necessary. Be open to change.
Issues with Serverless Products
Serverless technologies have ushered in a new era of scalable and cost-effective cloud computing solutions. While they offer numerous advantages, including cost-effectiveness, easy scalability, and reduced operational burden, they can have issues that make them impossible to use. Especially when it comes to cost management and observability. The following exploration discusses some specific examples that highlight these issues, serving as cautionary tales for early-stage startups.
Example: MongoDB's Problematic Billing Units
MongoDB, a popular NoSQL database service, released a serverless option in 2022. This serverless version bills primarily by Read Processing Units (RPUs), a unit that, in practice, bears little connection to actual database read operations.
In our experience, interpreting the usage of RPUs was challenging, with MongoDB unable to provide any insight into which queries were consuming RPUs. Furthermore, there was no way to preview how RPUs were used.
In tests, a workload that was cost-effective on a $300/month standard instance ballooned to $3,000/month on the serverless option, despite the serverless instance being used only 1% of the time. This cost disparity, with no means to track it or understand its origins, made MongoDB's serverless option practically unusable for our purposes.
In such a case, a more trackable billing unit, such as actual read operations, could have helped better understand and manage the costs.
Example: OpenTok's Complete Lack of Tooling (Experience Composer)
Another case involves OpenTok's Experience Composer, a video composition product under the Vonage umbrella. Despite having a clear billing metric (minutes of usage), the product suffered from insufficient controls to understand how usage was billed.
We found the billing intervals didn't align with our internal controls. In essence, we were confronted with seemingly random charges and lacked the necessary controls to understand the billing from OpenTok's product.
Lessons for Startups
These experiences serve as crucial lessons for startups exploring serverless options:
- Understand Your Usage: Before adopting a serverless product, ensure it provides the necessary controls to understand and manage usage.
- Decipher the Billing Units: Billing units should be measurable and relatable to your product's usage. Avoid 'made-up' units like MongoDB's RPUs that don't directly tie to discernible operations.
In the early stages of startups, engineering teams often need to work across different domains under considerable uncertainties. When resources are limited, and every penny counts, understanding and controlling costs is crucial. As such, the choice of tools, including serverless products, needs careful consideration of their cost-management and observability features.
Early Stage SOC2 Strategies
Startups, particularly in the early stages, often find themselves juggling various priorities as they strive to grow and scale. While the focus primarily leans towards product development, sales, and customer acquisition, there's an increasingly pivotal area that needs attention—security compliance. For those in the B2B space, especially targeting enterprise clients, security becomes a key determinant in sealing larger deals.
The Security Factor in Sales
Surprisingly, small to medium-sized businesses (SMBs) may not be overly concerned about the security aspect of your product or service. However, when it comes to large enterprises, security can be a deal-breaker.
A study by G2 reveals that 86% of buyers require a security assessment prior to purchase1, an impactful statistic that underlines the gravity of having your security in order. However, only 24% involve a security stakeholder during the research phase, implying that early security assurances can streamline the sales process by minimizing the involvement of security personnel. This often results in a smoother, more efficient sales journey.
The Value of SOC-2
Among various security certifications, SOC-2 stands out as the most common and relevant for SaaS companies. Achieving SOC-2 compliance sends a powerful message to potential enterprise clients about your commitment to security and data protection, a prerequisite to closing substantial deals.
Preparing for SOC-2
Preparing for SOC-2 certification involves considerable work and financial investment, usually within the range of $5,000 to $10,000 per year. However, when balanced against the potential revenue from closing large enterprise deals, this investment becomes justifiable.
Even if your budget does not permit immediate SOC-2 certification, it's advisable to plan ahead. Engage with vendors, understand the requirements, and get the initial paperwork in order. By laying the groundwork, you can expedite the certification process when the time comes, thus keeping any associated scramble to a minimum.
Implementing Best Practices
While preparing for SOC-2 certification, it's crucial to implement robust security practices within your organization. The certification process will scrutinize your security framework, so establishing these measures is not merely a box-ticking exercise but a necessity to ensure data protection and instill confidence in your clients.
Conclusion
Investing in security compliance, particularly SOC-2 certification, is an investment in your startup's growth. This not only fortifies your product's security but also bolsters your sales efforts, particularly when targeting enterprise clients. Start preparing early to reap the benefits as you scale beyond the early stages.
The sales team, assuredly, will thank you for this.
References:
1. https://research.g2.com/hubfs/Buyer-Behavior-Report-2023.pdf