Imagine writing a piece of software that could understand, assist, and even generate code, similar to how a seasoned developer would.
Well, that’s possible with LangChain. Leveraging advanced models such as VectorStores, Conversational RetrieverChain, and LLMs, LangChain takes us to a new level of code understanding and generation.
In this guide, we will reverse engineer Twitter’s recommendation algorithm to understand the code base better and provide insights to craft better content. We’ll use OpenAI’s embedding technology and a tool called Activeloop to make the code understandable and an LLM hosted on DeepInfra called Dolly to converse with the code.
When we’re done, we’ll be able to shortcut the difficult work it will take to understand the algorithm by asking an AI to answer our most pressing questions rather than spending weeks sifting through it ourselves. Let’s begin.
A Conceptual Outline for Code Understanding With LangChain
LangChain is a very helpful tool that can analyze code repositories on GitHub. It brings together three important parts: VectorStores, Conversational RetrieverChain, and an LLM (Language Model) to assist you with understanding code, answering questions about it in context, and even generating new code within GitHub repositories.
The Conversational RetrieverChain system helps find and retrieve useful information from a VectorStore. It uses smart techniques like context-aware filtering and ranking to determine which code snippets and information are most relevant to your specific question or query. What sets it apart is that it considers the conversation’s history and the context in which the question is asked. This means it can provide you with high-quality and relevant results that specifically address your needs. In simpler terms, it’s like having a smart assistant that understands the context of your questions and gives you the best possible answers based on that context.
Now, let’s look into the LangChain workflow and see how it works at a high level:
Index the Code Base
The first step is to clone the target repository you want to analyze. Load all the files within the repository, break them into smaller chunks, and initiate the indexing process. You can skip this step if you already have an indexed dataset.
Embedding and Code Store
To make the code snippets more easily understandable, LangChain employs a code-aware embedding model. This model helps in capturing the essence of the code and stores the embedded snippets in a VectorStore, making them readily accessible for future queries.
In simpler terms, LangChain uses a special technique called code-aware embedding to make code snippets easier to understand. It has a model that can analyze the code and capture its important features. Then, it stores these analyzed code snippets in a VectorStore, which is like a storage place for easy access. This way, the code snippets are organized and ready to be quickly retrieved when you have queries or questions in the future.
Query Understanding
This is where your LLM comes into play. You can use a model like databricks/dolly-v2-12b to process your queries. The model analyzes your queries and understands their meaning by considering the context and extracting important information. By doing this, the model helps LangChain accurately interpret your queries and provide you with precise and relevant results.
Construct the Retriever
Once your question or query is clear, the Conversational RetrieverChain comes into play. It goes through the VectorStore, which is where the code snippets are stored and finds the code snippets that are most relevant to your query. This search process is very flexible and can be customized to fit your requirements. You can adjust the settings and apply filters that are specific to your needs, ensuring that you get the most accurate and useful results for your query.
Build the Conversational Chain
Once you have set up the retriever, it’s time to build the Conversational Chain. This step involves adjusting the settings of the retriever to suit your needs better and applying any additional filters that might be required. By doing this, you can narrow down the search and ensure you receive the most precise, accurate, and relevant results for your queries. Essentially, it allows you to fine-tune the retrieval process to obtain the information that is most useful to you.
Ask Questions: Now Comes the Exciting Part!
You can ask questions about the codebase using the ConversationalRetrievalChain. It will generate comprehensive and context-aware answers for you. Your LLM, being part of the Conversational Chain, takes into account the retrieved code snippets and the conversation history to provide you with detailed and accurate answers.
Following this workflow, you can effectively use LangChain to gain a deeper understanding of code, get context-aware answers to your questions, and even generate code snippets within GitHub repositories. Now, let’s see it in action, step by step.
Step-by-Step Guide
Let’s dive into the actual implementation.
Acquiring the Keys
To get started, you must register at the respective websites and obtain the API keys for Activeloop, DeepInfra, and OpenAI.
Setting up the Indexer.py File
Create a Python file, e.g., indexer.py, to index the data. Import the necessary modules and set the API keys as environment variables:
import os
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
os.environ['OPENAI_API_KEY'] = 'YOUR KEY HERE'
os.environ['ACTIVELOOP_TOKEN'] = 'YOUR KEY HERE'
embeddings = OpenAIEmbeddings(disallowed_special=())
Embeddings, in plain English, are representations of text that capture the meaning and relatedness of different text strings. They are numerical vectors, or lists of numbers, used to measure the similarity or distance between different text inputs.
Embeddings are commonly used for various tasks such as search, clustering, recommendations, anomaly detection, diversity measurement, and classification. In search, embeddings help rank the relevance of search results to a query. In clustering, embeddings group similar text strings together.
Recommendations leverage embeddings to suggest items with related text strings. Anomaly detection uses embeddings to identify outliers with little relatedness. Diversity measurement involves analyzing the distribution of similarities among text strings. Classification utilizes embeddings to assign text strings to their most similar label.
The distance between two embedding vectors indicates how related or similar the corresponding text strings are. Smaller distances suggest high relatedness, while larger distances indicate low relatedness.
Cloning and Indexing the Target Repository
Next, we’ll clone the Twitter algorithm repository, load, split, and index the documents. You can clone the algorithm from this link.
root_dir = './the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
for file in filenames:
try:
loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
docs.extend(loader.load_and_split())
except Exception as e:
pass
This code traverses through a directory and its subdirectories (os.walk(root_dir)). For each file it encounters (filenames), it attempts to perform the following steps:
- It creates a TextLoader object, specifying the path of the file it is currently processing
(os.path.join(dirpath, file))
and setting the encoding to UTF-8. - It then calls the
load_and_split()
method of the TextLoader object, which likely reads the contents of the file, performs some processing or splitting operation and returns the resulting text data. - The obtained text data is then added to an existing list called docs using the
extend()
method. - If any exception occurs during this process, it is caught by the try-except block and simply ignored (`pass`).
This code snippet is recursively walking through a directory, loading and splitting text data from files, and adding the resulting data to a list called docs.
Embedding Code Snippets
Next, we use OpenAI embeddings to embed the code snippets. These embeddings are then stored in a VectorStore, which will allow us to perform an efficient similarity search:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
username = "mikelabs" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", embedding_function=embeddings, public=True) #dataset would be publicly available
db.add_documents(texts)print(“done”)
This code imports the CharacterTextSplitter
class and initializes an instance of it with a chunk size of 1000 characters and no overlap. It then splits the provided docs into smaller text chunks using the split_documents method and stores them in the texts variable.
Next, it sets the username (the one you used to sign up for Activeloop!). It creates a DeepLake instance called db with a dataset path pointing to a publicly available dataset hosted on “app.activeloop.ai” under the specified username. The embedding_function handles the embeddings needed.
Finally, it adds the texts to the db using the add_documents method, presumably for storage or further processing purposes.
Run the file, then wait a few minutes (it may appear to hang for a bit… usually no more than 5 minutes). Then, on to the next step.
Utilizing dolly-v2-12b to Process and Understand User Queries
Now we set up another Python file, question.py, to use dolly-v2-12b, a language model available in the DeepInfra platform, to process and understand user queries.
Constructing the Retriever
We construct a retriever using the VectorStore we created earlier.
db = DeepLake(dataset_path="hub://mikelabs/twitter-algorithm", read_only=True, embedding_function=embeddings) #use your username
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10
Here’s a breakdown of what the code is doing:
The code initializes a DeepLake object called db. It reads the dataset from the path specified as “hub://mikelabs/twitter-algorithm.” It’s worth noting that you need to replace “mikelabs” with your own username!
The db object is then transformed into a retriever using the as_retriever() method. This step allows us to perform search operations on the data stored in the VectorStore.
Several search options are customized by modifying the retriever.search_kwargs
dictionary:
The distance_metric
It is set to ‘cos,’ indicating that cosine similarity will be used to measure the similarity between text inputs. Imagine you have two vectors representing different pieces of text, such as sentences or documents. Cosine similarity is a way to measure how similar or related these two pieces of text are.
We look at the angle between the two vectors to calculate cosine similarity. If the vectors are pointing in the same direction or are very close to each other, the cosine similarity will be close to 1. This means that the text pieces are very similar to each other.
On the other hand, if the vectors are pointing in opposite directions or are far apart, the cosine similarity will be close to -1. This indicates that the text pieces are very different or dissimilar.
A cosine similarity of 0 means that the vectors are perpendicular or at a 90-degree angle to each other. In this case, there is no similarity between the text pieces.
In the code above, cosine similarity is used as a measure to compare the similarity between text inputs. It helps determine how closely related two text pieces are. Using cosine similarity, the code can rank and retrieve the top matches most similar to a given query.
The fetch_k
the parameter is set to 100, meaning that the retriever will retrieve the top 100 closest matches based on cosine similarity.
The maximal_marginal_relevance
is set to True
, suggesting that the retriever will prioritize diverse results rather than returning highly similar matches.
The k
the parameter is set to 10, indicating that the retriever will return ten results for each query.
Building the Conversational Chain
We use the ConversationalRetrievalChain to link the retriever and the language model. This enables our system to process user queries and generate context-aware responses:
model = DeepInfra(model_id="databricks/dolly-v2-12b")
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)
The ConversationalRetrievalChain acts as a connection between the retriever and the language model. This connection allows the system to handle user queries and generate responses that are aware of the context.
Asking Questions
We can now ask questions about the Twitter algorithm codebase. The answers provided by the ConversationalRetrievalChain are context-aware and directly based on the codebase.
questions = ["What does favCountParams do?", ...]
chat_history = [] for question in questions: result = qa({"question": question, "chat_history": chat_history}) chat_history.append((question, result['answer'])) print(f"-> **Question**: {question} n")
print(f"**Answer**: {result['answer']} n")
Here are some example questions taken from the LangChain docs:
questions = [ "What does favCountParams do?", "is it Likes + Bookmarks, or not clear from the code?", "What are the major negative modifiers that lower your linear ranking parameters?", "How do you get assigned to SimClusters?", "What is needed to migrate from one SimClusters to another SimClusters?", "How much do I get boosted within my cluster?", "How does Heavy ranker work. what are it’s main inputs?", "How can one influence Heavy ranker?", "why threads and long tweets do so well on the platform?", "Are thread and long tweet creators building a following that reacts to only threads?", "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?", "Content meta data and how it impacts virality (e.g. ALT in images).", "What are some unexpected fingerprints for spam factors?", "Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
]
And here’s a sample answer that I got back:
**Question**: What does favCountParams do? **Answer**: FavCountParams helps count your favorite videos in a way that is friendlier to the video hosting service (i.e., TikTok). For example, it skips counting duplicates and doesn't show recommendations that may not be relevant to you.
Resources
Here are some additional resources you might find helpful:
Conclusion
Throughout this guide, we explored reverse engineering Twitter’s recommendation algorithm using LangChain. By leveraging AI capabilities, we save valuable time and effort, replacing manual code examination with automated query responses.
LangChain is a powerful tool that revolutionizes code understanding and generation. Using advanced models like VectorStores, Conversational RetrieverChain, and an LLM hosted on a service like DeepInfra, LangChain empowers developers to efficiently analyze code repositories, provide context-aware answers, and generate new code.
LangChain’s workflow involves indexing the code base, embedding code snippets, processing user queries with language models, and utilizing the Conversational RetrieverChain to retrieve relevant code snippets. By customizing the retriever and building the Conversational Chain, developers can fine-tune the retrieval process for precise results.
Following the step-by-step guide, you can leverage LangChain to enhance your code comprehension, obtain context-aware answers, and even generate code snippets within GitHub repositories. LangChain opens up new possibilities for productivity and understanding. What will you build with it? Thanks for reading!