The python code described in this series of blog posts is available on my GitHub account.
Large Language Models (LLM, an artificial neural network based AI technology) are well known for their ability to generate text. They can answer questions, summarize or translate texts and even write whole books.
In this blog post I will present an LLM based solution (in the form of an open source [1] python library) which can be used to quickly test a hypothesis in the social sciences. The solution exploits the fact that LLMs are statistical models which can be (ab)used to calculate the likelihood of sentences in natural language. The idea is easiest to explain with an example:
The sentence „She came home and painted her nails“ is much more likely to occur in a text (e.g. a book) than the sentence „He came home and painted his nails“ [2]. LLMs are able to calculate such likelihoods. Of course there are several limitations. More about his later. But my experiments show that the calculated likelihoods describe well known social relationships surprisingly well. It might be therefore possible to use LLMs for fast theory construction. In the example above we can assume that - because the likelihood for the first sentence is much higher than for the second sentence - men only rarely paint their nails.
LLMs are trained on huge amounts of internet data (ranging from wikipedia and classic literature to facebook posts and even worse). During this training process the LLM constructs, to be able to generate novel text, a language based world model (the extent to which this happens is currently being debated). This world model is a compressed description of the world as we describe it in all the texts we have ever published. It contains an approximate statistical model of our language (which - of course - is based on our behavior and culture).
To understand the idea from a more technical perspective we need to know a bit more about LLMs:
Most of them are currently autoregressive language models: they are given an input text for which they generate the next word. To be more precise they usually generate a sub word string called „token“ (i.e. some words consist of more than one token). This word can then be added to the text and another token can be generated. And so on.
Now what does it mean, when a LLM „generates“ a token? The output of an LLM is a list or probabilities: for every element in its token dictionary (which can contain easily 150000 different tokens) the probability is calculated that the element/token comes next in the text.
So for instance if I input the text („prompt“) „The teenager ate his“, the LLM will suggest various tokens describing food („soup“) with high probability (like 0.23) and the token „chair“ will be assigned an extremely low probability (like 0.0000012, because even teenagers only very rarely eat chairs!).
Typically one of the words with the highest probabilty is chosen and this is then the generated token.
But LLMs do not only calculate all these probabilities for the next generated word, but also for every word of the input sentence (the so called „prompt“). As this information is not needed in normal generative applications, it is usually discarded internally. Unfortunately this information is - for the same reason - not output by the APIs of the most powerful commercial LLMs (like ChatGPT etc.). It is therefore required to run the LLM locally and the selection of LLMs is limited to publicly available open source LLMs (like „Llama 3 8B“ etc.)
I therefore modified a proven LLM library (which runs fast on a decent Macbook with Apple silicon - i.e. M1-M4 processors - and can use many open source LLMs from the huge hugginface library) to output these probabilities too. The result is a simple library named SAILSS (Scouting AI Library for Social Sciences) which could - I believe - very helpful in the process of theory construction in the social sciences. It could allow to test hypothesis very quickly which would allow scientists to explore a large number of cases in a very short time (without the need for surveys etc.).
SAILSS currently consists of the following components:
- A modified LLM-Framework (I chose the MLX-Framework from Apple: Github ml-explore/mlx-examples). It runs with high performance on standard Apple Silicon computers (like a Macbook Pro / Air) with sufficient amount of RAM (from around 16B, depending on the LLM chosen).
- Some extensions of the Framework which contain special functionality needed to make meaningful experiments and generate different output formats which are well suited for the intended application.
- Some example scripts which show the use of the library (basic and more advanced cases) including data visualization etc.
The python code for all three components is available on my GitHub account.
Please note that the current implementation is rather minimalistic and its use currently requires some solid familiarity with computer science (python etc.). Let me know in any case if you are interested in this project (even if you don’t feel qualified to use it in its present state): if several people are interested in using the project, I will be motivated to improve and extend it.
I will describe the library and its applications in a series of blog posts (to be published in the next 2-3 weeks):
- Overview (this post!)
- A bit of theory, some examples of problems which can be explored using SAILSS. Challenges and solutions.
- How to use the library (basic python knowledge required) and advanced examples
- Implementation details (for experts who want to modify the library)
The next blog post is already available!
[1] MIT License
[2] In this example, the Meta Llama 3 8B LLM might suggest the word „masterpiece“ after the token „his“!
Image: DALL-E
Follow me on X to get informed about new content on this blog.