JuliaHub recently added a ChatGPT integration that allows users to ask questions about the Julia language including information about documentation, package information, and code examples. You can actually try our AskAI feature right now by first signing up for JuliaHub for free. To learn more about large language models in Julia, use this walkthrough with the Transformers.jl package.
You can access the original notebook here:
https://nbviewer.org/gist/aviks/2f525e31c7c2787228cf8e871311d33c
To Start the Walkthrough
Start by adding the following packages:
After loading the package, we need to set up the GPU. Currently multi-GPU architecture is not supported. If your machine has multiple GPU devices, we can use CUDA.devices()
to get the list of all devices and use CUDA.device!(device_number)
to specify the device we want to run our model on.
For demonstration, we disable the scalar indexing on GPU so that we can make sure all GPU calls are handled without performance issues. By setting enable_gpu
, we get a todevice
provided by Transformers.jl that will move the data/model to the GPU device.
In this tutorial, we show how to use dolly-v2-12b in Julia. Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. It's based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction-following dataset databricks-dolly-15k, crowdsourced among Databricks employees. They provide 3 model sizes: dolly-v2-3b, dolly-v2-7b, and dolly-v2-12b. More information can be found in databricks' blogpost. The process should also work for other causal LM based models. With Transformers.jl, we can get the tokenizer and model by using the hgf""
macro or HuggingFace.load_tokenizer
/HuggingFace.load_model
. The required files such as model weights will be downloaded and managed automatically.
The main generation loop is defined as follows:
The prompt is first preprocessed and encoded with the tokenizer
textenc
. Theencode
function returns aNamedTuple
where.token
is the one-hot representation of our context tokens.At each iteration, we copy the tokens to GPU and feed them into the model. The model also returns a
NamedTuple
where.logit
is the predictions of our model. We then apply the greedy decoding scheme to get the prediction of the next token. The token will be appended to the end of context tokens. The iterations stop if we exceed the maximum generation length or the predicted token is an end token.After the loop, we decode the one-hot encoding back to text tokens. The
decode
function converts the onehots to texts and also performs some post-processing to get the final list of strings.
We use the same prompt of dolly defined in instruct_pipeline.py
JuliaHub is a unified platform for modeling, simulation, and user built applications with the Julia language. Follow along with us and learn more about the Julia language and ecosystem and try our AskAI for free by signing up today.