JuliaHub recently added a ChatGPT integration that allows users to ask questions about the Julia language including information about documentation, package information, and code examples. You can actually try our AskAI feature right now by first signing up for JuliaHub for free. To learn more about large language models in Julia, use this walkthrough with the Transformers.jl package.

You can access the original notebook here:

https://nbviewer.org/gist/aviks/2f525e31c7c2787228cf8e871311d33c

To Start the Walkthrough

Start by adding the following packages:

using Transformers, CUDA

After loading the package, we need to set up the GPU. Currently multi-GPU architecture is not supported. If your machine has multiple GPU devices, we can use CUDA.devices() to get the list of all devices and use CUDA.device!(device_number) to specify the device we want to run our model on.

CUDA.devices

CUDA.device!(0

For demonstration, we disable the scalar indexing on GPU so that we can make sure all GPU calls are handled without performance issues. By setting enable_gpu, we get a todevice provided by Transformers.jl that will move the data/model to the GPU device.

CUDA.allowscalar(false)
enable_gpu(true

In this tutorial, we show how to use dolly-v2-12b in Julia. Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. It's based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction-following dataset databricks-dolly-15k, crowdsourced among Databricks employees. They provide 3 model sizes: dolly-v2-3b, dolly-v2-7b, and dolly-v2-12b. More information can be found in databricks' blogpost. The process should also work for other causal LM based models. With Transformers.jl, we can get the tokenizer and model by using the hgf"" macro or HuggingFace.load_tokenizer/HuggingFace.load_model. The required files such as model weights will be downloaded and managed automatically.

using Transformers.HuggingFace

textenc = hgf"databricks/dolly-v2-12b:tokenizer"
model = todevice(hgf"databricks/dolly-v2-12b:ForCausalLM") # move to gpu with `todevice` (or `Flux.gpu`)

using Flux
using StatsBase

function temp_softmax(logits; temperature = 1.2)
    return softmax(logits ./ temperature)
end

function top_k_sample(probs; k = 1)
    sorted = sort(probs, rev = true)
    indexes = partialsortperm(probs, 1:k, rev=true)
    index = sample(indexes, ProbabilityWeights(sorted[1:k]), 1)
    return index
end

The main generation loop is defined as follows:

The prompt is first preprocessed and encoded with the tokenizer textenc. The encode function returns a NamedTuple where .token is the one-hot representation of our context tokens.
At each iteration, we copy the tokens to GPU and feed them into the model. The model also returns a NamedTuple where .logit is the predictions of our model. We then apply the greedy decoding scheme to get the prediction of the next token. The token will be appended to the end of context tokens. The iterations stop if we exceed the maximum generation length or the predicted token is an end token.
After the loop, we decode the one-hot encoding back to text tokens. The decode function converts the onehots to texts and also performs some post-processing to get the final list of strings.

using Transformers.TextEncoders

function generate_text(textenc, model, context = ""; max_length = 512, k = 1, temperature = 1.2, ends = textenc.endsym)
    encoded = encode(textenc, context).token
    ids = encoded.onehots
    ends_id = lookup(textenc.vocab, ends)
    for i in 1:max_length
        input = (; token = encoded) |> todevice
        outputs = model(input)
        logits = @view outputs.logit[:, end, 1]
        probs = temp_softmax(logits; temperature)
        new_id = top_k_sample(collect(probs); k)[1]
        push!(ids, new_id)
        new_id == ends_id && break
    end
    return decode(textenc, encoded)
end

We use the same prompt of dolly defined in instruct_pipeline.py

function generate(textenc, model, instruction; max_length = 512, k = 1, temperature = 1.2)
    prompt = """
    Below is an instruction that describes a task. Write a response that appropriately completes the request.
    
    ### Instruction:
    $instruction
    
    ### Response:
    """    
    text_token = generate_text(textenc, model, prompt; max_length, k, temperature, ends = "### End")
    gen_text = join(text_token)
    println(gen_text)
end

generate(textenc, model, "Explain to me the difference between nuclear fission and fusion."

JuliaHub is a unified platform for modeling, simulation, and user built applications with the Julia language. Follow along with us and learn more about the Julia language and ecosystem and try our AskAI for free by signing up today.

Authors

Peter Cheng

Authors

Peter Cheng

Authors

Peter Cheng

‹ Julia User & Developer Survey 2023

Newsletter July 2023 - JuliaHub Receives $13 Million Strategic Investment from Boeing-Backed AEI HorizonX ›

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.

Contact Sales ›

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.

Contact Sales ›

Recent Blog Posts

All Blog Posts ›

Dec 22, 2025

•

Research & Innovation

Fortifying the Citadel: A Community Call to Secure the Julia Ecosystem

Mridul Upadhyay

Dec 12, 2025

•

Product Updates

Scaling Workflows and Securing the Enterprise: What’s New in JuliaHub 25.10

Mridul Upadhyay

Dec 11, 2025

•

Newsletter

Announcing Dyad v2.0.0, JuliaHub and Synopsys Partnership & Dyad Livestreaming

JuliaHub

Recent Blog Posts

All Blog Posts ›

Dec 22, 2025

•

Research & Innovation

Fortifying the Citadel: A Community Call to Secure the Julia Ecosystem

Dec 12, 2025

•

Product Updates

Scaling Workflows and Securing the Enterprise: What’s New in JuliaHub 25.10

Dec 11, 2025

•

Newsletter

Large Language Model (LLM) Tutorial with Julia’s Transformers.jl

Large Language Model (LLM) Tutorial with Julia’s Transformers.jl

To Start the Walkthrough

Tags

Tags

Tags

Authors

Peter Cheng

Authors

Peter Cheng

Authors

Peter Cheng

Recent Blog Posts

Fortifying the Citadel: A Community Call to Secure the Julia Ecosystem

Scaling Workflows and Securing the Enterprise: What’s New in JuliaHub 25.10

Announcing Dyad v2.0.0, JuliaHub and Synopsys Partnership & Dyad Livestreaming

Recent Posts

Fortifying the Citadel: A Community Call to Secure the Julia Ecosystem

Scaling Workflows and Securing the Enterprise: What’s New in JuliaHub 25.10

Recent Blog Posts

Fortifying the Citadel: A Community Call to Secure the Julia Ecosystem

Scaling Workflows and Securing the Enterprise: What’s New in JuliaHub 25.10

Announcing Dyad v2.0.0, JuliaHub and Synopsys Partnership & Dyad Livestreaming