OllamaModels - Find the Best Local AI Models for Your Hardware

All-MiniLM

Sentence Transformers · 23M

Compact and fast embedding model. Good balance of quality and speed for semantic search.

embedding embeddingfastlightweight

Quantization

Size: 0.05GB

RAM: 0.5GB

VRAM: 0.25GB

Context: 0K

Performance (tokens/sec)

800

CPU

3000

Mid GPU

8000

High GPU

ollama pull all-minilm

BakLLaVA

SkunkworksAI · 7B

Mistral-based vision model. Fast multimodal inference with good image understanding.

vision visionmultimodalfast

Quantization

Size: 4.5GB

RAM: 8GB

VRAM: 6GB

Context: 4K

Performance (tokens/sec)

15

CPU

58

Mid GPU

110

High GPU

ollama pull bakllava

BGE-M3

BAAI · 568M

Multilingual embedding model supporting 100+ languages. Great for cross-lingual retrieval.

embedding embeddingmultilingualretrieval

Quantization

Size: 1.1GB

RAM: 2GB

VRAM: 1.5GB

Context: 8K

Performance (tokens/sec)

250

CPU

1200

Mid GPU

3000

High GPU

ollama pull bge-m3

Code Llama

Meta · 34B

Meta's specialized coding model based on Llama 2. Excellent for code completion and generation.

coding codingprogramminginfill

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 16K

Performance (tokens/sec)

5

CPU

28

Mid GPU

65

High GPU

ollama pull codellama

CodeGemma

Google · 7B

Google's coding variant of Gemma. Optimized for code completion and generation.

coding codinggoogleefficient

Quantization

Size: 4.3GB

RAM: 8GB

VRAM: 6GB

Context: 8K

Performance (tokens/sec)

18

CPU

72

Mid GPU

135

High GPU

ollama pull codegemma

Command R

Cohere · 35B

Cohere's model optimized for RAG and tool use. Excellent at following complex instructions.

general ragtool-useenterprise

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 128K

Performance (tokens/sec)

5

CPU

26

Mid GPU

60

High GPU

ollama pull command-r

Command R+

Cohere · 104B

Cohere's largest model. Optimized for RAG, tool use, and complex multi-step tasks.

general ragtool-useenterprise

Quantization

Size: 60GB

RAM: 68GB

VRAM: 64GB

Context: 128K

Performance (tokens/sec)

2

CPU

10

Mid GPU

30

High GPU

ollama pull command-r-plus

DeepSeek Coder V2

DeepSeek · 16B

Advanced coding model with MoE architecture. Excellent for complex programming tasks.

coding codingprogrammingmoe

Quantization

Size: 9.4GB

RAM: 14GB

VRAM: 12GB

Context: 128K

Performance (tokens/sec)

10

CPU

45

Mid GPU

95

High GPU

ollama pull deepseek-coder-v2

DeepSeek R1

DeepSeek · 70B

State-of-the-art reasoning model with chain-of-thought capabilities. Excels at math, coding, and complex reasoning.

reasoning reasoningmathcoding

Quantization

Size: 43GB

RAM: 48GB

VRAM: 44GB

Context: 64K

Performance (tokens/sec)

2.5

CPU

14

Mid GPU

42

High GPU

ollama pull deepseek-r1

DeepSeek R1 14B

DeepSeek · 14B

Mid-sized distilled R1. Excellent reasoning capabilities for its size.

reasoning reasoningmathdistilled

Quantization

Size: 8.5GB

RAM: 12GB

VRAM: 10GB

Context: 64K

Performance (tokens/sec)

12

CPU

48

Mid GPU

95

High GPU

ollama pull deepseek-r1:14b

DeepSeek R1 7B

DeepSeek · 7B

Compact distilled version of R1. Strong reasoning in an efficient package.

reasoning reasoningmathdistilled

Quantization

Size: 4.7GB

RAM: 8GB

VRAM: 6GB

Context: 64K

Performance (tokens/sec)

18

CPU

72

Mid GPU

135

High GPU

ollama pull deepseek-r1:7b

DeepSeek R1 Distill (Qwen)

DeepSeek · 32B

Distilled version of R1 based on Qwen. Great reasoning in a smaller package.

reasoning reasoningdistilledefficient

Quantization

Size: 19GB

RAM: 24GB

VRAM: 20GB

Context: 32K

Performance (tokens/sec)

6

CPU

28

Mid GPU

65

High GPU

ollama pull deepseek-r1-distill-qwen

Dolphin Llama 3

Cognitive Computations · 8B

Uncensored Llama 3 fine-tune. General purpose assistant without content restrictions.

general uncensoredhelpfulchat

Quantization

Size: 4.7GB

RAM: 8GB

VRAM: 6GB

Context: 8K

Performance (tokens/sec)

18

CPU

72

Mid GPU

135

High GPU

ollama pull dolphin-llama3

Dolphin Mixtral

Cognitive Computations · 8x7B (47B)

Uncensored Mixtral fine-tune. Helpful, harmless, and honest without refusals.

general uncensoredmoehelpful

Quantization

Size: 26GB

RAM: 32GB

VRAM: 28GB

Context: 32K

Performance (tokens/sec)

5

CPU

24

Mid GPU

55

High GPU

ollama pull dolphin-mixtral

EverythingLM

Community · 13B

Extended context Llama model. Capable of handling very long documents.

general long-contextchat

Quantization

Size: 7.9GB

RAM: 12GB

VRAM: 10GB

Context: 16K

Performance (tokens/sec)

12

CPU

48

Mid GPU

95

High GPU

ollama pull everythinglm

Gemma 2

Google · 27B

Google's open model built from Gemini research. Available in 2B, 9B, and 27B sizes.

general googleefficientinstruct

Quantization

Size: 16GB

RAM: 20GB

VRAM: 18GB

Context: 8K

Performance (tokens/sec)

7

CPU

32

Mid GPU

75

High GPU

ollama pull gemma2

Gemma 2 2B

Google · 2B

Google's smallest Gemma. Fast and efficient for basic tasks.

small tinyfastgoogle

Quantization

Size: 1.6GB

RAM: 4GB

VRAM: 2.5GB

Context: 8K

Performance (tokens/sec)

40

CPU

140

Mid GPU

260

High GPU

ollama pull gemma2:2b

Gemma 2 9B

Google · 9B

Google's efficient 9B model. Great performance in a compact size.

general googleefficientinstruct

Quantization

Size: 5.4GB

RAM: 10GB

VRAM: 8GB

Context: 8K

Performance (tokens/sec)

16

CPU

62

Mid GPU

120

High GPU

ollama pull gemma2:9b

Granite Code

IBM · 34B

IBM's enterprise-focused coding model. Trained on 116 programming languages.

coding codingenterpriseibm

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 8K

Performance (tokens/sec)

5

CPU

26

Mid GPU

58

High GPU

ollama pull granite-code

Llama 3.1 70B

Meta · 70B

Meta's powerful 70B model. Excellent reasoning and instruction following with 128K context.

general chatinstructmultilingual

Quantization

Size: 40GB

RAM: 48GB

VRAM: 44GB

Context: 128K

Performance (tokens/sec)

3

CPU

15

Mid GPU

45

High GPU

ollama pull llama3.1:70b

Llama 3.1 8B

Meta · 8B

Meta's versatile 8B model with 128K context. Great balance of capability and efficiency.

general chatinstructmultilingual

Quantization

Size: 4.7GB

RAM: 8GB

VRAM: 6GB

Context: 128K

Performance (tokens/sec)

18

CPU

75

Mid GPU

140

High GPU

ollama pull llama3.1:8b

Llama 3.2

Meta · 3B

Efficient smaller models from Meta, perfect for on-device deployment. Available in 1B and 3B sizes.

small lightweightfastmobile

Quantization

Size: 2GB

RAM: 4GB

VRAM: 3GB

Context: 128K

Performance (tokens/sec)

35

CPU

120

Mid GPU

200

High GPU

ollama pull llama3.2

Llama 3.2 1B

Meta · 1B

Ultra-lightweight model from Meta. Perfect for edge devices and fast inference.

small tinyfastedge

Quantization

Size: 0.75GB

RAM: 2GB

VRAM: 1.5GB

Context: 128K

Performance (tokens/sec)

55

CPU

180

Mid GPU

320

High GPU

ollama pull llama3.2:1b

Llama 3.2 Vision

Meta · 11B

Multimodal model that can understand images and text. Available in 11B and 90B variants.

vision multimodalimage-understandingvision

Quantization

Size: 6.5GB

RAM: 10GB

VRAM: 8GB

Context: 128K

Performance (tokens/sec)

12

CPU

45

Mid GPU

90

High GPU

ollama pull llama3.2-vision

Llama 3.2 Vision 90B

Meta · 90B

Meta's largest vision model. State-of-the-art multimodal understanding.

vision visionmultimodallarge

Quantization

Size: 52GB

RAM: 60GB

VRAM: 56GB

Context: 128K

Performance (tokens/sec)

2

CPU

10

Mid GPU

28

High GPU

ollama pull llama3.2-vision:90b

Llama 3.3

Meta · 70B

Meta's latest and most capable open model. Excellent for general tasks, coding, and reasoning with 128K context.

general chatinstructmultilingual

Quantization

Size: 43GB

RAM: 48GB

VRAM: 44GB

Context: 128K

Performance (tokens/sec)

3

CPU

15

Mid GPU

45

High GPU

ollama pull llama3.3

LLaVA

LLaVA Team · 13B

Visual instruction-tuned model combining CLIP vision with Llama. Great for image understanding.

vision visionmultimodalimage-understanding

Quantization

Size: 8GB

RAM: 12GB

VRAM: 10GB

Context: 4K

Performance (tokens/sec)

10

CPU

40

Mid GPU

85

High GPU

ollama pull llava

LLaVA 34B

LLaVA Team · 34B

Large vision-language model. Excellent image understanding and reasoning.

vision visionmultimodallarge

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 4K

Performance (tokens/sec)

4

CPU

20

Mid GPU

48

High GPU

ollama pull llava:34b

Magicoder

ise-uiuc · 7B

OSS-Instruct trained coding model. Excels at generating clean, documented code.

coding codingclean-codedocumented

Quantization

Size: 4.1GB

RAM: 8GB

VRAM: 6GB

Context: 16K

Performance (tokens/sec)

18

CPU

72

Mid GPU

135

High GPU

ollama pull magicoder

Mistral 7B

Mistral AI · 7B

Highly efficient 7B model that punches above its weight. Great balance of speed and capability.

general efficientfastinstruct

Quantization

Size: 4.1GB

RAM: 8GB

VRAM: 6GB

Context: 32K

Performance (tokens/sec)

20

CPU

80

Mid GPU

150

High GPU

ollama pull mistral

Mistral Large

Mistral AI · 123B

Mistral's most capable model. Excellent for complex reasoning and enterprise use.

general enterprisereasoninglarge

Quantization

Size: 72GB

RAM: 80GB

VRAM: 76GB

Context: 128K

Performance (tokens/sec)

1.5

CPU

8

Mid GPU

25

High GPU

ollama pull mistral-large

Mistral Small

Mistral AI · 24B

Latest Mistral model optimized for efficiency. Enterprise-grade quality in a compact size.

general enterpriseefficientfunction-calling

Quantization

Size: 14GB

RAM: 18GB

VRAM: 16GB

Context: 32K

Performance (tokens/sec)

8

CPU

35

Mid GPU

80

High GPU

ollama pull mistral-small

Mixtral 8x7B

Mistral AI · 8x7B (47B)

Mixture of Experts model that activates only 2 experts per token. Fast inference with high quality.

general moeefficientmultilingual

Quantization

Size: 26GB

RAM: 32GB

VRAM: 28GB

Context: 32K

Performance (tokens/sec)

5

CPU

25

Mid GPU

60

High GPU

ollama pull mixtral

Moondream

vikhyatk · 1.8B

Tiny but capable vision model. Perfect for edge devices and fast image analysis.

vision visiontinyfast

Quantization

Size: 1.1GB

RAM: 2GB

VRAM: 1.5GB

Context: 2K

Performance (tokens/sec)

35

CPU

120

Mid GPU

220

High GPU

ollama pull moondream

Mxbai Embed Large

Mixedbread AI · 335M

State-of-the-art embedding model. Top performance on MTEB benchmarks.

embedding embeddingragretrieval

Quantization

Size: 0.67GB

RAM: 2GB

VRAM: 1GB

Context: 1K

Performance (tokens/sec)

300

CPU

1500

Mid GPU

4000

High GPU

ollama pull mxbai-embed-large

Neural Chat

Intel · 7B

Intel's fine-tuned Mistral for conversational AI. Optimized for dialogue.

general chatconversationalintel

Quantization

Size: 4.1GB

RAM: 8GB

VRAM: 6GB

Context: 8K

Performance (tokens/sec)

20

CPU

78

Mid GPU

145

High GPU

ollama pull neural-chat

Nomic Embed Text

Nomic AI · 137M

High-quality text embedding model. Perfect for RAG, semantic search, and similarity matching.

embedding embeddingragsearch

Quantization

Size: 0.27GB

RAM: 1GB

VRAM: 0.5GB

Context: 8K

Performance (tokens/sec)

500

CPU

2000

Mid GPU

5000

High GPU

ollama pull nomic-embed-text

Nous Hermes 2

Nous Research · 34B

Nous Research's flagship model. Excellent for roleplay, creative writing, and reasoning.

general roleplaycreativereasoning

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 4K

Performance (tokens/sec)

5

CPU

26

Mid GPU

58

High GPU

ollama pull nous-hermes2

OpenHermes 2.5

Teknium · 7B

Powerful Mistral fine-tune trained on large synthetic dataset. Great instruction following.

general instructchatsynthetic

Quantization

Size: 4.1GB

RAM: 8GB

VRAM: 6GB

Context: 8K

Performance (tokens/sec)

20

CPU

78

Mid GPU

145

High GPU

ollama pull openhermes

Orca Mini

Pankaj Mathur · 7B

Compact model with strong reasoning. Good balance between size and capability.

small reasoningefficientinstruct

Quantization

Size: 4GB

RAM: 8GB

VRAM: 6GB

Context: 4K

Performance (tokens/sec)

22

CPU

85

Mid GPU

160

High GPU

ollama pull orca-mini

Phi-3

Microsoft · 14B

Microsoft's small language model. Surprisingly capable for its size, great for resource-constrained environments.

small efficientsmallreasoning

Quantization

Size: 8.2GB

RAM: 12GB

VRAM: 10GB

Context: 128K

Performance (tokens/sec)

15

CPU

55

Mid GPU

110

High GPU

ollama pull phi3

Phi-4

Microsoft · 14B

Microsoft's latest small model with exceptional reasoning. Trained on high-quality synthetic data.

reasoning reasoningmathefficient

Quantization

Size: 8.4GB

RAM: 12GB

VRAM: 10GB

Context: 16K

Performance (tokens/sec)

14

CPU

52

Mid GPU

105

High GPU

ollama pull phi4

Qwen 2.5

Alibaba · 72B

Alibaba's flagship model with excellent multilingual support. Available from 0.5B to 72B.

general multilingualchinesechat

Quantization

Size: 44GB

RAM: 50GB

VRAM: 46GB

Context: 128K

Performance (tokens/sec)

2.8

CPU

14

Mid GPU

40

High GPU

ollama pull qwen2.5

Qwen 2.5 7B

Alibaba · 7B

Efficient 7B version of Alibaba's Qwen 2.5. Great multilingual support and reasoning.

general multilingualchineseefficient

Quantization

Size: 4.4GB

RAM: 8GB

VRAM: 6GB

Context: 128K

Performance (tokens/sec)

20

CPU

78

Mid GPU

145

High GPU

ollama pull qwen2.5:7b

Qwen 2.5 Coder

Alibaba · 32B

Specialized coding model with excellent code completion and generation. Supports 92 programming languages.

coding codingprogrammingcode-completion

Quantization

Size: 19GB

RAM: 24GB

VRAM: 20GB

Context: 128K

Performance (tokens/sec)

6

CPU

30

Mid GPU

70

High GPU

ollama pull qwen2.5-coder

Qwen 2.5 Coder 7B

Alibaba · 7B

Efficient coding model supporting 92 languages. Great for code completion on consumer hardware.

coding codingprogrammingefficient

Quantization

Size: 4.4GB

RAM: 8GB

VRAM: 6GB

Context: 128K

Performance (tokens/sec)

20

CPU

80

Mid GPU

150

High GPU

ollama pull qwen2.5-coder:7b

SmolLM2

Hugging Face · 1.7B

Hugging Face's tiny but capable model. Perfect for on-device inference and testing.

small tinyfastefficient

Quantization

Size: 1GB

RAM: 2GB

VRAM: 1.5GB

Context: 8K

Performance (tokens/sec)

50

CPU

170

Mid GPU

300

High GPU

ollama pull smollm2

Snowflake Arctic Embed

Snowflake · 335M

High-performance embedding model. Top results on MTEB retrieval benchmarks.

embedding embeddingretrievalhigh-quality

Quantization

Size: 0.67GB

RAM: 2GB

VRAM: 1GB

Context: 1K

Performance (tokens/sec)

300

CPU

1400

Mid GPU

3500

High GPU

ollama pull snowflake-arctic-embed

Solar

Upstage · 10.7B

Upstage's depth-upscaled model. Strong performance through novel training approach.

general chatinstructupscaled

Quantization

Size: 6.3GB

RAM: 10GB

VRAM: 8GB

Context: 4K

Performance (tokens/sec)

14

CPU

55

Mid GPU

108

High GPU

ollama pull solar

Stable Code

Stability AI · 3B

Stability AI's coding model. Good for code completion and generation tasks.

coding codingsmallfast

Quantization

Size: 1.8GB

RAM: 4GB

VRAM: 3GB

Context: 16K

Performance (tokens/sec)

35

CPU

130

Mid GPU

240

High GPU

ollama pull stable-code

StarCoder 2

BigCode · 15B

Code LLM trained on The Stack v2. Excellent for code completion across 600+ programming languages.

coding codingcode-completionprogramming

Quantization

Size: 9GB

RAM: 14GB

VRAM: 11GB

Context: 16K

Performance (tokens/sec)

12

CPU

50

Mid GPU

100

High GPU

ollama pull starcoder2

TinyLlama

TinyLlama · 1.1B

Ultra-compact 1.1B model. Perfect for testing, edge devices, and resource-limited environments.

small tinyfastedge

Quantization

Size: 0.64GB

RAM: 2GB

VRAM: 1GB

Context: 2K

Performance (tokens/sec)

60

CPU

200

Mid GPU

350

High GPU

ollama pull tinyllama

Wizard Vicuna Uncensored

Eric Hartford · 13B

Uncensored model combining Wizard and Vicuna. Good for creative and unrestricted tasks.

general uncensoredcreativechat

Quantization

Size: 7.9GB

RAM: 12GB

VRAM: 10GB

Context: 4K

Performance (tokens/sec)

12

CPU

48

Mid GPU

95

High GPU

ollama pull wizard-vicuna-uncensored

WizardCoder

WizardLM · 33B

Evol-Instruct trained coding model. Strong on complex programming tasks.

coding codingcomplex-tasksevol-instruct

Quantization

Size: 19GB

RAM: 24GB

VRAM: 21GB

Context: 8K

Performance (tokens/sec)

5

CPU

26

Mid GPU

58

High GPU

ollama pull wizardcoder

Yi 1.5

01.AI · 34B

01.AI's bilingual model excelling in English and Chinese. Strong reasoning capabilities.

general bilingualchinesereasoning

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 32K

Performance (tokens/sec)

5

CPU

26

Mid GPU

58

High GPU

ollama pull yi

Yi Coder

01.AI · 9B

01.AI's specialized coding model. Strong performance on code generation benchmarks.

coding codingprogrammingbilingual

Quantization

Size: 5.4GB

RAM: 10GB

VRAM: 8GB

Context: 128K

Performance (tokens/sec)

16

CPU

60

Mid GPU

115

High GPU

ollama pull yi-coder

Zephyr

Hugging Face · 7B

Hugging Face's DPO-aligned Mistral. Helpful and aligned assistant.

general chatalignedhelpful

Quantization

Size: 4.1GB

RAM: 8GB

VRAM: 6GB

Context: 32K

Performance (tokens/sec)

20

CPU

78

Mid GPU

145

High GPU

ollama pull zephyr

Find Your Perfect Local AI Model

Your Hardware

All-MiniLM

BakLLaVA

BGE-M3

Code Llama

CodeGemma

Command R

Command R+

DeepSeek Coder V2

DeepSeek R1

DeepSeek R1 14B

DeepSeek R1 7B

DeepSeek R1 Distill (Qwen)

Dolphin Llama 3

Dolphin Mixtral

EverythingLM

Gemma 2

Gemma 2 2B

Gemma 2 9B

Granite Code

Llama 3.1 70B

Llama 3.1 8B

Llama 3.2

Llama 3.2 1B

Llama 3.2 Vision

Llama 3.2 Vision 90B

Llama 3.3

LLaVA

LLaVA 34B

Magicoder

Mistral 7B

Mistral Large

Mistral Small

Mixtral 8x7B

Moondream

Mxbai Embed Large

Neural Chat

Nomic Embed Text

Nous Hermes 2

OpenHermes 2.5

Orca Mini

Phi-3

Phi-4

Qwen 2.5

Qwen 2.5 7B

Qwen 2.5 Coder

Qwen 2.5 Coder 7B

SmolLM2

Snowflake Arctic Embed

Solar

Stable Code

StarCoder 2

TinyLlama

Wizard Vicuna Uncensored

WizardCoder

Yi 1.5

Yi Coder

Zephyr