| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | pipeline_tag: question-answering |
| | --- |
| | |
| | # litert-community/Gecko-110m-en |
| |
|
| | This model provides a few variants of the embedding model published in the [Gecko paper](https://arxiv.org/abs/2403.20327) that are ready for deployment on Android or iOS using [LiteRT stack](https://ai.google.dev/edge/litert) or [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). |
| |
|
| | ## Use the models |
| |
|
| | ### Android |
| |
|
| | * Try out the gecko embedding model in the [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). You can find the SDK on [GitHub](https://github.com/google-ai-edge/ai-edge-apis/tree/main/local_agents/rag) or follow our [android guide](https://ai.google.dev/edge/mediapipe/solutions/genai/rag/android) |
| | to install directly from Maven. We have also published a |
| | [sample app](https://github.com/google-ai-edge/ai-edge-apis/tree/main/examples/rag). |
| | * Use the sentencepiece model as the tokenizer for the Gecko embedding model. |
| |
|
| | ## Performance |
| |
|
| | ### Android |
| |
|
| | Note that all benchmark stats are from a Samsung S23 Ultra. |
| |
|
| | <table border="1"> |
| | <tr> |
| | <th></th> |
| | <th>Backend</th> |
| | <th>Max sequence length</th> |
| | <th>Init time (ms)</th> |
| | <th>Inference time (ms)</th> |
| | <th>Memory (RSS in MB)</th> |
| | <th>Model size (MB)</th> |
| | </tr> |
| | <tr> |
| | <td><p style="text-align: right">dynamic_int8</p></td> |
| | <td><p style="text-align: right">GPU</p></td> |
| | <td><p style="text-align: right">256</p></td> |
| | <td><p style="text-align: right">1306.06</p></td> |
| | <td><p style="text-align: right">76.2</p></td> |
| | <td><p style="text-align: right">604.5</p></td> |
| | <td><p style="text-align: right">114</p></td> |
| | </tr> |
| | <tr> |
| | <td><p style="text-align: right">dynamic_int8</p></td> |
| | <td><p style="text-align: right">GPU</p></td> |
| | <td><p style="text-align: right">512</p></td> |
| | <td><p style="text-align: right">1363.38</p></td> |
| | <td><p style="text-align: right">173.2</p></td> |
| | <td><p style="text-align: right">604.6</p></td> |
| | <td><p style="text-align: right">120</p></td> |
| | </tr> |
| | <tr> |
| | <td><p style="text-align: right">dynamic_int8</p></td> |
| | <td><p style="text-align: right">GPU</p></td> |
| | <td><p style="text-align: right">1024</p></td> |
| | <td><p style="text-align: right">1419.87</p></td> |
| | <td><p style="text-align: right">397</p></td> |
| | <td><p style="text-align: right">871.1</p></td> |
| | <td><p style="text-align: right">145</p></td> |
| | </tr> |
| | <tr> |
| | <td><p style="text-align: right">dynamic_int8</p></td> |
| | <td><p style="text-align: right">CPU</p></td> |
| | <td><p style="text-align: right">256</p></td> |
| | <td><p style="text-align: right">11.03</p></td> |
| | <td><p style="text-align: right">147.6</p></td> |
| | <td><p style="text-align: right">126.3</p></td> |
| | <td><p style="text-align: right">114</p></td> |
| | </tr> |
| | <tr> |
| | <td><p style="text-align: right">dynamic_int8</p></td> |
| | <td><p style="text-align: right">CPU</p></td> |
| | <td><p style="text-align: right">512</p></td> |
| | <td><p style="text-align: right">30.04</p></td> |
| | <td><p style="text-align: right">353.1</p></td> |
| | <td><p style="text-align: right">225.6</p></td> |
| | <td><p style="text-align: right">120</p></td> |
| | </tr> |
| | <tr> |
| | <td><p style="text-align: right">dynamic_int8</p></td> |
| | <td><p style="text-align: right">CPU</p></td> |
| | <td><p style="text-align: right">1024</p></td> |
| | <td><p style="text-align: right">79.17</p></td> |
| | <td><p style="text-align: right">954</p></td> |
| | <td><p style="text-align: right">619.5</p></td> |
| | <td><p style="text-align: right">145</p></td> |
| | </tr> |
| | </table> |
| |
|
| | * Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models) |
| | * Memory: indicator of peak RAM usage |
| | * The inference is run on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads |
| | * The inference on GPU is accelerated via LiteRT GPU delegate. |
| | * Benchmark is done assuming XNNPACK cache is enabled |
| | * dynamic_int8: quantized model with int8 weights and float activations. |