Welcome to the performance showcase of Bielik, a state-of-the-art Polish language model. This leaderboard presents Bielik's capabilities compared to other models in the European LLM Leaderboard (Thellmann, K., Stadler, B., Fromm, M., Schulze Buschhoff, J., Jude, A., Barth, F., Leveling, J., Flores-Herr, N., Köhler, J., Jäkel, R., & Ali, M. (2024). Towards Multilingual LLM Evaluation for European Languages).
Bielik is designed specifically for Polish language understanding and generation, demonstrating strong performance across various natural language processing tasks.
Bielik-11B-v2.3-Instruct demonstrates strong performance across various language understanding and reasoning tasks, achieving an impressive average score of 0.66. This places it as the third-best performing model in the evaluation, behind only Gemma-2-27b-Instruct and Meta-Llama-3.1-70B-Instruct, while outperforming larger models like Mixtral-8x7B.
Key highlights of Bielik's performance:
In German language evaluation, Bielik-11B-v2.3-Instruct shows good performance with an average score of 0.62, positioning it in the middle range of evaluated models. Despite being primarily trained on Polish and English data, the model demonstrates reasonable cross-lingual transfer capabilities:
These results are particularly noteworthy as they demonstrate Bielik's ability to generalize to German despite not being explicitly trained on German language data. While it doesn't match the performance of top models like Meta-Llama-3.1-70B-Instruct (0.71) or Gemma-2-27b-Instruct (0.71), it maintains competitive performance against similarly sized models and shows promising cross-lingual capabilities.
In Czech language evaluation, Bielik-11B-v2.3-Instruct achieves a solid average score of 0.60, demonstrating notable cross-lingual transfer capabilities despite being primarily trained on Polish and English data. The model's performance places it in the top tier of evaluated models, showing interesting patterns across different tasks:
While top performers like Meta-Llama-3.1-70B-Instruct (0.71) and Gemma-2-27b-Instruct (0.70) maintain their leading positions, Bielik's performance is particularly noteworthy given that it wasn't explicitly trained on Czech data. The model demonstrates robust zero-shot cross-lingual transfer, performing comparably to or better than several larger models, including some variants of Mixtral-8x7B and Meta-Llama-3 series, especially in structured reasoning tasks like GSM8K.
In the FLORES200 translation benchmark for Polish language, Bielik-11B-v2.3-Instruct demonstrates competitive performance with an average BLEU score of 13.515, positioning it in the middle range of evaluated models. What's particularly interesting is the model's asymmetric performance in translation directions:
These results are particularly noteworthy considering Bielik's specialized training focus on Polish and English. While larger models like EuroLLM-9B-Instruct (20.65) and Meta-Llama-3.1-70B-Instruct (19.52) achieve higher overall scores, Bielik's strong performance in translating into Polish aligns with its design goals and demonstrates effective specialization for Polish language generation.
Detailed analysis of Bielik-11B-v2.3-Instruct's performance in the FLORES200 translation benchmark reveals interesting patterns across different language pairs with Polish. The model demonstrates asymmetric capabilities in translation directions, which aligns with its training focus on Polish and English:
These results demonstrate that Bielik excels in its primary training languages (Polish-English) and shows strong transfer to linguistically similar languages or widely-spoken European languages. The model generally performs better when translating into Polish (average BLEU 15.31) compared to translating from Polish (average BLEU 11.36), suggesting stronger generation capabilities in its primary training language. However, performance drops significantly with less related language families, particularly Baltic and Finno-Ugric languages, indicating limitations in cross-linguistic transfer to more distant language families.