Generated by All in One SEO v4.9.5.1, this is an llms.txt file, used by LLMs to index the site. # Sketch Engine language corpus management and query system ## Sitemaps - [XML Sitemap](https://www.sketchengine.eu/sitemap.xml): Contains all public & indexable URLs for this website. ## Posts - [Chinese New Year brings new Chinese corpora](https://www.sketchengine.eu/news/2026-03-02/) - In February, we published both Chinese corpora and reminded the Lexicom in Palermo. - [It's freezing outside, but we're still publishing corpora!](https://www.sketchengine.eu/news/2026-02-02/) - In January 2026, Sketch Engine published Latvian and Urdu corpus. - [Happy New Year from Sketch Engine!](https://www.sketchengine.eu/news/2026-9-1/) - In December 2025 we have launched an Advent Calendar featuring a useful tip or hidden feature. - [Advent with Sketch Engine: Fresh Insights & a Major Trends Update](https://www.sketchengine.eu/news/2025-12-1/) - Follow Sketch Engine on social media! We're launching an Advent Calendar featuring a useful tip or hidden feature every day until Christmas. - [Discover improved text analysis!](https://www.sketchengine.eu/news/2025-10-01/) - In September 2025, we improved text analasis. - [Open registration for Lexicom 2026 and new corpora!](https://www.sketchengine.eu/news/2025-11-03/) - In October 2025, we opened registration for Lexicom 2026 and introduced new corpora. - [Sketch Engine news: enhanced tools, new corpora, and Lexicom 2024 in Spain](https://www.sketchengine.eu/news/sketch-engine-news-2023-11-01/) - Sketch Engine news: enhanced tools, new corpora, and Lexicom 2024 in Spain. - [Introducing the lecturer for Lexicom 2024 and new corpora](https://www.sketchengine.eu/news/2024-03-01/) - A regular Sketch Engine newsletter introduces the lecturer for Lexicom 2024 and the new corpora published in February 2024. - [Term extraction from non-aligned docs, Lexicom 2024, and the largest corpus!](https://www.sketchengine.eu/news/2024-05-01/) - In April 2024, we introduced term extraction from non-aligned documents, the new Arabic corpus, the largest English corpus, and invitation to Lexicom 2024. - [Summer update: new ParlaTalk and Trends corpora](https://www.sketchengine.eu/news/2025_09_01/) - In summer 2025, we updated the ParlaTalk corpora and introduced new trend corpora. - [A new Russian corpus, Sketch Engine use case and DMLex.](https://www.sketchengine.eu/news/2025-07-01/) - In June 2025 we introduced new corpus for Russian and implementing DMLex in Lexonomy. - [BNC audio](https://www.sketchengine.eu/news/bnc-audio/) - The British National Corpus with audio is now available. Log in to Sketch Engine and listen to the audio recordings in the BNC corpus. - [Sketch Engine news: corpora and trends, Trados plugin, and Spanish term extraction](https://www.sketchengine.eu/news/sketch-engine-news-2023-12-01/) - Sketch Engine supports monolingual and bilingual term extraction. Read more about linguistic tools for term extraction on our blog. - [A bunch of new corpora and workshop in Kuwait](https://www.sketchengine.eu/news/sketch-engine-news-2024-01-01/) - Sketch Engine is a house of 700+ corpora in 100+ languages. Choose the right corpus from the list of available corpora. You can also create your own corpus from the web or from your texts. - [Advance your skills, explore updated corpora, and apply for the Kilgarriff Prize.](https://www.sketchengine.eu/news/2024-6-1/) - In May 2024, Sketch Engine reminds you how to advance your skills, apply for the Kilgarriff Prize, and explore updated corpora. - [Enhance your text analysis skills with new Corpora and Tools!](https://www.sketchengine.eu/news/2024-08-01/) - [Lexicom 2025, new corpora and features!](https://www.sketchengine.eu/news/2024-11-01/) - In October 2024, we remind you a Lexicom 2025, a new Vietnamese corpus, and a lexical database for English DANTE in Lexonomy. - [New Lexonomy with a new guide!](https://www.sketchengine.eu/news/2025-04-01/) - In March 2025, we released a new version of Lexonomy with a guide. Malay Corpus and Italian in OCT. - [Chinese localization and upcoming Lexicom](https://www.sketchengine.eu/news/2025-06-02/) - In May 2025 is Sketch Engine in Chinese. - [Automated word sense identification, multi-word term extraction for more languages, and new corpora](https://www.sketchengine.eu/news/sketch-engine-news-2024-02-01/) - Automated word sense identification, multi-word term extraction for more languages, and new corpora. - [New Year, new data: Maldivian corpus, NLP opportunities, Lexicom](https://www.sketchengine.eu/news/2025-02-03/) - In January 2025, Sketch Engine unveils fresh corpus insights, innovative analysis tools, and expanded learning opportunities. - [Expand your linguistic research with new corpora!](https://www.sketchengine.eu/news/2025-03-03/) - In February 2025, we introduced German, Norwegian and Belarusian new corpora. - [Open applications for AK Prize, new corpora and better term extraction](https://www.sketchengine.eu/news/2024-09-03/) - In august 2024 open applications for AK Prize, new corpora and better term extraction - [Happy New Year with a bunch of new corpora!](https://www.sketchengine.eu/news/2025_01_02/) - In December 2024 we introduced new corpora for Lithuanian, Finnish and Swedish language. - [Discover the new Timeline and other features.](https://www.sketchengine.eu/news/2024-07-01/) - In June 2024 discover new features in Sketch Engine. - [New corpora: Arabic parallel corpora, Icelandic Gigaword Corpus, and Czech Trends](https://www.sketchengine.eu/news/2024-04-01/) - In March 2024, Sketch Engine introduced new corpora for Icelandic, Czech, and parallel corpora for Arabic. Explore these new corpora and their tools. - [New corpora, tips and improvement.](https://www.sketchengine.eu/news/2024-12-02/) - In November 2024, Sketch Engine introduced new corpora for Spanish, improved Polish. And new overview panel for parallel corpora in corpus info. - [Japanese interface, Portuguese corpus 2023 and open registration for Lexicom 2025](https://www.sketchengine.eu/news/2024-10-01/) - In September 2024, we published a Japanese translation of the interface, new Portuguese corpus 2023, and opened registration for Lexicom 2025. - [No "business as usual" with Russia anymore](https://www.sketchengine.eu/news/no-business-as-usual-with-russia-anymore/) - This is a public statement by Miloš Jakubíček, CEO of Lexical Computing, on Sketch Engine unavailability in Russia and Belarus. - [Free Sketch Engine for Learner Corpus Association members](https://www.sketchengine.eu/news/free-sketch-engine-for-learner-corpus-association-members/) - [PhD studentship at the University of Brighton](https://www.sketchengine.eu/news/phd-studentship-at-the-university-of-brighton/) - [The best term extraction](https://www.sketchengine.eu/blog/the-best-term-extraction/) - Term extraction or terminology extraction is an automatic method of analyzing text. Find terms in your text with OneClick term tool. - [Your data are safe with us.](https://www.sketchengine.eu/news/secure-data-officially/) - [POS tags](https://www.sketchengine.eu/blog/pos-tags/) - A POS tag (part-of-speech tag) is a label showing the part of speech of each token (word) in a text corpus. POS tags are assigned automatically by a POS tagger. - [XLIFF support for multilingual files](https://www.sketchengine.eu/news/xliff-support-for-multilingual-files/) - [Early English Books Online corpus](https://www.sketchengine.eu/news/early-english-books-online-corpus/) - Early English Books Online corpus - [Find good examples in German with Sketch Engine](https://www.sketchengine.eu/news/german-examples-with-sketch-engine/) - Find good German examples in context, explore German collocations or use German thesaurus for German synonyms. Our tool deSkELL, a free simplified interface of Sketch Engine, is the right choice for these types of tasks. - [Topics and genres in corpora](https://www.sketchengine.eu/blog/topics-and-genres-in-corpora/) - Topics and genres are text types (metadata) that enrich the corpus with information about the subject of the texts or the writing styles. - [Automatic word sense induction](https://www.sketchengine.eu/news/automatic-word-sense-induction/) - Introducing a new functionality of the word sketch tool that identifies word senses automatically. Try this function that can categorize the collocations into groups. - [Boot Camp - a face-to-face course in using Sketch Engine](https://www.sketchengine.eu/news/boot-camp-a-face-to-face-cours-in-using-sketch-engine/) - [A new corpus of Tibetan](https://www.sketchengine.eu/news/a-new-corpus-of-tibetan/) - A new corpus of Classical Tibetan language with a size of 80 million words has been added to Sketch Engine - [Build a corpus from the web](https://www.sketchengine.eu/blog/build-a-corpus-from-the-web/) - Building a linguistically valuable corpus from the web requires techniques to avoid the inclusion of textual content which could make the corpus useless. - [Most frequent or most typical collocations?](https://www.sketchengine.eu/blog/most-frequent-or-most-typical-collocations/) - Which collocations are more useful - the most frequent collocations or the most typical collocations? - [New Italian word sketches](https://www.sketchengine.eu/news/new-italian-word-sketches/) - The collocation search for Italian corpora was improved in Sketch Engine. Search Italian collocation with word sketch function. - [API examples in Java, Python, R and Bash](https://www.sketchengine.eu/news/api-examples-in-java-python-r-and-bash/) - [New French word sketches](https://www.sketchengine.eu/news/new-french-word-sketches/) - The collocation search for French corpora was improved in Sketch Engine. Search French collocation with word sketch function. - [Better tools for Portuguese corpora](https://www.sketchengine.eu/news/better-tools-for-portuguese-corpora/) - Better tools are now available for Portuguese corpora in Sketch Engine. The new tools can recognise Brazilian and Portuguese language varieties. We improved also the search of Portuguese collocations. - [Sketch Engine for terminologists](https://www.sketchengine.eu/news/sketch-engine-for-terminologists/) - [Automatic thesaurus](https://www.sketchengine.eu/blog/automatic-thesaurus-synonyms-for-all-words/) - Read about how automatic thesaurus works. Can computations identify synonyms and similar words in a text automatically? - [October calendar page with word sketches](https://www.sketchengine.eu/news/calendar-october-2018/) - October's calendar page shows main Sketch Engine feature - word sketch. The tool giving a one-page summary of the word's collocational behaviour. - [Corpus annotation and structures](https://www.sketchengine.eu/blog/corpus-annotation-and-structures/) - How to annotate a corpus with metadata and divide a corpus into smaller parts using corpus structures. - [From parallel corpora to bilingual terminology: a hybrid approach](https://www.sketchengine.eu/sketch-engine-events/from-parallel-corpora-to-bilingual-terminology-a-hybrid-approach/) - Miloš Jakubíček will give a talk at the conference Translating and the Computer 37 on Thursday 26th November. - [languages of user corpora infographics](https://www.sketchengine.eu/news/languages-of-user-corpora-infographics/) - [Timestamped corpus in 18 languages](https://www.sketchengine.eu/news/timestamped-corpus-in-18-languages/) - Search the timestamped diachronic corpus created from Jozes Stefan Institute Newseed in 18 languages. - [Case sensitive and insensitive corpus analysis](https://www.sketchengine.eu/blog/case-sensitive-and-insensitive-corpus-analysis/) - Learn to use case insensitive and case sensitive mode when searching and analysing corpora. The blog also explains the function of the lowercase attribute. - [Build a parallel corpus](https://www.sketchengine.eu/news/build-a-parallel-corpus/) - Learn to build a parallel corpus from the web or your own data. - [Words, tags, lemmas, lemposes, lowercase](https://www.sketchengine.eu/blog/words-tags-lemmas-lemposes-lowercase/) - This blog post explains words, tags, lemmas, lemposes, lowercase and other attributes found in a text corpus and used in corpus searching and analysis. - [Bigger and up-to-date Timestamped JSI web corpora](https://www.sketchengine.eu/news/bigger-and-up-to-date-timestamped-jsi-web-corpora/) - Timestamped JSI web corpora in 18 languages now with data until current days. With diachronic analysis - discover new words! - [More concordance context](https://www.sketchengine.eu/news/more-concordance-context/) - The new switch displays more left and right context in a KWIC concordance. It may require horizontal scrolling. - [Improved support for Thai](https://www.sketchengine.eu/news/improved-support-for-thai/) - [Spoken British National Corpus 2014](https://www.sketchengine.eu/news/spoken-british-national-corpus-2014/) - The 11-million-word Spoken British National Corpus 2014 is now available in Sketch Engine. - [CQL builder](https://www.sketchengine.eu/news/cql-builder/) - [Fit more text on the screen](https://www.sketchengine.eu/news/fit-more-text-on-the-screen/) - [Display and hide statistics and counts.](https://www.sketchengine.eu/news/display-and-hide-statistics-and-counts/) - Word frequency, statistics and scores can be displayed or hidden. The setting can be different for each corpus. - [Parallel corpus - how to search](https://www.sketchengine.eu/news/parallel-corpus-how-to-search/) - [Searching for hyphenated, non-hyphenated and space-separated words in one step](https://www.sketchengine.eu/news/searching-for-hyphenated-non-hyphenated-and-space-separated-words-in-one-step/) - [Search words and phrases in one step](https://www.sketchengine.eu/news/search-words-and-phrases-in-one-step/) - Use the pipe to search for words and phrases in one step. - [Old interface closes down](https://www.sketchengine.eu/news/old-interface-closes-down/) - [Jozef Stefan Institute Timestamped Corpus](https://www.sketchengine.eu/news/jsi-newsfeed-corpus/) - A unique diachronic corpus of English newsfeeds, the Jozef Stefan Institute Timestamped web Corpus, has been added to Sketch Engine. - [Adam Kilgarriff Prize: announcement of winner](https://www.sketchengine.eu/news/adam-kilgarriff-prize-announcement-of-winner/) - [Sketch Engine calendar 2019 – March](https://www.sketchengine.eu/news/sketch-engine-calendar-2019-march/) - Sketch Engine calendar for March 2019 which explains the basics of regular expressions and their purpose. - [A few photos from SkEW and eLex2015](https://www.sketchengine.eu/news/a-few-photos-from-skew-and-elex2015/) - [Sketch Engine calendar 2019 – January](https://www.sketchengine.eu/news/sketch-engine-calendar-2019-january/) - [Sketch Engine calendar 2018 – September](https://www.sketchengine.eu/news/sketch-engine-calendar-2018-september/) - Learn how to use character classes in the regular expressions. See an example using regular expressions for searching words starting with a capital letter. - [Sketch Engine calendar 2018 – August](https://www.sketchengine.eu/news/sketch-engine-calendar-2018-august/) - Learn how to label tokens in your CQL query with using regular expressions. Sketch Engine calendar for August 2018 explains labelling tokens. - [Sketch Engine Masterclass, Kazan, Russia](https://www.sketchengine.eu/news/sketch-engine-masterclass-kazan-russia/) - [Sketch Engine calendar 2018 – July](https://www.sketchengine.eu/news/sketch-engine-calendar-2018-july/) - Learn how to search only tokens consist of alphanumeric characters using the regular expression metacharacter \w. - [Sketch Engine workshop at Europhras 2017](https://www.sketchengine.eu/news/sketch-engine-workshop-at-europhras-2017/) - [N'ko corpus](https://www.sketchengine.eu/news/nko-corpus-news/) - [Complex concordance searches are now easier](https://www.sketchengine.eu/news/complex-concordance-searches-are-now-easier/) - [An update to Discovering English with Sketch Engine 2nd edition](https://www.sketchengine.eu/news/an-update-to-discovering-english-with-sketch-engine-2nd-edition/) - The updated 2nd edition of the book Discovering English with Sketch Engine by James Thomas has been published. - [New Chinese Word Sketches](https://www.sketchengine.eu/news/new-chinese-word-sketches/) - [Spanish: NEW rich collocations and NEW clitics handling](https://www.sketchengine.eu/news/spanish-new-rich-collocations-and-new-clitics-handling/) - Spanish Word Sketches now provide rich collocation information. We have improved Spanish clitics handling too. - [English Preposition Corpus](https://www.sketchengine.eu/news/english-preposition-corpus-1/) - English Prepositions Corpus uncovers the senses and usage of prepositions in English. - [Meet Sketch Engine in Madrid](https://www.sketchengine.eu/news/meet-sketch-engine-in-madrid/) - [EUR-Lex Judgements Corpus](https://www.sketchengine.eu/news/eur-lex-judgements-corpus/) - A multilingual corpus of the judgments of the EU Court of Justice. - [Brexit Corpus](https://www.sketchengine.eu/news/brexit-corpus-referendum/) - A new Brexit corpus has been available to search. It is a collection of texts about the UK referendum on the withdrawal from the EU. - [Extended corpus of English broadsheets](https://www.sketchengine.eu/news/extended-corpus-of-english-broadsheets/) - [New corpus from the environment domain](https://www.sketchengine.eu/news/new-corpus-from-environment-domain/) - The EcoLexicon corpus is a collection of environmental texts that was developed by the LexiCon Research Group at the University of Granada. - [New academic English corpus](https://www.sketchengine.eu/news/new-academic-english-corpus/) - A new corpus of academic English is now available in Sketch Engine. The corpus was collected from the database of open access journals DOAJ. - [Discover trending words in English newspapers](https://www.sketchengine.eu/news/discover-trending-words-in-english-broadsheets/) - [Lemmatization & tagging for Greek](https://www.sketchengine.eu/news/lemmatization-tagging-for-greek/) - Sketch Engine now provides lemmatization and part-of-speech tagging for Greek text corpora. Try it for free. - [Better Danish](https://www.sketchengine.eu/news/better-danish/) - We improved tools for processing text corpora in the Danish language. - [Lexicom 2018, Jesus College, Cambridge, UK](https://www.sketchengine.eu/news/lexicom-2018-jesus-college-cambridge-uk/) - The next Lexicom – a workshop in lexicography & lexical computing takes place at Jesus College, Cambridge, UK, 11–15 Sep 2018. - [Bigger ACL Anthology Reference Corpus](https://www.sketchengine.eu/news/acl-corpus-news/) - The ACL Anthology Reference Corpus has extended to 62 million words. The corpus consists of papers of computational linguistics. - [A new Amharic corpus](https://www.sketchengine.eu/news/a-new-amharic-corpus/) - A new 25-million-word Amharic corpus has been added to Sketch Engine. - [New English corpus from the Web](https://www.sketchengine.eu/news/15-billion-word-english-corpus/) - See our new 15-billion-word English corpus (enTenTen) comprised of texts from the Web until the end of 2015. Work with new data processed by our most up-to-date text tools. - [Sketch Engine calendar 2018 – March](https://www.sketchengine.eu/news/sketch-engine-calendar-2018-march/) - Also this year we prepared Sketch Engine calendar with useful CQL examples. See or download calendar page for March with handy examples representing an optional character and repetitions. - [A new Belarusian corpus (beTenTen)](https://www.sketchengine.eu/news/new-belarusian-corpus/) - A new Belarusian corpus (beTenTen), the 63-million-word Belarusian corpus from the web has been added to Sketch Engine. - [New version of Danish corpus from the web](https://www.sketchengine.eu/news/new-version-of-danish-corpus-from-the-web/) - Search a new version of Danish Corpus from the web, the 2-billion-word Danish corpus from 2017. Search current texts with the most recent Danish terms and collocations. - [Sketch Engine calendar 2018 – April](https://www.sketchengine.eu/news/sketch-engine-calendar-2018-april/) - Search the CQL query: April Fools' day. Find out how to find punctuations in corpora in Sketch Engine. Use the CQL (Corpus query language) feature to find more advanced queries. - [Sketch Engine calendar 2018 – June](https://www.sketchengine.eu/news/sketch-engine-calendar-2018-june/) - Check our Sketch Engine calendar 2018. The month June will learn you how to search inside a corpus structure using the Corpus query language operator within. - [Sketch Engine workshop, University of Surrey, Guildford, UK](https://www.sketchengine.eu/news/sketch-engine-workshop-university-of-surrey-guildford-uk/) - [Happy New Year!](https://www.sketchengine.eu/news/257/) - [SDL Trados Studio plugin](https://www.sketchengine.eu/news/sdl-trados-studio-plugin/) - [Photos from Lexicom trip](https://www.sketchengine.eu/news/photos-from-lexicom-trip/) - [Adam's blog](https://www.sketchengine.eu/news/255/) - [Discovering English with Sketch Engine, 2nd edition](https://www.sketchengine.eu/news/discovering-english-with-sketch-engine-2nd-edition/) - [Lexicom 2015 in Telč, Czech Republic](https://www.sketchengine.eu/news/365/) - [Webinar "Sketch Engine for translation and terminology"](https://www.sketchengine.eu/news/webinar-sketch-engine-for-translation-and-terminology/) - [Lexicom courses in Europe and USA](https://www.sketchengine.eu/sketch-engine-events/lexicom-courses-in-europe-and-usa/) - In 2016, two Lexicom courses will be held, one in the USA and one in Europe. The first U.S. course will take place at the University of Colorado Boulder from 6th to 10th June 2016. The second European course will take place at the Austrian Academy of Sciences in Vienna from 11th to 15th July 2016. Lexicom is - [Upgrade your lexicography and lexical computing skills](https://www.sketchengine.eu/news/upgrade-your-lexicography-and-lexical-computing-skills/) - [Calendar 2017](https://www.sketchengine.eu/news/calendar-2017/) - [webcorpora.org to adopt Sketch Engine technology](https://www.sketchengine.eu/news/webcorpora-org-to-adopt-sketch-engine-technology/) - [6th International Sketch Engine Workshop (SKEW6)](https://www.sketchengine.eu/news/6th-international-sketch-engine-workshop-skew6/) - [Discovering English with Sketch Engine](https://www.sketchengine.eu/news/discovering-english-with-sketch-engine/) - [Future directions and research agenda of Sketch Engine, LDA’15](https://www.sketchengine.eu/news/future-directions-and-research-agenda-of-sketch-engine-lda15/) - [Updated programme of SkEW-6](https://www.sketchengine.eu/news/2214/) - [Sketch Engine in Arabic](https://www.sketchengine.eu/news/sketch-engine-in-arabic/) - [SKEW-6, the 6th International Sketch Engine Workshop](https://www.sketchengine.eu/news/253/) - [Adam Kilgarriff passed away](https://www.sketchengine.eu/news/16th-may/) - [Your lecture slides = an academic corpus](https://www.sketchengine.eu/news/4-4-2015/) - [Extract terminology from your lectures](https://www.sketchengine.eu/news/extract-terminology-from-your-lectures/) - [Use Sketch Engine as a terminology tool](https://www.sketchengine.eu/news/use-sketch-engine-as-a-terminology-tool/) - [Spanish interface](https://www.sketchengine.eu/news/spanish-interface/) - [Big dictionary publishers use Sketch Engine](https://www.sketchengine.eu/news/big-dictionary-publishers-use-sketch-engine/) - [Sketch Engine in videos](https://www.sketchengine.eu/news/sketch-engine-in-videos/) - [French localization](https://www.sketchengine.eu/news/french-localization/) - [Update of Service Level Agreement](https://www.sketchengine.eu/news/update-of-service-level-agreement/) - [Better Bulgarian!](https://www.sketchengine.eu/news/better-bulgarian/) - [Adam Kilgarriff Prize](https://www.sketchengine.eu/news/adam-kilgarriff-prize/) - [New multilingual resource in 24 languages is available](https://www.sketchengine.eu/news/new-multilingual-resource-in-24-languages-is-available/) - [7th Sketch Engine Workshop at LREC2016](https://www.sketchengine.eu/sketch-engine-events/7th-sketch-engine-workshop-at-lrec2016/) - 7th Sketch Engine Workshop will be at LREC 2016 in Portorož, Slovenia. We are looking forward to seeing you there! - [Lexicom workshops in 2016: USA and Europe](https://www.sketchengine.eu/sketch-engine-events/lexicom-workshops-in-2016-usa-and-europe/) - Do you want to become a professional lexicographer or to learn about new trends in lexicography or meet other enthusiastic lexicographers from all over the world or all of these? Then join us in our masterclass workshop Lexicom, this year both in USA (June) and Europe (July). Register now at www.lexmasterclass.com! ## Pages - [Home Page - Create and search a text corpus](https://www.sketchengine.eu/) - Sketch Engine is the ultimate corpus tool to create and search text corpora in 100+ languages. Try a 30-day free trial. - [Academic and non-academic subscriptions](https://www.sketchengine.eu/academic-and-non-academic-subscriptions/) - Academic for universities and schools who use Sketch Engine for their research and teaching process. Non-academic for lexicography, translation, terminology, etc. - [Request subscription to OneClick Terms](https://www.sketchengine.eu/oneclick-terms-support/request-subscription/) - [Elsevier OA CC-BY Corpus](https://www.sketchengine.eu/elsevier-oa-cc-by-corpus/) - Search the Elsevier OA CC-BY Corpus, the 187-million-word English corpus of 40,000 Scientific Research Papers from Elsevier Open Access Journals. - [Dot corpus](https://www.sketchengine.eu/dot-corpus/) - [Statistics used in Sketch Engine](https://www.sketchengine.eu/statistics-used-in-sketch-engine/) - See statistic methods and formulas in detail used in Sketch Engine to calculate collocations for word sketches, thesaurus, and other results. - [Corpus info page](https://www.sketchengine.eu/guide/corpus-info-page/) - [Corpus Configuration File: All Features](https://www.sketchengine.eu/documentation/corpus-configuration-file-all-features/) - A list of all features and their attributes contained in corpus configuration file in Sketch Engine. - [Maltese Reference Corpus](https://www.sketchengine.eu/maltese-reference-corpus/) - Search the Maltese Reference corpus. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Maltese Trends corpus](https://www.sketchengine.eu/maltese-trends-corpus/) - Search the Maltese Trends corpus, a Maltese monitor corpus of news articles gained from their RSS feeds. The corpus is updated weekly with 100 thousand words. - [Arabic Trends corpus](https://www.sketchengine.eu/arabic-trends-corpus/) - Search the Arabic Trends corpus, an Arabic monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 1–2 million words. - [Frequently Asked Questions (FAQ)](https://www.sketchengine.eu/frequently-asked-questions/) - [Corpora by language](https://www.sketchengine.eu/corpora-and-languages/) - Languages supported by the corpus building tool in the Sketch Engine corpus system. - [Quick Start Guide](https://www.sketchengine.eu/quick-start-guide/) - A brief course how to use Sketch Engine, a corpus management system. It includes lessons on term extraction, concordance and collocations. - [Price List](https://www.sketchengine.eu/price-list/) - Users can start with a 30-day free trial and then subscribe to Sketch Engine depending on the purpose of using Sketch Engine. - [Lexicographers](https://www.sketchengine.eu/user-guide/lexicographers/) - Retrieve language data for dictionaries and other lexicographic projects. Use the automated dictionary drafting and dictionary entry building. - [Russian corpus for SKELL](https://www.sketchengine.eu/russian-skell-corpus/) - Search Russian Corpus for SKELL, the 1-billion-word Russian corpus behind the ruSKELL interface which serves for learning Russian. This corpus contains sentences sorted by their quality. - [Documentation](https://www.sketchengine.eu/documentation/) - Documentation for expert users related to corpus building, text analysis API, corpus querying and customising Sketch Engine. - [English Corpus for SkELL](https://www.sketchengine.eu/english-skell-corpus/) - English Corpus for SKELL is a text corpus behind th English SKELL interface containing sentences only sorted by their quality. - [Czech Corpus for SKELL](https://www.sketchengine.eu/cskell/) - Search csSKELL corpus, the 1.4-billion-word Czech corpus behind csSKELL which is a free online tool for learning Czech in real-life texts. - [etSKELL – Estonian corpus for SKELL](https://www.sketchengine.eu/etskell-estonian-corpus/) - Search etSKELL corpus, the 280-million-word Estonian corpus behind the etSKELL interface which serves for learning Estonian. This corpus contains sentences sorted by their quality. - [Maltese part-of-speech tagset v3.0](https://www.sketchengine.eu/maltese-part-of-speech-tagset-v3/) - Maltese POS tagset v3.0 is a list POS tags used to indicate grammatical categories for Maltese corpora in Sketch Engine. - [Referencing Sketch Engine and bibliography](https://www.sketchengine.eu/bibliography-of-sketch-engine/) - [Manual annotation – Skema](https://www.sketchengine.eu/guide/manual-annotation-skema/) - Skema is a manual annotation tool that allows users to label, annotate or categorize concordance lines. - [CQL - Practical examples](https://www.sketchengine.eu/documentation/cql-practical-examples/) - [Polish Parliamentary Corpus (PPC)](https://www.sketchengine.eu/polish-parliamentary-corpus/) - Search the Polish Parliamentary Corpus, a 550-million-word Polish corpus of documents of the Polish Parliament, Sejm and Senate covering the period 1919–2019. - [Malay Trends corpus](https://www.sketchengine.eu/malay-trends-corpus/) - Search the Malay Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. This timestamped corpus is updated daily by 70 thousand words. - [GDEX - Good Dictionary Examples](https://www.sketchengine.eu/guide/gdex/) - GDEX, or Good Dictionary Examples, is a system of evaluating sentences with regard to their suitability to serve as dictionary or teaching examples. - [Word list quote](https://www.sketchengine.eu/word-list-quote/) - [Account limitations](https://www.sketchengine.eu/guide/account-limitations/) - [Annotating corpus text](https://www.sketchengine.eu/guide/annotating-corpus-text/) - How to add metadata or annotate a corpus via structures, attributes and values which Sketch Engine can process into text types. - [Create a corpus by uploading files](https://www.sketchengine.eu/guide/create-corpus-from-files/) - Create a multi-million-word corpus within minutes. Fully automatic lemmatization and tagging in 30+ languages. Register for a free 30-day trial! - [Trends - diachronic analysis](https://www.sketchengine.eu/guide/trends/) - Neologisms and diachronic analysis of word usage for detecting new words in a language. Find new words in your text corpora which is time annotated. - [User Guide](https://www.sketchengine.eu/guide/) - User guide and manual showing how to search, analyze and build text corpora in Sketch Engine. - [Fine-tune your corpus](https://www.sketchengine.eu/documentation/fine-tune-your-corpus/) - Fine-tune your corpus to make your corpus easier to use and more user-friendly. Improve the configuration file of your corpus to enhance of using it. - [teTenTen – Telugu Corpus from the Web](https://www.sketchengine.eu/tetenten-telugu-corpus/) - Search teTenTen, a 101-million-word Telugu corpus of web texts. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [English Trends corpus](https://www.sketchengine.eu/english-trends-corpus/) - Search the English Trends corpus, the largest English monitor corpus of news articles gathered from RSS feeds. Updated weekly with 100 million new words. - [SoNaR corpus](https://www.sketchengine.eu/sonar-corpus/) - 500-million-token Dutch reference corpus of texts from all parts of the Netherlands between years 1954–2011. - [Australian Legislative Corpus 2023](https://www.sketchengine.eu/australian-legislative-corpus/) - Search the Australian Legislative Corpus 2023 and explore the acts, regulations, and rules across the nine major jurisdictions in Australia. - [Subscriptions, invoices, storage space](https://www.sketchengine.eu/guide/subscriptions-invoices-storage-space/) - To access your subscriptions, invoices, and storage space info, click the person icon (top right corner) → My account → SUBSCRIPTIONS & INVOICES. - [ESLORA corpus](https://www.sketchengine.eu/eslora-corpus/) - Explore the spoken variant of the Spanish language via the ESLORA corpus. - [Setting up parallel and multilingual corpora](https://www.sketchengine.eu/guide/setting-up-parallel-corpora/) - Create, upload, and setting up parallel and multilingual corpora in Sketch Engine. - [EUR-Lex parallel corpus](https://www.sketchengine.eu/eur-lex-parallel-corpus/) - Search EUR-Lex parallel corpus, the multilingual corpora of documents in 24 languages. Use the largest parallel corpus built from European language resources. - [OPUS parallel corpora](https://www.sketchengine.eu/opus-parallel-corpora/) - Search the OPUS parallel corpora, the multilingual corpora in 40 languages. Make concordance or generate n-gram, word lists, collocations and more... - [Kannada web corpus](https://www.sketchengine.eu/kannada-web-corpus/) - Search knWaC, 11-million-word Kannada corpus of text from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Telugu web corpus](https://www.sketchengine.eu/teluguwac-corpus/) - Telugu web as corpus is a 3-million-word collection with part-of-speech tagging. Search the corpus with Sketch Engine. - [Hindi Corpus (HindiWaC)](https://www.sketchengine.eu/hindiwac-hindi-corpus/) - Search HindiWaC, the 107-million-word Hindi corpus of texts collected from the Hindi web. Texts were cleaned and deduplicated. - [Yoruba corpus (yoWaC)](https://www.sketchengine.eu/yowac-yoruba-corpus/) - Search yoWaC, the 2.8-million-word Yoruba corpus of texts from the Yoruba national domain. Texts were cleaned and deduplicated. - [Indonesian web corpus](https://www.sketchengine.eu/indonesianwac-corpus/) - Indonesian web corpus (IndonesianWaC) is a 100-million-word text corpus of texts collected from the Internet. Generate concordance, n-grams, collocations. - [Penn Corpus of Historical English](https://www.sketchengine.eu/penn-corpus-of-historical-english/) - The Penn Parsed Corpus of Historical English (PennHistEn) is a corpus of Middle English and Modern British English (from mid 12th to early 20th century). - [Maltese part-of-speech tagset v2.0](https://www.sketchengine.eu/maltese-part-of-speech-tagset/) - Maltese MLSS POS tagset is a list POS tags used to indicate grammatical categories for Maltese corpora in Sketch Engine. - [Regular expressions](https://www.sketchengine.eu/guide/regular-expressions/) - Regular expressions are special characters used instead of unspecified letters and numbers in searches. Exercise them in regex course. - [CQL - within & containing](https://www.sketchengine.eu/documentation/cql-within-containing/) - [CQL - global conditions](https://www.sketchengine.eu/documentation/cql-global-conditions/) - [Maltese part-of-speech tagsets](https://www.sketchengine.eu/maltese-part-of-speech-tagsets/) - Sketch Engine provides a various part-of-speech tagsets for Maltese corpora to indicate grammatical categories in English corpora. - [Hindi Trends corpus](https://www.sketchengine.eu/hindi-trends-corpus/) - Search the Hindi Trends corpus, a Hindi monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 2 million words. - [MULTEXT-East Slovenian part-of-speech tagset (version 5 and 6)](https://www.sketchengine.eu/slovene-tagset-multext-east-v5/) - MULTEXT-East Morphosyntactic Slovenian specification v5 is a list POS tags for Slovene corpora in Sketch Engine. - [MULTEXT-East Slovenian part-of-speech tagset (version 4)](https://www.sketchengine.eu/slovene-tagset-multext-east-v4/) - MULTEXT-East Morphosyntactic Slovenian specification v4 is a list POS tags for Slovene corpora in Sketch Engine. - [lvTenTen – Latvian corpus from the web](https://www.sketchengine.eu/lvtenten-latvian-corpus/) - Search lvTenTen, the 1.1-billion--word Latvian corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Prices for Academic Teams and Institutions](https://www.sketchengine.eu/prices-for-academic-teams-and-institutions/) - Prices of Sketch Engine subscriptions for Academic Teams and Institutions. - [Boot Camp Brno](https://www.sketchengine.eu/bootcamp/boot-camp-brno/) - [Tibetan part-of-speech tagset](https://www.sketchengine.eu/tibetan-part-of-speech-tagset/) - Tagset for Tibetan is a list POS tags used to indicate grammatical categories in Tibetan corpora of Sketch Engine. - [Marathi part-of-speech tagset](https://www.sketchengine.eu/marathi-iiit-h-tagset/) - Tagset for Indian languages (Bengali, Hindi, Telugu, ...) is a list of POS tags used to indicate grammatical categories in Indian corpora. - [Oxford Corpus of Academic English](https://www.sketchengine.eu/oxford-corpus-of-academic-english/) - Search OCAE, the 85-million-word Oxford Corpus of Academic English which consists of undergraduate textbooks and journals about various sciences and humanities. - [Additional storage for academic accounts](https://www.sketchengine.eu/additional-storage-academic/) - [raw] ADDITIONAL STORAGE - ACADEMIC Space million words for 0.00 € per year Does not include VAT (if applicable) Deleted: [/raw] - [Privacy Policy](https://www.sketchengine.eu/gdpr-privacy-consent/) - This text describes how we treat your personal information in accordance with Regulation (EU) 2016/679 GDPR (General Data Protection Regulation). - [Create a corpus from the web](https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/) - Create a multi-million-word corpus from the web within minutes. Fully automatic corpus building, lemmatization and tagging in 50+ languages. - [Wordlist - frequency lists](https://www.sketchengine.eu/guide/wordlist-frequency-lists/) - The word list tool uses a text corpus to generate frequency lists of words, lemmas, nouns, verbs and other parts of speech. Regular expressions can be used for detailed criteria. - [Prices for non-academic personal accounts](https://www.sketchengine.eu/prices-for-non-academic-personal-accounts/) - Non-academic personal accounts – freelance translators interpreter, SEO or marketing professional and I sell my services to multiple clients or companies. - [Prices for Academic Personal Accounts](https://www.sketchengine.eu/prices-for-academic-personal-accounts-2/) - Check the prices for academic personal accounts in Sketch Engine. - [Tagsets](https://www.sketchengine.eu/tagsets/) - A tagset is a list of labels used to indicate the part of speech in text corpora. This page lists all used POS tagsets in Sketch Engine. - [Khmer part-of-speech tagset by RDRPOSTagger](https://www.sketchengine.eu/khmer-part-of-speech-tagset-by-rdrpostagger/) - This Khmer part-of-speech tagset is available in Khmer corpora tagged by RDRPOSTagger. - [itSKELL – Italian corpus for SkELL](https://www.sketchengine.eu/itskell-italian-corpus/) - Search itSKELL corpus, the 300-million-word Italian corpus behind it which is a free online interface for language learners and teachers to learn and teach Italian. - [deSKELL – German corpus for SKELL](https://www.sketchengine.eu/deskell-german-corpus/) - Search deSKELL corpus, the 770-million-word German corpus behind deSKELL which is a free online tool for learning German in real-life texts. - [LatinISE corpus](https://www.sketchengine.eu/latinise-corpus/) - Search the LatinISE corpus, the 11-million-word Latin corpus made of the ​LacusCurtius, ​Intratext and​ Musisque Deoque websites. Texts were lemmatized and part-of-speech tagged. - [Institutional login - Single Sign On SSO](https://www.sketchengine.eu/guide/single-sign-on-sso/) - Hot to log in to Sketch Engine via your institution using Single sign-on (SSO).with your university username and password. - [English Preposition Corpus](https://www.sketchengine.eu/english-preposition-corpus/) - The Preposition Corpus (TPP) based on the Pattern Dictionary of English Prepositions (PDEP) is designed to describing behavior of prepositions. - [ukWaC – British Web corpus](https://www.sketchengine.eu/ukwac-british-web-corpus/) - Search ukWaC – British Web corpus, the 1.3-billion-word British Web corpus of texts from the United Kingdom national domain .uk. Texts were cleaned and deduplicated. - [UKWaCsst corpus](https://www.sketchengine.eu/ukwacsst-corpus/) - UKWaC tagged with SuperSenseTagger (​sst-light) described in ​Ciaramita and Altun (2006). Attributes include the Penn TreeBank tags and SuperSenseTagger WordNet labels and Named Entity Labels. There are also named entity and MWE structures produced in the vertical from the sst output. The Sketch Grammar is undergoing research and development. Changelog v1.0 (March 2012) 370,023,634 tokens - [Chinese Gigaword corpus](https://www.sketchengine.eu/chinese-gigaword/) - Chinese Gigaword consists of newswire data with POS tagging. In enables to search collocations, n-grams in Chinese journalism. - [Adam Kilgarriff: Structured bibliography](https://www.sketchengine.eu/adam-kilgarriff-structured-bibliography/) - Adam Kilgarriff, founder of Sketch Engine, and his structured bibliography. - [mrTenTen – Marathi corpus from the web](https://www.sketchengine.eu/mrtenten-marathi-corpus/) - Search mrTenTen, the 342-million-word Marathi corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Transhistorical Corpus of Written English](https://www.sketchengine.eu/transhistorical-corpus-of-written-english/) - Search the Transhistorical Corpus of Written English, a diachronic text corpus containing data from the fifteenth to the twenty-first century. - [urTenTen – Urdu corpus from the web](https://www.sketchengine.eu/urtenten-urdu-corpus/) - Search urTenTen, the 328-million-word Urdu corpus of texts from the web. Make Urdu concordances, generate Urdu word lists or n-grams. - [Create subcorpora to share with other users](https://www.sketchengine.eu/documentation/create-subcorpora-to-share-with-other-users/) - A subcorpus is a smaller part of text corpus based on specific attributes, Sketch Engine offers to share it with other users. - [Corpus hosting](https://www.sketchengine.eu/corpus-hosting/) - Let your corpus shine with our state-of-the-art corpus query tools. Sketch Engine corpus hosting is a safe and permanent place to display your corpus. - [deTenTen - German corpus from the web](https://www.sketchengine.eu/detenten-german-corpus/) - Search deTenTen, the 16-billion-word German corpus of texts collected from the web. Texts were part-of-speech tagged and lemmatized. Genre and topic classification available. - [enTenTen — English corpus from the web](https://www.sketchengine.eu/ententen-english-corpus/) - Search enTenTen, the 36-billion-word English corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [zhTenTen – Chinese corpus from the web](https://www.sketchengine.eu/zhtenten-chinese-corpus/) - Search zhTenTen 2021, the 14.9-billion-word Chinese corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged, lemmatized. - [Ask a question](https://www.sketchengine.eu/ask-a-question/) - Ask a question about Sketch Engine. Would this tool be right for you? Check whether Sketch Engine has the features you need. - [Request support](https://www.sketchengine.eu/request-support/) - Request support - Are you not sure how to use a feature? Is something not working? We will listen and will get back to you. - [Contact Us](https://www.sketchengine.eu/contact-us/) - Contact us with support queries, feedback, or general questions asking for information about Sketch Engine. - [Italian TreeTagger part-of-speech tagset using Marco Baroni’s parameter file](https://www.sketchengine.eu/italian-treetagger-part-of-speech-tagset/) - Italian TreeTagger POS tagset is a list POS tags used to indicate grammatical categories for Italian corpora in Sketch Engine with Baroni’s parameter file. - [bgTenTen – Bulgarian corpus from the web](https://www.sketchengine.eu/bgtenten-bulgarian-corpus/) - Search bgTenTen, the 4.6-billion-word Bulgarian corpus 2021 of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [arTenTen – Arabic corpus from the web](https://www.sketchengine.eu/artenten-arabic-corpus/) - Search arTenTen, the 6.5-billion-word Arabic corpus of texts from the web. Texts were part-of-speech tagged and lemmatized. - [esTenTen – Spanish corpus from the web](https://www.sketchengine.eu/estenten-spanish-corpus/) - Search esTenTen, the 28-billion-word Spanish corpus of texts from the web. It was cleaned and deduplicated. Containswith part-of-speech tagging and lemmatization. - [frTenTen – French corpus from the web](https://www.sketchengine.eu/frtenten-french-corpus/) - Search frTenTen, the 23-billion-word French corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. Genre and topic classification available. - [noTenTen – Norwegian corpus from the web](https://www.sketchengine.eu/notenten-norwegian-corpus/) - Search noTenTen, the 2.4-billion-word Norwegian corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [ruTenTen – Russian corpus from the web](https://www.sketchengine.eu/rutenten-russian-corpus/) - Search ruTenTen, the 19-billion-word Russian corpus of texts from the web. Texts were cleaned and deduplicated, part-of-speech tagged and lemmatized. - [TenTen Corpus Family](https://www.sketchengine.eu/documentation/tenten-corpora/) - Search TenTen corpora, the billion-word corpus family of 50+ languages with a target size of 10 billion words. Each TenTen corpus provides a list of features. - [beTenTen – Belarusian corpus from the web](https://www.sketchengine.eu/betenten-belarusian-corpus/) - Search beTenTen, the 51-million-word Belarusian corpus of texts from the web. Generate Belarusian concordances, word lists or n-grams. - [igTenTen – Igbo corpus from the web](https://www.sketchengine.eu/igtenten-igbo-corpus/) - [msTenTen — Malay corpus from the web](https://www.sketchengine.eu/mstenten-malay-corpus/) - Search msTenTen, the 805-million-word Malay corpus of texts from the web. The texts include part-of-speech tagging and lemmatization. - [idTenTen – Indonesian corpus from the web](https://www.sketchengine.eu/idtenten-indonesian-corpus/) - Search idTenTen, the 7.1-billion-word Indonesian corpus of texts from the web. Texts were cleaned and deduplicated, part-of-speech tagged and lemmatized. - [dvTenTen – Maldivian corpus from the web](https://www.sketchengine.eu/dvtenten-maldivian-corpus/) - Search dvTenTen, the 20-million-word Maldivian corpus of texts from the web. This corpus of Dhivehi contains genre and topic classification. - [ltTenTen – Lithuanian corpus from the web](https://www.sketchengine.eu/lttenten-lithuanian-corpus/) - Search ltTenTen, the 2.3-billion-word Lithuanian corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized, and classified by genres and topics. - [ptTenTen – Portuguese corpus from the web](https://www.sketchengine.eu/pttenten-portuguese-corpus/) - Search ptTenTen, the 16-billion-word Portuguese corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized, and genre and topic classified. - [Czech Trends corpus](https://www.sketchengine.eu/czech-trends-corpus/) - Search the Czech Trends corpus, a Czech monitor corpus made up of news articles gained from their RSS feeds. This timestamped corpus is updated daily by 1–2 million words. - [Italian Trends corpus](https://www.sketchengine.eu/italian-trends-corpus/) - Search the Italian Trends corpus, an Italian monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 3–4 million words. - [Slovak Trends corpus](https://www.sketchengine.eu/slovak-trends-corpus/) - Search the Slovak Trends corpus, a Slovak monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 100,000 words. - [kaTenTen — Georgian corpus from the web](https://www.sketchengine.eu/katenten-georgian-corpus/) - Search kaTenTen, the 1-billion-token Georgian corpus of texts from the web. Texts were cleaned and deduplicated. - [ELEXIS corpora](https://www.sketchengine.eu/elexis-corpora/) - Search the ELEXIS corpora or their subset containing a semantically-annotated corpora with word sense disambiguation (WSD). - [nlTenTen – Dutch corpus from the web](https://www.sketchengine.eu/nltenten-dutch-corpus/) - Search nlTenTen, the 5.9-billion-word Dutch corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. Genre and topic classification available. - [ukTenTen – Ukrainian corpus from the web](https://www.sketchengine.eu/uktenten-ukrainian-corpus/) - Search ukTenTen, the 7.5-billion-word Ukrainian corpus of text from the web. Text were cleaned, deduplicated, part-of-peech tagged, and lemmatized. - [Chinese Web 2005 corpus (Internet-ZH)](https://www.sketchengine.eu/internet-zh-corpus/) - Search Internet-ZH, the 198-million-word Chinese corpus prepared by Serge Sharoff in 2005. The corpus is part-of-speech tagged. - [Guangwai-Lancaster Chinese Learner Corpus](https://www.sketchengine.eu/guangwai-lancaster-chinese-learner-corpus/) - Search Guangwai-Lancaster Chinese Learner Corpus with Sketch Engine. Generate collocations, n-grams, concordances. - [sqTenTen — Albanian corpus from the web](https://www.sketchengine.eu/sqtenten-albanian-corpus/) - Search sqTenTen, the 500-million-word Albanian corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [taTenTen – Tamil corpus from the web](https://www.sketchengine.eu/tatenten-tamil-corpus/) - Search taTenTen, the 823-million-word Tamil corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and stemmed. - [guTenTen – Gujarati corpus from the web](https://www.sketchengine.eu/gutenten-gujarati-corpus/) - Search guTenTen, the 88-million-word Gujarati corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and stemmed. - [pnbTenTen – Western Punjabi corpus from web](https://www.sketchengine.eu/pnbtenten-western-punjabi-corpus/) - [afTenTen – Afrikaans corpus from the Web](https://www.sketchengine.eu/aftenten-afrikaans-corpus/) - Search afTenTen, the 142-million-word Afrikaans corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [elTenTen – Greek corpus from the web](https://www.sketchengine.eu/eltenten-greek-corpus/) - Search elTenTen, the 2.3-billion-word Greek corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged, lemmatized. - [Irish corpus from the web](https://www.sketchengine.eu/gatenten-irish-corpus/) - Search gaTenTen, the 125-million-word Irish corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [bnTenTen - Bengali corpus from the web](https://www.sketchengine.eu/bntenten-bengali-corpus/) - Search bnTenTen, the 470-million-word Bengali corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Tibetan corpus](https://www.sketchengine.eu/tibetan-corpus/) - Search ACTib, the 170-million-word Tibetan corpus of Classical Tibetan. Texts were lemmatized and part-of-speech tagged. - [NoSketch Engine](https://www.sketchengine.eu/nosketch-engine/) - This page presents main difference in features between Sketch Engine (commercial version) and NoSketch Engine (open source version). - [United Nations Parallel Corpus (UNPC)](https://www.sketchengine.eu/united-nations-parallel-corpus-unpc/) - Search the United Nations Parallel Corpus (UNPC), the multi-million parallel corpora in six languages: Arabic, Chinese, English, French, Russian and Spanish. - [Polish NKJP part-of-speech tagset](https://www.sketchengine.eu/polish-nkjp-part-of-speech-tagset/) - Polish NKJP POS tagset is a list POS tags used to indicate grammatical categories for Polish corpora in Sketch Engine. - [Training by the Sketch Engine team](https://www.sketchengine.eu/user-guide/sketch-engine-training/) - Workshops and training events in lexicography corpus querying, NLP, corpus linguistics and lexical computing - [Santa Barbara Corpus of Spoken American English](https://www.sketchengine.eu/santa-barbara-corpus-of-spoken-american-english/) - Explore the English spoken corpus, which consists of authentic recordings of people from all over the US. Metadata and recordings are part of the corpus. - [Sketch Engine team](https://www.sketchengine.eu/sketch-engine-team/) - [Word sketch – collocations and word combinations](https://www.sketchengine.eu/guide/word-sketch-collocations-and-word-combinations/) - The word sketch shows the most typical collocations and word combinations of each word in the language identified in a text corpus. - [ParlaTalk corpora of parliamentary debates](https://www.sketchengine.eu/parlatalk-corpora/) - The ParlaTalk corpora are a set of 22 parliamentary debate corpora in 20 languages, covering the parliaments of 22 EU member states, totaling 3 billion words. - [Limerick Corpus of Irish English](https://www.sketchengine.eu/limerick-corpus-of-irish-english/) - Explore the spoken variant of Irish English in the Limerick Corpus of Irish English (LCIE - [ELTeC — European Literary Text Collection Corpora](https://www.sketchengine.eu/eltec-european-literary-text-collection-corpora/) - ELTec is a collection of corpora consisting of novels, suitable for linguists or anyone interested in researching literary works. - [Norwegian Nynorsk Trends corpus](https://www.sketchengine.eu/norwegian-nynorsk-trends-corpus/) - Search the Norwegian Nynorsk Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 100 thousand words. - [Norwegian Bokmål Trends corpus](https://www.sketchengine.eu/norwegian-bokmal-trends-corpus/) - Search the Norwegian Bokmål Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 1 million words. - [Word Sketch lesson](https://www.sketchengine.eu/quick-start-guide/word-sketch-lesson/) - Word Sketch is a tool gathering information from millions of examples of use and provides a one-page summary of categorised collocations with links to examples. - [Tamil Trends corpus](https://www.sketchengine.eu/tamil-trends-corpus/) - Search the Tamil Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 1 million words. - [Hungarian Trends corpus](https://www.sketchengine.eu/hungarian-trends-corpus/) - Search the Hungarian Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 5 million words. - [Greek Trends corpus](https://www.sketchengine.eu/greek-trends-corpus/) - Search the Greek Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 10 million words. - [Hebrew Trends corpus](https://www.sketchengine.eu/hebrew-trends-corpus/) - Search the Hebrew Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 3 million words. - [Catalan Trends corpus](https://www.sketchengine.eu/catalan-trends-corpus/) - Search the Catalan Trends corpus, a monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 1 million words. - [Chinese Trends corpus](https://www.sketchengine.eu/chinese-trends-corpus/) - Search the Chinese Trends corpus, a Chinese monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated weekly by 1 million words. - [Sketch Engine Localisation](https://www.sketchengine.eu/documentation/sketch-engine-localisation/) - The Sketch Engine interface can be translated into any other language. To do this we need to translate all the interface strings into the particular language. Sketch Engine uses the simple and popular gettext translation system. The strings are stored in a file ske.po (only example file, for the up-to-date version of this file, request it - [British National Corpus (BNC)](https://www.sketchengine.eu/british-national-corpus-bnc/) - Search the British National Corpus (BNC), a 100-million-word text corpus of written and spoken language (incl. audio records). Texts are PoS tagged, lemmatized. - [Terms of Service corpus](https://www.sketchengine.eu/terms-of-service-corpus/) - A legal corpus of English terms and conditions for translation, analysis, and training. Explore legal phraseology via Sketch Engine. - [German Political Speeches Corpus](https://www.sketchengine.eu/german-political-speeches-corpus/) - Search the 11-million-word German Political Speeches Corpus. Texts were cleaned, part-of-speech tagged and lemmatized. - [Šolar: Slovenian Learner corpus of school essays](https://www.sketchengine.eu/solar-slovenian-learner-corpus-of-school-essays/) - Corpus Šolar is an error-annotated Slovenian corpus of authentic texts written by students in Slovene primary and secondary schools. The corpus contains 1 million words. - [German RFTagger part-of-speech tagset](https://www.sketchengine.eu/german-rftagger-part-of-speech-tagset/) - German RFTagger POS tagset is a list POS tags used to indicate grammatical categories for German corpora in Sketch Engine. - [GerManC. A Historical Corpus of German Newspapers 1650–1800](https://www.sketchengine.eu/germanc-corpus/) - GerManC: A Historical Corpus of German Newspapers 1650–1800 search with Sketch Engine. - [MULTEXT-East Croatian part-of-speech tagset (version 5)](https://www.sketchengine.eu/multext-east-croatian-part-of-speech-tagset/) - MULTEXT-East Morphosyntactic Croatian specification is a list POS tags for Croatian corpora in Sketch Engine. - [Word sketch difference - compare words](https://www.sketchengine.eu/guide/word-sketch-difference-compare-words/) - The word sketch difference compares the use, meaning and connotations of two words. It uses their collocation for the comparison. - [Product naming ideas](https://www.sketchengine.eu/user-guide/product-naming-ideas/) - Get ideas for product names, trademarks and brands, check their connotation the contexts in which they appear. Use text analysis with materials from the competition. Sign up for a free trial! - [Open Cambridge Learner Corpus (Uncoded)](https://www.sketchengine.eu/cambridge-learner-corpus/) - Search OpenCLC corpus, the English balanced corpus collecting exams of learners of English. Explore 2.9 million words from 10,000 student responses. - [Students](https://www.sketchengine.eu/user-guide/students/) - [English TreeTagger PoS tagset with modifications](https://www.sketchengine.eu/english-treetagger-pipeline-2/) - See a list of part-of-speech tags used to indicate grammatical categories in English corpora in Sketch Engine. - [English Penn Treebank part-of-speech Tagset](https://www.sketchengine.eu/penn-treebank-tagset/) - See a list of part-of-speech tags included in the English Penn Treebank tagset used in English text corpora within Sketch Engine. - [Building word sketches from parsed corpora](https://www.sketchengine.eu/documentation/building-sketches-from-parsed-corpora/) - Building word sketches from parsed corpora in the CoNLL or sCoNLL format. Creating a user corpus from the CONLL format in the interface. - [English Penn Treebank Tagset – ukWaC version](https://www.sketchengine.eu/penn-treebank-tagset-2/) - English Penn Treebank Tagset II is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [Build a corpus from the web - lesson](https://www.sketchengine.eu/build-a-corpus-from-the-web-lesson/) - Learn to build a corpus in a few steps or watch a 5-minute video lesson. Sketch Engine enables you to create your own corpus from the web or your own documents. - [Choose the right corpus](https://www.sketchengine.eu/quick-start-guide/choose-corpus-lesson/) - [Command Line Tools](https://www.sketchengine.eu/documentation/command-line-tools/) - This page listed command line tools available for local installation of Sketch Engine. - [SiBol corpus of English broadsheets](https://www.sketchengine.eu/sibol-corpus/) - Search SiBol corpus, the 850-million-word English corpus of articles collected from English broadsheet newspapers from 1993 to 2021. - [Leaning concordance](https://www.sketchengine.eu/concordance-new-feature/) - [etTenTen – Estonian corpus from the web](https://www.sketchengine.eu/ettenten-estonian-corpus/) - Search etTenTen23, the 1.5-billion-word Estonian corpus of texts from the web. Texts are part-of-speech tagged, lemmatized including genre and topic annotation. - [Word sense induction](https://www.sketchengine.eu/guide/word-sense-induction/) - Word sense induction is a function in the Word Sketch tool in Sketch Engine. The tool automatically identifies senses of the word by grouping collocations. - [Concordance - lesson](https://www.sketchengine.eu/quick-start-guide/concordance-lesson/) - The concordance is used to find examples of a word, lemma, phrase, tag or even a complex grammatical or lexical structure. - [Definitions of the “lc” and “lemma_lc” attributes](https://www.sketchengine.eu/documentation/lc-attributes/) - Modification of a corpus configuration files with "lc" and "lemma_lc" attributes that can significantly increase the speed of various query operations. - [Preparing a Text Corpus for Sketch Engine: Overview](https://www.sketchengine.eu/documentation/preparing-a-text-corpus-for-the-sketch-engine-overview/) - This page describes how to prepare a text corpus for indexation by the Manatee corpus management system used as the underlying database backend in Sketch Engine. Text corpus from a technical point of view The informal definition of a text corpus usually boils down to something close to "any collection of texts in electronic form". From - [Annotate a corpus](https://www.sketchengine.eu/guide/annotate-a-corpus/) - To annotate a corpus means to add information about texts in the corpus. This can relate to documents, paragraphs, sentences, words or tokens. - [Text Types, Headers and Subcorpora](https://www.sketchengine.eu/documentation/text-types-headers-and-subcorpora/) - [Writing a Sketch Grammar](https://www.sketchengine.eu/documentation/writing-sketch-grammar/) - [Simple maths](https://www.sketchengine.eu/documentation/simple-maths/) - Simple math is a method for coputing the keyness score to identify keywords and terms in text or corpus. That includes options to focus on higher or lower frequency words. - [British Academic Spoken English Corpus](https://www.sketchengine.eu/british-academic-spoken-english-corpus/) - Search BASE corpus, the British Academic Spoken English corpus of 160 lectures from the University of Warwick and the University of Reading. - [zsmWaC – Malaysian corpus from the web](https://www.sketchengine.eu/malaysianwac-malaysian-corpus/) - Search the 230-million-word Malaysian corpus (Malay language used in Malaysia) of texts collected from the web in 2010. Texts were cleaned and deduplicated. - [Word Sketches definition files](https://www.sketchengine.eu/word-sketches-definition-files/) - The following files can be used for building word sketches in Corpus Builder assuming the texts were POS-tagged and lemmatized using the ​TreeTagger (the tagger used by Corpus Builder). Download English (Penn tagset)​ French (Penn tagset)​ Spanish - [Word Sketch Index Format](https://www.sketchengine.eu/word-sketch-index-format/) - This page is a brief overview of the development of the word sketch indices over time. version 4 (released 2015) fixed issues in indexing very small corpora changes to score computations (documented on the page Statistics used in Sketch Engine) version 3 (released 2014; internal use only) added commonest match indices uses more compact index - [Word lists](https://www.sketchengine.eu/word-lists/) - Download frequency word lists of common words in English, Spanish, French, German. Frequency word lists of nouns, verbs, adjectives and adverbs also available. - [WelshWaC corpus](https://www.sketchengine.eu/welshwac-corpus/) - The corpus is prepared by Corpus factory method by Anil in October 2013. Full details are described in ​Kilgarriff et al. at LREC 2010. - [WaC text corpora](https://www.sketchengine.eu/wac-corpora/) - Web as Corpus (WaC) corpora are available in Sketch Engine. Search concordance, n-grams, collocations. - [Vietnamese Tagset](https://www.sketchengine.eu/vietnamese-tagset/) - Vietnamese POS tagset of vnTagger is a list POS tags used to indicate grammatical categories for Vietnamese corpora in Sketch Engine. - [Video lessons](https://www.sketchengine.eu/video-lessons/) - These video lessons explain how to use Sketch Engine to build, search, analyse and manage text corpora. The videos are available from our YouTube playlist. - [Varieties of Learner English (VOLE) corpus](https://www.sketchengine.eu/vole-corpus/) - VOLE (Varieties of Learner English) is a corpus gathered in 2010 to explore different varieties of English. The crawl used the BootCaT process. We generated a set of triples of mid-frequency English words, and then sent each of them to a search engine seven times over, with seven different geographic constraints; for UK, US, Canada, - [Variation in hit counts](https://www.sketchengine.eu/variation-in-hit-counts/) - It often seems like you have got a different hit count for the same search when, for example, you compare hits for a concordance with hits in a frequency list. The usual reason is that the searches were not identical, and even if they were nearly identical, in a large corpus the unusual events where - [Compare corpora](https://www.sketchengine.eu/guide/compare-corpora/) - The comparison tool compares two or more corpora in the same language and computes a similarity score for each pair of corpora. - [uzWaC – Uzbek corpus](https://www.sketchengine.eu/uzwac-uzbek-corpus/) - Search uzWaC, the 18-million-word Uzbek corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances or generate Uzbek word lists and n-grams. - [User Administration](https://www.sketchengine.eu/guide/user-administration/) - User administration enables administrators to maintain user accounts in their group account, e.g. adding new accounts, deleting them, changing storage space, ... - [Thesaurus - synonyms, antonyms, similar words](https://www.sketchengine.eu/guide/thesaurus-synonyms-antonyms-similar-words/) - The automatically generated thesaurus contains synonyms, antonyms and similar words for any word in the language. - [Select a corpus](https://www.sketchengine.eu/guide/select-a-corpus/) - [Parallel concordance - searching translations](https://www.sketchengine.eu/guide/parallel-concordance-searching-translations/) - Parallel multilingual concordance is used to search documents and their translations to analyse the translation or to find translation equivalents. - [N-grams - multiword expressions, lexical bundles](https://www.sketchengine.eu/guide/n-grams-multiword-expressions/) - The N-gram tool uses a text corpus to generate frequency lists of multiword expressions (MWEs), lexical bundles or sequences of tokens. - [My account, password and settings](https://www.sketchengine.eu/guide/my-account-password-settings/) - Update your personal details, email address or change your password. - [Glossary](https://www.sketchengine.eu/guide/glossary/) - [Download a corpus](https://www.sketchengine.eu/guide/download-a-corpus/) - Download a text corpus in plain text or vertical file format. Upload your texts and download them with POS tags and lemmas. - [Document annotation tool](https://www.sketchengine.eu/guide/document-annotation-tool/) - The corpus annotation tool allows adding and editing metadata to documents in the corpus. - [Data format for a parallel corpus](https://www.sketchengine.eu/guide/data-format-for-a-parallel-corpus/) - [Create a subcorpus](https://www.sketchengine.eu/guide/create-a-subcorpus/) - A language corpus can be divided into smaller parts called subcorpora based on text type, date, medium source or any other metadata the corpus contains. - [Corpus dashboard](https://www.sketchengine.eu/guide/corpus-dashboard-2/) - The corpus dashboard is the main Sketch Engine screen divided into these corpus tools and history (recently used corpora). - [Concordance - a tool to search a corpus](https://www.sketchengine.eu/guide/concordance-a-tool-to-search-a-corpus/) - The concordance shows words and phrases in contexts and offers a variety of tools to analyse the result further. - [Compiling a corpus](https://www.sketchengine.eu/guide/compile-a-corpus/) - [Annotate a vertical file](https://www.sketchengine.eu/guide/annotating-tokens/) - Tokens in text corpora can be annotated at the level of tokens. See token annotation examples on Sketch Engine website. - [Concordance search](https://www.sketchengine.eu/user-guide/concordance-search/) - The basic search of words in corpora. It describes the search with context, in text type and usage of advanced options. - [Videos by James Thomas](https://www.sketchengine.eu/user-guide/sketch-engine-in-videos/) - Tutorial videos explaining how to use main functions in Sketch Engine. - [Terminologists – term extraction](https://www.sketchengine.eu/user-guide/terminologists-terminology-extraction/) - Extract terms using term extraction and bilingual terminology extraction from subject-specific corpora or your own texts. - [Teachers](https://www.sketchengine.eu/user-guide/teachers/) - [Sketch Engine Video Tutorials](https://www.sketchengine.eu/user-guide/sketch-engine-video-tutorials/) - All videos are accessible also on our YouTube channel. Please note some parts of the user interface were updated since creating these tutorials and look slightly different now. Basics: making concordances This video demonstrates the concordancer function in Sketch Engine. Basics: thinning your concordance In this tutorial, you will learn how to thin down your - [User guide](https://www.sketchengine.eu/user-guide/) - Description of all features in Sketch Engine. - [Urdu part-of-speech tagset](https://www.sketchengine.eu/urdu-part-of-speech-tagset/) - Urdu POS tagset is a list POS tags used to indicate grammatical categories for Urdu corpora in Sketch Engine. - [Turkish part-of-speech tagset](https://www.sketchengine.eu/turkish-part-of-speech-tagset/) - Turkish POS tagset is a list POS tags used to indicate grammatical categories for Turkish corpora in Sketch Engine. - [Turkish corpus (trWaC)](https://www.sketchengine.eu/trwac-turkish-corpus/) - Search trWaC, the 32-million-word Turkish corpus of texts from the web. Texts were cleaned and deduplicated. - [TRMorph – Turkish part-of-speech tagset](https://www.sketchengine.eu/turkish-trmorph-part-of-speech-tagset/) - The Turkish part-of-speech tagset is available in corpora annotated by TRMorph, a free morphological analyzer for Turkish. - [tkWaC – Turkmen corpus](https://www.sketchengine.eu/tkwac-turkmen-corpus/) - Search tkWaC, the 2-million-word Turkmen corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances or generate Turkmen word lists and n-grams. - [Timestamped Arabic corpus](https://www.sketchengine.eu/timestamped-arabic-corpus/) - Search the 4.7-billion-word Timestamped Arabic corpus updated with new data daily. Carry out the diachronic analysis of words. - [The New Corpus for Ireland | Nua-Chorpas na hÉireann](https://www.sketchengine.eu/the-new-corpus-for-ireland/) - [ezcol_1half] The New Corpus for Ireland – user’s guide Welcome to the New Corpus for Ireland, a corpus created as part of the New English-Irish Dictionary project in Foras na Gaeilge. The New Corpus for Ireland is a large collection of texts in Irish with approximately 30 million words. It contains a wide range of - [ThaiWaC corpus](https://www.sketchengine.eu/thaiwac/) - The corpus is prepared by Corpus factory method. Full details are described in ​Kilgarriff et al. at LREC 2010. Corpus is tokenised using Swath Word Segmentation tool downloadable at ​http://www.cs.cmu.edu/~paisarn/software.html - [Terms of Use](https://www.sketchengine.eu/terms-of-use/) - [Tatar part-of-speech tagset](https://www.sketchengine.eu/tatar-part-of-speech-tagset/) - Tatar POS tagset is a list POS tags used to indicate grammatical categories for Tatar corpora in Sketch Engine. - [Tajik part-of-speech tagset](https://www.sketchengine.eu/tajik-part-of-speech-tagset/) - Tajik POS tagset is a list POS tags used to indicate grammatical categories for Tajik corpora in Sketch Engine. - [TaiwanWaC – Chinese corpus from the web](https://www.sketchengine.eu/taiwanwac-chinese-corpus/) - Search TaiwanWaC, the 260-million-word Chinese corpus of Taiwan texts collected from the web. Texts were cleaned and deduplicated. - [Slovenian part-of-speech tagsets](https://www.sketchengine.eu/tagsets/slovenian-part-of-speech-tagset/) - Slovenian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Spanish part-of-speech tagsets](https://www.sketchengine.eu/tagsets/spanish-part-of-speech-tagset/) - Spanish corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Portuguese part-of-speech tagset](https://www.sketchengine.eu/tagsets/portuguese-part-of-speech-tagset/) - Portuguese corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Norwegian part-of-speech tagsets](https://www.sketchengine.eu/tagsets/norwegian-part-of-speech-tagsets/) - Norwegian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Lithuanian part-of-speech tagsets](https://www.sketchengine.eu/tagsets/lithuanian-part-of-speech-tagset/) - Lithuanian part-of-speech tagsets indicate grammatical categories in Lithuanian text corpora in Sketch Engine. - [Hungarian part-of-speech tagsets](https://www.sketchengine.eu/tagsets/hungarian-part-of-speech-tagset/) - Hungarian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Japanese part-of-speech tagset](https://www.sketchengine.eu/tagsets/japanese-part-of-speech-tagset/) - Japanese corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Italian part-of-speech tagset](https://www.sketchengine.eu/tagsets/italian-part-of-speech-tagset/) - Italian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Greek part-of-speech tagsets](https://www.sketchengine.eu/tagsets/greek-part-of-speech-tagset/) - Greek text corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [German part-of-speech tagsets](https://www.sketchengine.eu/tagsets/german-part-of-speech-tagsets/) - German corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [French part-of-speech tagset](https://www.sketchengine.eu/tagsets/french-part-of-speech-tagset/) - French corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Finnish part-of-speech tagsets](https://www.sketchengine.eu/tagsets/finnish-part-of-speech-tagset/) - Finnish corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Estonian part-of-speech tagsets](https://www.sketchengine.eu/tagsets/estonian-part-of-speech-tagsets/) - Estonian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [English part-of-speech tagsets](https://www.sketchengine.eu/tagsets/english-part-of-speech-tagset/) - Sketch Engine provides a various part-of-speech tagsets for English corpora to indicate grammatical categories in English corpora. - [Dutch part-of-speech-tagsets](https://www.sketchengine.eu/tagsets/dutch-part-of-speech-tagset/) - Dutch corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Danish part-of-speech tagsets](https://www.sketchengine.eu/tagsets/danish-part-of-speech-tagset/) - See a list of Danish PoS tagsets. Danish corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Chinese part-of-speech tagset](https://www.sketchengine.eu/tagsets/chinese-part-of-speech-tagset/) - See a list of Chinese PoS tagsets used to indicate grammatical categories for Chinese corpora in Sketch Engine. - [Bulgarian part-of-speech tagsets](https://www.sketchengine.eu/tagsets/bulgarian-part-of-speech-tagsets/) - See a list of Bulgarian part-of-speech tagsets used to indicate grammatical categories for Bulgarian corpora in Sketch Engine. - [Arabic part-of-speech tagsets](https://www.sketchengine.eu/tagsets/arabic-part-of-speech-tagset/) - French corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Tagset for Indian Languages](https://www.sketchengine.eu/tagset-indian-languages/) - Tagset for Indian languages (Bengali, Hindi, Telugu, ...) is a list of POS tags used to indicate grammatical categories in Indian corpora. - [SwedishWaC corpus](https://www.sketchengine.eu/swedishwac-corpus/) - The corpus is prepared by Corpus factory method. Full details are described in ​Kilgarriff et al. at LREC 2010. An important aim in creating this corpus was to get a corpus that was comparable to the ​PAROLE corpus of the Swedish Department of Gothenburg University. In order to achieve that, the corpus was gathered by - [SWATWOL - Swahili part-of-speech tagset](https://www.sketchengine.eu/swatwol-swahili-part-of-speech-tagset/) - SWATWOL Swahili POS tagset is a list POS tags used to indicate grammatical categories for Swahili (Kiswahili) corpora in Sketch Engine. - [Susanne corpus part-of-speech tagset](https://www.sketchengine.eu/susanne-corpus-part-of-speech-tagset/) - Susanne corpus POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [Susanne corpus](https://www.sketchengine.eu/susanne-corpus/) - Search Susanne corpus, the subset of the Brown Corpus of American English annotated using the special annotation scheme which represents all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation. - [Stanford Arabic part-of-speech tagset](https://www.sketchengine.eu/stanford-arabic-part-of-speech-tagset/) - Stanford Arabic POS tagset is a list POS tags used to indicate grammatical categories for Arabic corpora in Sketch Engine. - [Stanford Arabic parser tagset](https://www.sketchengine.eu/stanford-arabic-parser-tagset/) - Stanford Arabic parser tagset is a list POS tags used to indicate grammatical categories for Arabic corpora in Sketch Engine. - [srWaC – Serbian corpus from the web](https://www.sketchengine.eu/srwac-serbian-corpus/) - Search srWaC, the 476-million-word Serbian corpus of texts collected from the web. Texts were cleaned and deduplicated. - [SPOOK morphosyntactic specifications for English](https://www.sketchengine.eu/spook-morphosyntactic-specifications-english/) - The English tagset of SPOOK morphosyntactic specifications developed by Tomaž Erjavec is available in the English-Montenegrin parallel corpus in Sketch Engine. - [SpanishWaC corpus](https://www.sketchengine.eu/spanishwac-corpus/) - This corpus was gathered using a list of URLs provided by Serge Sharoff at the University of Leeds using the method described ​here, designed to produce a general language resource. There has been little checking of the content. It was part-of-speech tagged and lemmatised using ​TreeTagger, a leading part-of-speech tagger which has been trained for - [Spanish word lists](https://www.sketchengine.eu/spanish-word-list/) - Download the list of the most frequent Spanish words, nouns, adjectives and verbs - [Spanish TreeTagger part-of-speech tagset](https://www.sketchengine.eu/spanish-treetagger-part-of-speech-tagset/) - Spanish TreeTagger POS tagset is a list POS tags used to indicate grammatical categories for Spanish corpora in Sketch Engine. - [Spanish FreeLing part-of-speech tagset](https://www.sketchengine.eu/spanish-freeling-part-of-speech-tagset/) - Spanish FreeLing POS tagset is a list POS tags used to indicate grammatical categories for Spanish corpora in Sketch Engine. - [Slovenski nabor oznak (verzija 3)](https://www.sketchengine.eu/slovene-tagset-sl/) - Slovenski nabor oznak - [Slovak part-of-speech tagsets](https://www.sketchengine.eu/slovak-part-of-speech-tagsets/) - A list of Slovak part-of-speech tagsets available in Slovak corpora in Sketch Engine. A part-of-speech tagset indicates grammatical categories. - [Slovak part-of-speech tagset (Slovak National corpus)](https://www.sketchengine.eu/slovak-part-of-speech-tagset/) - Slovak POS tagset of the Slovak National corpus is a list POS tags used to indicate grammatical categories for Slovak corpora in Sketch Engine. - [skTenTen – Slovak corpus from the web](https://www.sketchengine.eu/sktenten-slovak-corpus/) - Search the 900-million-word Slovak corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized and annotated with topics and genres. - [SkEW-7: 7th International Sketch Engine Workshop](https://www.sketchengine.eu/skew-7/) - Monday, 23rd May 2016, afternoon as LREC tutorial T6, Portorož, Slovenia Sponsors: Lexical Computing It follows on from successful events in Herstmonceux, UK (2015), Bolzano, Italy (2014), Tallinn, Estonia (2013), Brno, Czech Republic (2012), Brighton, United Kingdom (2011) and Ljubljana, Slovenia (2010). This year's workshop is organized as a tutorial attached to the LREC conference. For more - [SkEW-6: 6th International Sketch Engine Workshop](https://www.sketchengine.eu/6th-international-sketch-engine-workshop/) - Monday, 10th August 2015, 14.00–17.30 Herstmonceux castle, United Kingdom Sponsors: Lexical Computing Ltd. It follows on from successful events in Bolzano, Italy (2014), Tallinn, Estonia (2013), Brno, Czech Republic (2012), Brighton, United Kingdom (2011) and Ljubljana, Slovenia (2010). The workshop is immediately before eLex 2015 and collocates with the SIGWAC workshop, and on the same site. There - [Sketch Engine workshops (SkEW series)](https://www.sketchengine.eu/past-skew-international-sketch-engine-workshops/) - [Sketch Engine funded by ELEXIS: gaining access](https://www.sketchengine.eu/elexis-gaining-access/) - [Sketch Engine access funded by ELEXIS: terms](https://www.sketchengine.eu/elexis-terms/) - [Sketch Engine access funded by ELEXIS: technical requirements](https://www.sketchengine.eu/elexis-technical-requirements/) - [Sketch Engine access funded by ELEXIS: in detail](https://www.sketchengine.eu/elexis-in-detail/) - [SKELL – examples and collocations for learners of English](https://www.sketchengine.eu/skell/) - SKELL is a tool for language learners and teachers to see how phrases and words are used by real speakers. Use concordance and collocation examples or thesaurus. - [Shallow tagging](https://www.sketchengine.eu/shallow-tagging/) - Shallow tagging is a simple notation based on frequency properties of tokens in text corpora without part-of-speech tagsets. - [SetswanaWaC corpus](https://www.sketchengine.eu/setswanawac-corpus/) - (version 2) The corpus is prepared by Corpus factory method. Full details are described in ​Kilgarriff et al. at LREC 2010. Prepared from data provided by Thapelo Otlogetswe. - [Service Level Agreement](https://www.sketchengine.eu/service-level-agreement/) - [SemCor – sense-tagged English corpus](https://www.sketchengine.eu/semcor-annotated-corpus/) - Try SemCor, the sense-tagged English corpus extracted from Brown corpus. Semantic analysis was performed manually with WordNet senses. - [SDeWaC corpus](https://www.sketchengine.eu/sdewac-corpus/) - SDeWaC is a subset of DeWaC. The creation of sDeWaC is described in detail here. Corpus release announcement can be viewed here. We thank Janina Kopp and Niels Ott for parsing sDeWaC (TBExpanded. Details about pos-tagger, morphological analyser, dependency parser and computing resources used). Word Sketches are extracted from the Dependency Parsed sDeWaC using the - [Scripts for adding header fields](https://www.sketchengine.eu/scripts-for-adding-header-fields/) - Creating text corpus section. You find here scripts for adding header fields. Adding attributes is based on mapping existing structure attributes... - [Scottish Gaelic Wiki corpus](https://www.sketchengine.eu/scottish-gaelic-wiki-corpus/) - Scottish Gaelic Wikipedia corpus. Downloaded in February 2015. Processed by WikiExtractor.py, tokenised by unitok. - [ScienceBlog corpus](https://www.sketchengine.eu/scienceblog-corpus/) - Search ScienceBlogs, the 100-million-word English corpus of posts from ScienceBlogs.com (2006–2014). Texts were part-of-speech tagged and lemmatized. - [Russian word list](https://www.sketchengine.eu/russian-word-list/) - Download a word list of the most common and frequent Russian words, nouns, verbs and adjectives for free - [Russian Web Corpus](https://www.sketchengine.eu/russian-web-corpus/) - Russian web corpus (ruWaC) is a text corpus collected from the Internet. The corpus is part-of-speech tagged with word sketches. - [Russian part-of-speech tagset](https://www.sketchengine.eu/russian-part-of-speech-tagset/) - Russian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [ruSKELL - examples and collocations for learners of Russian](https://www.sketchengine.eu/ruskell-examples-and-collocations-for-learners-of-russian/) - Use ruSKELL, the Russian version of the SKELL interface for Russian language learning based on a corpus. Search Russian concordance, collocations, or generate thesaurus. - [Riznica: The Croatian Language corpus](https://www.sketchengine.eu/riznica-croatian-corpus/) - The Croatian language corpus (CLC) is a POS tagged collection of various texts: online articles, printed books, transcripts of recordings. - [Research Agenda](https://www.sketchengine.eu/research-agenda/) - Lexical Computing's research interests lie at the intersection of corpus and computational linguistics. - [Quranic Arabic Corpus with annotation](https://www.sketchengine.eu/quran-annotated-corpus/) - Search Quranic Arabic corpus, the Arabic corpus of texts from Quran the and the QurAna anaphoric coreference database of the Quran. - [Quick Start Guide - français](https://www.sketchengine.eu/quick-start-guide-francais/) - Un bref cours sur l'utilisation de Sketch Engine, un système de gestion de corpus. Il comprend des leçons sur l'extraction de termes, la concordance etc. - [Word Sketch Difference lesson](https://www.sketchengine.eu/quick-start-guide/word-sketch-difference-lesson/) - Word Sketch Difference is an extension of the word sketch and generates Word Sketches for two words and compares them to show differences in use. - [Word list - lesson](https://www.sketchengine.eu/quick-start-guide/word-list-lesson/) - Download frequency word lists in all words many languages or generate lists of the most frequent nouns, verbs, adjectives, adverbs and other parts of speech. - [Thesaurus](https://www.sketchengine.eu/quick-start-guide/thesaurus/) - Lost for words? — Thesaurus! The automaticly generated thesaurus in Sketch Engine can identify synonyms and similar words even for less common words. - [Parallel concordance lesson](https://www.sketchengine.eu/quick-start-guide/parallel-concordance-lesson/) - [Keywords and terms - lesson](https://www.sketchengine.eu/quick-start-guide/keywords-and-terms-lesson/) - [POS tag set for Modern Standard Arabic](https://www.sketchengine.eu/pos-tag-set-for-modern-standard-arabic/) - POS tagset for Modern Standard Arabic a list POS tags used to indicate grammatical categories for Arabic corpora in Sketch Engine. - [Portuguese VISL part-of-speech tagset](https://www.sketchengine.eu/portuguese-tagset/) - Portuguese VISL symbol sets is a list POS tags used to indicate grammatical categories for Portuguese corpora in Sketch Engine. - [Portuguese FreeLing part-of-speech tagset](https://www.sketchengine.eu/portuguese-freeling-part-of-speech-tagset/) - Portuguese FreeLing POS tagset is a list POS tags used to indicate grammatical categories for Portuguese corpora in Sketch Engine. - [Portuguese corpus](https://www.sketchengine.eu/portuguese-corpus/) - [Polish-Swahili Bible parallel corpora](https://www.sketchengine.eu/polish-swahili-bible-parallel-corpora/) - Polish-Swahili Bible corpus is a parallel text corpus collected from the Polish and Swahili Bible during 2010. - [Polish Web Corpus (PolishWaC)](https://www.sketchengine.eu/polish-web-corpus/) - Polish web corpus with 103 million words. It based on queries to Google with the most frequent Polish words. Tagged by Morfeusz a TaKIPI. - [plTenTen – Polish corpus from the web](https://www.sketchengine.eu/pltenten-polish-corpus/) - Search plTenTen, the 4-billion-word Polish corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [Persian part-of-speech tagset](https://www.sketchengine.eu/persian-part-of-speech-tagset/) - Persian TreeBank POS tagset is a list POS tags used to indicate grammatical categories for Persian corpora in Sketch Engine. - [Patakis corpus](https://www.sketchengine.eu/patakis-corpus/) - Patakis is a 100-million-word Greek corpus of POS-tagged texts mostly downloaded from the Internet. - [Parallel Corpora Registry Info](https://www.sketchengine.eu/parallel-corpora-registry-info/) - General Attribute Set ATTRIBUTE word STRUCTURE s{ ATTRIBUTE id } STRUCTURE p STRUCTURE chapter{ ATTRIBUTE id } STRUCTURE speaker{ ATTRIBUTE id ATTRIBUTE name ATTRIBUTE language ATTRIBUTE affiliation } STRUCTURE doc{ ATTRIBUTE stype ATTRIBUTE type } STRUCTURE g { DISPLAYTAG 0 DISPLAYBEGIN "_EMPTY_" } STRUCTURE align English, French, German, Spanish, Italian, Dutch, Portuguese ATTRIBUTE tag ATTRIBUTE - [Oxford English part-of-speech Tagset](https://www.sketchengine.eu/oxford-english-corpus-tagset/) - Tagset used in the Oxford English Corpus. Here are listed all part-of-spech tags used in the corpus in Sketch Engine. - [Oromo part-of-speech tagset](https://www.sketchengine.eu/oromo-part-of-speech-tagset/) - [Norwegian Oslo-Bergen part-of-speech tagset](https://www.sketchengine.eu/norwegian-oslo-bergen-part-of-speech-tagset/) - Norwegian Oslo-Bergen POS tagset is a list POS tags used to indicate grammatical categories for Norwegian corpora in Sketch Engine. - [Norwegian Nynorsk part-of-speech tagset](https://www.sketchengine.eu/norwegian-nynorsk-part-of-speech-tagset/) - Norwegian Nynorsk POS tagset is a list POS tags used to indicate grammatical categories for Norwegian corpora in Sketch Engine. - [Nineteenthcentury corpus](https://www.sketchengine.eu/nineteenthcentury-corpus/) - Actually, the 19th century corpus is only available to Osnabrück University users. - [New Model Corpus](https://www.sketchengine.eu/new-model-corpus/) - The New model Corpus is a ~100 million words domain corpus built from web data in 2008. For more information see in attachments (below). Text types Genres Genre # documents blog 13,957 news 12,388 general 10,216 business 1,433 speech (subtitles) 1,088 medical 516 law 451 fiction 123 Web top level domains TLD # documents com - [NepaliWaC corpus](https://www.sketchengine.eu/nepaliwac-corpus/) - Nepali web corpus downloaded by LCL on Dec 10, 2014. ~1200 docs obtained using Corpus factory method ~110 docs obtained using wget a newspaper website A very small corpus just for a base word list and a test of possibilities for obtaining Nepalese. - [Nepali part-of-speech tagset](https://www.sketchengine.eu/nepali-tagset/) - Nepali Nelralec POS tagset is a list POS tags used to indicate grammatical categories for Nepali corpora in Sketch Engine. - [Nepali National Corpus](https://www.sketchengine.eu/nepali-national-corpus/) - Search the National Nepali corpus, the 13-million-word corpus of texts from books, newspapers, websites etc. The corpus is part-of-speech tagged and lemmatized. - [N'ko corpus](https://www.sketchengine.eu/nko-corpus/) - The N'ko corpus is text corpus provided by - [MULTEXT-East Slovenian part-of-speech tagset (version 3)](https://www.sketchengine.eu/slovene-tagset-multext-east-v3/) - MULTEXT-East Morphosyntactic Slovenian specification v3 is a list POS tags for Slovene corpora in Sketch Engine. - [MULTEXT-EAST Serbo-Croatian part-of-speech tagset](https://www.sketchengine.eu/multext-east-serbo-croatian-part-of-speech-tagset/) - MULTEXT-East Serbo-Croatian part-of-speech tagset used in Bosnian corpora. - [MULTEXT-East Serbian part-of-speech tagset (version 5)](https://www.sketchengine.eu/multext-east-serbian-part-of-speech-tagset/) - MULTEXT-East Morphosyntactic Serbian specification is a list POS tags for Serbian corpora in Sketch Engine. - [MULTEXT-East Romanian part-of-speech tagset](https://www.sketchengine.eu/romanian-tagset/) - MULTEXT-East Morphosyntactic Romanian specification is a list POS tags for Romanian text corpora in Sketch Engine. - [MULTEXT-East Bosnian part-of-speech tagset](https://www.sketchengine.eu/multext-east-bosnian-part-of-speech-tagset/) - MULTEXT-East Morphosyntactic Bosnian specification is a list POS tags for Bosnian corpora in Sketch Engine. - [mtWaC – Maltese corpus](https://www.sketchengine.eu/mtwac-maltese-corpus/) - Search mtWaC, the 139-million-word Maltese corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances and collocations or generate Maltese word lists and n-grams. - [Mongolian web corpus](https://www.sketchengine.eu/mongolian-wac-corpus/) - Mongolian web corpus is a 6-million-word corpus in Sketch Engine which was created in March 2016. - [Morphological annotation Quranic Arabic corpus](https://www.sketchengine.eu/morphological-annotation-quranic-arabic-corpus/) - Morphological annotation in Quranic Arabic corpus is a list POS tags used to indicate grammatical categories for Arabic corpora in Sketch Engine. - [mlWaC – Malayalam corpus](https://www.sketchengine.eu/mlwac-malayalam-corpus/) - Search mlWaC, the 16-million-word Malayalam corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances or generate Malayalam word lists and n-grams. - [miTenTen – Māori corpus from the web](https://www.sketchengine.eu/mitenten-maori-corpus/) - Search miTenTen, the 11-million-word Māori corpus of texts from the web. Make Māori concordances, generate Māori word lists or Māori n-grams. - [MGNN Tagalog part of speech tagset](https://www.sketchengine.eu/mgnn-tagalog-part-of-speech-tagset/) - See a list of part-of-speech tags used to indicate grammatical categories in Tagalog corpora in Sketch Engine. - [Mapping file for English CLAWS tagset, version 8 to version 7](https://www.sketchengine.eu/english-mapping-claws-part-of-speech-tagset/) - Mapping file for English CLAWS POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [Malaysian and Indonesian tagset](https://www.sketchengine.eu/malaysian-indonesian-tagset/) - Malay and Indonesian corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [loTenTen – Lao corpus from the web](https://www.sketchengine.eu/lotenten-lao-corpus/) - Search loTenTen, the 105-million word Lao corpus of texts from the web. Texts were cleaned, deduplicated and part-of-speech tagged. - [Lithuanian web corpus](https://www.sketchengine.eu/lithuanian-wac/) - The LithuanianWaC is a text corpus with part-of-speech tagging available in Sketch Engine. Generate collocations, n-grams, concordances. - [Lithuanian part-of-speech tagset for MATAS](https://www.sketchengine.eu/matas-lithuanian-part-of-speech-tagset/) - Lithuanian annotated corpus was manually annotated and uses the part-of-speech tagset mentioned on this page. - [Lithuanian part-of-speech tagset for LAC](https://www.sketchengine.eu/lac-lithuanian-part-of-speech-tagset/) - Lithuanian part-of-speech tagset is a list of morphological tags indicating grammatical categories in Lithuanian text corpora. - [Liste de mots - leçon](https://www.sketchengine.eu/liste-de-mots-lecon/) - Téléchargez des listes de fréquence de mots dans de nombreuses langues ou générez des listes des noms, verbes et autres parties du discours les plus fréquents. - [LEXMCI](https://www.sketchengine.eu/lexmci/) - The 1.7 billion word LEXMCI corpus of English was created by the Lexicography MasterClass in 2008 as a source of lexicographic information for the lexicographers compiling the Dante database. Bibliography (about Dante) Kilgarriff, Adam (2010). DANTE: A Detailed, Accurate, Extensive, Available English Lexical Database in Proceedings of a meeting of the North American Association for Computational - [Lexicom 2015: 15th Workshop in Lexicography](https://www.sketchengine.eu/lexicom-2015-15th-workshop-in-lexicography/) - Workshop in Lexicography, Corpus Linguistics, and Lexical Computing Telč, Czech Republic, June 8th–12th 2015 Telč Lexicom is a five-day intensive workshop in lexicography, corpus linguistics and lexical computing, created by the Lexicography MasterClass. Seminars on theoretical issues alternate with hands-on work at the computer. Working in small groups or individually, you will learn how to create corpora, analyse - [Leçon sur les mots-clés et les expressions](https://www.sketchengine.eu/lecon-sur-les-mots-cles-et-les-expressions/) - Extraire des mots-clés et des expressions à partir de vos textes. On peut s'en servir pour définir ou identifier le sujet principal du corpus . Pour l'extraction, Sketch Engine combine des calculs statistiques avec des critères linguistiques. - [Leçon sur la différence de profils lexicaux](https://www.sketchengine.eu/lecon-sur-la-difference-de-profils-lexicaux/) - Comparer deux mots afin de pouvoir observer leur différences d'usage. Convient à des - [Leçon sur le concordancier](https://www.sketchengine.eu/lecon-sur-le-concordancier/) - Apprenez comment trouver des exemples d'un mot, d'un lemme, d'une expression ou d'une étiquette ou même d'une structure grammaticale ou lexicale complexe. - [Leçon sur le concordancier parallèle](https://www.sketchengine.eu/lecon-sur-le-concordancier-parallele/) - Travailler avec des textes bi- ou multilingues (corpus parallèles) pour chercher un mot ou une expression et pour voir la traduction en contexte. Faire une requête parallèle pour obtenir une concordance parallèle. - [Latvian part-of-speech tagset](https://www.sketchengine.eu/latvian-part-of-speech-tagset/) - Latvian part-of-speech tagset is a list POS tags for Latvian corpora in Sketch Engine. - [Latin part-of-speech tagset](https://www.sketchengine.eu/latin-part-of-speech-tagset/) - Latin Lamap POS tagset is a list POS tags used to indicate grammatical categories for Latin corpora in Sketch Engine. - [Lao part-of-speech tagset](https://www.sketchengine.eu/lao-part-of-speech-tagset/) - Lao POS tagset is a list of part-of-speech tags used to indicate grammatical categories for Lao corpora in Sketch Engine. - [kyWaC – Kyrgyz corpus](https://www.sketchengine.eu/kywac-kyrgyz-corpus/) - Search kyWaC, the 19-million-word Kyrgyz corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances or generate Kirghiz word lists and n-grams. - [koTenTen – Korean corpus from the web](https://www.sketchengine.eu/kotenten-korean-corpus/) - Search koTenTen, the 460-million-word Korean corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [Korean part-of-speech tagsets](https://www.sketchengine.eu/korean-part-of-speech-tagset/) - Korean corpora in Sketch Engine can be POS tagged. A tagset is a list of part-of-speech tags (POS tags for short). - [Korean HanNanum part-of-speech tagset simplified](https://www.sketchengine.eu/hannanum-korean-tagset-simplified/) - Korean text corpora in Sketch Engine contains part-of-speech tagging enables to search collocations, n-grams. - [kkWaC – Kazakh corpus](https://www.sketchengine.eu/kkwac-kazakh-corpus/) - Search kkWaC, the 139-million-word Kazakh corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances or generate Kazakh word lists and n-grams. - [Institutions with ELEXIS-funded access to Sketch Engine](https://www.sketchengine.eu/list-of-elexis-universities-and-institutions/) - An up-to-date list of institutions with free access to Sketch Engine funded by the ELEXIS project. - [Historical English Penn TreeBank tagset](https://www.sketchengine.eu/historical-penn-treebank-tagset/) - English TreeTagger POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [hebWaC – Hebrew corpus from the web](https://www.sketchengine.eu/hebwac-hebrew-corpus/) - Search hebWaC, the 47-million-word Hebrew corpus of texts collected from the web. Texts were cleaned and deduplicated. This Hebrew corpus consists of newspapers pages, blog posts, etc. - [HebrewGC – Hebrew General Corpus](https://www.sketchengine.eu/hebrewgc-hebrew-general-corpus/) - Search HebrewGC, the 157-million-word Hebrew General corpus which consists of mostly newspaper texts. Texts were not duplicated. The corpus was created at the Hebrew University in Jerusalem. - [Hebrew web corpora](https://www.sketchengine.eu/hebrew-web-corpora/) - Hebrew web corpus was gained from the Internet and includes mostly newspaper materials. It has more than 150 million words. - [Hebrew part-of-speech tagset by Meni Adler‘s tool](https://www.sketchengine.eu/hebrew-meni-adler-part-of-speech-tagset/) - This Hebrew part-of-speech tagset is the output of the Hebrew tagger developed by Meni Adler. - [Gujarati web corpus (guWaC)](https://www.sketchengine.eu/gujarathiwac-corpus/) - GuWac web as corpus is a corpus of Gujarati language (Indo-Aryan language belonging to the Indo-European language family), was crawled in 2013. It contains almost 18 million words and is encoding in UTF-8 without tagging. - [Gujarati Web corpus](https://www.sketchengine.eu/gujarati-web-corpus/) - Search guWaC, the 18-million-word Tajik corpus of texts from the web. Texts were cleaned and deduplicated. - [Greek NeuroLingo part-of-speech tagset](https://www.sketchengine.eu/neurolingo-tagset-summary/) - Greek NeuroLingo POS tagset is a list POS tags used to indicate grammatical categories for Greek corpora in Sketch Engine. - [Greek INTERA part-of-speech tagset](https://www.sketchengine.eu/greek-intera-part-of-speech-tagset/) - Greek INTERA POS tagset is a list part-of-speech tags used to indicate grammatical categories for Greek corpora in Sketch Engine. - [German word list](https://www.sketchengine.eu/german-word-list/) - Download a word list of the most common and frequent German words, nouns, verbs and adjectives - [German STTS part-of-speech tagset](https://www.sketchengine.eu/german-stts-part-of-speech-tagset/) - German STTS POS tagset is a list POS tags used to indicate grammatical categories for German corpora in Sketch Engine. - [Georgian Wikipedia corpus](https://www.sketchengine.eu/georgian-wikipedia-corpus/) - Search Georgian Wikipedia corpus with Sketch Engine. Generate frequency word lists, n-grams, or work with concordances. - [General instructions on corpus data directory structure](https://www.sketchengine.eu/general-instructions-on-corpus-data-directory-structure/) - The aims of these instructions is to ensure that for every corpus, it is obvious what are its data sources, what is its configuration and what are the related compiled indices, altogether guaranteeing integrity and reproducibility of achieved results. The preferred directory hierarchy is as follows: one directory containing three subdirectories for corpus vertical files, - [Fryske Akademy Parallel Corpus](https://www.sketchengine.eu/fryske-akademy-parallel-corpus/) - Frisian and Dutch not POS tagged aligned sentences Dutch TreeTagger and WS grammar used for Frisian - [FinnTreeBank2 part-of-speech tagset](https://www.sketchengine.eu/finntreebank/) - FinnTreeBank POS tagset is a list POS tags used to indicate grammatical categories for Finnish corpora in Sketch Engine. - [FAQ: Negative verb forms](https://www.sketchengine.eu/faq-negative-verb-forms/) - The negative verb forms are divided during pre-processing of corpora in Sketch Engine. For this reason, search them separately, "cannot" as "can not". - [Fair Use Policy](https://www.sketchengine.eu/fair-use-policy/) - [Estonian RSS Feed Corpus](https://www.sketchengine.eu/estonian-rss-feed-corpus/) - [Estonian Filosoft part-of-speech tagset](https://www.sketchengine.eu/estonian-filosoft-part-of-speech-tagset/) - See the Estonian Filosoft POS tagset, the list of part-of-speech tags used to indicate grammatical categories for Estonian corpora in Sketch Engine. - [English corpus from Wikipedia](https://www.sketchengine.eu/english-wikipedia-corpus/) - Search the Wikipedia corpus, the 1.3-billion-word English corpus built up from the whole content of English Wikipedia using the Wikipedia dump. - [English CLAWS part-of-speech tagset, version 5](https://www.sketchengine.eu/english-claws5-part-of-speech-tagset/) - English CLAWS 5 POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [English CLAWS part-of-speech tagset, version 2](https://www.sketchengine.eu/english-claws2-part-of-speech-tagset/) - English CLAWS 2 POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [e-flux corpus](https://www.sketchengine.eu/eflux-corpus/) - The e-flux corpus is a web corpus of English art news digests. The corpus consists of 9538 art announcements released from March 1998 to May 2012 collected from ​e-flux. It was collected in collaboration with ​David Levine and ​Alexander Provan, as the basis of their research article,​'International Art English' presented in the online journal ​Triple - [Dynamic Functions](https://www.sketchengine.eu/dynamic-functions/) - Please read first about what dynamic attributes are and how they are setup in the corpus configuration file documentation. Internal dynamic functions The following table gives an overview of existing builtin dynamic functions together with examples of usage: - striplastn (str, n) - returns str striped from last n characters - lowercase (str, locale) - - [Dutch Web Corpus](https://www.sketchengine.eu/dutch-web-corpus/) - This corpus was created within the Corpus Factory project as described in the paper below. Bibliography A corpus factory for many languages. Adam Kilgarriff, Siva Reddy, Jan Pomikálek, Avinesh PVS. In The Seventh International Conference on Language Resources and Evaluation (LREC), Malta, May 2010. - [Dutch Trends corpus](https://www.sketchengine.eu/dutch-trends-corpus/) - Search the Dutch Trends corpus, a Dutch monitor corpus of news articles gained from their RSS feeds. The corpus is updated daily with 2 million words. - [Dutch ANW part-of-speech tagset](https://www.sketchengine.eu/dutch-anw-part-of-speech-tagset/) - Dutch ANW POS tagset is a list POS tags used to indicate grammatical categories for Dutch corpora in Sketch Engine. - [Domain Specific Corpora](https://www.sketchengine.eu/domain-specific-corpora/) - These corpora are prepared from specific domains, e.g. science, art etc. Thanks to that, you can study specifics the certain domain. Domain specific corpora built using WebBootCat and Dante lexical database. List of corpora: CAJA (academic journal articles) COMPAS (newspaper dailies related to immigration) Environment (restricted access) Medical Web Corpus (medical) ScienceBlog (science) TECU (geodetics, development) e-flux (art) - [Virtual Corpus](https://www.sketchengine.eu/documentation/virtual-corpus/) - [Sketch Engine changelog - FinLib](https://www.sketchengine.eu/documentation/finlib-changelog/) - Sketch Engine is a powerful tool for searching and building text corpora. It offers hundreds of corpora in 100+ languages. - [Sketch Engine API for IntelliWebSearch](https://www.sketchengine.eu/documentation/ske-api-for-intelliwebsearch/) - Sketch Engine is a corpus manager tool offering many corpus linguistics tools. From concordancing, various context statistics, distributional thesaurus, word sketches to keyword and term extraction, building your own corpora, ... IntelliWebSearch is a Windows tool which copies highlighted text from a Windows application (MS Word, CAT systems, etc.) and puts it into a search - [Renaming Sketch Grammar relations](https://www.sketchengine.eu/documentation/renaming-sketch-grammar-relations/) - CD to directory which contains the compiled corpus files. cd `corpinfo -p mycorpus` Create a file with a list of the names of the sketch grammar relations (each name on a separate line). tr 'CD to directory which contains the compiled corpus files. cd `corpinfo -p mycorpus` Create a file with a list of the names of the sketch grammar relations (each name on a separate line). tr 'CD to directory which contains the compiled corpus files. cd `corpinfo -p mycorpus` Create a file with a list of the names of the sketch grammar relations (each name on a separate line). tr '#post_excerpt' '\n' < `corpinfo -g WSBASE mycorpus`.lex > /tmp/mycorpus_sgrel_names Edit the names of relations in the /tmp/mycorpus_sgrel_names file. Do not change' '\n' < `corpinfo -g WSBASE mycorpus`.lex > /tmp/mycorpus_sgrel_names Edit the names of relations in the /tmp/mycorpus_sgrel_names file. Do not change' '\n' < `corpinfo -g WSBASE mycorpus`.lex > /tmp/mycorpus_sgrel_names Edit the names of relations in the /tmp/mycorpus_sgrel_names file. Do not change - [Preparing Corpus Text](https://www.sketchengine.eu/documentation/preparing-corpus-text/) - [Methods and attributes in Sketch Engine API](https://www.sketchengine.eu/documentation/methods-documentation/) - This is a list of all methods and attributes used in Sketch Engine. The output of these methods is on the JSON API documentation page. - [MAPTO directive](https://www.sketchengine.eu/documentation/mapto-directive/) - Adding MAPTO directives MAPTO directive in attribute definitions serves for defining a map between the attribute and another attribute. It is computed from vertical data using tool mknormattr. First you need to define which attribute should be mapped to what. This is done in the corpus configuration file. It is allowed to use multiple values - [m:n mapping with parallel corpora](https://www.sketchengine.eu/documentation/mn-mapping-helper-scripts/) - [Full Administration](https://www.sketchengine.eu/documentation/full-administration/) - Full administration is available only for local installations with user management system (UMS). It allows managing with users and corpora. - [JSON API - creating query](https://www.sketchengine.eu/documentation/json-api-query/) - A description of creating a query via JSON API that you want to work with. - [CQL - thesaurus](https://www.sketchengine.eu/documentation/cql-thesaurus/) - [CQL - search structures](https://www.sketchengine.eu/documentation/cql-search-structures/) - Corpus Query language (CQL) enables users to search words within specific structures such as paragraphs, documents, etc. - [CQL - meet & union](https://www.sketchengine.eu/documentation/cql-meet-union/) - [CQL & Word Sketches](https://www.sketchengine.eu/documentation/cql-word-sketches/) - [CQL - Corpus Query Language](https://www.sketchengine.eu/documentation/corpus-querying/) - [CQL - basics](https://www.sketchengine.eu/documentation/cql-basics/) - [Corpus Configuration File: Overview](https://www.sketchengine.eu/documentation/the-corpus-configuration-file/) - Corpus Configuration File Overview describes which structures and attributes are contained in Corpus configuration file. - [Corpus configuration example](https://www.sketchengine.eu/documentation/corpus-configuration-example/) - Corpus configuration example describes how to configure a simple vertical file in Sketch Engine. - [Corpcheck](https://www.sketchengine.eu/documentation/corpcheck/) - [Common corpus structures](https://www.sketchengine.eu/documentation/common-corpus-structures/) - Description of common corpora structures in Sketch Engine. - [Clustering](https://www.sketchengine.eu/documentation/clustering-neighbours-documentation/) - Clustering can be performed in Sketch Engine on the similar words in a Thesaurus the collocates in a Word sketch If the clustering option is selected then the similar words from the thesaurus are clustered according to their distributional similarity scores. The distributional similarity score is provided in section 3 of our documentation Statistics used - [API Documentation](https://www.sketchengine.eu/documentation/api-documentation/) - Instructions on how to manipulate with the Sketch Engine JSON API and a description of all available methods and attributes. - [Average Reduced Frequency](https://www.sketchengine.eu/documentation/average-reduced-frequency/) - Average Reduced Frequency (ARF) is a variant on a frequency list that 'discounts' multiple occurrences of a word that occur close to each other. - [Dissertation topics](https://www.sketchengine.eu/dissertation-topics/) - We invite all PhD students interested in natural language processing to cooperate with the Sketch Engine team in discovering new ways of processing lexical data. - [daTenTen – Danish corpus from the web](https://www.sketchengine.eu/datenten-danish-corpus/) - Search daTenTen, the 3.4-billion-word Danish corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [danishWaC corpus](https://www.sketchengine.eu/danishwac-corpus/) - The corpus prepared by Corpus factory method. It has 288 million words with encoding in UTF-8 and isn't tagged yet. - [Danish Gigaword Corpus](https://www.sketchengine.eu/danish-gigaword-corpus/) - Search DAGW, the 964-million-word Danish corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Danish ePOS part-of-speech tagset](https://www.sketchengine.eu/danish-epos-part-of-speech-tagset/) - Danish ePOS tagset is a list part-of-speech tags used to indicate grammatical categories for Danish corpora. - [Danish CST’s TaggerXML part-of-speech tagset](https://www.sketchengine.eu/danish-taggerxml-part-of-speech-tagset/) - Danish TaggerXML tagset is a list POS tags used to indicate grammatical categories for Arabic corpora in Sketch Engine. - [czes corpus](https://www.sketchengine.eu/czes-corpus/) - CZES is a Czech corpus consisting of newspaper articles and magazine articles from years 1995–1998 and 2002. The data was downloaded from trafika.cz and newspapers' home sites: Lidové noviny, Mladá fronta, Českomoravský profit, Právo and other. Some data (articles, books) was taken from many small websites (students' work). Another part was obtained from CD archives - [N'Ko text corpora](https://www.sketchengine.eu/corpora-and-languages/nko-text-corpora/) - [Nepali text corpora](https://www.sketchengine.eu/corpora-and-languages/nepali-text-corpora/) - [Ndebele text corpora](https://www.sketchengine.eu/corpora-and-languages/ndebele-text-corpora/) - [Montenegrin text corpora](https://www.sketchengine.eu/corpora-and-languages/montenegrin-text-corpora/) - [Mongolian text corpora](https://www.sketchengine.eu/corpora-and-languages/mongolian-text-corpora/) - [Mirning text corpora](https://www.sketchengine.eu/corpora-and-languages/mirning-text-corpora/) - [Marlpa text corpora](https://www.sketchengine.eu/corpora-and-languages/marlpa-text-corpora/) - [Marathi text corpora](https://www.sketchengine.eu/corpora-and-languages/marathi-text-corpora/) - [Maori text corpora](https://www.sketchengine.eu/corpora-and-languages/maori-text-corpora/) - [Manyjiljar text corpora](https://www.sketchengine.eu/corpora-and-languages/manyjiljar-text-corpora/) - [Mankulatjarra text corpora](https://www.sketchengine.eu/corpora-and-languages/mankulatjarra-text-corpora/) - [Maltese text corpora](https://www.sketchengine.eu/corpora-and-languages/maltese-text-corpora/) - [Maldivian text corpora](https://www.sketchengine.eu/corpora-and-languages/maldivian-text-corpora/) - [Malayalam text corpora](https://www.sketchengine.eu/corpora-and-languages/malayalam-text-corpora/) - [Malay text corpora](https://www.sketchengine.eu/corpora-and-languages/malay-text-corpora/) - [Macedonian text corpora](https://www.sketchengine.eu/corpora-and-languages/macedonian-text-corpora/) - [Lithuanian text corpora](https://www.sketchengine.eu/corpora-and-languages/lithuanian-text-corpora/) - [Limburgish text corpora](https://www.sketchengine.eu/corpora-and-languages/limburgish-text-corpora/) - [Latvian text corpora](https://www.sketchengine.eu/corpora-and-languages/latvian-text-corpora/) - [Sanskrit part-of-speech tagset](https://www.sketchengine.eu/sanskrit-part-of-speech-tagset/) - [Jozef Stefan Institute Timestamped web Corpus](https://www.sketchengine.eu/jozef-stefan-institute-newsfeed-corpus/) - Search Timestamped corpora, the time annotated corpora with daily updated texts collected by Jozef Stefan Institute Newsfeed. Search billions of words in 18 languages. - [Japanese MeCab part-of-speech tagset](https://www.sketchengine.eu/tagset-jp-mecab/) - Japanese text corpora in Sketch Engine contain part-of-speech tagging which enables to search collocations, n-grams. - [Japanese corpus (jpWaC)](https://www.sketchengine.eu/jpwac-japanese-corpus/) - Search jpWaC, the 330-million-word Japanese corpus of texts from the Japanese national domain. Texts were cleaned and deduplicated. - [Japanese ChaSen part-of-speech tagset - English translation](https://www.sketchengine.eu/japanese-tagset/) - Japanese ChaSen POS tagset is a list POS tags used to indicate grammatical categories for Japanese corpora in Sketch Engine. - [itTenTen – Italian corpus from the web](https://www.sketchengine.eu/ittenten-italian-corpus/) - Search itTenTen, the 5-billion-word Italian corpus of texts from the web. Texts were cleaned and deduplicated. - [Italian word list](https://www.sketchengine.eu/italian-word-list/) - Download a word list of the most common and frequent Italian words, nouns, verbs and adjectives - [Italian corpus (itWaC)](https://www.sketchengine.eu/itwac-italian-corpus/) - Search itWaC, the 1.5-billion-word Italian corpus of texts from the Italian national domain. Texts were cleaned and deduplicated. - [Irish (Gaeilge) part-of-speech tagset](https://www.sketchengine.eu/gaeilge-tagset/) - This is a full description of the Irish part-of-speech (morpho-syntactical) tagset used in the New Corpus for Ireland. - [Igbo part-of-speech tagset (IgboNLP)](https://www.sketchengine.eu/igbo-part-of-speech-tagset/) - See the Igbo POS tagset, the of list part-of-speech tags used to indicate grammatical categories for Igbo corpora in Sketch Engine. - [Icelandic sample corpus](https://www.sketchengine.eu/icelandic-sample-corpus/) - This is a small corpus of Icelandic texts prepared for the Sketch Engine as a sample. Kindly provided by Þórdís Úlfarsdóttir. The size of the corpus is 9 million tokens. The texts included are of three different genres. The first one is fiction (novels), published in the year 2000 or later. The second category is - [HunMorph – Hungarian part-of-speech tagset](https://www.sketchengine.eu/hunmorph-part-of-speech-tagset/) - Search Hungarian grammatical categories using the HunMorph code system which is a Hungarian POS tagset, the list of POS tags used to indicate grammatical categories for Hungarian corpora in Sketch Engine. - [Hungarian MSD part-of-speech tagset](https://www.sketchengine.eu/hungarian-msd-code/) - Search grammatical categories with the MSD code system which is a Hungarian POS tagset, the list of POS tags used to indicate grammatical categories for Hungarian corpora in Sketch Engine. - [Hungarian emMorph-based part-of-speech tagset](https://www.sketchengine.eu/hungarian-emmorph-based-part-of-speech-tagset/) - Search Hungarian grammatical categories using the emMorph code system which is a Hungarian POS tagset, the list of POS tags used to indicate grammatical categories for Hungarian corpora in Sketch Engine. - [frWaC - French corpus from the .fr domain](https://www.sketchengine.eu/frwac-french-corpus/) - Search the 1.3-billion-word French frWaC corpus of texts from the French .fr domain. Texts were cleaned and deduplicated. Create an account now! - [French word list](https://www.sketchengine.eu/french-word-list/) - Download a word list of the most common and frequent French words, nouns, verbs and adjectives - [French TreeTagger part-of-speech tagset](https://www.sketchengine.eu/french-treetagger-part-of-speech-tagset/) - French TreeTagger POS tagset is a list POS tags used to indicate grammatical categories for French corpora in Sketch Engine. - [French FreeLing part-of-speech tagset](https://www.sketchengine.eu/french-freeling-part-of-speech-tagset/) - French FreeLing POS tagset is a list POS tags used to indicate grammatical categories for French corpora in Sketch Engine. - [Frantext: French literature of the 18th–20th century](https://www.sketchengine.eu/frantext-corpus-of-french-literature/) - Search Frantext, the 15-million-word corpus of French literature of the 18th–20th century. TThe corpus has part-of-speech tagging and lemmatization. - [Finnish TreeTagger part-of-speech tagset](https://www.sketchengine.eu/finnish-treetagger-part-of-speech-tagset/) - Finnish TreeTagger POS tagset is a list POS tags used to indicate grammatical categories for Finnish corpora in Sketch Engine. - [FidaPLUS corpus of Slovenian](https://www.sketchengine.eu/fida-plus-corpus/) - FidaPLUS is a 600-million-word text corpus of the years 1990–2006. POS tagged texts are categorized according to variety and types. - [Feed Corpus Project](https://www.sketchengine.eu/feed-corpus-project/) - FCP corpus aims to be a million word per day collection of POS-tagged texts downloaded from the Internet by following a set of RSS feeds. Maintained by Milos Husak of Masaryk University, Brno, for Lexical Computing Ltd. The first version was obtained by ​FCP software developed by Jakub Zgolinski. Now it is being downloaded using - [Estonian National Corpus](https://www.sketchengine.eu/estonian-national-corpus/) - The Estonian National Corpus is a morphologically annotated Estonian corpus. It contains written texts from the web, Estonian Wikipedia, DOAJ corpus. - [Error corpus from English Wikipedia](https://www.sketchengine.eu/error-corpus-from-english-wikipedia/) - Search the Error corpus, the learner corpus made up of texts collected from English Wikipedia. Study common mistakes in English texts, such as misspelling, lexico-semantic, typographical etc. - [English word lists](https://www.sketchengine.eu/english-word-list/) - Download a word list of the most common and frequent English words, nouns, verbs and adjectives. - [English Modified Penn Treebank part-of-speech tagset](https://www.sketchengine.eu/modified-penn-treebank-tagset/) - English modified Penn Treebank POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [Czech part-of-speech tagset](https://www.sketchengine.eu/tagset-reference-for-czech/) - See a list of part-of-speech tags used to indicate grammatical categories for Czech corpora in Sketch Engine. - [Cundeelee Wangka Stories](https://www.sketchengine.eu/cundeelee-wangka-stories/) - Cundeelee Wangka Stories corpus is a parallel corpus of Cundeelee Wangka (aboriginal language) and English. The corpus is part-of-speech tagged. - [Cundeelee Wangka POS tagset](https://www.sketchengine.eu/cundeelee-wangka-pos-tagset/) - Cundeelee Wangka POS tagset is a set of part-of-speech tags available in Cundeelee Wangka corpora annotated by linguists from Goldfields Aboriginal Language Centre in Kalgoorlie, Australia. - [Corpus of Academic Journal Articles (CAJA)](https://www.sketchengine.eu/corpus-of-academic-journal-articles-caja/) - Search CAJA corpus, the English balanced corpus of Academic Journal Articles created by Iztok Kosem in 2010. Corpus includes 13 thousand articles from 28 fields. - [Corpus Factory Method](https://www.sketchengine.eu/corpus-factory-method/) - A method for developing large general language corpora which can be applied to many languages. - [Xhosa text corpora](https://www.sketchengine.eu/corpora-and-languages/xhosa-text-corpora/) - Xhosa is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tibetan text corpora](https://www.sketchengine.eu/corpora-and-languages/tibetan-text-corpora/) - Tibetan is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Nyakinyaki text corpora](https://www.sketchengine.eu/corpora-and-languages/nyakinyaki-text-corpora/) - [Norwegian Nynorsk text corpora](https://www.sketchengine.eu/corpora-and-languages/norwegian-nynorsk-text-corpora/) - [Norwegian Bokmål text corpora](https://www.sketchengine.eu/corpora-and-languages/norwegian-bokmal-text-corpora/) - [Norwegian (Mixed) text corpora](https://www.sketchengine.eu/corpora-and-languages/norwegian-mixed-text-corpora/) - [Northern Sotho](https://www.sketchengine.eu/corpora-and-languages/northern-sotho/) - [Nganta text corpora](https://www.sketchengine.eu/corpora-and-languages/nganta-text-corpora/) - [Ngalia text corpora](https://www.sketchengine.eu/corpora-and-languages/ngalia-text-corpora/) - [Ngaju text corpora](https://www.sketchengine.eu/corpora-and-languages/ngaju-text-corpora/) - [Ngaanyatjarra text corpora](https://www.sketchengine.eu/corpora-and-languages/ngaanyatjarra-text-corpora/) - [Newspeak text corpora](https://www.sketchengine.eu/corpora-and-languages/newspeak-text-corpora/) - [Latin text corpora](https://www.sketchengine.eu/corpora-and-languages/latin-text-corpora/) - [Lao text corpora](https://www.sketchengine.eu/corpora-and-languages/lao-text-corpora/) - [Kyrgyz text corpora](https://www.sketchengine.eu/corpora-and-languages/kyrgyz-text-corpora/) - [Kuwarra text corpora](https://www.sketchengine.eu/corpora-and-languages/kuwarra-text-corpora/) - [Kurdish (Sorani) text corpora](https://www.sketchengine.eu/corpora-and-languages/kurdish-sorani-text-corpora/) - [Kurdish (Kurmanji) text corpora](https://www.sketchengine.eu/corpora-and-languages/kurdish-kurmanji-text-corpora/) - [Korean text corpora](https://www.sketchengine.eu/corpora-and-languages/korean-text-corpora/) - [Khmer text corpora](https://www.sketchengine.eu/corpora-and-languages/khmer-text-corpora/) - [Kazakh text corpora](https://www.sketchengine.eu/corpora-and-languages/kazakh-text-corpora/) - [Kannada text corpora](https://www.sketchengine.eu/corpora-and-languages/kannada-text-corpora/) - [Kalaamaya text corpora](https://www.sketchengine.eu/corpora-and-languages/kalaamaya-text-corpora/) - [Japanese text corpora](https://www.sketchengine.eu/corpora-and-languages/japanese-text-corpora/) - [Italian text corpora](https://www.sketchengine.eu/corpora-and-languages/italian-text-corpora/) - [Irish text corpora](https://www.sketchengine.eu/corpora-and-languages/irish-text-corpora/) - [Indonesian text corpora](https://www.sketchengine.eu/corpora-and-languages/indonesian-text-corpora/) - [Igbo text corpora](https://www.sketchengine.eu/corpora-and-languages/igbo-text-corpora/) - [Icelandic text corpora](https://www.sketchengine.eu/corpora-and-languages/icelandic-text-corpora/) - [Hungarian text corpora](https://www.sketchengine.eu/corpora-and-languages/hungarian-text-corpora/) - [Hindi text corpora](https://www.sketchengine.eu/corpora-and-languages/hindi-text-corpora/) - [Hebrew text corpora](https://www.sketchengine.eu/corpora-and-languages/hebrew-text-corpora/) - [Cebuano text corpora](https://www.sketchengine.eu/corpora-and-languages/cebuano-text-corpora/) - [Catalan text corpora](https://www.sketchengine.eu/corpora-and-languages/catalan-text-corpora/) - [Cantonese text corpora](https://www.sketchengine.eu/corpora-and-languages/cantonese-text-corpora/) - [Burmese text corpora](https://www.sketchengine.eu/corpora-and-languages/burmese-text-corpora/) - [Bulgarian text corpora](https://www.sketchengine.eu/corpora-and-languages/bulgarian-text-corpora/) - [Breton text corpora](https://www.sketchengine.eu/corpora-and-languages/breton-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Breton such as generating frequency word lists, concordances, n-grams, ... - [Bosnian text corpora](https://www.sketchengine.eu/corpora-and-languages/bosnian-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Amharic such as generating frequency word lists, collocations, n-grams, ... - [Bengali text corpora](https://www.sketchengine.eu/corpora-and-languages/bengali-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Amharic such as generating frequency word lists, concordances, n-grams, ... - [Belarusian text corpora](https://www.sketchengine.eu/corpora-and-languages/belarusian-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Belarusian such as generating frequency word lists, concordances, n-grams, ... - [Basque text corpora](https://www.sketchengine.eu/corpora-and-languages/basque-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Basque such as generating frequency word lists, collocations, n-grams, ... - [Azerbaijani text corpora](https://www.sketchengine.eu/corpora-and-languages/azerbaijani-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Azerbaijani such as generating frequency word lists, collocations, n-grams, ... - [Assamese text corpora](https://www.sketchengine.eu/corpora-and-languages/assamese-text-corpora/) - Assamese is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Armenian text corpora](https://www.sketchengine.eu/corpora-and-languages/armenian-text-corpora/) - [Arabic text corpora](https://www.sketchengine.eu/corpora-and-languages/arabic-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Arabic such as generating frequency word lists, collocations, n-grams, ... - [Ancient Greek text corpora](https://www.sketchengine.eu/corpora-and-languages/ancient-greek-text-corpora/) - [Amharic text corpora](https://www.sketchengine.eu/corpora-and-languages/amharic-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Amharic such as generating frequency word lists, collocations, n-grams, ... - [Amazigh text corpora](https://www.sketchengine.eu/corpora-and-languages/amazigh-text-corpora/) - [Albanian text corpora](https://www.sketchengine.eu/corpora-and-languages/albanian-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Albanian such as generating frequency word lists, collocations, n-grams, ... - [CoPEP – Corpus of Portuguese from Academic Journals](https://www.sketchengine.eu/copep-corpus-of-portuguese-from-academic-journals/) - Search CoPEP, the 40-million-word Portuguese corpus of texts from academic journals from Brazil and Portugal. Texts were published 2000–2016. - [Cambridge Academic English Corpus (CAEC)](https://www.sketchengine.eu/cambridge-academic-english-corpus/) - Search CAEC, the 3-million-word sample of the Cambridge Academic English Corpus collected from the academic English texts provided by US and UK institutions. - [Cantonese corpus from the web](https://www.sketchengine.eu/cantonesewac-corpus/) - Search CantoneseWaC, the 42-million-word Chinese corpus of Cantonese texts from the Web. Texts were cleaned and deduplicated. - [Bulgarian TreeBank part-of-speech tagset](https://www.sketchengine.eu/bulgarian-treebank-part-of-speech-tagset/) - Bulgarian TreeBank POS tagset is a list POS tags used to indicate grammatical categories for Bulgarian corpora in Sketch Engine. - [Bulgarian tagset](https://www.sketchengine.eu/bulgarian-tagset/) - Bulgarian part-of-speech tagset is a list of tags used in Bulgarian corpora in Sketch Engine. - [bsWaC – Bosnian corpus from the web](https://www.sketchengine.eu/bswac-bosnian-corpus/) - Search bsWaC, the 248-million-word Bosnian corpus of texts collected from the web. Texts were cleaned and deduplicated. - [Brexit corpus](https://www.sketchengine.eu/brexit-corpus/) - Brexit text corpus is a database of articles relating to the UK leaving the EU. Study English n-grams, collocations, concordances, etc. - [Brasileiro part-of-speech tagset](https://www.sketchengine.eu/brasileiro-part-of-speech-tagset/) - Sketch Engine offers Brazilian Portuguese corpus with part-of-speech tags. Search concordances, collocations, n-grams. - [BLaRC: British Law Reference Corpus](https://www.sketchengine.eu/blarc-british-law-reference-corpus/) - Search BLaRC corpus, the 8-million-word British English corpus of legal text which consists of judicial decisions issued by British courts and tribunals in 2008–2010. Explore legal English. - [Bengali part-of-speech tagset](https://www.sketchengine.eu/bengali-part-of-speech-tagset/) - Bengali text corpora have available part-of-speech tagging in Sketch Engine with free trial. Search collocations, n-grams, thesaurus. - [Basque part-of-speech tagset](https://www.sketchengine.eu/basque-part-of-speech-tagset/) - Basque part-of-speech tagset is a list POS tags used to indicate grammatical categories for Basque corpora in Sketch Engine. - [Arabic MADA system tagset](https://www.sketchengine.eu/arabic-mada-system-tagset/) - MADA system is an Arabic part-of-speech tagset. See the list POS tags used to indicate grammatical categories for Arabic corpora. - [Arabic corpus (arWaC)](https://www.sketchengine.eu/arabic-web-corpus-wac/) - Search arWaC, the 174-million-word Arabic corpus of texts from the Italian national domain. Texts were cleaned and deduplicated. - [Araneum Slovacum Maius part-of-speech tagset](https://www.sketchengine.eu/araneum-slovacum-tagset/) - Tagset for Araneum Slovacum Maius Corpus in Sketch Engine. - [azWaC – Azerbaijani corpus](https://www.sketchengine.eu/azwac-azerbaijani-corpus/) - Search azWaC, the 94-million-word Azerbaijani corpus of texts from the web. Texts were cleaned and deduplicated. Search concordances or generate Azeri word lists and n-grams. - [Amharic part-of-speech tagset](https://www.sketchengine.eu/amharic-part-of-speech-tagset/) - Amharic POS tagset is available in Amharic text corpora in Sketch Engine enabling to search concordance, n-grams, collocations. - [Algemeen Nederlands Woordenboek (ANW) corpus](https://www.sketchengine.eu/anw-corpus/) - Search Algemeen Nederlands Woordenboek (ANW), the 100-million-word Dutch corpus of various texts – literature, websites, newspapers. - [OPUS MontenegrinSubs parallel corpora](https://www.sketchengine.eu/opus-montenegrinsubs-parallel-corpora/) - Search the OPUS MontenegrinSubs parallel corpora, the bilingual collection of corpora in two languages: Montenegrin and English. - [CHILDES - child language corpus](https://www.sketchengine.eu/childes-corpora/) - Discover CHILDES corpora, multilingual text corpora built up from transcripts of child language. Study language of learners in 24 languages. Generate concordances, n-grams or collocations. - [Shared Resources](https://www.sketchengine.eu/user-guide/shared-resources/) - [Norwegian Universal dependencies tagset](https://www.sketchengine.eu/norwegian-universal-dependencies-tagset/) - Norwegian Universal dependencies tagset is a list of part-of-speech tags used to indicate the part of speech in Norwegian. Both language variants are included: Bokmål and Nynorsk. - [Universal Word Sketch grammar with shallow tags](https://www.sketchengine.eu/universal-word-sketch-grammar-with-shallow-tags/) - [Universal Word Sketch grammar](https://www.sketchengine.eu/universal-word-sketch-grammar/) - [Timeline – language use over time](https://www.sketchengine.eu/timeline-language-use-over-time/) - The timeline function displays the changing frequency of a word or phrase over time. It provides a detailed graph with information about word frequency changes. - [Compiling a corpus on local installation](https://www.sketchengine.eu/documentation/compiling-corpus/) - Manual to compiling corpus on local installation with a description of compilecorp tool. - [88milSMS – French text-message corpus](https://www.sketchengine.eu/88milsms-french-text-message-corpus/) - Search 88milSMS, the French corpus of 88,000 text messages. Texts were part-of-speech tagged and lemmatized. Corpus contains emoji symbols. - [Tickbox Lexicography quote request](https://www.sketchengine.eu/tickbox-lexicography-quote-request/) - [Corpus hosting quote request](https://www.sketchengine.eu/corpus-hosting-quote-request/) - [non-academic subscription quote](https://www.sketchengine.eu/non-academic-subscription-quote/) - [Chinese Traditional text corpora](https://www.sketchengine.eu/corpora-and-languages/chinese-traditional-text-corpora/) - A list of Chinese Traditional text corpora designed for linguists, lexicologists, lexicographers, researchers, translators, terminologists, teachers and students working with Chinese Traditional. - [Lithuanian MULTEXT-EAST part-of-speech tagset](https://www.sketchengine.eu/lithuanian-multext-east-part-of-speech-tagset/) - Lithuanian part-of-speech tagset is a list of morphological tags indicating grammatical categories in Lithuanian text corpora. - [svTenTen – Swedish corpus from the web](https://www.sketchengine.eu/svtenten-swedish-corpus/) - Search svTenTen, the 2.3-billion-word Swedish corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [MDPI Open Peer Review Corpus 2](https://www.sketchengine.eu/mdpi-open-peer-review-corpus-2/) - Explore the MDPI Corpus. It includes over 135,000 peer reviews with full texts, metadata, and author responses, showcasing transparent peer review practices. - [Text type analysis](https://www.sketchengine.eu/guide/text-type-analysis/) - Text type analysis displays statistics of metadata of a corpus or subcorpus. The sizes can be shown as a number of structures, words or tokens. - [The Sketch Engine changelog – Bonito](https://www.sketchengine.eu/documentation/bonito-changelog/) - Sketch Engine is a powerful tool for searching and building text corpora. It offers hundreds of corpora in 100+ languages. - [Sketch Engine changelog - Manatee](https://www.sketchengine.eu/documentation/manatee-changelog/) - Sketch Engine is a powerful tool for searching and building text corpora. It offers hundreds of corpora in 100+ languages. - [fiTenTen – Finnish corpus from the web](https://www.sketchengine.eu/fitenten-finnish-corpus/) - Search fiTenTen, the 4.4-billion-word Finnish corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [Boot Camp Online](https://www.sketchengine.eu/bootcamp/boot-camp-online/) - [Maduwongga text corpora](https://www.sketchengine.eu/corpora-and-languages/maduwongga-text-corpora/) - Maduwongga is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Northern Iroquoian text corpora](https://www.sketchengine.eu/corpora-and-languages/northern-iroquoian-text-corpora/) - Northern Iroquoian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Punjabi (Shahmukhi) text corpora](https://www.sketchengine.eu/corpora-and-languages/punjabi-shahmukhi-text-corpora-2/) - Punjabi (Shahmukhi) is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [huTenTen – Hungarian corpus from the web](https://www.sketchengine.eu/hutenten-hungarian-corpus/) - Search huTenTen, the 5.1-billion-word Hungarian corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [Parallel corpora](https://www.sketchengine.eu/corpora-and-languages/parallel-corpora/) - Parallel corpora in Sketch Engine are monolingual corpora linked to each other so that users can search for results in more languages at once. - [Wangkatja text corpora](https://www.sketchengine.eu/corpora-and-languages/wangkatja-text-corpora/) - Wangkatja is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Warlpiri text corpora](https://www.sketchengine.eu/corpora-and-languages/warlpiri-text-corpora/) - Warlpiri is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works - [Venda text corpora](https://www.sketchengine.eu/corpora-and-languages/venda-text-corpora/) - Venda is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Esperanto text corpora](https://www.sketchengine.eu/corpora-and-languages/esperanto-text-corpora/) - [Portuguese text corpora](https://www.sketchengine.eu/corpora-and-languages/portuguese-text-corpora/) - Portuguese is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Polish text corpora](https://www.sketchengine.eu/corpora-and-languages/polish-text-corpora/) - Polish is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Pitjantjatjara text corpora](https://www.sketchengine.eu/corpora-and-languages/pitjantjatjara-text-corpora/) - Pitjantjatjara is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Pintupi text corpora](https://www.sketchengine.eu/corpora-and-languages/pintupi-text-corpora/) - Pintupi is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Persian text corpora](https://www.sketchengine.eu/corpora-and-languages/persian-text-corpora/) - Persian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Pashto text corpora](https://www.sketchengine.eu/corpora-and-languages/pashto-text-corpora/) - Pashto is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Oromo text corpora](https://www.sketchengine.eu/corpora-and-languages/oromo-text-corpora/) - [Quechua text corpora](https://www.sketchengine.eu/corpora-and-languages/quechua-text-corpora/) - Quechua is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Romanian text corpora](https://www.sketchengine.eu/corpora-and-languages/romanian-text-corpora/) - Romanian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Russian text corpora](https://www.sketchengine.eu/corpora-and-languages/russian-text-corpora/) - Russian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Punjabi (Gurmukhi) text corpora](https://www.sketchengine.eu/corpora-and-languages/punjabi-shahmukhi-text-corpora/) - Punjabi (Gurmukhi) is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Cundeelee Wangka text corpora](https://www.sketchengine.eu/corpora-and-languages/cundeelee-wangka-text-corpora/) - Cundeelee Wangka is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Crimean Tatar text corpora](https://www.sketchengine.eu/corpora-and-languages/crimean-tatar-text-corpora/) - Crimean is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Sanskrit (romanised) text corpora](https://www.sketchengine.eu/corpora-and-languages/sanskrit-romanised-text-corpora/) - Sanskrit (romanised) is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Scottish Gaelic text corpora](https://www.sketchengine.eu/corpora-and-languages/scottish-gaelic-text-corpora/) - Scottish Gaelic is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Serbian (Latin) text corpora](https://www.sketchengine.eu/corpora-and-languages/serbian-latin-text-corpora/) - Serbian (Latin) is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Serbian text corpora](https://www.sketchengine.eu/corpora-and-languages/serbian-text-corpora/) - Serbian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Samoan text corpora](https://www.sketchengine.eu/corpora-and-languages/samoan-text-corpora/) - Samoan is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Swedish text corpora](https://www.sketchengine.eu/corpora-and-languages/swedish-text-corpora/) - Swedish is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Swazi text corpora](https://www.sketchengine.eu/corpora-and-languages/swazi-text-corpora/) - Swazi is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Swahili text corpora](https://www.sketchengine.eu/corpora-and-languages/swahili-text-corpora/) - Swahili is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Spanish text corpora](https://www.sketchengine.eu/corpora-and-languages/spanish-text-corpora/) - Spanish is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Somali text corpora](https://www.sketchengine.eu/corpora-and-languages/somali-text-corpora/) - Somali is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Slovenian text corpora](https://www.sketchengine.eu/corpora-and-languages/slovenian-text-corpora/) - Slovenian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Sesotho text corpora](https://www.sketchengine.eu/corpora-and-languages/sesotho-text-corpora/) - Sesotho is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Sinhalese text corpora](https://www.sketchengine.eu/corpora-and-languages/sinhalese-text-corpora/) - Sinhalese is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Slovak text corpora](https://www.sketchengine.eu/corpora-and-languages/slovak-text-corpora/) - Slovak is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Setswana text corpora](https://www.sketchengine.eu/corpora-and-languages/setswana-text-corpora/) - Setswana is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Syriac text corpora](https://www.sketchengine.eu/corpora-and-languages/syriac-text-corpora/) - Syriac is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tigrinya text corpora](https://www.sketchengine.eu/corpora-and-languages/tigrinya-text-corpora/) - Tigrinya is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Thai text corpora](https://www.sketchengine.eu/corpora-and-languages/thai-text-corpora/) - Thai is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Telugu text corpora](https://www.sketchengine.eu/corpora-and-languages/telugu-text-corpora/) - Telugu is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tatar text corpora](https://www.sketchengine.eu/corpora-and-languages/tatar-text-corpora/) - Tatar is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tamil text corpora](https://www.sketchengine.eu/corpora-and-languages/tamil-text-corpora/) - Tamil is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Talysh text corpora](https://www.sketchengine.eu/corpora-and-languages/talysh-text-corpora/) - Talysh is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tajik text corpora](https://www.sketchengine.eu/corpora-and-languages/tajik-text-corpora/) - Tajik is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tagalog text corpora](https://www.sketchengine.eu/corpora-and-languages/tagalog-text-corpora/) - Tagalog is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tjalkatjarra text corpora](https://www.sketchengine.eu/corpora-and-languages/tjalkatjarra-text-corpora/) - Tjalkatjarra is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Hausa (Boko) text corpora](https://www.sketchengine.eu/corpora-and-languages/hausa-boko-text-corpora/) - [Tjupan text corpora](https://www.sketchengine.eu/corpora-and-languages/tjupan-text-corpora/) - Tjupan is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Tsonga text corpora](https://www.sketchengine.eu/corpora-and-languages/tsonga-text-corpora/) - Tsonga is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Turkish text corpora](https://www.sketchengine.eu/corpora-and-languages/turkish-text-corpora/) - Turkish is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Turkmen text corpora](https://www.sketchengine.eu/corpora-and-languages/turkmen-text-corpora/) - Turkmen is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Bashkir text corpora](https://www.sketchengine.eu/corpora-and-languages/bashkir-text-corpora/) - Bashkir is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works - [Gujarati text corpora](https://www.sketchengine.eu/corpora-and-languages/gujarati-text-corpora/) - [Greek text corpora](https://www.sketchengine.eu/corpora-and-languages/greek-text-corpora/) - [German text corpora](https://www.sketchengine.eu/corpora-and-languages/german-text-corpora/) - [Georgian text corpora](https://www.sketchengine.eu/corpora-and-languages/georgian-text-corpora/) - [Vietnamese text corpora](https://www.sketchengine.eu/corpora-and-languages/vietnamese-text-corpora/) - Vietnamese is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Uzbek text corpora](https://www.sketchengine.eu/corpora-and-languages/uzbek-text-corpora/) - Uzbek is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Urdu text corpora](https://www.sketchengine.eu/corpora-and-languages/urdu-text-corpora/) - Urdu is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Ukrainian text corpora](https://www.sketchengine.eu/corpora-and-languages/ukrainian-text-corpora/) - Ukrainian is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Galician text corpora](https://www.sketchengine.eu/corpora-and-languages/galician-text-corpora/) - [Frisian text corpora](https://www.sketchengine.eu/corpora-and-languages/frisian-text-corpora/) - [French text corpora](https://www.sketchengine.eu/corpora-and-languages/french-text-corpora/) - [Finnish text corpora](https://www.sketchengine.eu/corpora-and-languages/finnish-text-corpora/) - [Yankunytjatjara text corpora](https://www.sketchengine.eu/corpora-and-languages/yankunytjatjara-text-corpora/) - Yankunytjatjara is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Yiddish text corpora](https://www.sketchengine.eu/corpora-and-languages/yiddish-text-corpora/) - Yiddish is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Wudjaarri text corpora](https://www.sketchengine.eu/corpora-and-languages/wudjaarri-text-corpora/) - Wudjaarri is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Welsh text corpora](https://www.sketchengine.eu/corpora-and-languages/welsh-text-corpora/) - Welsh is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Yoruba text corpora](https://www.sketchengine.eu/corpora-and-languages/yoruba-text-corpora/) - Yoruba is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Zulu text corpora](https://www.sketchengine.eu/corpora-and-languages/zulu-text-corpora/) - Zulu is one of the many languages whose text corpora are included in Sketch Engine, a tool for discovering how language works. - [Filipino text corpora](https://www.sketchengine.eu/corpora-and-languages/filipino-text-corpora/) - [Estonian text corpora](https://www.sketchengine.eu/corpora-and-languages/estonian-text-corpora/) - [English text corpora](https://www.sketchengine.eu/corpora-and-languages/english-text-corpora/) - [Dutch text corpora](https://www.sketchengine.eu/corpora-and-languages/dutch-text-corpora/) - [Danish text corpora](https://www.sketchengine.eu/corpora-and-languages/danish-text-corpora/) - [Czech text corpora](https://www.sketchengine.eu/corpora-and-languages/czech-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Czech such as generating frequency word lists, collocations, n-grams, ... - [Croatian text corpora](https://www.sketchengine.eu/corpora-and-languages/croatian-text-corpora/) - [Chinese Simplified text corpora](https://www.sketchengine.eu/corpora-and-languages/chinese-simplified-text-corpora/) - [Afrikaans text corpora](https://www.sketchengine.eu/corpora-and-languages/afrikaans-text-corpora/) - The online search interface offers a variety of corpus analytic tools for Afrikaans text corpora such as generating frequency word lists, collocations, n-grams, ... - [Reference corpora](https://www.sketchengine.eu/corpora-and-languages/reference-corpus/) - A list of reference corpora for keyword and term extraction in the supported languages. - [List of corpora](https://www.sketchengine.eu/corpora-and-languages/corpus-list/) - A list of text corpora in all languages available in Sketch Engine. - [Corpus types](https://www.sketchengine.eu/corpora-and-languages/corpus-types/) - See definitions of corpus types: monolingual corpus, parallel corpus, multilingual corpus, diachronic corpus, learner corpus, multimedia corpus, comparable corpus. - [Timestamped English corpus](https://www.sketchengine.eu/timestamped-english-corpus/) - Search the Timestamped English corpus with 70 billion words. Texts are deduplicated, lemmatized, and part-of-speech tagged. - [The Oxford English Corpus](https://www.sketchengine.eu/oxford-english-corpus/) - The Oxford English Corpus (OED) contains all types of English including novels, everyday newspapers, blogs, emails and social media. - [Lektor: Slovenian Learner corpus of proofreading and translations](https://www.sketchengine.eu/lektor-slovenian-learner-corpus-of-proofreading-and-translations/) - Corpus Lektor is an error-annotated Slovenian corpus of the author's corrections of texts and translations. The texts were manually tagged and classified. - [OpenSubtitles parallel corpora](https://www.sketchengine.eu/opensubtitles-parallel-corpora/) - The OpenSubtitles parallel corpora are a collection of 60 corpora in 58 languages made up of translated movie subtitles in the OpenSubtitles database. - [viTenTen – Vietnamese corpus from the web](https://www.sketchengine.eu/vitenten-vietnamese-corpus/) - Search viTenTen, the 6-billion-word Vietnamese corpus of texts from the web. Texts were POS-tagged, cleaned, deduplicated and foreign language filtered. - [New words in English](https://www.sketchengine.eu/new-words-in-english/) - New Words in English, a valuable subscription service for publishers, lexicographers, and content developers. Create engaging content for your readers or users. - [CQL for geeks](https://www.sketchengine.eu/cql-for-geeks/) - This CQL functionality is primarily meant for development and testing. Use Corpus query language to advanced search in your corpora. - [MaCoCu Corpora from the web](https://www.sketchengine.eu/macocu-corpora-from-the-web/) - MaCoCu corpora is a collection of web corpora with hundreds of millions of words with a focus on under-resourced languages, facilitating language preservation. - [Ukrainian Trends corpus](https://www.sketchengine.eu/ukrainian-trends-corpus/) - Search the Ukrainian Trends corpus, a Ukrainian monitor corpus made up of news articles gained from their RSS feeds. The corpus is PoS tagged and lemmatized. - [Spanish Trends corpus](https://www.sketchengine.eu/spanish-trends-corpus/) - Search the Spanish Trends corpus, a Spanish monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 2 million words. - [Slovene Trends corpus](https://www.sketchengine.eu/slovene-trends-corpus/) - Search the Slovene Trends corpus, a Slovene monitor corpus of news articles gained from their RSS feeds. The corpus is updated daily with million words. - [Russian Trends corpus](https://www.sketchengine.eu/russian-trends-corpus/) - Search the Russian Trends corpus, a Russian monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 3-4 million words. - [Portuguese Trends corpus](https://www.sketchengine.eu/portuguese-trends-corpus/) - Search the Portuguese Trends corpus, a Portuguese monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 1-2 million words. - [Polish Trends corpus](https://www.sketchengine.eu/polish-trends-corpus/) - Search the Polish Trends corpus, a Polish monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by about 1 million words. - [Persian Trends corpus](https://www.sketchengine.eu/persian-trends-corpus/) - Search the Persian Trends corpus, an Arabic monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 5million words. - [Irish Trends corpus](https://www.sketchengine.eu/irish-trends-corpus/) - Search the Irish Trends corpus, an Irish monitor corpus of news articles gained from their RSS feeds. The corpus is updated daily with 35,000 words. - [German Trends corpus](https://www.sketchengine.eu/german-trends-corpus/) - Search the German Trends corpus, a German monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 2 million words. - [French Trends corpus](https://www.sketchengine.eu/french-trends-corpus/) - Seach the Fench Tends copus, a Fench monito copus made up of news aticles gained fom thei RSS feeds. The copus is updated daily by 1 million wods. - [Estonian Trends corpus](https://www.sketchengine.eu/estonian-trends-corpus/) - Search the Estonian Trends corpus, an Estonian monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by about 200,000 words. - [Danish Trends corpus](https://www.sketchengine.eu/danish-trends-corpus/) - #Search the Danish Trends corpus, a Danish monitor corpus made up of news articles gained from their RSS feeds. The corpus is updated daily by 100-200,000 words. - [Estonian Coursebook Corpus 2018](https://www.sketchengine.eu/estonian-coursebook-corpus/) - Search Estonian Coursebook Corpus, the 120-thousand word corpus containing complete sentences from Estonian language textbooks. - [Igbo corpus (igWaC)](https://www.sketchengine.eu/igbowac-igbo-corpus/) - Search igWaC, the 600,000-word Igbo corpus of texts from the web. Texts were cleaned and deduplicated. The corpus is part-of-speech tagged and lemmatized. - [MAGPIE: a sense-annotated corpus](https://www.sketchengine.eu/magpie-sense-annotated-corpus/) - [Turkic corpora](https://www.sketchengine.eu/turkic-web-corpora/) - Search Turkic corpora, the corpora set of the Azerbaijani, Kazakh, Kyrgyz, Turkish, Turkmen and Uzbek language. The total size is over 3.5-billion words. Make concordances or generate word lists and collocations. - [Timestamped Spanish corpus](https://www.sketchengine.eu/timestamped-spanish-corpus/) - Search the 16-billion-word Timetamped Spanish corpus updated with new data daily. Carry out the diachronic analysis of words from many Spanish varieties. - [Swedish Parole corpus](https://www.sketchengine.eu/swedish-parole-corpus/) - Swedish Parole is a 21-million-word morphologically and syntactically annotated corpus, created as part of the EU project PAROLE. - [slWaC – Slovenian corpus from the web](https://www.sketchengine.eu/slwac-slovenian-corpus-from-the-web/) - Search slWaC, the 750-million-word Slovenian corpus of texts collected from the web. Texts were cleaned and deduplicated. - [Chinese Penn Treebank part-of-speech tagset](https://www.sketchengine.eu/chinese-penn-treebank-part-of-speech-tagset/) - Chinese Penn Treebank is a list of part-of-speech tags used to indicate grammatical categories in Chinese corpora. - [COMPAS corpus](https://www.sketchengine.eu/compas-corpus/) - Search COMPAS, the 260-million-word English corpus from UK's newspaper articles related to immigration from the period 1985–2015. - [German corpus (deWaC)](https://www.sketchengine.eu/dewac-german-corpus/) - Search deWaC, the 1.3-billion-word German corpus of texts from the German national domain. Texts were cleaned and deduplicated. - [OANC: Open American National Corpus](https://www.sketchengine.eu/oanc_masc-corpus/) - Search OANC corpus, the 11-million-word Open American National Corpus. Use lemmatization and part-of-speech tagging to find collocations or generate n-grams. The corpus is merged with the Manually Annotated subcorpus. - [Merlin Learner Corpus](https://www.sketchengine.eu/merlin-learner-corpus/) - MERLIN is an error-annotated written learner corpus for German, Italian and Czech. - [jaTenTen – Japanese corpus from the web](https://www.sketchengine.eu/jatenten-japanese-corpus/) - Search jaTenTen, the 8-billion-word Japanese corpus of texts from the Japanese Web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Gutenberg Corpora 2020](https://www.sketchengine.eu/gutenberg-corpora-2020/) - Search Gutenberg corpora 2020, a collection of 29 corpora of books created from the Project Gutenberg database in April 2020. - [Europarl spoken parallel corpus](https://www.sketchengine.eu/europarl-spoken-parallel-corpus/) - Search Europarl spoken parallel corpus, the multilingual corpus in 21 languages built up from the European Parliament Proceedings. - [Eur-Lex judgments parallel corpus](https://www.sketchengine.eu/eur-lex-judgments-parallel-corpus/) - Search EUR-Lex judgments parallel corpus, the multilingual corpora made up offrom the Court of Justice judgments in 24 languages. - [Directory of Open Access Journals – DOAJ corpora](https://www.sketchengine.eu/doaj-corpora/) - Search DOAJ corpora, corpora of Directory of Open Access Journals of journals covering all areas of science in dozens of languages. - [csTenTen - Czech corpus from the web](https://www.sketchengine.eu/cstenten-czech-corpus/) - Search the 5.7-billion-word Czech corpus of texts from the web 2023. Texts were cleaned, part-of-speech tagged, lemmatized, genre and topic classification. - [ParlaMint corpora of parliamentary debates](https://www.sketchengine.eu/parlamint-corpora-of-parliamentary-debates/) - Search ParlaMint corpora 2.1, a collection of 17 multilingual comparable corpora consisting of parliamentary debates of 17 European parliaments in 16 languages. - [Hebrew YAP part-of-speech tagset](https://www.sketchengine.eu/hebrew-yap-part-of-speech-tagset/) - Hebrew YAP part-of-speech tagset is used for corpora annotated by the Yet Another (natural language) Parser (abbreviated as YAP). YAP is part of the ONLP lab tool kit. - [A Corpus of English Dialogues 1560–1760](https://www.sketchengine.eu/corpus-english-dialogues/) - Search the corpus of English Dialogues 1560–1760 (CED), the 1.2-million-word English corpus of Early Modern English speech-related texts. - [Writing term grammar](https://www.sketchengine.eu/documentation/writing-term-grammar/) - Learn how to write a term grammar definition which identifies multi-word terms in corpora within Sketch Engine. - [Swedish part-of-speech tagset](https://www.sketchengine.eu/swedish-part-of-speech-tagset/) - Swedish POS tagset is a list POS tags used to indicate grammatical categories for Swedish corpora in Sketch Engine. - [Catalan FreeLing part-of-speech tagset](https://www.sketchengine.eu/catalan-tagset/) - Catalan FreeLing POS tagset is a list POS tags used to indicate grammatical categories for Catalan corpora in Sketch Engine. - [DraCor Drama Corpora](https://www.sketchengine.eu/dracor-drama-corpora/) - The Drama Corpora (DraCor) is a set of 21 corpora consisting of theater plays in 14 languages and dialects. Useful for digital humanities, literature studies, and linguistics. - [Historical collection: Early English Books Online (EEBO), ECCO, Readex's Evans](https://www.sketchengine.eu/historical-collection-eebo-ecco-evans/) - Historical collection of English books published between 1473 and 1820 from Early English Books Online (EEBO), ECCO and Readex's Evans projects. - [Local installation](https://www.sketchengine.eu/documentation/local-installation/) - Sketch Engine can be local installed. This option is provided under special conditions. - [English CLAWS part-of-speech tagset, version 7](https://www.sketchengine.eu/english-claws7-part-of-speech-tagset/) - English CLAWS 7 POS tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [euWaC – Basque Corpus from Web](https://www.sketchengine.eu/euwac-basque-corpus/) - Search euWaC, the 100-million-word Basque corpus of texts from the Basque domain. Texts were cleaned and deduplicated. - [PICAE: Pearson International Corpus of Academic English](https://www.sketchengine.eu/picae-pearson-international-corpus-of-academic-english/) - Search PICAE, the Pearson International Corpus of Academic English with text covering various academic disciplines of the period 1975–1993. - [Hausa corpus (haWaC)](https://www.sketchengine.eu/hawac-hausa-corpus/) - Search haWaC, the 5-million-word Hausa corpus (written in Boko) of texts collected from the web. Texts were cleaned and deduplicated. - [Corpus Brasileiro](https://www.sketchengine.eu/corpus-brasileiro/) - Brazilian Portuguese corpus with 870 million words can be searched with Sketch Engine; concordances, n-grams, collocations. - [Bulgarian National Corpus](https://www.sketchengine.eu/bulgarian-national-corpus/) - Search the Bulgarian National Corpus, the 419-million-word Bulgarian corpus of texts collected from various web and non-web sources. - [Historians](https://www.sketchengine.eu/user-guide/historians/) - Historians can create corpora from historical texts or use preloaded corpora in Sketch Engine for history research. - [DGT Translation Memory parallel corpus](https://www.sketchengine.eu/dgt-translation-memory-parallel-corpus/) - Search DGT Translation Memory parallel corpus, the parallel corpora in 24 European languages. Generate n-grams, collocations, word lists or make concordances. - [Manage a corpus](https://www.sketchengine.eu/guide/manage-a-corpus/) - Manage your corpus to create subcorpora, expand its size, and enhance its usability for a better working experience - [lvWaC – Latvian corpus from the web](https://www.sketchengine.eu/latvianwac-latvian-corpus/) - Search LatvianWaC, the 57-million-word Latvian corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [Boot Camp](https://www.sketchengine.eu/bootcamp/) - Boot Camp is a 2-day course in using Sketch Engine to build, search and analyse corpora for linguistics, social sciences, lexicography, translation, IT & NLP... - [Yiddish corpus from Wikipedia](https://www.sketchengine.eu/yiddish-wikipedia-corpus/) - Search the Yiddish corpus, the 15-million-word Yiddish corpus built up from the whole content of Yiddish Wikipedia using the Wikipedia dump in December 2018. - [Toxicity Corpus](https://www.sketchengine.eu/toxicity-corpus/) - Search the Toxicity corpus, an English corpus of 2 million comments manually annotated to identify the degree of toxicity in terms of sexuality, threat, etc. - [tiWaC – Tigrynia corpus from the web](https://www.sketchengine.eu/tiwac-tigrinya-corpus/) - Search tiWaC, the 2-million-word Tigrynia corpus of texts collected from the web. Texts were cleaned and deduplicated. Generate Tigrynia n-grams and word lists or search Tigrynia collocations. - [The Digital Parisian Stage Corpus](https://www.sketchengine.eu/the-digital-parisian-stage-corpus/) - Search the Digital Parisian Stage Corpus, a French corpus of 24 Theatrical texts from The Parisian Stage written by Charles Beaumont Wicks. - [thTenTen — Thai corpus from the web](https://www.sketchengine.eu/thtenten-thai-corpus/) - Search thTenTen, the 640-million-word Thai corpus of texts from the web. Thai texts were cleaned, deduplicated and tokenised in the corpus. - [ttWaC – Tatar corpus from the web](https://www.sketchengine.eu/ttwac-tatar-corpus/) - Search ttWaC, the Tatar corpus of texts collected from the web in 2015. - [Tatar News Corpus](https://www.sketchengine.eu/tatar-news-corpus/) - Search Tatar News corpus, the 24-million-word Tatar corpus of the news texts collected from the internet. Texts were cleaned and deduplicated. - [Tatar Mixed Corpus from the web](https://www.sketchengine.eu/tatar-corpus-from-the-web/) - Search the Tatar Mixed corpus, the 100-million-word Tatar corpus of texts of various genres. from the web. Texts were cleaned and deduplicated. - [TalkBank Persian corpus of blog posts](https://www.sketchengine.eu/talkbank-persian-corpus/) - Search TalkBank Persian corpus, the 470-million-word Persian corpus of blog posts from various Farsi websites. The corpus was part-of-speech tagged. - [Tajik Web Corpus](https://www.sketchengine.eu/tajik-web-corpus/) - Search tgWaC, the 93-million-word Tajik corpus of texts from the web. Texts were cleaned and deduplicated. - [Sorani Kurdish Corpus from Wikipedia](https://www.sketchengine.eu/sorani-kurdish-corpus-from-wikipedia/) - Search the Sorani Kurdish corpus, the 5-million-word Sorani Kurdish corpus built up from the whole content of Sorani Kurdish Wikipedia in November 2020. - [Samoan corpus (smWaC)](https://www.sketchengine.eu/samoanwac-samoan-corpus/) - Search smWaC, the 3-million-word Samoan corpus of texts collected from the web. Texts were cleaned and deduplicated. - [RapCor – French corpus of rap songs](https://www.sketchengine.eu/rapcor-french-rap-corpus/) - Search RapCor, the small domain-specific corpus of spoken French extracted from Francophone rap songs. Texts were cleaned, part-of-speech tagged and lemmatized. - [Georgian corpus (kaWaC)](https://www.sketchengine.eu/kawac-georgian-corpus/) - Search kaWaC, the 50-million-word Georgian corpus of texts collected from the web. Texts were cleaned and deduplicated. Make Georgian concordances or generate Georgian n-grams and word lists. - [Frisian corpus (fyWaC)](https://www.sketchengine.eu/fywac-frisian-corpus/) - Search fyWaC, the 3-million-word West Frisian corpus of texts from the Frisian national domain. Texts were cleaned and deduplicated. - [Polish language of the 1960s corpus](https://www.sketchengine.eu/polish-language-of-the-1960s-corpus/) - Explore the Polish language of the 1960s corpus covering news, essays, scientific text, fiction or plays! The corpus contains metadata, lemmatization or POS tagging. - [Oxford Children's Corpus](https://www.sketchengine.eu/oxford-childrens-corpus/) - Search the English Oxford Children's Corpus. Study children language in the hundred-million English corpus created by Oxford University Press. - [Corpus of Old and Middle French (BFM 2022)](https://www.sketchengine.eu/bfm-corpus-of-old-and-middle-french/) - Search BFM corpus of Old and Middle French texts written between the 9th and the 15th centuries. Texts are part-of-speech tagged and lemmatized. - [Vietnamese corpus (viWaC)](https://www.sketchengine.eu/viwac-vietnamese-corpus/) - Search viWaC, the 100-million-word Vietnamese corpus of texts from the Yoruba national domain. Texts were cleaned and deduplicated. The corpus contains part-of-speech tagging. - [Urdu corpus from the Web](https://www.sketchengine.eu/urwac-urdu-corpus/) - Search urWaC, the 53-million-word Urdu corpus of texts collected from the web. Texts were cleaned and deduplicated. Make Urdu concordances or generate Urdu n-grams and word lists. - [Tamil web corpus](https://www.sketchengine.eu/tawac-tamil-corpus/) - Tamil web corpus (taWaC) is a language corpus collected from the Internet in 2015. The corpus consists of 26 million Tamil words. - [swWaC – Swahili corpus from the web](https://www.sketchengine.eu/swwac-swahili-corpus/) - Search swWaC, the 17-million-word Swahili corpus from texts collected from the web. Texts were cleaned and deduplicated. - [soWaC – Somali corpus from the web](https://www.sketchengine.eu/sowac-somali-corpus/) - Search soWaC, the 71-million-word Somali corpus of texts collected from the web. Texts were cleaned and deduplicated. Generate Somali n-grams and word lists or search Somali collocations. - [orWaC – Oromo corpus from the web](https://www.sketchengine.eu/orwac-oromo-corpus/) - Search orWaC, the 4-million-word Oromo corpus of texts collected from the web. Texts were cleaned and deduplicated. - [Maori corpus (miWaC)](https://www.sketchengine.eu/miwac-maori-corpus/) - Search miWaC, the 7-million-word Maori corpus of texts collected from the web. Texts were cleaned and deduplicated. Make Maori concordances or generate Maori n-grams and word lists. - [Film corpus of movie scripts](https://www.sketchengine.eu/film-corpus/) - Search Film corpus, the English corpus of 1068 film scripts from The Internet Movie Script Database (IMSDb) imsdb.com. Screenplays are categorized into genres. - [COVID-19 corpus from Open Research Dataset (CORD-19)](https://www.sketchengine.eu/covid-19-corpus/) - Search Covid-19 corpus which consists of texts that were released as part of the COVID-19 Open Research Dataset (CORD-19). - [BROWN Corpus](https://www.sketchengine.eu/brown-corpus/) - Search the Brown corpus, the Brown University Standard Corpus of Present-Day American English. Generate collocations, n-grams or use thesaurus and further tools. - [ACL Anthology Reference Corpus (ARC)](https://www.sketchengine.eu/acl-anthology-reference-corpus-arc/) - Search the ACL Anthology Reference Corpus (ARC), an English corpus of academic papers from conferences on NLP and computational linguistics in 1979–2015. - [BAWE: British Academic Written English Corpus](https://www.sketchengine.eu/british-academic-written-english-corpus/) - Search BAWE corpus, the British Academic Written English corpus of English texts collected from student academic works at UK universities. - [London English corpus](https://www.sketchengine.eu/london-english-corpus/) - Search the London English corpus, which consists of sociolects of London English: Traditional London English and new Multicultural London English (MLE). - [pukWaC – British English corpus parsed with MaltParser](https://www.sketchengine.eu/pukwac-british-english-corpus-maltparser/) - Search pukWaC, the 40-million-word sample of the British English corpus parsed with MaltParser. It contains syntactic annotation to show the syntax dependency. - [Norwegian dictionary corpus (Nynorskkorpuset)](https://www.sketchengine.eu/norwegian-dictionary-corpus-nynorskkorpus/) - Nynorskkorpuset is a Norwegian corpus of the new written standard Nynorsk. See the corpus containing of various domains, newspapers, journals, fictions, textbooks and religious texts. - [Mueller Report corpus](https://www.sketchengine.eu/mueller-report-corpus/) - Search Mueller report corpus, the English corpus of the entire Mueller report. Texts were part-of-speech tagged and lemmatized. - [METCLIL: Corpus of Metaphor in Academic Talk](https://www.sketchengine.eu/metclil-corpus-of-metaphor-in-academic-talk/) - [English medical corpus from the web](https://www.sketchengine.eu/english-medical-corpus/) - Search Medical corpus, the 33-million-word English corpus of texts collected from websites related to medical science. - [Maldivian corpus from Wikipedia](https://www.sketchengine.eu/maldivian-wikipedia-corpus/) - Search the Maldivian corpus, the 500-thousand-word Maldivian corpus built up from the whole content of Maldivian Wikipedia in April 2019. - [Corpus of Classical Arabic (KSUCCA)](https://www.sketchengine.eu/corpus-of-classical-arabic-ksucca/) - Search KSUCCA, the 46-million-word King Saud University Corpus of Classical Arabic from texts from the 7th–11th centuries. The texts were lemmatized and tagged. - [Icelandic Gigaword Corpus 2017](https://www.sketchengine.eu/icelandic-gigaword-corpus-2017/) - Explore the Icelandic corpus, enriched by bibliographic metadata. The corpus is lemmatized and part-of-speech tagged. - [Hebrew Translation Corpus](https://www.sketchengine.eu/hebrew-translation-corpus/) - Also known as Hebrew Comparable Corpus is text corpus of original and translated Hebrew texts. Generate n-grams, word lists. - [Greek corpus (gkWaC)](https://www.sketchengine.eu/gkwac-greek-corpus/) - Search gkWaC, the 100-billion-word Greek corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [English Environment corpus from the web](https://www.sketchengine.eu/english-environment-corpus/) - Search Environment Corpus, the 61-million-word English corpus of texts related to environmental science. Texts were processed with named entity recognition (NER). - [EcoLexicon corpus](https://www.sketchengine.eu/ecolexicon-corpus/) - Search the EcoLexicon corpus, an English corpus of contemporary environmental texts prepared by the LexiCon Research Group at the University of Granada. - [CzechParl: Corpus of Stenographic Protocols from Czech Parliament](https://www.sketchengine.eu/czechparl-corpus-of-czech-parliament/) - CzechParl is a text corpus of Stenographic Protocols from Czech Parliament 1993–2012. Search collocations, concordances, n-grams. - [hrWaC – Croatian corpus from the web](https://www.sketchengine.eu/hrwac-croatian-corpus/) - Search hrWaC, the 1.2-billion-word Croatian corpus of texts collected from the web. Texts were cleaned and deduplicated. - [Corpus of Estonian Web sentences](https://www.sketchengine.eu/corpus-of-estonian-web-sentences/) - Corpus of Estonian Web sentences is an Estonian corpus consisting of sentences sorted by the GDEx score reflecting text quality. - [New Corpus for Ireland](https://www.sketchengine.eu/new-corpus-for-ireland/) - New Corpus for Ireland is a project of building Irish and English text corpora for the creation of a English-to-Irish dictionary. - [myTenTen21– Burmese corpus from the web](https://www.sketchengine.eu/mytenten-burmese-corpus/) - Search myTenTen21, the 716-million-word Burmese corpus of texts from the Web. Texts were cleaned, deduplicated and part-of-speech tagged. - [British National Corpus 2014 spoken](https://www.sketchengine.eu/british-national-corpus-2014-spoken/) - Search BNC2014, the 11-million-word British National Corpus 2014 made up of spoken English language. The BNC2014 corpus contains transcripts of spoken language. - [bnWaC – Bengali corpus from the web](https://www.sketchengine.eu/bnwac-bengali-corpus/) - Search bnWaC, the 13-million-word Bengali corpus of text from the web. Texts were cleaned and deduplicated. Generate Bangla n-gram or collocation. - [Assamese corpus from Wikipedia](https://www.sketchengine.eu/assamese-corpus-from-wikipedia/) - Search the Assamese corpus, the 2.5-million-word Assamese corpus built up from the whole content of Assamese Wikipedia in May 2023. - [Armenian corpus from Wikipedia](https://www.sketchengine.eu/armenian-corpus-from-wikipedia/) - Search the Armenian corpus, the 51-million Armenian corpus built up from the whole content of Armenian Wikipedia in December 2020. - [Arabic Learner Corpus (ALC)](https://www.sketchengine.eu/arabic-learner-corpus-alc/) - Arabic Learner Corpus (ALC) is a text corpus that consists of written and spoken texts from learners of Arabic in Saudi Arabia. - [Amharic corpus (amWaC)](https://www.sketchengine.eu/amwac-amharic-corpus/) - Search amWaC, the 25-million-word Amharic corpus of texts from the Amharic Web. Texts were cleaned, deduplicated and part-of-speech tagged. - [Afrikaans corpus from Wikipedia](https://www.sketchengine.eu/afrikaans-wikipedia-corpus/) - Search the Afrikaans corpus, the 22-million-word Afrikaans corpus built up from the whole Afrikaans Wikipedia dump in October 2022. - [trTenTen – Turkish corpus from the web](https://www.sketchengine.eu/trtenten-turkish-corpus/) - Search trTenTen, the 4.9-billion-word Turkish corpus of texts from the web. Texts were cleaned and deduplicated and part-of-speech tagged and contain stemming. - [tlTenTen — Tagalog corpus from the web](https://www.sketchengine.eu/tltenten-tagalog-corpus/) - Search tlTenTen, the 190-million-word Tagalog corpus of texts from the web (Filipino included). Texts were cleaned and deduplicated. - [slTenTen – Slovenian corpus from the web](https://www.sketchengine.eu/sltenten-slovenian-corpus/) - Search slTenTen, the 800-million-word Slovenian corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [roTenTen – Romanian corpus from the web](https://www.sketchengine.eu/rotenten-romanian-corpus/) - Search roTenTen, the 2.7-billion-word Romanian corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [kmTenTen – Khmer corpus from the web](https://www.sketchengine.eu/kmtenten-khmer-corpus/) - Search kmTenTen, the 103-million-word Khmer corpus of texts from the web. Texts were cleaned, deduplicated and part-of-speech tagged. - [isTenTen — Icelandic corpus from the web](https://www.sketchengine.eu/istenten-icelandic-corpus/) - Search isTenTen, the 500-million-word Icelandic corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [hiTenTen – Hindi corpus from the web](https://www.sketchengine.eu/hitenten-hindi-corpus/) - Search hiTenTen, the 1.6-billion-word Hindi corpus of texts from the web. Texts were cleaned, deduplicated, part-of-speech tagged and lemmatized. - [heTenTen – Hebrew corpus from the web](https://www.sketchengine.eu/hetenten-hebrew-corpus/) - Search heTenTen, the 2.7-billion-word Hebrew corpus of texts from the web. Texts were cleaned, part-of-speech tagged, and lemmatized. - [cebTenTen — Cebuano corpus from the web](https://www.sketchengine.eu/cebtenten-cebuano-corpus/) - Search cebTenTen, the 4-million-word Cebuano corpus of texts from the web. Texts were cleaned and deduplicated. - [caTenTen - Catalan corpus from the web](https://www.sketchengine.eu/catenten-catalan-corpus/) - Search caTenTen, the 180-million-word Catalan corpus of texts from the web. Texts were cleaned, part-of-speech tagged, lemmatized. - [OneClick Terms support](https://www.sketchengine.eu/oneclick-terms-support/) - Are you not sure how to use OneClick terms? Is something not working? Would you like to suggest an improvement or new functionality? - [Build and search a learner corpus – error analysis](https://www.sketchengine.eu/documentation/setting-up-learner-corpus/) - Build and search a learner corpus in Sketch Engine with a dedicated learner corpus search interface "error analysis". Analyse learner's errors and mistakes. - [TickBox Lexicography](https://www.sketchengine.eu/user-guide/tickbox-lexicography/) - TickBox Lexicography (TBL) is functionality for semi-automated compiling of dictionaries by looking up collocates and corresponding good dictionary examples. - [Tools for text analysis](https://www.sketchengine.eu/tools-for-text-analysis/) - Text analysis tools using linguistic criteria in 90+ languages: text mining, cooccurrence, keyword extraction and more. Try free 30-day trial! - [Timestamped Russian corpus](https://www.sketchengine.eu/timestamped-russian-corpus/) - Search the 5.5-billion-word Timestamped Russian corpus updated with new data daily. Carry out the diachronic analysis of Russian words. - [Timestamped Italian corpus](https://www.sketchengine.eu/timestamped-italian-corpus/) - Search the 8.4-billion-word Timestamped Italian corpus updated with new data daily. Carry out the diachronic analysis of Italian words. - [Timestamped German corpus](https://www.sketchengine.eu/timestamped-german-corpus/) - Search the 6.9-billion-word Timetamped German corpus updated with new data daily. Carry out the diachronic analysis of German words. - [Timestamped French corpus](https://www.sketchengine.eu/timestamped-french-corpus/) - Search the 6.8-billion-word Timestamped French corpus updated with new data daily. Carry out the diachronic analysis of French words. - [Indonesian TreeTagger PoS Tagset](https://www.sketchengine.eu/indonesian-treetagger-tagset/) - See a list of part-of-speech tags used to indicate grammatical categories in Indonesia corpora in Sketch Engine. - [What can Sketch Engine do?](https://www.sketchengine.eu/what-can-sketch-engine-do/) - Search for information about a word: combinations, synonyms, examples of use in context, translations, part of speech and use other corpus tools. - [Choisir le bon corpus](https://www.sketchengine.eu/choisir-le-bon-corpus/) - Sélectionnez, parmi plus de 500 corpus présents dans la base de données de Sketch Engine, celui qui correspond à vos besoins. - [Corpus TECU – Geodetics web corpus](https://www.sketchengine.eu/corpus-tecu/) - (information in Czech language) Tvorba specializovaných dat a technik pro poloautomatické rozšiřování tezauru Porovnatelné specializované korpusy z oblasti zeměměřictví a katastru nemovitostí Textová data pro korpus byla shromážděna dvěma metodami z veřejně dostupných internetových zdrojů. Všechny dále odkazované nástroje byly vyvinuty v Centru ZPJ, FI MU, Brno. Shromážděné dokumenty byly vyčištěny od netextového a nekvalitního - [Translators - term extraction](https://www.sketchengine.eu/user-guide/translators-term-extraction/) - Sketch Engine for translating usage. How you can use Sketch Engine as a translator; tools and functions for translators. - [Access to unlimited data](https://www.sketchengine.eu/guide/access-to-unlimited-wordlists/) - There is no limit to the number of word lists or concordance lines a user can generate, however, there is a limit to the length. - [Keyboard shortcuts](https://www.sketchengine.eu/keyboard-shortcuts/) - Keyboard shortcuts enable you to speed up your work in Sketch Engine. Most of the hotkeys require pressing two letters (first letter, release, second letter). - [Trados Studio plugin](https://www.sketchengine.eu/user-guide/trados-studio-plugin/) - The Sketch Engine plugin brings thesaurus, concordance, collocations and EUR-Lex search into Trados Studio. - [ALDF – Average Logarithmic Distance Frequency](https://www.sketchengine.eu/aldf-average-logarithmic-distance-frequency/) - Average Logarithmic Distance Frequency (ALDF) is a type of frequency indicating whether a token si distributed evenly throughout the whole corpus or not. - [Boot Camp Prague Airport](https://www.sketchengine.eu/bootcamp/boot-camp-prague-airport/) - [CAMeL Arabic part-of-speech tagset](https://www.sketchengine.eu/camel-arabic-part-of-speech-tagset/) - [Irish Universal dependencies tagset](https://www.sketchengine.eu/irish-universal-dependencies-tagset/) - See the Irish Universal dependencies tags, a part-of-speech tagset of UD tags used to indicate grammatical categories in Irish corpora in Sketch Engine. - [Universal POS tags](https://www.sketchengine.eu/tagsets/universal-pos-tags/) - Universal POS tags are used in various text corpora that have been tagged with Universal dependencies tagset. - [Russian part-of-speech tagset – multilingual MULTEXT-East specifications, version 4](https://www.sketchengine.eu/russian-tagset/) - Russian part-of-speech tagset – multilingual MULTEXT-East specifications – is a list of POS tags used to indicate grammatical categories for Japanese corpora in Sketch Engine. - [MULTEXT-East Ukrainian part-of-speech tagset (version 6)](https://www.sketchengine.eu/multext-east-ukrainian-part-of-speech-tagset/) - Ukrainian multilingual MULTEXT-East specifications are lists POS tags used to indicate grammatical categories for Ukrainian corpora in Sketch Engine. - [Dutch TreeTagger part-of-speech tagset](https://www.sketchengine.eu/dutch-treetagger-tagset/) - Dutch TreeTagger POS tagset is a list of POS tags used to indicate grammatical categories for Dutch corpora in Sketch Engine. - [Bilingual term extraction](https://www.sketchengine.eu/guide/bilingual-term-extraction/) - Bilingual term extraction will find terms in texts in two languages and exports them as TBX term base to be imported into a CAT tool or terminology management system. - [Afrikaans part-of-speech tagset](https://www.sketchengine.eu/afrikaans-part-of-speech-tagset/) - See a list of part-of-speech tags in the Afrikaans part-of-speech tagset used to indicate grammatical categories in Afrikaans corpora. - [Find your ideal job in the Sketch Engine team](https://www.sketchengine.eu/jobs/) - Ask a question about Sketch Engine. Would this tool be right for you? Check whether Sketch Engine has the features you need. - [Access after Elexis](https://www.sketchengine.eu/access-after-elexis/) - [Create corpus in an unsupported language](https://www.sketchengine.eu/corpora-and-languages/unsupported-language/) - Sketch Engine can handle corpora with languages which are not supported directly. It uses universal tokenizer. - [Keywords and term extraction](https://www.sketchengine.eu/guide/keywords-and-term-extraction/) - Keywords and term extraction identifies words and phrases which are typical of a document or corpus. - [Annotation schema customization](https://www.sketchengine.eu/guide/annotation-schema-customization/) - Sketch Engine supports annotation of concordance lines. A default annotation schema can be customized according to the needs of your annotation task. - [Lexical Computing Workshop at SPP Day (May 19, 2022)](https://www.sketchengine.eu/lexical-computing-workshop-at-spp-day-may-19-2022/) - [Sketch Engine access funded by ELEXIS](https://www.sketchengine.eu/elexis/) - Sketch Engine will be funded by the ELEXIS project between 2018 and 2022. All academic EU institutions are eligible. Request access to Sketch Engine for your institution. - [Storage space](https://www.sketchengine.eu/guide/user-administration/storage-space/) - [IFD Icelandic part-of-speech tagset](https://www.sketchengine.eu/ifd-icelandic-part-of-speech-tagset/) - See IFD Icelandic part-of-speech tagset and its list of part-of-speech tags used to indicate grammatical categories in Icelandic corpora in Sketch Engine. - [ArabCC – Learner Corpus of English Essays](https://www.sketchengine.eu/arabcc-learner-corpus-of-english-essays/) - Search ArabCC, the Learner Corpus of English Essays written by the native speakers of the Arabic language. Texts were part-of-speech tagged and lemmatized. - [Italian FreeLing part-of-speech tagset](https://www.sketchengine.eu/italian-freeling-part-of-speech-tagset/) - Italian FreeLing POS tagset is a list POS tags used to indicate grammatical categories for Italian corpora in Sketch Engine. - [Hebrew part-of-speech tagsets](https://www.sketchengine.eu/hebrew-part-of-speech-tagsets/) - Sketch Engine provides various part-of-speech tagsets for Hebrew corpora to indicate grammatical categories in Hebrew corpora. - [Find X function – word highlights](https://www.sketchengine.eu/find-x-word-highlights/) - Find X function enriches the word sketch results by displaying additional information about the word usage, e.g. the word is usually used in the plural. - [Prices for Academic Individual Users](https://www.sketchengine.eu/prices-for-academic-individual-users/) - [raw] 1st milliondefault, no fee 2nd to 10th million11 € 11th to 30th million5.5 € 31st to 100th million2.5 € 101st and any additional million1.5 € INDIVIDUAL ACADEMIC SUBSCRIPTION I am in an academic environment and I do not conduct any commercial activities. The prices are only valid for web purchases paid online. Yearly subscription - [Estonian Treebank tagset](https://www.sketchengine.eu/estonian-treebank-tagset/) - Estonian TreeTagger POS tagset is a list POS tags used to indicate grammatical categories for Estonian corpora in Sketch Engine. - [Prices for non-academic individual users](https://www.sketchengine.eu/prices-for-commercial-individual-users/) - [raw] 1st million default, no fee 2nd to 10th million 19 € 11th to 30th million 9.5 € 31st to 100th million 4.3 € 101st and any additional million 2.6 € INDIVIDUAL ACCOUNT / FREELANCE SUBSCRIPTION I'm a freelance translator, terminologist or copywriter and I do not conduct any other commercial activities including dictionary publishing. - [Burmese part-of-speech tagset](https://www.sketchengine.eu/burmese-pos-tagset/) - [Academic subscription quote](https://www.sketchengine.eu/academic-subscription-quote/) - [Prices for Commercial Users](https://www.sketchengine.eu/prices-for-commercial-users/) - [raw] COMMERCIAL SINGLE AND MULTI-USER LICENCE I'm a lexicographer OR I conduct other kind of commercial activities or dictionary publishing. Request personalized quotation [/raw] - [Highlight Only Part of a Complex Query](https://www.sketchengine.eu/documentation/highlight-only-part-of-a-complex-query/) - I want to align a concordance accoding to a part of the query. How can I do that? Alternatively: I have a complex query and I don´t want to have the whole match in the middle column. This can be done using the CQL keyword within e.g.: [word="where"] within [word=","][word="where"][word!=","]*[word=","] within Click here to - [Compatibility Matrix](https://www.sketchengine.eu/compatibility-matrix/) - This page provides compatibility matrix of Sketch Engine components and requirements. Each line contains only two versions that always impose a constraint (component A version X.Z requires component B version Y.V). Finlib Manatee Bonito Cheetah PCRE SWIG GCC Antlr Gdex >=2.35 >=2.135 >=2.34 >=2.134 >=2.130 >=3.80 >=2.127 >=3.76.6 >=2.32 >=2.124 >= 2.31 >= 2.121.1 >=3.56.8 - [Sketch Engine calendar 2019](https://www.sketchengine.eu/sketch-engine-calendar-2019/) - Download the Sketch Engine calendar 2019 with basics of Corpus Query Language (CQL) and regular expressions (regex). - [Chinese Wikipedia corpus](https://www.sketchengine.eu/chinese-wikipedia-corpus/) - The Chinese Wikipedia corpus is a text corpus built up from the whole content of Chinese Wikipedia using the Wikipedia dump. - [Sketch Engine calendar 2018](https://www.sketchengine.eu/calendar-2018/) - Download the Sketch Engine calendar 2018 with useful CQL examples to search our corpora. - [Allowed language names in corpus configuration](https://www.sketchengine.eu/documentation/allowed-language-names-in-corpus-configuration/) - Find the list of language names allowed in corpus configuration in Sketch Engine. Find your language and create your own corpus. - [Comment créer un corpus à partir d'Internet](https://www.sketchengine.eu/comment-creer-un-corpus-a-partir-dinternet/) - Servez-vous de l'outil de création automatique de corpus de Sketch Engine qui se charge de trouver sur Internet des textes qui sont pertinents pour vous, de les télécharger et d'en faire un corpus. - [Leçon sur le Thésaurus](https://www.sketchengine.eu/lecon-sur-le-thesaurus/) - Générer un thésaurus pour un mot. Sketch Engine permet de créer automatiquement, en l'espace de quelques secondes, un thésaurus à partir de millions de mots en douze langues. - [Leçon sur le Profil lexical](https://www.sketchengine.eu/lecon-sur-le-profil-lexical/) - Créer un Profil lexical – un bref aperçu des caractéristiques d'un mot. Cet outil rassemble des informations provenant de milliers et millions d'exemples attestés et il fournit, en une page, des collocations classées dans des catégories et dotées de liens vers des exemples. - [Chinese NEUSCP part-of-speech tagset](https://www.sketchengine.eu/chinese-neuscp-part-of-speech-tagset/) - See Chinese NEUSCP POS tagset is a list POS tags used to indicate grammatical categories for Chinese corpora in Sketch Engine. - [Blog](https://www.sketchengine.eu/blog/) - Blog posts about Sketch Engine relating to corpus linguistics, term extraction, collocations, etc. - [Catalan Web corpus](https://www.sketchengine.eu/catalanwac-corpus/) - Catalan web corpus (CatalanWaC) is a text corpus in Sketch Engine. Generate concordance, n-grams, collocations. - [CLAWS tagset - mapping file](https://www.sketchengine.eu/claws-tagset/) - C8 to C7 mapping file. NS 2011-5-14. APPGE -> APPGE: possessive pronoun, pre-nominal (e.g. "my", "your", "our") AT -> AT: article (e.g. "the", "no") AT1 -> AT1: singular article (e.g. "a", "an", "every") BCL -> BCL: before-clause marker (e.g. "in order (that)", "in order (to)") CC -> CC: coordinating conjunction (e.g. "and", "or") CCB -> - [Chinese symbol part-of-speech tagset](https://www.sketchengine.eu/chinese-symbol-part-of-speech-tagset/) - Chinese symbol part-of-speech tagset is a list POS tags used to indicate grammatical categories for English corpora in Sketch Engine. - [Source code](https://www.sketchengine.eu/source-code/) - English – 01_dog.txt I have a nice dog. It runs a lot. German – 01_Hund.txt Ich habe einen schönen Hund. Es läuft sehr viel. Spanish – 01_perro.txt Tengo un buen perro. Corre mucho. English – 01_dog.txt I have a nice - [Adding sentence boundaries to a compiled corpus](https://www.sketchengine.eu/documentation/adding-sentence-boundaries-to-a-compiled-corpus/) - This document explains how structures, such as documents, paragraph, and sentences are stored in a compiled corpus and how they can be modified. We will illustrate this on a practical example. After compiling the LEXMCI corpus we figured out that some of the included document don't have the sentence boundaries marked. We could have simply - [CLAWS7 Tagset](https://www.sketchengine.eu/claws7-tagset/) - An overview of all tags that are in the CLAWS7 tagset. These POS tags are included in the CLAWS6 tagset except for punctuation tags. - [FinnishWaC corpus](https://www.sketchengine.eu/finnishwac-corpus/) - Finnish web as corpus. ## Tooltip Glossary - [POS](https://www.sketchengine.eu/glossary/pos/) - POS stands for "part of speech" such as noun, adjective, verb, adverb etc. In Sketch Engine, POS also refers to a positional attribute which is assigned to tokens in the corpus and contains information about the part of speech only. It does not include any additional information. (Do not mistake it for tag.) POS may - [escaping](https://www.sketchengine.eu/glossary/escaping/) - In regular expressions, escaping refers to canceling the special function of certain characters, typically when searching for punctuation. These characters must be excaped if you want to search for the character: . ^ $ * + ? ( ) [ ] { } | \ In CQL, also the double quotes " must be escaped. To find a full stop (dot), it has to be escaped with a - [CQL](https://www.sketchengine.eu/glossary/cql/) - The Corpus Query Language is a code used to set criteria for complex searches which cannot be carried out using the standard user interface controls. The criteria may include words or lemmas but also tags and other attributes, text types or structures. Conditions can be set for optional tokens or token repetition. Learn CQL - [lemma](https://www.sketchengine.eu/glossary/lemma/) - Learn to understand attributes Lemma is a positional attribute. It is the basic form of a word, typically the form found in dictionaries. A lemmatized corpus allows for searching for the basic form and include all forms of the word in the result, e.g. searching for lemma go will find go, goes, went, going, gone. - [token](https://www.sketchengine.eu/glossary/token/) - [gender lemma](https://www.sketchengine.eu/glossary/gender-lemma/) - The gender lemma is an attribute used in connection with term extraction. Its purpose is to display terminology in the correct word form in languages which observe the agreement in gender between adjectives and nouns. The standard lemma would produce a grammatically unacceptable word form combination. Examples Spanish word form lemma gender lemma cámaras compactas - [lempos](https://www.sketchengine.eu/glossary/lempos/) - Learn to understand attributes Lempos is a positional attribute, i.e. an attribute assigned to each token in the corpus. It is a combination of lemma and part of speech (pos) consisting of the lemma, hyphen and a one-letter abbreviation of the part of speech, eg. go-v, house-n. The letters used for the part of speech (-n, - [longest-commonest match](https://www.sketchengine.eu/glossary/longest-commonest-match/) - The longest-commonest match (LCM) was coined by Adam Kilgarriff to name the most common realisation of a collocation, i.e. the chunk of language in which the collocation appears most frequently. The longest-commonest match is part of the word sketch result screen to facilitate the understanding of how the collocation typically behaves. The longest-commonest match can reveal some - [relative frequency, frequency per million](https://www.sketchengine.eu/glossary/freqmill/) - (also called freq/mill in the interface) is the number of occurrences of an item per million tokens, also called i.p.m. (instances per million). It is used to compare frequencies between corpora (or datasets) of different sizes. Formula number of hits : corpus size in millions of tokens = frequency per million (an alternative calculation producing - [comparable corpus](https://www.sketchengine.eu/glossary/comparable-corpus/) - A comparable corpus is a corpus consisting of texts from the same domain in more languages. In contrast to a parallel corpus, the texts are not translations of each other and belong to the same domain with the same metadata. An example of a comparable corpus is corpus made from Wikipedia. - [corpus architect](https://www.sketchengine.eu/glossary/corpus-architect/) - an intuitive tool inside Sketch Engine for creating corpora from documents or the Web which does not require any expert knowledge. See the create your own corpus page. - [corpus manager](https://www.sketchengine.eu/glossary/corpus-manager/) - a program used to manage text corpora, i.e. to build, edit, annotate and search corpora. Sketch Engine is the user interface to the corpus manager Manatee. - [focus corpus](https://www.sketchengine.eu/glossary/focus-corpus/) - In keyword and term extraction, the focus corpus is the corpus from which keywords and terms are extracted. Compare reference corpus. - [learner corpus](https://www.sketchengine.eu/glossary/learner-corpus/) - A collection of texts produced by learners of a language used to study errors and mistakes made by learners of languages. Learner corpora in Sketch Engine can use both error and correction annotation. A special search interface is available to search by the former or the latter or both. see also Setting up a learner corpus - [parallel corpus](https://www.sketchengine.eu/glossary/parallel-corpus/) - A parallel corpus consists of the same text translated into one or more languages. The texts are aligned (matching segments, usual sentences, are linked). The corpus allows searches in one or both languages to look up or compare translations. - [preloaded corpus](https://www.sketchengine.eu/glossary/preloaded-corpus/) - a ready-to-use corpus included in Sketch Engine subscription or Trial access, not created by a user, e.g. English Trends corpus - [reference corpus](https://www.sketchengine.eu/glossary/reference-corpus/) - A reference corpus is used in keyword extraction and term extraction. A reference corpus is a corpus to which the focus corpus is compared. When using the Keywords & Terms tool, a reference corpus is preselected but the user can use a different corpus as a reference corpora. The reference corpus can but does not have to - [user corpus](https://www.sketchengine.eu/glossary/user-corpus/) - a corpus created by a user. Users can create corpora by uploading their own data or using Sketch Engine to collect data from the Web. User corpora are created as private. No other user can access them. However, users can grant access to the corpus to individually selected users. This is called sharing. User corpora - [corpus](https://www.sketchengine.eu/glossary/corpus/) - A corpus is a large collection of authentic texts used for studying language or generating linguistic data. Modern corpora contain texts whose total length is billions or dozens of billions of words. A corpus is usually tagged. (= annotated, i.e. the words are labelled with information about the part of speech and their grammatical category). - [ARF – Average Reduced Frequency](https://www.sketchengine.eu/glossary/arf/) - a modified frequency which prevents the result to be excessively influenced by one part of the corpus (e.g. one or more documents) which contains a high concentration of the token. If the token is evenly distributed across the corpus, ARF and absolute frequency will be similar or identical. In comparison with ALDF (Average Logarithmic Distance - [concordancer](https://www.sketchengine.eu/glossary/concordancer/) - A concordancer is a tool (a piece of software) which searches a text corpus and displays a concordance. A concordancer is one of the features in Sketch Engine which allows for simple corpus searches as well as queries involving complex criteria that search for grammatical or lexical structures. see also concordance - [Lemmatization](https://www.sketchengine.eu/glossary/lemmatization/) - Lemmatization is a process of assigning a lemma to each word form in a corpus using an automatic tool called a lemmatizer. Lemmatization bring the benefit of searching for a base form of a word and getting all the derived forms in the result, e.g. searching for go will also find goes, went, gone, going. See also PoS tagger - [word sketch grammar](https://www.sketchengine.eu/glossary/word-sketch-grammar/) - Word Sketch grammar (WSG) is a set of rules defining the grammatical relations (=columns/categories) in a Word Sketch. In other words, WSG tells Sketch Engine which words should be regarded as collocations of the search word and also what type of collocation they are. WSG defines the criteria using POS tags, distance between words, and - [simple maths](https://www.sketchengine.eu/glossary/simple-maths/) - The simple maths formula is used to calculate the keyness score in Sketch Engine. This score is used to identify terms, keywords and also key n-grams and key collocations. It identifies items which appear more frequently in the focus corpus than in the reference corpus. It uses relative (per million) frequencies and, therefore, makes it - [term](https://www.sketchengine.eu/glossary/term/) - Terms is a concept used in connection with Keywords & Terms tool. A term is a multi-word expression (consisting of several tokens) which appears more frequently in one corpus (focus corpus) compared to another corpus (reference corpus) and, at the same time, the expression has a format of a term in the language. The format - [positional attribute](https://www.sketchengine.eu/glossary/positional-attribute/) - Learn to understand attributes A positional attribute is information added to each token in a corpus, typically its lemma or tag. Attributes differ between languages and, occasionally, even between corpora in the same language. Here are some examples of attributes: word lemma tag POS lempos dogs dog NNS n dog-n dog dog NN n dog-n - [vertical file](https://www.sketchengine.eu/glossary/vertical-file/) - A vertical file is a text file where each token (or word) is on a separate line. It is typically used for text corpora and may contain additional metainformation. - [alignment](https://www.sketchengine.eu/glossary/alignment/) - Alignment is a term used in connection with parallel corpora. A parallel corpus consists of a text and its translation into one or more languages. Parallel corpora need to be divided into segments. A segment usually corresponds to a sentence. Alignment refers to information that tells Sketch Engine which segment (sentence) in one language is - [Attribute](https://www.sketchengine.eu/glossary/attribute/) - An attribute can refer to: A positional attribute - information added to each token in a corpus, e.g. its lemma or part of speech. more» A structure attribute - information added to a structure in a corpus, often called metadata - [word list](https://www.sketchengine.eu/glossary/word-list/) - A word list is a generic name for various types of lists such as lists of words, lemmas, POS tags or other attributes with their frequency (hit counts, document counts or others). See more about the wordlist function in Sketch Engine. - [web mining](https://www.sketchengine.eu/glossary/web-mining/) - web mining is the application of data mining which extracts information from texts. The web mining is focused on gaining information and metadata from the web. For this task, Sketch Engine uses the fully-automated tool WebBootCaT for creating corpora from the web which stores also metadata of processed websites. Read about other text analysis tools. - [timeline](https://www.sketchengine.eu/glossary/timeline/) - The timeline function displays the changing frequency of a word or phrase over time. Timelines are not a standalone tool, they are included in the Concordance and Wordlist tools. Timelines are computed the same as the graphs in Trends – a diachronic analysis of word usage, however, they can be generated for any word or even multi-word phrase the graph displays more details. See also - [TBL](https://www.sketchengine.eu/glossary/tbl/) - application in Sketch Engine for collecting usage-example sentences to build dictionaries. Find more on the Tick Box Lexicography page - [text analysis](https://www.sketchengine.eu/glossary/text-analysis/) - text analysis (also content analysis or text analytics) is a method for analyzing (usually unstructured) text in order to extract information. The result of the text analysis is structured data. In addition to the traditional tools, Sketch Engine also offers some unique features. The traditional tools consist of various frequency-based statistics: word or lemma frequency, - [text mining](https://www.sketchengine.eu/glossary/text-mining/) - text mining is an automatic process of extracting information from text, such as keywords of a text or its source(s). The corresponding tools in Sketch Engine are WebBootCaT for creating corpora from the web or keywords and terms extraction which finds terminology in your texts. Read about other text analysis tools. - [GDEX](https://www.sketchengine.eu/glossary/gdex/) - Good Dictionary Examples is a technology in Sketch Engine which can identify automatically sentences which are suitable as dictionary example sentences or as teaching examples, i.e. are illustrative and representative. The GDEX can be applied on any concordance. It will sort the lines and will place the ones with the best GDEX score to the - [cooccurrence](https://www.sketchengine.eu/glossary/cooccurrence/) - cooccurrence or co-occurrence is a term which expresses how often two terms from a corpus occur alongside each other in a certain order. It usually indicates words which together create a new meaning. We call them as phraseme or multi-word expression, e.g. black sheep or get on. Sketch Engine help to find such words with using the word sketch tool or - [document frequency (docf)](https://www.sketchengine.eu/glossary/document-frequency/) - The document frequency is the number of documents in which the token or phrase appears. If the corpus has 100 documents and 2 documents contain the word city: document number 7 contains 17 instances of city, document number 31 contains 6 instances of city, the document frequency of city is 2, because 2 documents contain - [stem](https://www.sketchengine.eu/glossary/stem/) - A stem is a part of a word without its affixes (suffixes, prefixes, etc.). Stems do not have to be valid word forms, e.g. stem hav for the word form having, in comparison to lemma have for the word form having. Stems are used instead of lemmas or in addition to lemmas with languages whose morphology requires - [T-score](https://www.sketchengine.eu/glossary/t-score/) - T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation, which is why very frequent word combinations tend to reach a high T-score despite not being significant collocations. When comparing the - [multilevel list](https://www.sketchengine.eu/glossary/multilevel-list/) - a list sorted at more than one level e.g. a frequency list sorted by word form followed by lemma and then tag, see this multilevel list in the BAWE corpus. - [distributional thesaurus](https://www.sketchengine.eu/glossary/distributional-thesaurus/) - an automatically produced thesaurus which identifies words that occur in similar contexts as the target word. It draws on the theory of distributional semantics.The automatically produced thesaurus is available for each word in the corpus. more about automatic thesaurus The distributional thesaurus in Sketch Engine is available for every language and corpus that supports word - [CAT tool](https://www.sketchengine.eu/glossary/cat-tool/) - A CAT tool stands for a computer assisted translation tool. It is software that helps translators maintain consistency in terminology across their translation jobs and also aids the translation process by suggesting (or translating automatically) passages (segments) which the translator already translated in the past. Data exported from CAT tools (translation memories) can be used - [word sketch triple](https://www.sketchengine.eu/glossary/word-sketch-triple/) - A word sketch triple is a data format used for representing one collocation identified by the word sketch. A word sketch triple consists of: node as lempos name of the grammatical relation as displayed in the header of the column in word sketch interface collocate as lempos. school-n modifiers of "%w" secondary-j (to be understood - [word sketch](https://www.sketchengine.eu/glossary/word-sketch/) - The word sketch is a tool to display collocations (=word combinations) in a compact, easy-to-understand way. The word sketch makes it easy to understand how a word behaves, which contexts it typically appears in and which words it can be used together. The word sketch can typically display collocations of only nouns, adjectives, verbs and - [word form](https://www.sketchengine.eu/glossary/word-form/) - ⓘ This entry is for the positional attribute: word form, lemma, lowercase, tag… For the type of token, the opposite of nonword, see word. Learn to understand attributes The word form (often shortened to word in the interface) is a positional attribute. It refers to one of the word forms that a lemma can take, e.g. - [word](https://www.sketchengine.eu/glossary/word/) - Note: This entry is for the type of token. For the positional attribute, see word form. A word is a type of token. All tokens in a corpus are divided into two groups: words and nonwords. Words are tokens which begin with a letter of the alphabet. Tokens such as book, working, Mary, T-shirt, post-1945, - [UMS](https://www.sketchengine.eu/glossary/ums/) - feature available to users with local installation for the administration of users and corpora. - [Type/token ratio (TTR)](https://www.sketchengine.eu/glossary/type-token-ratio-ttr/) - The type/token ratio, often shortened TTR, is a simple measure of lexical diversity. It can only be interpreted when comparing it to TTR of a different text (corpus). The corpus with a higher TTR contains a higher variety of words than the other corpus. In other words, the authors use more different words, or richer - [trends](https://www.sketchengine.eu/glossary/trends/) - Trends is a feature used for diachronic analysis, i.e. for identifying how the frequency of the word (or other attributes) changes over time. read more - [translation memory](https://www.sketchengine.eu/glossary/translation-memory/) - A translation memory is a database inside a CAT tool which holds segments of text translated in the past. The CAT tool can suggest (or automatically supply) translations based on matching text from the translation memory. - [tokenizer](https://www.sketchengine.eu/glossary/tokenizer/) - A tokenizer is a tool (software) used for dividing text into tokens. A tokenizer is language specific and takes into account the peculiarities of the language, e.g. don't in English is tokenized as two tokens. Sketch Engine contains tokenisers for many languages and also a universal tokenizer used for languages not yet supported by Sketch Engine. - [tokenization](https://www.sketchengine.eu/glossary/tokenization/) - For the corpus to work, the corpus text should be first divided into individual tokens. Tokenization is the automatic process of dividing text into tokens. This process is performed by tools called tokenizers. - [TMX - Translation Memory eXchange format](https://www.sketchengine.eu/glossary/tmx-translation-memory-exchange-format/) - Translation Memory eXchange (TMX) is a specific XML format used for creating parallel corpora in Sketch Engine. This format is standardly used in translation memories (TM). See more about Setting up parallel corpora in Sketch Engine. An example of a TMX document (from Wikipedia), the following structures are required for creating parallel corpora: , - [text type selector](https://www.sketchengine.eu/glossary/text-type-selector/) - Any search in Sketch Engine can be limited to certain text types only. The results will be taken from documents annotated with the specific text type(s). Users can include metadata in their corpora. If the metadata are in the required format, they will be converted to text types and will appear in the text type - [text type](https://www.sketchengine.eu/glossary/text-type/) - [We follow Biber (1989) in using text type as a generic term for the many ways in which a text might be classified.] A text type refers to values assigned to structures (e.g. documents, paragraphs, sentences or others) inside a corpus. Text types can refer to the source (newspaper, book, etc.), medium (spoken, written), time (year, century), - [term grammar](https://www.sketchengine.eu/glossary/term-grammar/) - A term grammar is a set of rules written in CQL which define the lexical structures, typically noun phrases, which should be included in term extraction. The lexical structures are defined using POS tags and CQL. The use of a term grammar ensures a clean term extraction result which requires very little post editing. For - [term extraction](https://www.sketchengine.eu/glossary/term-extraction/) - the process of identifying subject specific vocabulary in a subject specific text usually using specialized software. The identification of one-word and multi-word terms in Sketch Engine is based on the comparison of the frequency of such words and phrases between the reference corpus and the focus corpus. compare keywords related topics term extraction explained (blog) term - [term base](https://www.sketchengine.eu/glossary/term-base/) - In connection with CAT tools, a term base is a database of subject-specific terminology and other lexical items which need to be translated consistently. The CAT tool uses the term base to check the consistency of translation, to look for untranslated segments, and to suggest (or automatically supply) translations of the terms from the database. - [tagset](https://www.sketchengine.eu/glossary/tagset/) - (called also tag set) is a list of part-of-speech tags used in one corpus. In Sketch Engine, corpora in the same language tend to use the same tagset but exceptions exist. To check the tagset used, access Corpus statistics and details. See our blog about POS tags. - [tag](https://www.sketchengine.eu/glossary/tag/) - (also called part-of-speech tag, POS tag or morphological tag) is a label assigned to each token in an annotated corpus to indicate the part of speech and often also grammatical categories and morphological information. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called - [subcorpus](https://www.sketchengine.eu/glossary/subcorpus/) - a corpus can be subdivided into an unlimited number of parts called subcorpora. Subcorpora can be used to divide the corpus by the type (fiction, newspaper), media (spoken, written) or time (e.g. by years) or by any other criteria. A subcorpus can also be created from a concordance by including all concordance lines and the - [structure](https://www.sketchengine.eu/glossary/structure/) - a corpus structure refers to the segments or parts into which a corpus can be divided. Typically, a corpus is divided into sentences, paragraphs and documents but the corpus author can introduce various other structures to allow the analysis to focus on smaller or larger parts of the corpus. see a list of common corpus - [stemming](https://www.sketchengine.eu/glossary/stemming/) - stemming is the process during which a word reduces its affixes (suffixes, prefixes, etc.) and finally, the stem only remains. Stemming is used to detect related words with the same stem, the word root which does not change in any case, number or tense. The word stems are available in Portuguese corpus ptTenTen or Turkis corpus - [segment](https://www.sketchengine.eu/glossary/segment/) - Segments refer to the parts into which a parallel (multilingual) corpus is divided for the purpose of alignment. Alignment means that the corpus contains information about which segment in one language is a translation of which segment in another language. Segments typically correspond to sentences but some corpora can be aligned at a paragraph or - [search span](https://www.sketchengine.eu/glossary/search-span/) - the number of tokens either side of the node that will be matched for filtering concordance. The set search span from -5 to 5 means filter all concordance lines which containing a requirement of the filter in the range of 5 tokens around the node. - [search attribute](https://www.sketchengine.eu/glossary/search-attribute/) - the attribute that is used for the search and creating a word list. You can have the word list of words, lemmas, tags, etc. - [salience](https://www.sketchengine.eu/glossary/salience/) - a statistical measure of the significance of a specific token in the given context. This is measured with logDice, for more information, see section 3 of Statistics used in Sketch Engine) - [relative text type frequency](https://www.sketchengine.eu/glossary/relative-text-type-frequency/) - (also called Relative density in the interface) Relative text type frequency compares the frequency in a specific text type to the frequency in the whole corpus. It shows how typical the word(s) is of a specific text type, e.g. of the spoken part of the corpus or of a particular website which the texts were downloaded - [regular expressions](https://www.sketchengine.eu/glossary/regular-expressions/) - a collection of special symbols that can be used to search for patterns rather than specific characters, e.g. to find all words starting, containing or ending in a specific sequence of characters, for example .*tion will find all words ending in tion and having an unlimited number of characters at the beginning read more» - [query](https://www.sketchengine.eu/glossary/query/) - a sequence of characters or words or their combinations inputed by the user in order to retrieve a concordance. Often, the word query is not restricted to the concordance only but can also refer to any type of search or criteria uses in connection with any Sketch Engine feature, i.e. Word Sketch, thesaurus, word list etc. - [prevertical file](https://www.sketchengine.eu/glossary/prevertical-file/) - A prevertical file is a pain text file that contains the corpus text and structures. Usually, it is a source file for creating vertical files which are created by the tokenization process from the prevertical. An example of a prevertical file with corpus structures for documents, chapters, and paragraphs: - [POS tagger](https://www.sketchengine.eu/glossary/pos-tagger/) - POS (part of speech) tagging is a process of annotating each token with a tag carrying information about the part of speech and often also morphological and grammatical information such as number, gender, case, tense etc. The automatic tagging tool is called a tagger or POS tagger. See also lemmatization stemming - [POS tag](https://www.sketchengine.eu/glossary/pos-tag/) - A POS tag (also part-of-speech tag) is the same as tag. Do not mistake for POS, the simplified POS tag showing only the part-of-speech information but not the additional morphological and grammatical information. See also positional attributes lempos lemma - [overall score](https://www.sketchengine.eu/glossary/overall-score/) - score of the relation based on logDice in word sketches. The score is displayed in the header of each column of the relation. - [non-word](https://www.sketchengine.eu/glossary/non-word/) - Non-words (also spelt nonwords) are tokens which do not start with a letter of the alphabet. Examples of non-words are numbers, punctuation but also tokens such as 25-hour, 16-year-old, !mportant, 3D. Tokens such as post-1945, mp3 or CO2 are words because they start with a letter. The regular expression Sketch Engine users to identify non-words - [node](https://www.sketchengine.eu/glossary/node/) - (talking about collocations) central word in a collocation, e.g. strong wind consists of the collocate strong and the node wind (talking about concordances) the search word or phrase, sometimes called a query, appears in the centre of a KWIC concordance or highlighted in other types of concordances - [minimum sensitivity](https://www.sketchengine.eu/glossary/minimum-sensitivity/) - a statistics measure similar to logDice which is the minimum of the two following numbers: the number of co-occurrences divided by the frequency of the collocate the number of co-occurrences divided by the frequency of the node word The minimum sensitivity number grows with a high number of co-occurrences and falls with a high number of occurrences of - [MI Score](https://www.sketchengine.eu/glossary/mi-score/) - The Mutual Information score expresses the extent to which words co-occur compared to the number of times they appear separately. MI Score is affected strongly by the frequency, low-frequency words tend to reach a high MI score which may be misleading. This is why Sketch Engine allows setting a frequency limit so that low-frequency words - [metadata](https://www.sketchengine.eu/glossary/metadata/) - information about the texts in the corpus: for example, year of publication, author name, publishing house, medium (written, spoken), register (formal, informal) etc. Metadata are automatically converted to text types in Sketch Engine. see Annotate a corpus - [macro](https://www.sketchengine.eu/glossary/macro/) - Macro is a concordance feature that automates your usual concordance operations. Macros let you save all the actions applied on the concordance and carry them out automatically on future concordances. - [longtag](https://www.sketchengine.eu/glossary/longtag/) - Longtag is a detailed part-of-speech tag which usually contains more information than tag. Some corpora have tags containing only basic information on parts of speech and also attribute longtags consist of detailed grammatical information such as case, number, gender, etc. The longtangs are available in Estonian corpus etTenTen or Turkis corpus trTenTen. - [n-gram](https://www.sketchengine.eu/glossary/n-gram/) - is a sequence of items (bigram = 2 items , trigram = 3 items ...n-gram = n items). An item can refer to anything (letter, digit, syllable, token, word or others) . In the context of corpora and corpus linguistics, n-grams typically refer to tokens (or words). In linguistics, n-grams are sometimes referred to as - [logDice](https://www.sketchengine.eu/glossary/logdice/) - a statistic measure for identifying co-occurrence (=two items appearing together). Sketch Engine uses it to identify collocations. It expresses the typicality (or strength) of the collocation. It is used in the word sketch feature and also when computing collocations from a concordance. It is only based on the frequency of the node and the collocate - [log-likelihood](https://www.sketchengine.eu/glossary/log-likelihood/) - one of the functions used in computed statistics of Sketch Engine. It is the association measures based on the likelihood function, used in tests for significance (see the log-likelihood calculator and more details) - [likelihood](https://www.sketchengine.eu/glossary/likelihood/) - a function of parameters of a statistical model, it plays a key role in statistical inference and is the basis for the log-likelihood function. see Statistics in Sketch Engine - [lempos_lc](https://www.sketchengine.eu/glossary/lempos_lc/) - Learn to understand attributes lempos_lc is a positional attribute. It is a lowercased version of lempos. All uppercase letters are converted to lowercase, thus House-n becomes identical with house-n. It is used for case insensitive searching and analysis. see also lempos list of attributes - [lemma_lc](https://www.sketchengine.eu/glossary/lemma_lc/) - Learn to understand attributes lemma_lc is a positional attribute. It is a lemma converted to lowercase. apple and Apple are treated as the same thing. It is used for case insensitive searching and case insensitive analysis. see lemma - [lc](https://www.sketchengine.eu/glossary/lc/) - Learn to understand attributes (also referred to as word_lc, word lowercase or word form lowercase) is a positional attribute assigned to of each token in the corpus. The lc attribute is a lowercased version of the word attribute: John becomes john, Apple becomes apple, BE becomes be. The lc attribute makes the upper case and - [KWIC](https://www.sketchengine.eu/glossary/kwic/) - KWIC is the acronym for Key Word in Context and refers to the red text highlighted in a concordance. The red text is the result that matches the search criteria. Such a concordance is referred to as a KWIC concordance. The KWIC concordance is the preferred format for displaying concordance data because it is easy to - [keyword](https://www.sketchengine.eu/glossary/keyword/) - (Not to be confused with terms which is a related concept.) Keywords is a concept used in connection with Keyword & Term extraction. Keywords are words (single-token items), that appear more frequently in the focus corpus than in the reference corpus. They are used to identify what is specific to a corpus (focus corpus) or its - [Grammatical relation](https://www.sketchengine.eu/glossary/grammatical-relation/) - A grammatical relation, or gramrel, refers to one column in the word sketch. Each column represents a category which displays collocates with the same relation to the search word, e.g. subjects of a verb or modifiers of a noun. Some columns may also display the usage statistics of the search word instead of collocates, e.g. - [header field](https://www.sketchengine.eu/glossary/header-field/) - various types of information associated with documents of a corpus, e.g. a corpus with documents from different domains can be structured according to these domains with a usage of header fields and their values "nameofdomain" = - [glue ](https://www.sketchengine.eu/glossary/glue/) - A glue is a special structure inserted into a corpus to tell Sketch Engine that two tokens, which would otherwise be displayed with a space in between, should actually be displayed without a space. Typically do and n't will have glue between them to be displayed as don't. A glue does not have any - [global subcorpus](https://www.sketchengine.eu/glossary/global-subcorpus/) - a subcorpus that is shared with all users. See instructions how to set the subcorpus shared all users» - [document](https://www.sketchengine.eu/glossary/document/) - A document (called a file in old corpora) in Sketch Engine refers to any file, document or webpage the corpus is made up of. If a user uploads a file (such as .doc, .pdf, .txt), each of the files becomes a corpus document. If the user downloads content from the web, each web page becomes - [disambiguation](https://www.sketchengine.eu/glossary/disambiguation/) - a process of identifying meanings of words (lemma, part of speech) when a word has multiple meanings. The result of this process is one word with one meaning. - [deduplication](https://www.sketchengine.eu/glossary/deduplication/) - Deduplication is a process of removing duplicated content from a corpus. Only the first instance of the text is preserved, any subsequent (duplicated) occurrences are removed. Deduplication is especially important with corpora built by crawling the web. This is because lots of web content is reposted and shared to other locations. Including the same content - [CSV](https://www.sketchengine.eu/glossary/csv/) - a type of plain text document used for saving tabular data. It is seamlessly accepted by a large variety of applications and is therefore ideal for exporting Sketch Engine results to be used in other software. CSV can be opened directly in Microsoft Excel, Open Office, Google Documents and many others. - [CoNLL format](https://www.sketchengine.eu/glossary/conll-format/) - CoNLL format is a specific format of the vertical file that represents a syntactic parse tree. - [concordance](https://www.sketchengine.eu/glossary/concordance/) - a list of all examples of the search word or phrase found in a corpus, usually in the format of a KWIC concordance with the search word highlighted in the centre of the screen and some context to the right and to the left see also KWIC - [compile](https://www.sketchengine.eu/glossary/compile/) - A corpus compilation refers to the processing of the corpus data (text) with the tools available for the language and converting the text into a corpus.Only a compiled corpus can be searched. see corpus compilation - [collocation](https://www.sketchengine.eu/glossary/collocation/) - a collocation is a sequence or combination of words that occur together more often than would be expected by chance (from Wikipedia|Collocation) A collocation, e.g. fatal error, typically consists of a node (error) and a collocate (fatal). The words in a collocation may appear immediately next to each other or at a certain distance from each other, - [collocate](https://www.sketchengine.eu/glossary/collocate/) - a part of a collocation that is not the node. A collocate is dependent on the node. The collocate strong and the node wind make up the collocation strong wind collocation collocate node strong wind icy wind cold wind The most typical collocates for every word in the language can be generated with the word sketch tool. - [cluster](https://www.sketchengine.eu/glossary/cluster/) - a process of creating groups of words in the thesaurus or word sketch. Words are connected to their shared collocational behavior. See more on the Clustering Neighbours documentation - [frequency](https://www.sketchengine.eu/glossary/frequency/) - Frequency (also absolute frequency) refers to the number of occurrences or hits. If a word, phrase, tag etc. has a frequency of 10, it means it was found 10 times or it exists 10 times. It is an absolute figure. It is not calculated using a specific formula. compare frequency per million see also ARF - [ALDF – Average Logarithmic Distance Frequency](https://www.sketchengine.eu/glossary/aldf/) - a modified frequency that prevents the result to be excessively influenced by one part of the corpus (e.g. one or more documents) which contains a high concentration of the token. If the token is evenly distributed across the corpus, ALDF and absolute frequency will be similar or identical. In comparison with ARF (Average Reduced Frequency), ## Categories - [Events](https://www.sketchengine.eu/category/sketch-engine-events/) - Subscribe to news by email - [News](https://www.sketchengine.eu/category/news/) - Get notified by email. Subscribe to news - [blog](https://www.sketchengine.eu/category/blog/) - Sketch Engine blog provides a simple introduction to various topics relating to text corpora and their processing including illustrative examples. ## Tags - [Corpus](https://www.sketchengine.eu/tag/corpus/) - A page relevant to corpora. - [Europarl](https://www.sketchengine.eu/tag/europarl/) - [teacher](https://www.sketchengine.eu/tag/teacher/) - [student](https://www.sketchengine.eu/tag/student/) - [faq](https://www.sketchengine.eu/tag/faq/) - [Frequently questions](https://www.sketchengine.eu/tag/frequently-questions/) - [Tutorials](https://www.sketchengine.eu/tag/tutorials/) - [video](https://www.sketchengine.eu/tag/video/) - [View Options](https://www.sketchengine.eu/tag/view-options/) - [options](https://www.sketchengine.eu/tag/options/) - [page size](https://www.sketchengine.eu/tag/page-size/) - [size](https://www.sketchengine.eu/tag/size/) - [checkbox](https://www.sketchengine.eu/tag/checkbox/) - [icon](https://www.sketchengine.eu/tag/icon/) - [configuration](https://www.sketchengine.eu/tag/configuration/) - [allowed language](https://www.sketchengine.eu/tag/allowed-language/) - [subcorpus](https://www.sketchengine.eu/tag/subcorpus/) - [non-sharing](https://www.sketchengine.eu/tag/non-sharing/) - [non-shared](https://www.sketchengine.eu/tag/non-shared/) - [sharing](https://www.sketchengine.eu/tag/sharing/) - [shared](https://www.sketchengine.eu/tag/shared/) - [attributes](https://www.sketchengine.eu/tag/attributes/) - [dynamic attribute](https://www.sketchengine.eu/tag/dynamic-attribute/) - [DYNAMIC](https://www.sketchengine.eu/tag/dynamic/) - [DYNLIB](https://www.sketchengine.eu/tag/dynlib/) - [configuration file](https://www.sketchengine.eu/tag/configuration-file/) - [attribute](https://www.sketchengine.eu/tag/attribute/) - [corpus configuration file](https://www.sketchengine.eu/tag/corpus-configuration-file/) - [registry](https://www.sketchengine.eu/tag/registry/) - [preparing corpus](https://www.sketchengine.eu/tag/preparing-corpus/) - [preparing](https://www.sketchengine.eu/tag/preparing/) - [index](https://www.sketchengine.eu/tag/index/) - [vertical text](https://www.sketchengine.eu/tag/vertical-text/) - [Dutch](https://www.sketchengine.eu/tag/dutch/) - [ANW](https://www.sketchengine.eu/tag/anw/) - [Nederlands](https://www.sketchengine.eu/tag/nederlands/) - [balanced](https://www.sketchengine.eu/tag/balanced/) - [Aclwac](https://www.sketchengine.eu/tag/aclwac/) - [ACL](https://www.sketchengine.eu/tag/acl/) - [Anthology Reference](https://www.sketchengine.eu/tag/anthology-reference/) - [ARC](https://www.sketchengine.eu/tag/arc/) - [Arabic](https://www.sketchengine.eu/tag/arabic/) - [WaC](https://www.sketchengine.eu/tag/wac/) - [blog](https://www.sketchengine.eu/tag/blog/) - [Persian](https://www.sketchengine.eu/tag/persian/) - [Farsi](https://www.sketchengine.eu/tag/farsi/) - [Basque](https://www.sketchengine.eu/tag/basque/) - [Euskara](https://www.sketchengine.eu/tag/euskara/) - [Brasil](https://www.sketchengine.eu/tag/brasil/) - [Portuguese](https://www.sketchengine.eu/tag/portuguese/) - [Brasileiro](https://www.sketchengine.eu/tag/brasileiro/) - [BROWN](https://www.sketchengine.eu/tag/brown/) - [American English](https://www.sketchengine.eu/tag/american-english/) - [Bulgarian](https://www.sketchengine.eu/tag/bulgarian/) - [BulgarianNC](https://www.sketchengine.eu/tag/bulgariannc/) - [English](https://www.sketchengine.eu/tag/english/) - [Academic Language](https://www.sketchengine.eu/tag/academic-language/) - [immigration](https://www.sketchengine.eu/tag/immigration/) - [COMPAS](https://www.sketchengine.eu/tag/compas/) - [Students](https://www.sketchengine.eu/tag/students/) - [lexicographer](https://www.sketchengine.eu/tag/lexicographer/) - [tutorial](https://www.sketchengine.eu/tag/tutorial/) - [translator](https://www.sketchengine.eu/tag/translator/) - [GDEX](https://www.sketchengine.eu/tag/gdex/) - [dictionary](https://www.sketchengine.eu/tag/dictionary/) - [how to](https://www.sketchengine.eu/tag/how-to/) - [syntax](https://www.sketchengine.eu/tag/syntax/) - [classifier](https://www.sketchengine.eu/tag/classifier/) - [Tickbox](https://www.sketchengine.eu/tag/tickbox/) - [lexicography](https://www.sketchengine.eu/tag/lexicography/) - [TBL](https://www.sketchengine.eu/tag/tbl/) - [manual](https://www.sketchengine.eu/tag/manual/) - [parallel](https://www.sketchengine.eu/tag/parallel/) - [mapping](https://www.sketchengine.eu/tag/mapping/) - [terminology](https://www.sketchengine.eu/tag/terminology/) - [keyword](https://www.sketchengine.eu/tag/keyword/) - [extract keywords](https://www.sketchengine.eu/tag/extract-keywords/) - [terms](https://www.sketchengine.eu/tag/terms/) - [extraction](https://www.sketchengine.eu/tag/extraction/) - [Clustering](https://www.sketchengine.eu/tag/clustering/) - [Neighbours documentation](https://www.sketchengine.eu/tag/neighbours-documentation/) - [n-gram](https://www.sketchengine.eu/tag/n-gram/) - [bigram](https://www.sketchengine.eu/tag/bigram/) - [trigram](https://www.sketchengine.eu/tag/trigram/) - [dialogue](https://www.sketchengine.eu/tag/dialogue/) - [historical](https://www.sketchengine.eu/tag/historical/) - [historian](https://www.sketchengine.eu/tag/historian/) - [latin](https://www.sketchengine.eu/tag/latin/) - [German](https://www.sketchengine.eu/tag/german/) - [newspaper](https://www.sketchengine.eu/tag/newspaper/) - [12](https://www.sketchengine.eu/tag/12/) - [middle english](https://www.sketchengine.eu/tag/middle-english/) - [tools](https://www.sketchengine.eu/tag/tools/) - [language resources](https://www.sketchengine.eu/tag/language-resources/) - [onion](https://www.sketchengine.eu/tag/onion/) - [unitok](https://www.sketchengine.eu/tag/unitok/) - [justext](https://www.sketchengine.eu/tag/justext/) - [chared](https://www.sketchengine.eu/tag/chared/) - [automatic collocation dictionaries](https://www.sketchengine.eu/tag/automatic-collocation-dictionaries/) - [Cantonese](https://www.sketchengine.eu/tag/cantonese/) - [Corpus Factory](https://www.sketchengine.eu/tag/corpus-factory/) - [Bengali](https://www.sketchengine.eu/tag/bengali/) - [Bosnian](https://www.sketchengine.eu/tag/bosnian/) - [Serbian](https://www.sketchengine.eu/tag/serbian/) - [Croatian](https://www.sketchengine.eu/tag/croatian/) - [WaC corpora](https://www.sketchengine.eu/tag/wac-corpora/) - [History](https://www.sketchengine.eu/tag/history/) - [Nineteenth](https://www.sketchengine.eu/tag/nineteenth/) - [List of WaC](https://www.sketchengine.eu/tag/list-of-wac/) - [Filipino](https://www.sketchengine.eu/tag/filipino/) - [Specific Corpora](https://www.sketchengine.eu/tag/specific-corpora/) - [environment](https://www.sketchengine.eu/tag/environment/) - [named entity](https://www.sketchengine.eu/tag/named-entity/) - [Korpus](https://www.sketchengine.eu/tag/korpus/) - [TECU](https://www.sketchengine.eu/tag/tecu/) - [zeměměřictví](https://www.sketchengine.eu/tag/zememerictvi/) - [katastr](https://www.sketchengine.eu/tag/katastr/) - [nemovitost](https://www.sketchengine.eu/tag/nemovitost/) - [geodetics](https://www.sketchengine.eu/tag/geodetics/) - [e-flex](https://www.sketchengine.eu/tag/e-flex/) - [art](https://www.sketchengine.eu/tag/art/) - [Science](https://www.sketchengine.eu/tag/science/) - [Domain specific](https://www.sketchengine.eu/tag/domain-specific/) - [List of corpora](https://www.sketchengine.eu/tag/list-of-corpora/) - [domain](https://www.sketchengine.eu/tag/domain/) - [Domain specific corpora](https://www.sketchengine.eu/tag/domain-specific-corpora/) - [danishWaC](https://www.sketchengine.eu/tag/danishwac/) - [Danish](https://www.sketchengine.eu/tag/danish/) - [FrisianWaC](https://www.sketchengine.eu/tag/frisianwac/) - [Frisian](https://www.sketchengine.eu/tag/frisian/) - [FinnishWaC](https://www.sketchengine.eu/tag/finnishwac/) - [Finnish](https://www.sketchengine.eu/tag/finnish/) - [georgianWaC](https://www.sketchengine.eu/tag/georgianwac/) - [Georgian](https://www.sketchengine.eu/tag/georgian/) - [NeuroLingo](https://www.sketchengine.eu/tag/neurolingo/) - [tagset](https://www.sketchengine.eu/tag/tagset/) - [Greek](https://www.sketchengine.eu/tag/greek/) - [Patakis](https://www.sketchengine.eu/tag/patakis/) - [Gujarati](https://www.sketchengine.eu/tag/gujarati/) - [Gujarathi](https://www.sketchengine.eu/tag/gujarathi/) - [gujarathiWac](https://www.sketchengine.eu/tag/gujarathiwac/) - [DANTE](https://www.sketchengine.eu/tag/dante/) - [lexical database](https://www.sketchengine.eu/tag/lexical-database/) - [UKWaC](https://www.sketchengine.eu/tag/ukwac/) - [British English](https://www.sketchengine.eu/tag/british-english/) - [UKWaCsst](https://www.sketchengine.eu/tag/ukwacsst/) - [Turkish](https://www.sketchengine.eu/tag/turkish/) - [TurkishWaC](https://www.sketchengine.eu/tag/turkishwac/) - [Thai](https://www.sketchengine.eu/tag/thai/) - [ThaiWaC](https://www.sketchengine.eu/tag/thaiwac/) - [WelshWaC](https://www.sketchengine.eu/tag/welshwac/) - [Welsh](https://www.sketchengine.eu/tag/welsh/) - [SDeWaC](https://www.sketchengine.eu/tag/sdewac/) - [TeluguWaC](https://www.sketchengine.eu/tag/teluguwac/) - [Telugu](https://www.sketchengine.eu/tag/telugu/) - [SwedishWaC](https://www.sketchengine.eu/tag/swedishwac/) - [Swedish](https://www.sketchengine.eu/tag/swedish/) - [SpanishWaC](https://www.sketchengine.eu/tag/spanishwac/) - [Spanish](https://www.sketchengine.eu/tag/spanish/) - [SetswanaWaC2](https://www.sketchengine.eu/tag/setswanawac2/) - [Setswana](https://www.sketchengine.eu/tag/setswana/) - [SamoanWaC1](https://www.sketchengine.eu/tag/samoanwac1/) - [Samoan](https://www.sketchengine.eu/tag/samoan/) - [NepaliWaC](https://www.sketchengine.eu/tag/nepaliwac/) - [Nepali](https://www.sketchengine.eu/tag/nepali/) - [MalaysianWaC](https://www.sketchengine.eu/tag/malaysianwac/) - [Malay](https://www.sketchengine.eu/tag/malay/) - [JpWaC](https://www.sketchengine.eu/tag/jpwac/) - [Japanese](https://www.sketchengine.eu/tag/japanese/) - [ItWaC](https://www.sketchengine.eu/tag/itwac/) - [Italian](https://www.sketchengine.eu/tag/italian/) - [IgboWaC](https://www.sketchengine.eu/tag/igbowac/) - [Igbo](https://www.sketchengine.eu/tag/igbo/) - [HindiWaC](https://www.sketchengine.eu/tag/hindiwac/) - [Hindi](https://www.sketchengine.eu/tag/hindi/) - [daTenTen](https://www.sketchengine.eu/tag/datenten/) - [TenTen](https://www.sketchengine.eu/tag/tenten/) - [arTenTen](https://www.sketchengine.eu/tag/artenten/) - [bgTenTen](https://www.sketchengine.eu/tag/bgtenten/) - [Catalan](https://www.sketchengine.eu/tag/catalan/) - [czTenTen](https://www.sketchengine.eu/tag/cztenten/) - [Czech](https://www.sketchengine.eu/tag/czech/) - [RFTagger](https://www.sketchengine.eu/tag/rftagger/) - [deTenTen](https://www.sketchengine.eu/tag/detenten/) - [elTenTen](https://www.sketchengine.eu/tag/eltenten/) - [enTenTen](https://www.sketchengine.eu/tag/ententen/) - [esAmTenTen](https://www.sketchengine.eu/tag/esamtenten/) - [Spanish American](https://www.sketchengine.eu/tag/spanish-american/) - [esTenTen](https://www.sketchengine.eu/tag/estenten/) - [European Spanish](https://www.sketchengine.eu/tag/european-spanish/) - [etTenTen](https://www.sketchengine.eu/tag/ettenten/) - [Estonian](https://www.sketchengine.eu/tag/estonian/) - [fiTenTen](https://www.sketchengine.eu/tag/fitenten/) - [frTenTen](https://www.sketchengine.eu/tag/frtenten/) - [French](https://www.sketchengine.eu/tag/french/) - [heTenTen](https://www.sketchengine.eu/tag/hetenten/) - [Hebrew](https://www.sketchengine.eu/tag/hebrew/) - [huTenTen](https://www.sketchengine.eu/tag/hutenten/) - [Hungarian](https://www.sketchengine.eu/tag/hungarian/) - [itTenTen](https://www.sketchengine.eu/tag/ittenten/) - [jpTenTen](https://www.sketchengine.eu/tag/jptenten/) - [MeCab](https://www.sketchengine.eu/tag/mecab/) - [lvTenTen](https://www.sketchengine.eu/tag/lvtenten/) - [Latvian](https://www.sketchengine.eu/tag/latvian/) - [Korean](https://www.sketchengine.eu/tag/korean/) - [HanNanum](https://www.sketchengine.eu/tag/hannanum/) - [ltTenTen](https://www.sketchengine.eu/tag/lttenten/) - [Lithuanian](https://www.sketchengine.eu/tag/lithuanian/) - [koTenTen](https://www.sketchengine.eu/tag/kotenten/) - [ptTenTen](https://www.sketchengine.eu/tag/pttenten/) - [plTenTen](https://www.sketchengine.eu/tag/pltenten/) - [Polish](https://www.sketchengine.eu/tag/polish/) - [noTenTen](https://www.sketchengine.eu/tag/notenten/) - [Norwegian](https://www.sketchengine.eu/tag/norwegian/) - [TreeTagger](https://www.sketchengine.eu/tag/treetagger/) - [nlTenTen](https://www.sketchengine.eu/tag/nltenten/) - [zhTenTen](https://www.sketchengine.eu/tag/zhtenten/) - [Chinese](https://www.sketchengine.eu/tag/chinese/) - [yoTenTen](https://www.sketchengine.eu/tag/yotenten/) - [Yoruba](https://www.sketchengine.eu/tag/yoruba/) - [uaTenTen](https://www.sketchengine.eu/tag/uatenten/) - [Ukrainian](https://www.sketchengine.eu/tag/ukrainian/) - [trTenTen](https://www.sketchengine.eu/tag/trtenten/) - [svTenTen](https://www.sketchengine.eu/tag/svtenten/) - [skTenTen](https://www.sketchengine.eu/tag/sktenten/) - [Slovak](https://www.sketchengine.eu/tag/slovak/) - [ruTenTen](https://www.sketchengine.eu/tag/rutenten/) - [Russian](https://www.sketchengine.eu/tag/russian/) - [CAJA](https://www.sketchengine.eu/tag/caja/) - [Academic](https://www.sketchengine.eu/tag/academic/) - [journal article](https://www.sketchengine.eu/tag/journal-article/) - [CHILDES](https://www.sketchengine.eu/tag/childes/) - [child](https://www.sketchengine.eu/tag/child/) - [children](https://www.sketchengine.eu/tag/children/) - [child language](https://www.sketchengine.eu/tag/child-language/) - [corpora](https://www.sketchengine.eu/tag/corpora/) - [CHILDES English corpus](https://www.sketchengine.eu/tag/childes-english-corpus/) - [Gigaword](https://www.sketchengine.eu/tag/gigaword/) - [Chinese simplified](https://www.sketchengine.eu/tag/chinese-simplified/) - [Chinese traditional](https://www.sketchengine.eu/tag/chinese-traditional/) - [ChineseTaiwanWaC](https://www.sketchengine.eu/tag/chinesetaiwanwac/) - [Taiwan](https://www.sketchengine.eu/tag/taiwan/) - [ChineseWiki](https://www.sketchengine.eu/tag/chinesewiki/) - [Wikipedia](https://www.sketchengine.eu/tag/wikipedia/) - [European parliament](https://www.sketchengine.eu/tag/european-parliament/) - [multilingual](https://www.sketchengine.eu/tag/multilingual/) - [DGT](https://www.sketchengine.eu/tag/dgt/) - [translation memory](https://www.sketchengine.eu/tag/translation-memory/) - [English Wikipedia](https://www.sketchengine.eu/tag/english-wikipedia/) - [Reference corpus](https://www.sketchengine.eu/tag/reference-corpus/) - [FeedCorpus](https://www.sketchengine.eu/tag/feedcorpus/) - [Slovene](https://www.sketchengine.eu/tag/slovene/) - [Fida PLUS](https://www.sketchengine.eu/tag/fida-plus/) - [French Web Corpus](https://www.sketchengine.eu/tag/french-web-corpus/) - [akademy](https://www.sketchengine.eu/tag/akademy/) - [project Gutenberg](https://www.sketchengine.eu/tag/project-gutenberg/) - [Internet](https://www.sketchengine.eu/tag/internet/) - [Islam](https://www.sketchengine.eu/tag/islam/) - [United Kingdom](https://www.sketchengine.eu/tag/united-kingdom/) - [London English](https://www.sketchengine.eu/tag/london-english/) - [NepaliNC](https://www.sketchengine.eu/tag/nepalinc/) - [Nepali National Corpus](https://www.sketchengine.eu/tag/nepali-national-corpus/) - [New Model Corpus](https://www.sketchengine.eu/tag/new-model-corpus/) - [SuperSenseTagger](https://www.sketchengine.eu/tag/supersensetagger/) - [Oxford](https://www.sketchengine.eu/tag/oxford/) - [OPUS](https://www.sketchengine.eu/tag/opus/) - [registry infor](https://www.sketchengine.eu/tag/registry-infor/) - [Bible](https://www.sketchengine.eu/tag/bible/) - [Swahili](https://www.sketchengine.eu/tag/swahili/) - [Polish Web Corpus](https://www.sketchengine.eu/tag/polish-web-corpus/) - [Público](https://www.sketchengine.eu/tag/publico/) - [Folha](https://www.sketchengine.eu/tag/folha/) - [Brazil](https://www.sketchengine.eu/tag/brazil/) - [Romanian](https://www.sketchengine.eu/tag/romanian/) - [RoWaC](https://www.sketchengine.eu/tag/rowac/) - [pukWaC](https://www.sketchengine.eu/tag/pukwac/) - [Russian Web Corpus](https://www.sketchengine.eu/tag/russian-web-corpus/) - [Scottish](https://www.sketchengine.eu/tag/scottish/) - [Gaelic](https://www.sketchengine.eu/tag/gaelic/) - [semantic](https://www.sketchengine.eu/tag/semantic/) - [SiBol](https://www.sketchengine.eu/tag/sibol/) - [Port](https://www.sketchengine.eu/tag/port/) - [jpTenTen11](https://www.sketchengine.eu/tag/jptenten11/) - [LUW](https://www.sketchengine.eu/tag/luw/) - [TED](https://www.sketchengine.eu/tag/ted/) - [talkbank](https://www.sketchengine.eu/tag/talkbank/) - [Turkic web corpora](https://www.sketchengine.eu/tag/turkic-web-corpora/) - [Turkic](https://www.sketchengine.eu/tag/turkic/) - [Urdu](https://www.sketchengine.eu/tag/urdu/) - [Ajka](https://www.sketchengine.eu/tag/ajka/) - [czes](https://www.sketchengine.eu/tag/czes/) - [newspaper sites](https://www.sketchengine.eu/tag/newspaper-sites/) - [translational](https://www.sketchengine.eu/tag/translational/) - [Preparing Corpus Text](https://www.sketchengine.eu/tag/preparing-corpus-text/) - [preparing text](https://www.sketchengine.eu/tag/preparing-text/) - [Text Types](https://www.sketchengine.eu/tag/text-types/) - [headers](https://www.sketchengine.eu/tag/headers/) - [subcorpora](https://www.sketchengine.eu/tag/subcorpora/) - [Creating Subcorpora](https://www.sketchengine.eu/tag/creating-subcorpora/) - [all users](https://www.sketchengine.eu/tag/all-users/) - [all features](https://www.sketchengine.eu/tag/all-features/) - [version](https://www.sketchengine.eu/tag/version/) - [Versioning](https://www.sketchengine.eu/tag/versioning/) - [changelog](https://www.sketchengine.eu/tag/changelog/) - [administrator](https://www.sketchengine.eu/tag/administrator/) - [Full Administrators](https://www.sketchengine.eu/tag/full-administrators/) - [Command line](https://www.sketchengine.eu/tag/command-line/) - [generating n-grams](https://www.sketchengine.eu/tag/generating-n-grams/) - [viewing n-grams](https://www.sketchengine.eu/tag/viewing-n-grams/) - [Compiling corpus](https://www.sketchengine.eu/tag/compiling-corpus/) - [n-grams](https://www.sketchengine.eu/tag/n-grams/) - [authentication](https://www.sketchengine.eu/tag/authentication/) - [API](https://www.sketchengine.eu/tag/api/) - [JSON](https://www.sketchengine.eu/tag/json/) - [JSON API Documentation](https://www.sketchengine.eu/tag/json-api-documentation/) - [corpcheck](https://www.sketchengine.eu/tag/corpcheck/) - [tool](https://www.sketchengine.eu/tag/tool/) - [verifying](https://www.sketchengine.eu/tag/verifying/) - [integrity](https://www.sketchengine.eu/tag/integrity/) - [complenetess](https://www.sketchengine.eu/tag/complenetess/) - [compare corpora](https://www.sketchengine.eu/tag/compare-corpora/) - [comparing](https://www.sketchengine.eu/tag/comparing/) - [methods](https://www.sketchengine.eu/tag/methods/) - [localisation](https://www.sketchengine.eu/tag/localisation/) - [localization](https://www.sketchengine.eu/tag/localization/) - [Manatee](https://www.sketchengine.eu/tag/manatee/) - [Finlib](https://www.sketchengine.eu/tag/finlib/) - [Bonito](https://www.sketchengine.eu/tag/bonito/) - [Sketch Engine changelog](https://www.sketchengine.eu/tag/sketch-engine-changelog/) - [photo](https://www.sketchengine.eu/tag/photo/) - [virtual corpus](https://www.sketchengine.eu/tag/virtual-corpus/) - [virtual](https://www.sketchengine.eu/tag/virtual/) - [User Administration](https://www.sketchengine.eu/tag/user-administration/) - [administration](https://www.sketchengine.eu/tag/administration/) - [user](https://www.sketchengine.eu/tag/user/) - [save concordance](https://www.sketchengine.eu/tag/save-concordance/) - [save](https://www.sketchengine.eu/tag/save/) - [concordance](https://www.sketchengine.eu/tag/concordance/) - [Sketch Engine Localisation](https://www.sketchengine.eu/tag/sketch-engine-localisation/) - [Getting Started](https://www.sketchengine.eu/tag/getting-started/) - [Getting Started with Sketch Engine](https://www.sketchengine.eu/tag/getting-started-with-sketch-engine/) - [new account](https://www.sketchengine.eu/tag/new-account/) - [Learner Corpus Functionality](https://www.sketchengine.eu/tag/learner-corpus-functionality/) - [Learner Corpus](https://www.sketchengine.eu/tag/learner-corpus/) - [functionality](https://www.sketchengine.eu/tag/functionality/) - [vertical file for learner corpus](https://www.sketchengine.eu/tag/vertical-file-for-learner-corpus/) - [Thesaurus](https://www.sketchengine.eu/tag/thesaurus/) - [Distributional Thesaurus](https://www.sketchengine.eu/tag/distributional-thesaurus/) - [Showing the different usage of two similar words](https://www.sketchengine.eu/tag/showing-the-different-usage-of-two-similar-words/) - [different usage](https://www.sketchengine.eu/tag/different-usage/) - [two similar words](https://www.sketchengine.eu/tag/two-similar-words/) - [word sketch differences](https://www.sketchengine.eu/tag/word-sketch-differences/) - [switch menu position](https://www.sketchengine.eu/tag/switch-menu-position/) - [menu position](https://www.sketchengine.eu/tag/menu-position/) - [Multi-Level Lists](https://www.sketchengine.eu/tag/multi-level-lists/) - [multi-level](https://www.sketchengine.eu/tag/multi-level/) - [Distinguish Between Lemmas](https://www.sketchengine.eu/tag/distinguish-between-lemmas/) - [Between Lemmas](https://www.sketchengine.eu/tag/between-lemmas/) - [compare corpora using wordlists](https://www.sketchengine.eu/tag/compare-corpora-using-wordlists/) - [Word Sketch](https://www.sketchengine.eu/tag/word-sketch/) - [Make a MultiWord Sketch](https://www.sketchengine.eu/tag/make-a-multiword-sketch/) - [MultiWord Sketch](https://www.sketchengine.eu/tag/multiword-sketch/) - [Search in Specific subcategories](https://www.sketchengine.eu/tag/search-in-specific-subcategories/) - [Specific subcategories](https://www.sketchengine.eu/tag/specific-subcategories/) - [Search Punctuation](https://www.sketchengine.eu/tag/search-punctuation/) - [Punctuation](https://www.sketchengine.eu/tag/punctuation/) - [Make Concordance](https://www.sketchengine.eu/tag/make-concordance/) - [Create Subcorpus](https://www.sketchengine.eu/tag/create-subcorpus/) - [Search Multiple Words](https://www.sketchengine.eu/tag/search-multiple-words/) - [Multiple Words](https://www.sketchengine.eu/tag/multiple-words/) - [Specify Word Context](https://www.sketchengine.eu/tag/specify-word-context/) - [Word Context](https://www.sketchengine.eu/tag/word-context/) - [Search For a Specific Word Form](https://www.sketchengine.eu/tag/search-for-a-specific-word-form/) - [Specific Word Form](https://www.sketchengine.eu/tag/specific-word-form/) - [Check Search](https://www.sketchengine.eu/tag/check-search/) - [Show Whole Sentences](https://www.sketchengine.eu/tag/show-whole-sentences/) - [Whole Sentences](https://www.sketchengine.eu/tag/whole-sentences/) - [Expand Context](https://www.sketchengine.eu/tag/expand-context/) - [KWIC Source](https://www.sketchengine.eu/tag/kwic-source/) - [Source](https://www.sketchengine.eu/tag/source/) - [Highlight Only Part of a Complex Query](https://www.sketchengine.eu/tag/highlight-only-part-of-a-complex-query/) - [Highlight Only Part](https://www.sketchengine.eu/tag/highlight-only-part/) - [Part of a Complex Query](https://www.sketchengine.eu/tag/part-of-a-complex-query/) - [Another Page](https://www.sketchengine.eu/tag/another-page/) - [move around concordance](https://www.sketchengine.eu/tag/move-around-concordance/) - [Show document identification](https://www.sketchengine.eu/tag/show-document-identification/) - [document identification](https://www.sketchengine.eu/tag/document-identification/) - [Show Part of Speech tags](https://www.sketchengine.eu/tag/show-part-of-speech-tags/) - [Part of Speech tags](https://www.sketchengine.eu/tag/part-of-speech-tags/) - [Show Sentence breaks](https://www.sketchengine.eu/tag/show-sentence-breaks/) - [Sentence breaks](https://www.sketchengine.eu/tag/sentence-breaks/) - [Change Number of Examples](https://www.sketchengine.eu/tag/change-number-of-examples/) - [Number of Examples](https://www.sketchengine.eu/tag/number-of-examples/) - [Copy Concordance Line](https://www.sketchengine.eu/tag/copy-concordance-line/) - [Concordance Line](https://www.sketchengine.eu/tag/concordance-line/) - [One-click copying not working](https://www.sketchengine.eu/tag/one-click-copying-not-working/) - [One-click copying](https://www.sketchengine.eu/tag/one-click-copying/) - [copying not working](https://www.sketchengine.eu/tag/copying-not-working/) - [Word Sketch Index Format](https://www.sketchengine.eu/tag/word-sketch-index-format/) - [Index Format](https://www.sketchengine.eu/tag/index-format/) - [Word Sketches definition files](https://www.sketchengine.eu/tag/word-sketches-definition-files/) - [definition files](https://www.sketchengine.eu/tag/definition-files/) - [Building sketches from parsed corpora](https://www.sketchengine.eu/tag/building-sketches-from-parsed-corpora/) - [Building sketches](https://www.sketchengine.eu/tag/building-sketches/) - [Preloaded Configuration Templates](https://www.sketchengine.eu/tag/preloaded-configuration-templates/) - [Configuration Templates](https://www.sketchengine.eu/tag/configuration-templates/) - [Sketch Engine API for IntelliWebSearch](https://www.sketchengine.eu/tag/sketch-engine-api-for-intelliwebsearch/) - [Intelli Web Search](https://www.sketchengine.eu/tag/intelli-web-search/) - [IntelliWebSearch](https://www.sketchengine.eu/tag/intelliwebsearch/) - [Sketch Engine installation packages](https://www.sketchengine.eu/tag/sketch-engine-installation-packages/) - [installation packages](https://www.sketchengine.eu/tag/installation-packages/) - [Scripts for adding header fields](https://www.sketchengine.eu/tag/scripts-for-adding-header-fields/) - [adding header fields](https://www.sketchengine.eu/tag/adding-header-fields/) - [Uploading multiple files to Sketch Engine](https://www.sketchengine.eu/tag/uploading-multiple-files-to-sketch-engine/) - [Uploading multiple files](https://www.sketchengine.eu/tag/uploading-multiple-files/) - [Compatibility Matrix](https://www.sketchengine.eu/tag/compatibility-matrix/) - [Slovene tagset](https://www.sketchengine.eu/tag/slovene-tagset/) - [Slovenski nabor oznak](https://www.sketchengine.eu/tag/slovenski-nabor-oznak/) - [Sketch Grammar development corpora](https://www.sketchengine.eu/tag/sketch-grammar-development-corpora/) - [Sketch](https://www.sketchengine.eu/tag/sketch/) - [development corpora](https://www.sketchengine.eu/tag/development-corpora/) - [sketch grammar](https://www.sketchengine.eu/tag/sketch-grammar/) - [Adding sentence boundaries to a compiled corpus](https://www.sketchengine.eu/tag/adding-sentence-boundaries-to-a-compiled-corpus/) - [Adding sentence boundaries](https://www.sketchengine.eu/tag/adding-sentence-boundaries/) - [Renaming Sketch Grammar relations](https://www.sketchengine.eu/tag/renaming-sketch-grammar-relations/) - [renaming](https://www.sketchengine.eu/tag/renaming/) - [General instructions on corpus data directory structure](https://www.sketchengine.eu/tag/general-instructions-on-corpus-data-directory-structure/) - [corpus data directory structure](https://www.sketchengine.eu/tag/corpus-data-directory-structure/) - [Line numbers in concordances](https://www.sketchengine.eu/tag/line-numbers-in-concordances/) - [Line numbers](https://www.sketchengine.eu/tag/line-numbers/) - [Printing concordances](https://www.sketchengine.eu/tag/printing-concordances/) - [printign](https://www.sketchengine.eu/tag/printign/) - [Getting a short permanent link](https://www.sketchengine.eu/tag/getting-a-short-permanent-link/) - [permanent link](https://www.sketchengine.eu/tag/permanent-link/) - [Unsupported language](https://www.sketchengine.eu/tag/unsupported-language/) - [Unsupported](https://www.sketchengine.eu/tag/unsupported/) - [Writing Sketch Grammars](https://www.sketchengine.eu/tag/writing-sketch-grammars/) - [Sketch Grammars](https://www.sketchengine.eu/tag/sketch-grammars/) - [Find Common Collocates](https://www.sketchengine.eu/tag/find-common-collocates/) - [SkEW-6](https://www.sketchengine.eu/tag/skew-6/) - [Icelandic](https://www.sketchengine.eu/tag/icelandic/) - [Icelandic sample corpus](https://www.sketchengine.eu/tag/icelandic-sample-corpus/) - [Sort concordance](https://www.sketchengine.eu/tag/sort-concordance/) - [sorting](https://www.sketchengine.eu/tag/sorting/) - [Concordance Filter](https://www.sketchengine.eu/tag/concordance-filter/) - [Filter a Concordance](https://www.sketchengine.eu/tag/filter-a-concordance/) - [Use Sample](https://www.sketchengine.eu/tag/use-sample/) - [thinning concordance](https://www.sketchengine.eu/tag/thinning-concordance/) - [Word Sketches](https://www.sketchengine.eu/tag/word-sketches/) - [Parallel corpora](https://www.sketchengine.eu/tag/parallel-corpora/) - [FrWaC](https://www.sketchengine.eu/tag/frwac/) - [Bilingual Word Sketch](https://www.sketchengine.eu/tag/bilingual-word-sketch/) - [Simple maths](https://www.sketchengine.eu/tag/simple-maths/) - [identifying keywords](https://www.sketchengine.eu/tag/identifying-keywords/) - [Glossary of Terms](https://www.sketchengine.eu/tag/glossary-of-terms/) - [Glossary](https://www.sketchengine.eu/tag/glossary/) - [Jargon Buster](https://www.sketchengine.eu/tag/jargon-buster/) - [Bibliography](https://www.sketchengine.eu/tag/bibliography/) - [Bibliography of Sketch Engine](https://www.sketchengine.eu/tag/bibliography-of-sketch-engine/) - [Bibliographies](https://www.sketchengine.eu/tag/bibliographies/) - [Sketch Engine for](https://www.sketchengine.eu/tag/sketch-engine-for/) - [Change interface language](https://www.sketchengine.eu/tag/change-interface-language/) - [interface language](https://www.sketchengine.eu/tag/interface-language/) - [Quering corpora](https://www.sketchengine.eu/tag/quering-corpora/) - [Sketch Engine for historians](https://www.sketchengine.eu/tag/sketch-engine-for-historians/) - [Discrepancies between API and interface results](https://www.sketchengine.eu/tag/discrepancies-between-api-and-interface-results/) - [Discrepancies between API and interface](https://www.sketchengine.eu/tag/discrepancies-between-api-and-interface/) - [Discrepancies API interface](https://www.sketchengine.eu/tag/discrepancies-api-interface/) - [Discrepancies API and interface](https://www.sketchengine.eu/tag/discrepancies-api-and-interface/) - [common corpora structures](https://www.sketchengine.eu/tag/common-corpora-structures/) - [corpora structures](https://www.sketchengine.eu/tag/corpora-structures/) - [structures](https://www.sketchengine.eu/tag/structures/) - [Variation in hit counts](https://www.sketchengine.eu/tag/variation-in-hit-counts/) - [Variation hit counts](https://www.sketchengine.eu/tag/variation-hit-counts/) - [hit counts](https://www.sketchengine.eu/tag/hit-counts/) - [Frequently asked questions](https://www.sketchengine.eu/tag/frequently-asked-questions/) - [Using WebBootCat](https://www.sketchengine.eu/tag/using-webbootcat/) - [Answers on Using WebBootCat](https://www.sketchengine.eu/tag/answers-on-using-webbootcat/) - [6th International Sketch Engine Workshop](https://www.sketchengine.eu/tag/6th-international-sketch-engine-workshop/) - [SkEW6](https://www.sketchengine.eu/tag/skew6/) - [Oxford Children's Corpus](https://www.sketchengine.eu/tag/oxford-childrens-corpus/) - [Oxford Children's](https://www.sketchengine.eu/tag/oxford-childrens/) - [Sketch Engine for students](https://www.sketchengine.eu/tag/sketch-engine-for-students/) - [Sketch Engine for teachers](https://www.sketchengine.eu/tag/sketch-engine-for-teachers/) - [Oxford English Corpus](https://www.sketchengine.eu/tag/oxford-english-corpus/) - [OCC](https://www.sketchengine.eu/tag/occ/) - [Varieties of Learner English corpus](https://www.sketchengine.eu/tag/varieties-of-learner-english-corpus/) - [Learner English](https://www.sketchengine.eu/tag/learner-english/) - [New Corpus for Ireland](https://www.sketchengine.eu/tag/new-corpus-for-ireland/) - [Nua-Chorpas na hÉireann](https://www.sketchengine.eu/tag/nua-chorpas-na-heireann/) - [Irish](https://www.sketchengine.eu/tag/irish/) - [chorpas](https://www.sketchengine.eu/tag/chorpas/) - [Gaeilge](https://www.sketchengine.eu/tag/gaeilge/) - [treoir don úsáideoir](https://www.sketchengine.eu/tag/treoir-don-usaideoir/) - [focloir](https://www.sketchengine.eu/tag/focloir/) - [GDEX installation](https://www.sketchengine.eu/tag/gdex-installation/) - [installation](https://www.sketchengine.eu/tag/installation/) - [permalink](https://www.sketchengine.eu/tag/permalink/) - [SkELL](https://www.sketchengine.eu/tag/skell/) - [Sketch Engine for Language Learning](https://www.sketchengine.eu/tag/sketch-engine-for-language-learning/) - [documentation](https://www.sketchengine.eu/tag/documentation/) - [doc](https://www.sketchengine.eu/tag/doc/) - [Sketch Engine Workshops](https://www.sketchengine.eu/tag/sketch-engine-workshops/) - [workshops](https://www.sketchengine.eu/tag/workshops/) - [API documentation for keyword extraction](https://www.sketchengine.eu/tag/api-documentation-for-keyword-extraction/) - [API documentation](https://www.sketchengine.eu/tag/api-documentation/) - [API parameters](https://www.sketchengine.eu/tag/api-parameters/) - [WordSketch](https://www.sketchengine.eu/tag/wordsketch/) - [Word profile](https://www.sketchengine.eu/tag/word-profile/) - [Simple Query](https://www.sketchengine.eu/tag/simple-query/) - [Simple search](https://www.sketchengine.eu/tag/simple-search/) - [Adam Kilgarriff Structured bibliography](https://www.sketchengine.eu/tag/adam-kilgarriff-structured-bibliography/) - [Adam Kilgarriff bibliography](https://www.sketchengine.eu/tag/adam-kilgarriff-bibliography/) - [Adam Kilgarriff](https://www.sketchengine.eu/tag/adam-kilgarriff/) - [Corpus Architect](https://www.sketchengine.eu/tag/corpus-architect/) - [Features](https://www.sketchengine.eu/tag/features/) - [Trends in diachronic corpora](https://www.sketchengine.eu/tag/trends-in-diachronic-corpora/) - [trends](https://www.sketchengine.eu/tag/trends/) - [my jobs](https://www.sketchengine.eu/tag/my-jobs/) - [my background jobs](https://www.sketchengine.eu/tag/my-background-jobs/) - [Concordance Query Error Query](https://www.sketchengine.eu/tag/concordance-query-error-query/) - [Concordance Error Query](https://www.sketchengine.eu/tag/concordance-error-query/) - [Error Query](https://www.sketchengine.eu/tag/error-query/) - [Shallow tagging](https://www.sketchengine.eu/tag/shallow-tagging/) - [BNC tagset](https://www.sketchengine.eu/tag/bnc-tagset/) - [British National Corpus tagset](https://www.sketchengine.eu/tag/british-national-corpus-tagset/) - [CLAWS-5](https://www.sketchengine.eu/tag/claws-5/) - [caTenTen corpus](https://www.sketchengine.eu/tag/catenten-corpus/) - [Symbols of Parts of Speech](https://www.sketchengine.eu/tag/symbols-of-parts-of-speech/) - [Modified Penn Treebank Tagset](https://www.sketchengine.eu/tag/modified-penn-treebank-tagset/) - [Penn Treebank Tagset](https://www.sketchengine.eu/tag/penn-treebank-tagset/) - [OPUS alignment](https://www.sketchengine.eu/tag/opus-alignment/) - [OPUS parallel corpora](https://www.sketchengine.eu/tag/opus-parallel-corpora/) - [Parole Common Morphosyntactical Tagset](https://www.sketchengine.eu/tag/parole-common-morphosyntactical-tagset/) - [gaeilge tagset](https://www.sketchengine.eu/tag/gaeilge-tagset/) - [Feed Corpus Project](https://www.sketchengine.eu/tag/feed-corpus-project/) - [feed project](https://www.sketchengine.eu/tag/feed-project/) - [Single sign-on](https://www.sketchengine.eu/tag/single-sign-on/) - [sign on](https://www.sketchengine.eu/tag/sign-on/) - [SSO](https://www.sketchengine.eu/tag/sso/) - [WebBootCaT](https://www.sketchengine.eu/tag/webbootcat/) - [Statistics used in Sketch Engine](https://www.sketchengine.eu/tag/statistics-used-in-sketch-engine/) - [Statistics](https://www.sketchengine.eu/tag/statistics/) - [Statistics Sketch Engine](https://www.sketchengine.eu/tag/statistics-sketch-engine/) - [Sketch Engine statistics](https://www.sketchengine.eu/tag/sketch-engine-statistics/) - [bootstrapping text](https://www.sketchengine.eu/tag/bootstrapping-text/) - [Text Corpora](https://www.sketchengine.eu/tag/text-corpora/) - [job runner](https://www.sketchengine.eu/tag/job-runner/) - [SkEW 6 workshop](https://www.sketchengine.eu/tag/skew-6-workshop/) - [Early English Books Online](https://www.sketchengine.eu/tag/early-english-books-online/) - [eebo](https://www.sketchengine.eu/tag/eebo/) - [eebo corpus](https://www.sketchengine.eu/tag/eebo-corpus/) - [ske.li](https://www.sketchengine.eu/tag/ske-li/) - [Concordance Query](https://www.sketchengine.eu/tag/concordance-query/) - [Query Type](https://www.sketchengine.eu/tag/query-type/) - [sketch differences](https://www.sketchengine.eu/tag/sketch-differences/) - [Lexicom](https://www.sketchengine.eu/tag/lexicom/) - [Telč](https://www.sketchengine.eu/tag/telc/) - [Local installations](https://www.sketchengine.eu/tag/local-installations/) - [requirements for installing](https://www.sketchengine.eu/tag/requirements-for-installing/) - [r](https://www.sketchengine.eu/tag/r/) - [CLAWS tagset](https://www.sketchengine.eu/tag/claws-tagset/) - [claws](https://www.sketchengine.eu/tag/claws/) - [C8 mapping C7](https://www.sketchengine.eu/tag/c8-mapping-c7/) - [hebwac corpus](https://www.sketchengine.eu/tag/hebwac-corpus/) - [japanese tagset](https://www.sketchengine.eu/tag/japanese-tagset/) - [Romanian Tagset](https://www.sketchengine.eu/tag/romanian-tagset/) - [Limba Română](https://www.sketchengine.eu/tag/limba-romana/) - [Română](https://www.sketchengine.eu/tag/romana/) - [corporaWaC](https://www.sketchengine.eu/tag/corporawac/) - [Vietnamese Tagset](https://www.sketchengine.eu/tag/vietnamese-tagset/) - [Vietnamese](https://www.sketchengine.eu/tag/vietnamese/) - [Chinese Tagset](https://www.sketchengine.eu/tag/chinese-tagset/) - [Hebrew Translational Corpus tagset](https://www.sketchengine.eu/tag/hebrew-translational-corpus-tagset/) - [Yoruba WaC corpus](https://www.sketchengine.eu/tag/yoruba-wac-corpus/) - [Kannada WaC](https://www.sketchengine.eu/tag/kannada-wac/) - [Kannada](https://www.sketchengine.eu/tag/kannada/) - [Dravidian language](https://www.sketchengine.eu/tag/dravidian-language/) - [Domain Web Corpus](https://www.sketchengine.eu/tag/domain-web-corpus/) - [hrwac](https://www.sketchengine.eu/tag/hrwac/) - [Indonesian](https://www.sketchengine.eu/tag/indonesian/) - [IndonesianWaC](https://www.sketchengine.eu/tag/indonesianwac/) - [Oxford Children's Stories](https://www.sketchengine.eu/tag/oxford-childrens-stories/) - [Beebox](https://www.sketchengine.eu/tag/beebox/) - [LithuanianWaC](https://www.sketchengine.eu/tag/lithuanianwac/) - [Dutch Web Corpus](https://www.sketchengine.eu/tag/dutch-web-corpus/) - [talk bank persian](https://www.sketchengine.eu/tag/talk-bank-persian/) - [PICAE](https://www.sketchengine.eu/tag/picae/) - [Pearson International Corpus](https://www.sketchengine.eu/tag/pearson-international-corpus/) - [Academic English](https://www.sketchengine.eu/tag/academic-english/) - [romanian wac tagset](https://www.sketchengine.eu/tag/romanian-wac-tagset/) - [SamoanWaC](https://www.sketchengine.eu/tag/samoanwac/) - [sibolport](https://www.sketchengine.eu/tag/sibolport/) - [java](https://www.sketchengine.eu/tag/java/) - [python](https://www.sketchengine.eu/tag/python/) - [JSON API Documentation authentication](https://www.sketchengine.eu/tag/json-api-documentation-authentication/) - [Frequency per million](https://www.sketchengine.eu/tag/frequency-per-million/) - [relative frequency](https://www.sketchengine.eu/tag/relative-frequency/) - [text type frequency](https://www.sketchengine.eu/tag/text-type-frequency/) - [Diachronic analysis](https://www.sketchengine.eu/tag/diachronic-analysis/) - [word usage over time](https://www.sketchengine.eu/tag/word-usage-over-time/) - [Find x](https://www.sketchengine.eu/tag/find-x/) - [Findx](https://www.sketchengine.eu/tag/findx/) - [finding word most x](https://www.sketchengine.eu/tag/finding-word-most-x/) - [GkWaC](https://www.sketchengine.eu/tag/gkwac/) - [greek web as corpus](https://www.sketchengine.eu/tag/greek-web-as-corpus/) - [greek corpus](https://www.sketchengine.eu/tag/greek-corpus/) - [workshop](https://www.sketchengine.eu/tag/workshop/) - [Writing Sketch Grammar](https://www.sketchengine.eu/tag/writing-sketch-grammar/) - [import CSV](https://www.sketchengine.eu/tag/import-csv/) - [CSV file](https://www.sketchengine.eu/tag/csv-file/) - [import CSV into Excel](https://www.sketchengine.eu/tag/import-csv-into-excel/) - [Sketch Engine for translators](https://www.sketchengine.eu/tag/sketch-engine-for-translators/) - [Bilingual terminology extraction](https://www.sketchengine.eu/tag/bilingual-terminology-extraction/) - [Bilingual terminology](https://www.sketchengine.eu/tag/bilingual-terminology/) - [full administration](https://www.sketchengine.eu/tag/full-administration/) - [user over quota](https://www.sketchengine.eu/tag/user-over-quota/) - [Granting user access](https://www.sketchengine.eu/tag/granting-user-access/) - [Configuring users](https://www.sketchengine.eu/tag/configuring-users/) - [Configuring corpora](https://www.sketchengine.eu/tag/configuring-corpora/) - [sonar](https://www.sketchengine.eu/tag/sonar/) - [contemporary written Dutch](https://www.sketchengine.eu/tag/contemporary-written-dutch/) - [Dutch reference corpus](https://www.sketchengine.eu/tag/dutch-reference-corpus/) - [Sketch Engine for terminologist](https://www.sketchengine.eu/tag/sketch-engine-for-terminologist/) - [contact](https://www.sketchengine.eu/tag/contact/) - [contact us](https://www.sketchengine.eu/tag/contact-us/) - [support](https://www.sketchengine.eu/tag/support/) - [cql](https://www.sketchengine.eu/tag/cql/) - [corpus query language](https://www.sketchengine.eu/tag/corpus-query-language/) - [delete](https://www.sketchengine.eu/tag/delete/) - [ruskell](https://www.sketchengine.eu/tag/ruskell/) - [russian skell](https://www.sketchengine.eu/tag/russian-skell/) - [Oxford Tagset](https://www.sketchengine.eu/tag/oxford-tagset/) - [english tagset](https://www.sketchengine.eu/tag/english-tagset/) - [biterm](https://www.sketchengine.eu/tag/biterm/) - [Sketch Engine for translator](https://www.sketchengine.eu/tag/sketch-engine-for-translator/) - [EUR-Lex Corpus](https://www.sketchengine.eu/tag/eur-lex-corpus/) - [EUR-Lex](https://www.sketchengine.eu/tag/eur-lex/) - [European Union](https://www.sketchengine.eu/tag/european-union/) - [parallel corpus](https://www.sketchengine.eu/tag/parallel-corpus/) - [Bulgarian tagset](https://www.sketchengine.eu/tag/bulgarian-tagset/) - [specific text types](https://www.sketchengine.eu/tag/specific-text-types/) - [restore interface language](https://www.sketchengine.eu/tag/restore-interface-language/) - [Portuguese Tagset](https://www.sketchengine.eu/tag/portuguese-tagset/) - [tutorial video](https://www.sketchengine.eu/tag/tutorial-video/) - [sketch engine video](https://www.sketchengine.eu/tag/sketch-engine-video/) - [Russian tagset](https://www.sketchengine.eu/tag/russian-tagset/) - [bulgarian pipeline](https://www.sketchengine.eu/tag/bulgarian-pipeline/) ## Tooltip Categories - [statistics](https://www.sketchengine.eu/glossary-categories/statistics/) - [feature](https://www.sketchengine.eu/glossary-categories/feature/) - [attribute](https://www.sketchengine.eu/glossary-categories/attribute/) - [corpus types](https://www.sketchengine.eu/glossary-categories/corpus-types/) - [text analysis](https://www.sketchengine.eu/glossary-categories/text-analysis/)