Technology
The HLT unit develops state-of-the-art technology in all the main research areas it operates in. The group has performed consistently well in several international evaluations, and is currently engaged in international projects for open source software development (e.g. the Moses platform for statistical machine translation). Research on speech recognition also meets the highest standards, and has reached the application market in several occasions.
Moreover, people of the unit are key-players of many international initiatives around evaluation and benchmarking. HLT provides technological support and high-level services in order to optimize the activities of the Research Unit. Providing a shared and efficient environment, specific for the HLT issues, ranges from the management of special hardware equipments and software tools, up to the creation and management of large scale linguistic resources.
Software
- EDITS (Edit Distance Textual Entailment Suite): an open source software package aimed at recognizing entailment relations between two portions of text
- TextPro: a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts.
- Moses: a phrase-based decoder for statistical machine translation
- IRSTLM: a toolkit for statistical language modeling
- jSRE: an open source Java tool for Relation Extraction
- jWeb1T: an open source Java tool for efficiently searching the Web 1T 5-gram corpus
- The Tool-box for lexicographers: a web-based application for accessing and updating lexical resources
- jFex and jInFil: java tools for Feature Extraction and Instance Filtering
- jExSLI: an open source java tool for language identification
- jWebS: a software tool for Web people search
- jTCat: a software tool for text categorization
- StringKernel: an implementation of the string kernel
- jLSI - an open source Java tool for Latent Semantic Indexing
Databases
- MultiWordNet: a Multilingual (English/Italian) Lexical Database
- WordNet Domains: a systematic labelling of WordNet synsets with domain labels; it includes WordNet-Affect, an additional labeling of the synsets representing affective concepts with "affective" domain labels
Corpora
- RTE3-derived CLTE dataset: 1600 English-Spanish Cross-Lingual Entailment pairs.
- Content Synchronization CLTE CoSyne Benchmark: English, Italian and German Cross-Lingual Textual Entailemtn aligned datasets.
- Textual Entailment Specialized Data Sets: 90 RTE-5 Test Set pairs annotated with linguistic phenomena + 203 monothematic pairs (i.e. pairs where only one linguistic phenomenon is relevant to the entailment relation) created from the 90 annotated pairs. Provided jointly with CELCT.
- MultiSemCor: an English/Italian parallel corpus
- CORPS: CORpus of tagged Political Speaches
- I-CAB: Italian Content Annotation Bank
- EVALITA 2011 NER dataset (research license): RTTR news broadcasts transcribed (both manually and automatically) and annotated with Named Entities
- QALL-ME Benchmark: annotated spoken requests in the tourism domain (Italian, Spanish, English and German)
- Wikipedia sentences with frame label in English and Italian, automatically extracted with a WSD system
- CRIPCO: a corpus of Italian news stories annotated with information about person cross-document coreference
- SWiiT: the Italian Wikipedia annotated with entity mentions
Electronic Dictionaries/Spell Checkers
- DILF: Dizionario Italiano/Ladino Fassano
- DLS: Dizionario Ladino Standard
- Correttore ortografico del Ladino Fassano
- Correttore ortografico del Ladino Standard
Demos
- TextPro: a suit of tools for analysis of English and Italian texts
- The Wiki Machine: a tool for linking to Wikipedia


