Summer Internships for Students
Call: HLT summer 2013 - open till May 25, 2013
Continuing the initiative of the previous years, the HLT unit is happy to announce the availability of internships for MA/MS and PhD students interested in carrying out research projects on NLP at FBK-irst (Trento), during our 2013 summer internship program.
WHO
HLT is a strong research group in Human Language Technology. We are experts in Content Extraction, Machine Translation and Speech Recognition. We are happy to collaborate with motivated students pursuing a degree in Natural Language Processing or Speech Processing on topics which are of special interest to the community.
WHAT
The Summer Internship Program is an exciting opportunity for students pursuing their graduate thesis to enrich their scientific background by working directly with researchers from HLT. Together with her/his adviser, the successful student candidate will focus on a specific language technology topic. The topics are chosen such that the implementation, the experiments and the analysis of results are carried out over a few weeks period. For a limited number of positions, accommodation, meals and transport expenses will be sponsored by the HLT unit.
HOW
Motivated students with a valuable curriculum in various Natural Language Processing topics are invited to submit their application.
The application is linked to a specific topic which is proposed and supervised by an HLT researcher. A candidate is welcome to apply for more than one topic if she/he feels that she/he is qualified to.
The application consists in sending the CV directly to the researcher responsible for a particular topic, before May 25, 2013. Please consult the list of topics and read the short description for each topic here. Candidates should specify if they apply for financial support (see below).
WHEN
Applicants will be selected and accepted on a first come, first served basis. This call will remain open untill all available positions are covered and no later than May 25th 2013. The typical duration of the internship is between 6 to 8 weeks. Each advisor will contact the successful candidate and they will agree on the exact period of the stage: any time between June 1st, 2013 and September 30th, 2013.
WHERE
The FBK Trentino Research Institute is situated in the very heart of a superb scenery offered by the Dolomiti Alps. Trento is characterized both by a rich and important cultural heritage and by a vibrant, international student life. By coming to Trento, be ready to meet students from all over the world, be part of exciting ideas and inspiring projects and enjoy the beauty of nature.
Financial Support
While for all non-resident students, FBK will cover meals at FBK canteen, for a limited number of non-resident students, FBK will also cover:
• part of travel expenses to/from Trento
• local transportation between Trento-FBK
• lodging at the university campus
• reimbursement of food expenses
Contact
For further enquiries contact:
Vivi Nastase, e-mail: hltsip AT list.fbk eu Marco Turchi, e-mail: hltsip AT list.fbk euList of Topics and Short Description
Integration of Translation Memory and Machine Translation (advisor: Marcello Federico) FILLED
Modern CAT (Computer Assisted Translation) tools are nowadays capable to provide suggestions seamlessly from TMs and MT engines. In recent work, some pilot studies have shown that tighter integrations of TM and MT technologies are indeed possible. This project will go one step further, by trying to optimally integrate these two sources. In particular, it will investigate the completion of fuzzy matches found in the TM, through statistical MT techniques. The objective is to optimize the precision recall trade-off of TM and SMT, respectively. Empirical evidence suggests that MT has potential only when no fuzzy matches above 85% are found in the TM. The goal of the project is twofold: (1) investigate methods to automatically post-edit TM fuzzy matches to reduce the human labor to fix them; (2) study the efficient integration of long phrases into a statistical MT in order to let it englobe the TM directly.
Processing tools for Dutch (advisor: Diego Giuliani)
The student activity will be mainly devoted to provide linguistic knowledge to the FBK research team working on the development of an automatic speech recognizer for Dutch. In particular, the student will support the development of grapheme-to-phoneme transcoding tools and of text processing tools for an efficiently treatment of compound words. In addition, she/he will perform manual transcription and annotation of some audio recordings in Dutch.
The profile of the ideal candidate we are looking for is as follows:
- master student
- native speaker of Dutch, with preference for the Flemish variant
- good knowledge of Italian or English (as the language of work is Italian and/or English)
- some knowledge in phonetics
- knowledge of the Linux OS environment and scripting languages (shell and Perl)
- good attitude to work in team.
Cache-based Machine Translation in CAT Tools (advisors: Nicola Bertoldi, Mauro Cettolo)
MateCAT (www.matecat.com/matecat/the-project) is a EU funded project with the goal of increasing the productivity of human translators by integrating statistical machine translation (SMT) into a Computer Assisted Translation (CAT) tool. Caching is the powerful mechanism we chose for implementing the online adaptation of SMT models in CAT tools, that is for continuously injecting into the system the knowledge about the newly translated text and the user corrections. We propose two projects in the framework of cache-based SMT for CAT tools:
- Context Reward: Typically, the recency of cached items is accounted by decaying their score with the age. This properly works on average. On the other side, less recent cached items could resemble the current text to translate and then be useful for its automatic translation. We propose the investigation of effective ways for recovering and refreshing the cached items more suitable to translate the current text.
- Text Repetitiveness: Preliminary experiments suggested that the more repetitions in a text, the higher the impact of caching on the quality of translation. We propose to investigate on which are the text features that better correlate with the quality of cache-based SMT models.
Domain-specific terminology extraction for Machine Translation (advisors: Marco Turchi, Sara Tonelli)
Computer Assisted Translation (CAT) tools have been developed to support and improve the productivity of professional translators. Several external tools are needed to enrich the CAT tool user interface to visualize additional information such as morphological or part-of-speech tags. The aim of this internship is to automatically create domain-specific (parallel) terminological resources that will be used to highlight and suggest domain-specific multi-word expressions (MWE) during the translation process. This will be obtained by: (1) identifying monolingual specific MWEs in the multi-lingual thesaurus Eurovoc; (2) creating links between terms across various languages and (3) increasing the coverage digging for morphological term variations in the JRC-Aquis corpus. The final goal of the internship is to make the resource available for the research community.
Automatic Estimation of Machine Translation Output Quality (advisors: Matteo Negri, Marco Turchi)
The automatic assessment of the MT quality without using of reference sentences (aka Quality Estimation) is an emerging topic with a number of recent initiatives and published scientific papers that address the problem from different perspectives. Quality Estimation combines a number of interesting issues, including: the extraction of informative features, the selection of appropriate machine learning algorithms, the design and evaluation of metrics that correlate well with human judgements. Besides the challenging nature of the task, automatic MT quality estimation has become a hot research topic also due to its market potential. The evolution of the translation industry, which is now facing the possibility of a close collaboration between humans and machines, has now opened to huge revenue opportunities for solutions that make such collaboration even more tight and productive. The HLT Unit is currently very active on both fronts, with a team of researchers targeting the design of effective MT quality estimation approaches (publication oriented), and their integration within existing Computer Assisted Translation tools (market oriented). The project, in turn, will address the development of innovative solutions with an eye to both sides (scientific impact, efficiency and robustness).
Duration: 1-2 months (anytime between early June to early September) Pre-requisites:background in machine translation and machine learning, good programming skills
Big data for NLP with HBase (advisors: Roldano Cattoni)
NewsReader (www.newsreader-project.eu) is a EU funded project with the goal of "Building structured event Indexes of large volumes of financial and economic Data for Decision Making". NLP and Knowledge-Management tools are exploited to extract and share structured information such Mentions and Entities. Therefore, efficient storage and retrieval of massive data and scalability are crucial. We have chosen HBase as database-like infrastructure upon which to build the storage server of the Newsreader-Project.
Goals of the internship are technological and related to the actual usage of HBase: (1) deployment in the cloud, (2) realization of configurable schema for table attributes, (3) integration with OMID for transaction support, (4) management of frontend replication.
Candidates must have excellent programming skills in Java in the Unix environment. Experience with HBase or database will be considered a plus.
Extensions of MultiWordNet lexical resource and its access via mobile App (advisors: Carlo Strapparava and Christian Girardi) FILLED
The MultiWordNet project aims at the realisation of a large scale multilingual computational lexicon based on WordNet. The model adopted within the MultiWordNet project stresses the usefulness of a strict alignment between lexical databases, i.e. wordnets, of different languages. FBK resources include also the extensions of WordNet-Domains, the addition of semantic domain labels to each synset, and WordNet-Affect, an additional hierarchy of "affective" domain labels, with which the synsets representing affective concepts.
The internship will focus on expanding and refining these resources and/or on planning and implementing a mobile App to access the lexical information.


