Biomedical Text Processing

Software     |     Projects     |     People     |     Publications     |     Videos

  

BioTALA is developing techniques to provide easier and faster access to valuable information buried in biomedical texts, saving time and cost for biomedical researchers and clinicians, and potentially enabling new insights and discoveries.


Processing and management of clinical records and research literature is a critical component of biomedical research and clinical practice. Our biomedical research partners, based in hospitals and institutes in the Melbourne biomedical precinct, have identified this cost as a significant bottleneck in their work, and see a strong need for methods for making sense of large volumes of text. Existing technology addresses neither of these specific needs, nor broader problems of searching and summarising massive specialised collections.

BioTALA --- BioMedical Text and Language Applications --- is developing "text mining" technologies to (semi-)automatically discover and visualise information from genetic and other biomedical research and clinical documents. Drawing on the team's leading strengths in information retrieval and natural language processing, we aim to develop and apply text mining techniques to a variety of practical problems faced by biomedical researchers.

We are developing fundamental algorithms and tools in the context of specific applications, basing our activities on issues identified as significant by our biomedical research collaborators. These biomedical researchers are investigating cutting-edge biomedical and clinical research topics in world-leading research institutes, and have found that issues with text are a critical bottleneck. We are developing innovative technologies that address specific problems identified by our partners where the technologies are deemed to be likely to be of broad value.

Our Research

Language Technology and Information Retrieval have long histories in the medical domain dating back to the 1960s. Since the completion of the human genome project in 2001, researchers in both areas have become increasingly involved in the information management challenges that have arisen from the rate at which new publications are being added to the bibliome.  However, despite the recent flurry of activity by the information retrieval, machine learning, and language technology communities in the biomedical arena, there has not been an enthusiastic uptake of these technologies by biomedical researchers. In an invited talk at the ACL-BioNLP workshop in 2007, Alfonso Valencia (Centro Nacional de Biotecnologia, Spain) stated that there is a growing gulf between what computer science researchers perceived to be of interest to biomedical personnel and what pain points these people are actually experiencing on a day-to-day basis.  He implored researchers to focus their efforts on tasks that really mattered to the biomedical community.

The aim of the BioTALA project is to bridge the gap between biomedical information needs and LT/IR research focus. BioTALA is ideally positioned to do this, given Melbourne's status as Australia's Biomedical research hub, and through our established links with our biomedical collaborators.

Our current focus is on the following research themes:

  • Fact extraction. This is the process of finding relationships between biological entities in the biomedical literature.  In conjunction with one of our partners we are investigating the use of fact extraction the task of curating locus-specific databases, i.e. databases containing information about the mutations associated with a specific gene.

  • Information visualisation and analysis involves constructing non-standard views of document collections and the information contained therein. We are applying statistical topic-mapping techniques to provide abstractions of large volumes of text, providing high-level views and allowing a user to more easily see topic-based relationships that may be present.

  • Information retrieval is concerned with information organisation and retrieval tasks such as document search, clustering and filtering. The collation of relevant documents is an essential preprocessing step in our information management architecture.  The success of the technologies developed in the summarisation and fact extraction strands is dependent on the correct identification of relevant documents.

What will this research achieve?

The outcomes of this project will be a customisable platform for document processing, as well as a suite of tools leveraging language technology to process documents by detecting pertinent information relevant to the information task, such as location names or biomedical terms.

Another major outcome will be tools for organising search results in ways that make information more accessible, including clustering of related results and summarisation of documents.

These tools are designed to interface to different search engines, allowing existing engines to have their search capabilities upgraded.

Current Biomedical Collaborators


Major Outputs

  • Genomic information retrieval engine: our Genomic IR engine uses biomedical knowledge resources and recognition of entities (e.g. gene names) for query expansion and concept-based document retrieval, as well as an improved ranking algorithm. (This was the third-best performing system at TREC 2007 Genomics IR track, a research forum for evaluating IR systems). See (Stokes et al, 2009);
  • Statistical text analysis techniques for interpreting whole-collections of documents,as well as the interpretation IR result-sets. We have applied topic-modeling techniques to interpreting collections (Newman et al. 2010) and the MeSH ontology (Newman et al. 2009a), and have also proposed for methodology external evaluation of the topic models themselves (Newman et al. 2009b);
  • Integrated search and topic-based visualisation system:  To facilitate more "abstract" views of document collections and the information they contain, we have integrated a topic-based visualisation framework with a biomedical search engine, allowing semantic relationships to be visualised as an aid to developing queries that give better results. (This system was a semi-finalist in the Elsevier Grand Challenge, 2008);
  • Data collection tools: web-based annotation tool; search-proxy interface for collecting biomedical queries:  The use of supervised machine learning techniques, popular in language technologies, requires the collection and annotation of data, in particular documents. Such data is extremely scarce in the biomedical domain, so we have developed web-based tools that can be used by our collaborators, for annotating documents and collecting task-specific query logs;
  • Biomedical text mining:  We have recently begun work on "text mining", i.e. using techniques from Human Language Technology to directly extract valuable information from documents; such information would ideally be used to populate a biomedical database without the high human labour cost currently required (see the Fact Extraction research stream described above). Two novel approaches we are exploring include: processing tables, which are a rich source of information within biomedical documents (Wong et al, 2009); combining the use of machine learning and grammar-based language technology approaches (Mackinlay et al, 2009).

 

Further Information

For further information, contact Lawrence Cavedon: lawrence.cavedon@nicta.com.au