22 July 2004

Xerox Research Centre Europe (XRCE): Media Backgrounder (Content Analysis Research Area)

Xerox Research Centre Europe (XRCE) is structured into four complementary research areas: content analysis; document structure; image processing; and work practice technology.

The content analysis research area consists of four core linguistic technologies that are used to build different content management software applications. These core technologies are: finite state technology (FST); machine learning; parsing; and semantics.

FST in simple terms is the use of devices to increase time and space efficiencies when creating language-processing tools. It is a well-established technology used extensively in many areas of natural language processing. FST is particularly well-adapted for multilingual tools as relevant local language phenomena can be easily and intuitively expressed as finite-state devices.

The theoretical foundations of finite state technologies have been developed to a high level of sophistication in the past two decades. It is now regarded as an established fact, for example, that finite state models are suitable for modelling broad areas of syntax, particularly in spoken language. To expand the range of applications a new weighted version of the technology (WFSC) is currently being developed with results which promise to confirm Xerox’s position as world leader in the field.

Machine learning can be described as the study of computer algorithms that improve automatically through experience. In other words, through examples a system will learn by itself to perform tasks automatically, in particular ones that have traditionally been performed manually by humans. At XRCE, machine learning is applied in textual information access to new options in processing document collections. The XRCE Categorizer is an example of such an application whereby, given a few manually classified files, the system quickly learns by itself how to classify documents hierarchically in existing categories. It can also learn entirely new categories on its own by detecting emerging topics of incoming documents and suggesting new categories to the user.

More information specifically on Categorizer is available in a Xerox press release, issued in February 2004 (refer to contact details below).

Parsing is the deconstruction of text into its syntactic parts (noun phrases, verbs etc) to then be able to analyse and identify useful functional relations between them in large collections of text (e.g. web pages, document collections). Xerox is already at the stage of enabling machines to parse text for its meaning using its incremental parser (XIP) technology. XIP is designed to build robust analysers that tackle deeper linguistic aspects than those traditionally handled by the now widespread shallow parsing technologies. Parsing is an essential building block for natural language applications.

Semantics is closely linked to parsing at XRCE but it goes beyond the syntactic structure by looking at meaning and concepts. Concepts and the relations between them are identified and abstractions made so that a ‘knowledge representation’ is built for specific applications e.g. identifying “bank” as a financial institution vs. identifying it as the edge of a river. Semantic representations also lend themselves to inference, particularly the use of background knowledge to refine or extend the interpretation of a text.

The content analysis research area not only generates technologies with practical business applications to solve business issues on its own, but also with the other three XRCE research areas. Individual technologies that have been guided by XRCE from R&D concept to development and/or commercialisation - either through a Xerox business group or via a third-party organisation - are:

• Terminology Suite
• Inxight LinguistX
• XIP
• LIRIX
• BioTIP
• Categorizer
• CopyFinder

Separate, individual fact sheets are available on all of these technologies and the business issues they solve (see below).

For more information, please refer to www.xrce.xerox.com or contact...

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

About Me

My photo
Toronto, Ontario, Canada
PR, internal communications and branding pro currently freelancing as a consultant, writer, DJ, and whatever else comes my way.