The Write Stuff: July 2004

22 July 2004

Xerox Research Centre Europe (XRCE): Media Backgrounder (Content Analysis Research Area)

Xerox Research Centre Europe (XRCE) is structured into four complementary research areas: content analysis; document structure; image processing; and work practice technology.

The content analysis research area consists of four core linguistic technologies that are used to build different content management software applications. These core technologies are: finite state technology (FST); machine learning; parsing; and semantics.

FST in simple terms is the use of devices to increase time and space efficiencies when creating language-processing tools. It is a well-established technology used extensively in many areas of natural language processing. FST is particularly well-adapted for multilingual tools as relevant local language phenomena can be easily and intuitively expressed as finite-state devices.

The theoretical foundations of finite state technologies have been developed to a high level of sophistication in the past two decades. It is now regarded as an established fact, for example, that finite state models are suitable for modelling broad areas of syntax, particularly in spoken language. To expand the range of applications a new weighted version of the technology (WFSC) is currently being developed with results which promise to confirm Xerox’s position as world leader in the field.

Machine learning can be described as the study of computer algorithms that improve automatically through experience. In other words, through examples a system will learn by itself to perform tasks automatically, in particular ones that have traditionally been performed manually by humans. At XRCE, machine learning is applied in textual information access to new options in processing document collections. The XRCE Categorizer is an example of such an application whereby, given a few manually classified files, the system quickly learns by itself how to classify documents hierarchically in existing categories. It can also learn entirely new categories on its own by detecting emerging topics of incoming documents and suggesting new categories to the user.

More information specifically on Categorizer is available in a Xerox press release, issued in February 2004 (refer to contact details below).

Parsing is the deconstruction of text into its syntactic parts (noun phrases, verbs etc) to then be able to analyse and identify useful functional relations between them in large collections of text (e.g. web pages, document collections). Xerox is already at the stage of enabling machines to parse text for its meaning using its incremental parser (XIP) technology. XIP is designed to build robust analysers that tackle deeper linguistic aspects than those traditionally handled by the now widespread shallow parsing technologies. Parsing is an essential building block for natural language applications.

Semantics is closely linked to parsing at XRCE but it goes beyond the syntactic structure by looking at meaning and concepts. Concepts and the relations between them are identified and abstractions made so that a ‘knowledge representation’ is built for specific applications e.g. identifying “bank” as a financial institution vs. identifying it as the edge of a river. Semantic representations also lend themselves to inference, particularly the use of background knowledge to refine or extend the interpretation of a text.

The content analysis research area not only generates technologies with practical business applications to solve business issues on its own, but also with the other three XRCE research areas. Individual technologies that have been guided by XRCE from R&D concept to development and/or commercialisation - either through a Xerox business group or via a third-party organisation - are:

• Terminology Suite
• Inxight LinguistX
• XIP
• LIRIX
• BioTIP
• Categorizer
• CopyFinder

Separate, individual fact sheets are available on all of these technologies and the business issues they solve (see below).

For more information, please refer to www.xrce.xerox.com or contact...

Xerox Research Centre Europe (XRCE): Media Backgrounder (Document Structure Research Area)

Xerox Research Centre Europe (XRCE) is structured into four complementary research areas: content analysis; document structure; image processing; and work practice technology.

The document structure research area of XRCE research is aligned to the increased adoption of extensible mark-up language (XML) by the IT and internet industries, and the sheer potential of XML as a language of communication between disparate systems.

While the primary benefit of XML is in exchanging data, greater benefits can be gained in content and document management. First of all, XML is naturally suited to represent the logical structure of documents (e.g. titles, sections, chapters, paragraphs) independently of their visual rendering. More importantly it can represent the semantics or meaning of documents (i.e. varied elements such as authors, dates, organisation or product names, financial data, copyright statements, legal warnings). This provides the potential for advanced, semantic-enabled search and data mining, but also for smart processes throughout the document lifecycle including content reuse and repurposing, quality assurance and security. It is also a natural bridge between databases and content for document validation and updating .

However, the challenges of how to create new documents automatically in XML, and convert legacy documents to XML, remain. XRCE is developing and combining new methods for Legacy Document Conversion where the research addresses the three faces of a structural document: layout, logical structure and semantics. The second research theme in this area is XML Schema management where researchers are addressing ways to link together different XML stores, and to repurpose and reformulate XML documents in order to enable “Smart processes”.

The document structure research area not only generates technologies with practical business applications to solve business issues on its own, but also with the other three XRCE research areas.
It combines expertise in machine learning, document mining and clustering, querying and visualization and hybrid methods for document acquisition. One technology that has been guided by XRCE from R&D concept to development and commercialisation is the SmartTagger for which a separate individual fact sheet is available (see below).

For more information, please refer to www.xrce.xerox.com or contact...

The Write Stuff

22 July 2004

Xerox Research Centre Europe (XRCE): Media Backgrounder (Content Analysis Research Area)

Xerox Research Centre Europe (XRCE): Media Backgrounder (Document Structure Research Area)

About Me

Glyn Elsewhere on the interweb thing...

Pick something to read about:

"We already get more!"

Blog Archive

The Write Stuff

22 July 2004

Xerox Research Centre Europe (XRCE): Media Backgrounder (Content Analysis Research Area)

Xerox Research Centre Europe (XRCE): Media Backgrounder (Document Structure Research Area)

About Me

Glyn Elsewhere on the interweb thing...

Pick something to read about:

"Love it - gimme more!"

"We already get more!"

Blog Archive