Text Analysis Pipelines: Towards Ad-hoc Large-Scale Text by Henning Wachsmuth

By Henning Wachsmuth

This monograph proposes a finished and entirely automated method of designing textual content research pipelines for arbitrary details wishes which are optimum by way of run-time potency and that robustly mine appropriate info from textual content of any sort. in keeping with cutting-edge thoughts from computing device studying and different components of man-made intelligence, novel pipeline development and execution algorithms are constructed and carried out in prototypical software program. Formal analyses of the algorithms and broad empirical experiments underline that the proposed procedure represents a necessary step in the direction of the ad-hoc use of textual content mining in internet seek and massive information analytics.
Both net seek and large information analytics goal to meet peoples’ wishes for info in an adhoc demeanour. the data looked for is frequently hidden in quite a lot of common language textual content. rather than easily returning hyperlinks to possibly appropriate texts, major seek and analytics engines have began to without delay mine appropriate info from the texts. To this finish, they execute textual content research pipelines that could include a number of advanced information-extraction and text-classification phases. because of useful necessities of potency and robustness, in spite of the fact that, using textual content mining has thus far been restricted to expected info wishes that may be fulfilled with particularly uncomplicated, manually built pipelines.

Example text

As a consequence, all text analysis algorithms need to resolve ambiguities (Jurafsky and Martin 2009). Without sufficient context, a correct analysis is hence often hard and can even be impossible. ” alone leaves undecidable whether it refers to a fruit or to a company. Technically, natural language processing can be seen as the production of annotations (Ferrucci and Lally 2004). An annotation marks a text or a span of text that represents an instance of a particular type of information. We discuss the role of annotations more extensively in Sect.

Gather input texts that are potentially relevant for the given task. 2. Natural language processing. 2 3. Data mining. Discover patterns in the structured information that has been inferred from the texts. Hearst (1999) points out that the main aspects of text mining are actually the same as those studied in empirical computational linguistics. Although focusing on natural language processing, some of the problems computational linguistics is concerned with are also addressed in information retrieval and data mining, such as text classification or machine learning.

In addition, some parts of this book represent original contributions that have not been published before, as pointed out where given. Chapter 2 Text Analysis Pipelines I put my heart and my soul into my work, and have lost my mind in the process. – Vincent van Gogh Abstract The understanding of natural language is one of the primary abilities that provide the basis for human intelligence. Since the invention of computers, people have thought about how to operationalize this ability in software applications (Jurafsky and Martin 2009).

