TIKA 1.1 Extension extractions propriétés

De EjnTricks

Cet article présente une modification mise en place sur TIKA afin d'extraire uniquement les propriétés, sans parcourir le contenu des documents, pour les fichiers PDF et Office.

Le code source de cette étude est disponible à l'adresse http://www.jouvinio.net/svn/study/branches/Tika/1.1.


Hand-icon.png Votre avis

Nobody voted on this yet

 You need to enable JavaScript to vote


Java format icon.png Implémentation

Nouvel argument de contexte

Une nouvelle classe est mise en place, afin de spécifier dans le contexte d'extraction si le contenu doit être échappé ou non. En effet, la modification laisse la possibilité de conserver le comportement standard.

La classe EscapeContentArg est écrite ainsi.

package fr.ejn.tutorial.tika.parser;

import org.apache.tika.parser.ParseContext;

/**
 * During document parsing, it may be done only on metadata. Instance of this class should be used in ParseContext.
 * 
 * @author Etienne Jouvin
 * 
 */
public final class EscapeContentArg {

	/**
	 * Check the EscapeContentArg value from a ParseContext instance.
	 * 
	 * @param context ParseContext instance to parse.
	 * @return True if the argument is not null in the PagContext instance, and if the escapeContent value is set to true.
	 */
	public static boolean checkEscapeContent(ParseContext context) {
		/* Get argument from the context. */
		EscapeContentArg escapeContentArg = context.get(EscapeContentArg.class);

		/* If argument is not null and the value is set to true in the argument, return true. */
		return null != escapeContentArg && escapeContentArg.isEscapeContent();
	}

	private boolean escapeContent;

	/**
	 * @return the escapeContent.
	 */
	public boolean isEscapeContent() {
		return escapeContent;
	}

	/**
	 * @param escapeContent the escapeContent to set.
	 */
	public void setEscapeContent(boolean escapeContent) {
		this.escapeContent = escapeContent;
	}

}

Ceci permet d'encapsuler la valeur booléenne escapeContent. A noter la fonction checkEscapeContent qui permet de savoir depuis le contexte d'extraction, argument d'instance ParseContext, si le contenu doit être échappé durant l'extraction. En effet, ce contrôle étant réalisé dans différents parsers, il était intéressant d'externaliser ce contrôle.


Nouveau parser

L'extraction des informations s'effectue très simplement à l'aide de la classe org.apache.tika.parser.AutoDetectParser. Cette classe a été étendue afin de placer une instance de EscapeContentArg dans le contexte d'extraction, juste avant de déclencher celle-ci.

package fr.ejn.tutorial.tika.parser;

import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * TIKA AutoDetectorParser extension used to set a flag for content parsing. By default, the content is parsed by TIKA and introduce poor performance if only file properties is
 * wanted.
 * 
 * @author Etienne Jouvin
 * 
 */
public class AutoDetectContentFilterParser extends AutoDetectParser {

	private static final long serialVersionUID = 7440689727706539222L;
	private boolean escapeContent;

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 */
	public AutoDetectContentFilterParser() {
		super();
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param detector type detector.
	 */
	public AutoDetectContentFilterParser(Detector detector) {
		super(detector);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param detector type detector.
	 * @param parsers Parsers instance used to read file properties.
	 */
	public AutoDetectContentFilterParser(Detector detector, Parser... parsers) {
		super(detector, parsers);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param parsers Parsers instance used to read file properties.
	 */
	public AutoDetectContentFilterParser(Parser... parsers) {
		super(parsers);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param config TIKA configuration instance.
	 */
	public AutoDetectContentFilterParser(TikaConfig config) {
		super(config);
	}

	/**
	 * @return the escapeContent.
	 */
	public boolean isEscapeContent() {
		return escapeContent;
	}

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		/* Build the escape content argument. */
		EscapeContentArg escapeContentArg = new EscapeContentArg();
		escapeContentArg.setEscapeContent(escapeContent);

		/* Complete the context with the argument. */
		context.set(EscapeContentArg.class, escapeContentArg);

		/* Let super method works. */
		super.parse(stream, handler, metadata, context);
	}

	/**
	 * @param escapeContent the escapeContent to set.
	 */
	public void setEscapeContent(boolean escapeContent) {
		this.escapeContent = escapeContent;
	}

}

La variable escapeContent sera affectée dans l'instance de EscapeContentArg. Ainsi, ce parser peut être utilisé pour un comportement standard ou modifié.

La modification la plus importante concerne la méthode parse. Une instance de EscapeContentArg y est ajouté au contexte d'extraction avant d'appeler le code standard.


Parser fichiers Office

Une implémentation du parser org.apache.tika.parser.microsoft.OfficeParser est utilisée en standard pour les fichiers Office, doc, xls, ppt ... Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.microsoft;

import java.io.IOException;

import org.apache.poi.poifs.filesystem.DirectoryNode;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.OfficeParser;
import org.apache.tika.parser.microsoft.VisibleSummaryExtractor;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * Office parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains
 * the flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOfficeParser extends OfficeParser {

	private static final long serialVersionUID = -1575837099583857962L;

	/** {@inheritDoc} */
	@Override
	protected void parse(DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			// Parse summary entries first, to make metadata available early
			/* Use the custom extension, because the standard one is visible only in package. */
			/* new SummaryExtractor(metadata).parseSummaries(root); */
			new VisibleSummaryExtractor(metadata).parseSummaries(root);

			// Parse remaining document entries
			POIFSDocumentType type = POIFSDocumentType.detectType(root);

			if (type != POIFSDocumentType.UNKNOWN) {
				setType(metadata, type.getType());
			}

			// Remove all code for content extraction.
		} else {
			super.parse(root, context, metadata, xhtml);
		}
	}

	/**
	 * Store the content type meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param type MediaType instance.
	 */
	private void setType(Metadata metadata, MediaType type) {
		metadata.set(Metadata.CONTENT_TYPE, type.toString());
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou l'extension mise en place pour ne conserver que l'extraction des propriétés par l'instance de org.apache.tika.parser.microsoft.VisibleSummaryExtractor. Cette dernière est une extension de la classe org.apache.tika.parser.microsoft.SummaryExtractor dont la visibilité est de type "package", ne pouvant être utilisée depuis cette extension.

package org.apache.tika.parser.microsoft;

import org.apache.tika.metadata.Metadata;

/**
 * SummaryExtractor extension to make it visible.
 * 
 * @author Etienne Jouvin
 * 
 */
public class VisibleSummaryExtractor extends SummaryExtractor {

	/**
	 * @param metadata Metadata completed during parsing.
	 */
	public VisibleSummaryExtractor(Metadata metadata) {
		super(metadata);
	}

}

La méthode setType doit être recopiée car elle est privée dans la classe étendue.


Parser fichiers Office Open XML

Une implémentation du parser org.apache.tika.parser.microsoft.ooxml.OOXMLParser est utilisée en standard pour les fichiers Office Open XML, docx, xlsx, pptx ... Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.io.IOException;
import java.io.InputStream;
import java.util.Set;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * Office parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains
 * the flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOOXMLParser extends OOXMLParser {

	private static final long serialVersionUID = 5116030503623701915L;
	protected static final Set<MediaType> VISIBLE_UNSUPPORTED_OOXML_TYPES = OOXMLParser.UNSUPPORTED_OOXML_TYPES;

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			ContentFilterOOXMLExtractorFactory.parse(stream, handler, metadata, context);
		} else {
			/* Let super method works. */
			super.parse(stream, handler, metadata, context);
		}
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou l'utilisation d'un nouvel extracteur, classe ContentFilterOOXMLExtractorFactory, de propriétés.

La classe ContentFilterOOXMLExtractorFactory est une "recopie" de la classe OOXMLExtractorFactory, dans laquelle le traitement sur le contenu a été supprimé.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.io.IOException;
import java.io.InputStream;
import java.util.Locale;

import org.apache.poi.POIXMLDocument;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.poi.xslf.usermodel.XMLSlideShow;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.CloseShieldInputStream;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.EmptyParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLExtractor;
import org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory;
import org.apache.tika.parser.microsoft.ooxml.POIXMLTextExtractorDecorator;
import org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator;
import org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator;
import org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator;
import org.apache.tika.parser.pkg.ZipContainerDetector;
import org.apache.xmlbeans.XmlException;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Skip content extraction.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOOXMLExtractorFactory extends OOXMLExtractorFactory {

	/**
	 * Parse the stream to extract properties. This is pretty the same as the parent function in parent class, but remove extraction from content.
	 * 
	 * @param stream Stream to parse.
	 * @param baseHandler Content handler.
	 * @param metadata Metadata instance where properties will be stored.
	 * @param context Parsing context.
	 * @throws IOException Exception during reading.
	 * @throws SAXException Exception during XML parsing.
	 * @throws TikaException Other exception fired.
	 */
	public static void parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		Locale locale = context.get(Locale.class, Locale.getDefault());
		ExtractorFactory.setThreadPrefersEventExtractors(true);

		try {
			OOXMLExtractor extractor;
			OPCPackage pkg;

			// Open the OPCPackage for the file
			TikaInputStream tis = TikaInputStream.cast(stream);
			if (tis != null && tis.getOpenContainer() instanceof OPCPackage) {
				pkg = (OPCPackage) tis.getOpenContainer();
			} else if (tis != null && tis.hasFile()) {
				pkg = OPCPackage.open(tis.getFile().getPath());
			} else {
				InputStream shield = new CloseShieldInputStream(stream);
				pkg = OPCPackage.open(shield);
			}

			// Get the type, and ensure it's one we handle
			MediaType type = ZipContainerDetector.detectOfficeOpenXML(pkg);
			/* Use the constant from the ContentFilterOOXMLParser class, OOXMLParser.UNSUPPORTED_OOXML_TYPES is protected. */
			if (type == null || /* OOXMLParser.UNSUPPORTED_OOXML_TYPES */ContentFilterOOXMLParser.VISIBLE_UNSUPPORTED_OOXML_TYPES.contains(type)) {
				// Not a supported type, delegate to Empty Parser
				EmptyParser.INSTANCE.parse(stream, baseHandler, metadata, context);
				return;
			}
			metadata.set(Metadata.CONTENT_TYPE, type.toString());

			// Have the appropriate OOXML text extractor picked
			POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg);

			/* In previous extension, we made a patch because custom properties were not extracted. */
			/* This was fixed in version 1.1. */
			/* So all previous custom extractor are no more necessary. */
			POIXMLDocument document = poiExtractor.getDocument();
			if (poiExtractor instanceof XSSFEventBasedExcelExtractor) {
				extractor = new XSSFExcelExtractorDecorator(context, (XSSFEventBasedExcelExtractor) poiExtractor, locale);
			} else if (document == null) {
				throw new TikaException("Expecting UserModel based POI OOXML extractor with a document, but none found. " + "The extractor returned was a " + poiExtractor);
			} else if (document instanceof XMLSlideShow) {
				extractor = new XSLFPowerPointExtractorDecorator(context, (XSLFPowerPointExtractor) poiExtractor);
			} else if (document instanceof XWPFDocument) {
				extractor = new XWPFWordExtractorDecorator(context, (XWPFWordExtractor) poiExtractor);
			} else {
				extractor = new POIXMLTextExtractorDecorator(context, poiExtractor);
			}

			/* Remove All work on document content, just keep metadata extractor. */
			// We need to get the content first, but not end
			// the document just yet
			// EndDocumentShieldingContentHandler handler = new EndDocumentShieldingContentHandler(baseHandler);
			// extractor.getXHTML(handler, metadata, context);

			// Now we can get the metadata
			extractor.getMetadataExtractor().extract(metadata);

			// Then finish up
			// handler.reallyEndDocument();
		} catch (IllegalArgumentException e) {
			if (e.getMessage().startsWith("No supported documents found")) {
				throw new TikaException("TIKA-418: RuntimeException while getting content" + " for thmx and xps file types", e);
			} else {
				throw new TikaException("Error creating OOXML extractor", e);
			}
		} catch (InvalidFormatException e) {
			throw new TikaException("Error creating OOXML extractor", e);
		} catch (OpenXML4JException e) {
			throw new TikaException("Error creating OOXML extractor", e);
		} catch (XmlException e) {
			throw new TikaException("Error creating OOXML extractor", e);

		}
	}

}

A noter l'ajout de la constante VISIBLE_UNSUPPORTED_OOXML_TYPES dans la classe ContentFilterOOXMLParser, donnant un accès à la constante UNSUPPORTED_OOXML_TYPES de la classe org.apache.tika.parser.microsoft.ooxml.OOXMLParser.


Parser fichiers PDF

Une implémentation du parser org.apache.tika.parser.pdf.PDFParser est utilisée en standard pour les fichiers PDF. Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.pdf;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.Calendar;
import java.util.List;

import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.io.RandomAccess;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.CloseShieldInputStream;
import org.apache.tika.io.TemporaryResources;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.PagedText;
import org.apache.tika.metadata.Property;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.PasswordProvider;
import org.apache.tika.parser.pdf.PDFParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * PDF parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains the
 * flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterPDFParser extends PDFParser {

	private static final long serialVersionUID = -7912065246613445882L;

	/**
	 * Store a Calendar meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param property Property to store.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, Property property, Calendar value) {
		if (value != null) {
			metadata.set(property, value.getTime());
		}
	}

	/**
	 * Store a Calendar meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param name Property name.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, String name, Calendar value) {
		if (value != null) {
			metadata.set(name, value.getTime().toString());
		}
	}

	/**
	 * Used when processing custom metadata entries, as PDFBox won't do the conversion for us in the way it does for the standard ones.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param name Property name.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, String name, COSBase value) {
		if (value instanceof COSArray) {
			for (COSBase v : ((COSArray) value).toList()) {
				addMetadata(metadata, name, v);
			}
		} else if (value instanceof COSString) {
			addMetadata(metadata, name, ((COSString) value).getString());
		} else {
			addMetadata(metadata, name, value.toString());
		}
	}

	/**
	 * Store a value meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param name Property name.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, String name, String value) {
		if (value != null) {
			metadata.add(name, value);
		}
	}

	/**
	 * Extract and store metada from the PDF document to Metadata instance.
	 * 
	 * @param document Document to parse.
	 * @param metadata Metadata instance to complete.
	 * @throws TikaException TIKA exception on properties extraction.
	 */
	private void extractMetadata(PDDocument document, Metadata metadata) throws TikaException {
		PDDocumentInformation info = document.getDocumentInformation();
		metadata.set(PagedText.N_PAGES, document.getNumberOfPages());
		addMetadata(metadata, Metadata.TITLE, info.getTitle());
		addMetadata(metadata, Metadata.AUTHOR, info.getAuthor());
		addMetadata(metadata, Metadata.CREATOR, info.getCreator());
		addMetadata(metadata, Metadata.KEYWORDS, info.getKeywords());
		addMetadata(metadata, "producer", info.getProducer());
		addMetadata(metadata, Metadata.SUBJECT, info.getSubject());
		addMetadata(metadata, "trapped", info.getTrapped());
		try {
			addMetadata(metadata, "created", info.getCreationDate());
			addMetadata(metadata, Metadata.CREATION_DATE, info.getCreationDate());
		} catch (IOException e) {
			// Invalid date format, just ignore
		}
		try {
			Calendar modified = info.getModificationDate();
			addMetadata(metadata, Metadata.LAST_MODIFIED, modified);
		} catch (IOException e) {
			// Invalid date format, just ignore
		}

		// All remaining metadata is custom
		// Copy this over as-is
		List<String> handledMetadata = Arrays.asList(new String[] { "Author", "Creator", "CreationDate", "ModDate", "Keywords", "Producer", "Subject", "Title", "Trapped" });
		for (COSName key : info.getDictionary().keySet()) {
			String name = key.getName();
			if (!handledMetadata.contains(name)) {
				addMetadata(metadata, name, info.getDictionary().getDictionaryObject(key));
			}
		}
	}

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		/* Get the escape content argument from the context. */
		if (EscapeContentArg.checkEscapeContent(context)) {
			/* Should not extract informations from content, work only on properties. */
			/* Must copy all code, because the there is no entry point. */
			/* Because of that, we need to duplicate many functions to handle properties. */
			PDDocument pdfDocument = null;
			TemporaryResources tmp = new TemporaryResources();

			try {
				// PDFBox can process entirely in memory, or can use a temp file
				// for unpacked / processed resources
				// Decide which to do based on if we're reading from a file or not already
				TikaInputStream tstream = TikaInputStream.cast(stream);
				if (tstream != null && tstream.hasFile()) {
					// File based, take that as a cue to use a temporary file
					RandomAccess scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
					pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
				} else {
					// Go for the normal, stream based in-memory parsing
					pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
				}

				if (pdfDocument.isEncrypted()) {
					String password = null;

					// Did they supply a new style Password Provider?
					PasswordProvider passwordProvider = context.get(PasswordProvider.class);
					if (passwordProvider != null) {
						password = passwordProvider.getPassword(metadata);
					}

					// Fall back on the old style metadata if set
					if (password == null && metadata.get(PASSWORD) != null) {
						password = metadata.get(PASSWORD);
					}

					// If no password is given, use an empty string as the default
					if (password == null) {
						password = "";
					}

					try {
						pdfDocument.decrypt(password);
					} catch (Exception e) {
						// Ignore
					}
				}
				metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
				extractMetadata(pdfDocument, metadata);
				// Update, do not parse pdf content.
				// PDF2XHTML.process(pdfDocument, handler, metadata, extractAnnotationText, enableAutoSpace, suppressDuplicateOverlappingText, sortByPosition);
			} finally {
				if (pdfDocument != null) {
					pdfDocument.close();
				}
				tmp.dispose();
			}
		} else {
			/* Let super method works. */
			super.parse(stream, handler, metadata, context);
		}
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou la duplication de celui-ci.


link+ Exécution

Afin de faciliter l'utilisation de TIKA et de ces nouveaux parsers, la classe TikaExtractorImpl a été écrite.

package fr.ejn.tutorial.metadatas.impl.tika;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.text.MessageFormat;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.log4j.Logger;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import fr.ejn.tutorial.metadatas.Extractor;
import fr.ejn.tutorial.tika.parser.AutoDetectContentFilterParser;

/**
 * File properties reader with TIKA implementation. By default TIKA read the file content, and cause poor performance. Use a custom parser that skip the content parsing. But it was
 * necessary to duplicate some code from TIKA framework.
 * 
 * @author Etienne Jouvin
 * 
 */
public class TikaExtractorImpl implements Extractor {

	private static final Logger LOGGER = Logger.getLogger(TikaExtractorImpl.class);
	private String configResource;

	/**
	 * TIKA configuration used for the extraction.
	 */
	private TikaConfig tikaConfig;

	/**
	 * Defaut constructor.
	 */
	public TikaExtractorImpl() {
		configResource = null;
		tikaConfig = null;
	}

	/**
	 * Convert metadata instance to a Map.
	 * 
	 * @param metadata Instance to convert.
	 * @return Converted map.
	 */
	private Map<String, String[]> convertToMap(Metadata metadata) {
		/* names can never returns null. No need to check the null. */
		String[] names = metadata.names();
		Map<String, String[]> properties = new HashMap<String, String[]>(names.length);

		for (String name : names) {
			properties.put(name, metadata.getValues(name));
		}

		return properties;
	}

	/** {@inheritDoc} */
	@Override
	public Map<String, String[]> extract(String filePath) {
		if (null == tikaConfig) {
			return null;
		}

		/* Create the metadata handler, where all metadata will be stored during content parsing. */
		Metadata metadata = readMetadata(filePath);

		/* Convert and return the metadata instance into a Map of values. */
		return convertToMap(metadata);
	}

	/** {@inheritDoc} */
	@Override
	public void init() {
		InputStream resource;
		String resourceToLoad;

		if (StringUtils.isBlank(configResource)) {
			resource = null;
			resourceToLoad = null;
		} else {
			/* Try to load the configuration resource set. */
			resourceToLoad = configResource;
			resource = TikaExtractorImpl.class.getResourceAsStream(resourceToLoad);
		}

		/* Initialize the Tika configuration. */
		try {
			if (null == resource) {
				/* Use the default TIKAConfig constructor. */
				/* Configuration from META-INF/services will be loaded. */
				/* Then detectors, parsers are sorted according the class name. */
				/* Classes into package org.apache.tika always come first. */
				/* Custom class in different package will be loaded after Tika core. */
				/* To override a default parser, just keep the package as org.apache.tika and make sure the String compare on class name */
				/* return first the new instance. */
				tikaConfig = new TikaConfig();
			} else {
				/* A resource was loaded, use it to build the TikaConfig instance. */
				tikaConfig = new TikaConfig(resource);
			}
		} catch (IOException ioException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_ACCESS_EXCEPTION, resourceToLoad), ioException);
		} catch (SAXException saxException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_SAX_EXCEPTION, resourceToLoad), saxException);
		} catch (TikaException tikaException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_TIKA_EXCEPTION, resourceToLoad), tikaException);
		} finally {
			IOUtils.closeQuietly(resource);
		}
	}

	/**
	 * Read file properties.
	 * 
	 * @param filePath File to parse.
	 * @return All metadatas read from the file.
	 */
	private Metadata readMetadata(String filePath) {
		/* Create the metadata handler, where all metadata will be stored during content parsing. */
		Metadata metadata = new Metadata();

		if (null != filePath) {
			InputStream input = null;
			try {
				input = TikaInputStream.get(new File(filePath), metadata);
				/* File loaded, can call the parser. */

				/* Create a parser, with auto format detection. */
				AutoDetectContentFilterParser parser = new AutoDetectContentFilterParser(tikaConfig);
				/* Set flag to escape content reading. */
				parser.setEscapeContent(true);

				/* Parse the content. Use a default handler, the map will be used directly. */
				parser.parse(input, new DefaultHandler(), metadata);
			} catch (IOException ioException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_FILE_IO, filePath), ioException);
			} catch (SAXException saxException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_FILE_IO, filePath), saxException);
			} catch (TikaException tikaException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_TIKA_EXCEPTION, filePath), tikaException);
			} finally {
				IOUtils.closeQuietly(input);
			}
		}

		return metadata;
	}

	/**
	 * Set the resource name.
	 * 
	 * @param configResource Configuration resource name to set.
	 */
	public void setConfigResource(String configResource) {
		this.configResource = configResource;
	}

}

Son fonctionnement est relativement simple. Il faut exécuter la méthode init dans un premier temps, permettant d'initialiser une instance de TikaConfig. Il est ensuite possible d'extraire les propriétés d'un fichier, identifié par son emplacement, à partir de la fonction extract.

link+ Configurations

Les deux modes de configuration, expliqués sur la page [Paramétrage TIKA], sont mis en place dans le cadre de cette analyse. Au niveau des sources, les fichiers se trouvent dans la répertoire des ressources de test.

XML format icon.png Configuration XML

Ce mode de configuration permet de spécifier uniquement les parsers à utiliser. L'inconvénient est qu'il est nécessaire de reprendre la déclaration de ceux-ci dans les différents jar fournis.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
	<!-- There is no specific rule on the root node name. But see link https://issues.apache.org/jira/browse/TIKA-527, where there is an example with node properties -->
	<!-- 
		See in org.apache.tika.config.TikaConfig, for constructor TikaConfig(Element element).
		Parser check the nodes detector and if it contains the attribute class, 
		try to instantiate a class with the value of attribute.
		
		There is multiple file in default implementation and all work fine in this case. No need to override them.
		Display example just for memory of how to configure it.
	-->
	<detectors>
		<!-- Configurations from file org.apache.tika.detect.Detector, in jar tika-parsers-1.1.jar -->
<!--
		<detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector" />
		<detector class="org.apache.tika.parser.pkg.ZipContainerDetector" />
-->
		<!-- Configurations from file org.apache.tika.detect.Detector, in jar vorbis-java-tika-0.1.jar -->
<!--
		<detector class="org.gagravarr.tika.OggDetector" />
-->
	</detectors>
	<!-- 
		See in org.apache.tika.config.TikaConfig, for constructor TikaConfig(Element element).
		Parser check the node mimeTypeRepository and if it contains the attribute resource, 
		then MimeTypesFactory is loaded from this attribute value, that should be a path.
	
		If attribute is not found, mimetypes in TikaConfig is set by the function MimeTypes.getDefaultMimeTypes(),
		which is the case the function TikaConfig.getDefaultConfig() is used.
	-->
	<mimeTypeRepository />
	<!--
		By default, parsers configuration is read from the file META-INF/services/org.apache.tika.parser.Parser
		loaded during DefaultParser constructor.
		
		Just copy all parser from original file in this XML and set custom parser.
	-->
	<parsers>
		<!-- Copy parsers from jar tika-parsers-1.1.jar -->
		<parser class="org.apache.tika.parser.asm.ClassParser" />
		<parser class="org.apache.tika.parser.audio.AudioParser" />
		<parser class="org.apache.tika.parser.audio.MidiParser" />
		<parser class="org.apache.tika.parser.dwg.DWGParser" />
		<parser class="org.apache.tika.parser.epub.EpubParser" />
		<parser class="org.apache.tika.parser.feed.FeedParser" />
		<parser class="org.apache.tika.parser.font.AdobeFontMetricParser" />
		<parser class="org.apache.tika.parser.font.TrueTypeParser" />
		<parser class="org.apache.tika.parser.html.HtmlParser" />
		<parser class="org.apache.tika.parser.image.ImageParser" />
		<parser class="org.apache.tika.parser.image.PSDParser" />
		<parser class="org.apache.tika.parser.image.TiffParser" />
		<parser class="org.apache.tika.parser.iwork.IWorkPackageParser" />
		<parser class="org.apache.tika.parser.jpeg.JpegParser" />
		<parser class="org.apache.tika.parser.mail.RFC822Parser" />
		<parser class="org.apache.tika.parser.mbox.MboxParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.microsoft.OfficeParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.microsoft.ContentFilterOfficeParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.microsoft.TNEFParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.microsoft.ooxml.ContentFilterOOXMLParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.mp3.Mp3Parser" />
		<parser class="org.apache.tika.parser.mp4.MP4Parser" />
		<parser class="org.apache.tika.parser.hdf.HDFParser" />
		<parser class="org.apache.tika.parser.netcdf.NetCDFParser" />
		<parser class="org.apache.tika.parser.odf.OpenDocumentParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.pdf.PDFParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.pdf.ContentFilterPDFParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.pkg.PackageParser" />
		<parser class="org.apache.tika.parser.rtf.RTFParser" />
		<parser class="org.apache.tika.parser.txt.TXTParser" />
		<parser class="org.apache.tika.parser.video.FLVParser" />
		<parser class="org.apache.tika.parser.xml.DcXMLParser" />
		<parser class="org.apache.tika.parser.xml.FictionBookParser" />
		<parser class="org.apache.tika.parser.chm.ChmParser" />
		<!-- Copy parsers from jar vorbis-java-tika-0.1.jar -->
		<parser class="org.gagravarr.tika.FlacParser" />
		<parser class="org.gagravarr.tika.OggParser" />
		<parser class="org.gagravarr.tika.VorbisParser" />
	</parsers>
</properties>

Dans cette configuration, les parsers standards sont remplacés par les nouveaux.

Icon Personnalisation.png Configuration ini

Un fichier org.apache.tika.parser.Parser est mis à disposition dans le répertoire META-INF/services afin d'enregistrer les nouveaux parsers.

#  Licensed to the Apache Software Foundation (ASF) under one or more
#  contributor license agreements.  See the NOTICE file distributed with
#  this work for additional information regarding copyright ownership.
#  The ASF licenses this file to You under the Apache License, Version 2.0
#  (the "License"); you may not use this file except in compliance with
#  the License.  You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

# Override the Office parser, default one is : 
# org.apache.tika.parser.microsoft.OfficeParser
fr.ejn.tutorial.tika.parser.microsoft.ContentFilterOfficeParser

# Override the OOXML document parser, default one is : 
# org.apache.tika.parser.microsoft.ooxml.OOXMLParser
fr.ejn.tutorial.tika.parser.microsoft.ooxml.ContentFilterOOXMLParser

# Override the PDF parser, default one is : 
# org.apache.tika.parser.pdf.PDFParser
fr.ejn.tutorial.tika.parser.pdf.ContentFilterPDFParser


link+ Variables extraites

Ce paragraphe liste les différentes variables extraites avec les parsers mis en place dans cet article.

Fichiers Office

format doc
Application-Name Author Character Count Comments Company
Content-Length Content-Type Creation-Date Edit-Time Keywords
Last-Author Last-Save-Date Page-Count Revision-Number Template
Word-Count custom:MyCustomDate custom:MyCustomString resourceName subject
title xmpTPg:NPages
format ppt
Author Content-Length Content-Type Creation-Date Edit-Time
Last-Author Last-Save-Date Revision-Number Slide-Count Word-Count
custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate
resourceName title xmpTPg:NPages
format xls
Application-Name Author Content-Length Content-Type Creation-Date
Last-Author Last-Save-Date custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean
custom:myCustomNumber custom:myCustomSecondDate resourceName


Fichiers Office Open XML

format docx
Application-Name Application-Version Author Character Count Character-Count-With-Spaces
Content-Length Content-Type Creation-Date Keywords Last-Author
Last-Modified Line-Count Page-Count Paragraph-Count Revision-Number
Template Total-Time Word-Count creator custom:MyCustomDate
custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate date
description publisher resourceName subject title
xmpTPg:NPages
format pptx
Application-Version Author Content-Length Content-Type Creation-Date
Last-Author Last-Modified Paragraph-Count Presentation-Format Revision-Number
Slide-Count Total-Time Word-Count creator custom:MyCustomDate
custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate date
resourceName title xmpTPg:NPages
format xlsx
Application-Name Application-Version Content-Length Content-Type Creation-Date
Last-Modified custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber
custom:myCustomSecondDate date protected resourceName


Fichiers PDF

Author Content-Length Content-Type Creation-Date created
creator producer resourceName title xmpTPg:NPages