TIKA 1.0 Extension extractions propriétés

De EjnTricks

Cet article présente une modification mise en place sur TIKA afin d'extraire uniquement les propriétés, sans parcourir le contenu des documents, pour les fichiers PDF et Office.

Le code source de cette étude est disponible à l'adresse http://www.svn.jouvinio.net/study/branches/Tika/1.0.


Hand-icon.png Votre avis

Nobody voted on this yet

 You need to enable JavaScript to vote


Java format icon.png Implémentation

Nouvel argument de contexte

Une nouvelle classe est mise en place, afin de spécifier dans le contexte d'extraction si le contenu doit être échappé ou non. En effet, la modification laisse la possibilité de conserver le comportement standard.

La classe EscapeContentArg est écrite ainsi.

package fr.ejn.tutorial.tika.parser;

import org.apache.tika.parser.ParseContext;

/**
 * During document parsing, it may be done only on metadata. Instance of this class should be used in ParseContext.
 * 
 * @author Etienne Jouvin
 * 
 */
public final class EscapeContentArg {

	/**
	 * Check the EscapeContentArg value from a ParseContext instance.
	 * 
	 * @param context ParseContext instance to parse.
	 * @return True if the argument is not null in the PagContext instance, and if the escapeContent value is set to true.
	 */
	public static boolean checkEscapeContent(ParseContext context) {
		/* Get argument from the context. */
		EscapeContentArg escapeContentArg = context.get(EscapeContentArg.class);

		/* If argument is not null and the value is set to true in the argument, return true. */
		return null != escapeContentArg && escapeContentArg.isEscapeContent();
	}

	private boolean escapeContent;

	/**
	 * @return the escapeContent.
	 */
	public boolean isEscapeContent() {
		return escapeContent;
	}

	/**
	 * @param escapeContent the escapeContent to set.
	 */
	public void setEscapeContent(boolean escapeContent) {
		this.escapeContent = escapeContent;
	}

}

Ceci permet d'encapsuler la valeur booléenne escapeContent. A noter la fonction checkEscapeContent qui permet de savoir depuis le contexte d'extraction, argument d'instance ParseContext, si le contenu doit être échappé durant l'extraction. En effet, ce contrôle étant réalisé dans différents parsers, il était intéressant d'externaliser ce contrôle.


Nouveau parser

L'extraction des informations s'effectue très simplement à l'aide de la classe org.apache.tika.parser.AutoDetectParser. Cette classe a été étendue afin de placer une instance de EscapeContentArg dans le contexte d'extraction, juste avant de déclencher celle-ci.

package fr.ejn.tutorial.tika.parser;

import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * TIKA AutoDetectorParser extension used to set a flag for content parsing. By default, the content is parsed by TIKA and introduce poor performance if only file properties is
 * wanted.
 * 
 * @author Etienne Jouvin
 * 
 */
public class AutoDetectContentFilterParser extends AutoDetectParser {

	private static final long serialVersionUID = 7440689727706539222L;
	private boolean escapeContent;

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 */
	public AutoDetectContentFilterParser() {
		super();
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param detector type detector.
	 */
	public AutoDetectContentFilterParser(Detector detector) {
		super(detector);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param detector type detector.
	 * @param parsers Parsers instance used to read file properties.
	 */
	public AutoDetectContentFilterParser(Detector detector, Parser... parsers) {
		super(detector, parsers);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param parsers Parsers instance used to read file properties.
	 */
	public AutoDetectContentFilterParser(Parser... parsers) {
		super(parsers);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param config TIKA configuration instance.
	 */
	public AutoDetectContentFilterParser(TikaConfig config) {
		super(config);
	}

	/**
	 * @return the escapeContent.
	 */
	public boolean isEscapeContent() {
		return escapeContent;
	}

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		/* Build the escape content argument. */
		EscapeContentArg escapeContentArg = new EscapeContentArg();
		escapeContentArg.setEscapeContent(escapeContent);

		/* Complete the context with the argument. */
		context.set(EscapeContentArg.class, escapeContentArg);

		/* Let super method works. */
		super.parse(stream, handler, metadata, context);
	}

	/**
	 * @param escapeContent the escapeContent to set.
	 */
	public void setEscapeContent(boolean escapeContent) {
		this.escapeContent = escapeContent;
	}

}

La variable escapeContent sera affectée dans l'instance de EscapeContentArg. Ainsi, ce parser peut être utilisé pour un comportement standard ou modifié.

La modification la plus importante concerne la méthode parse. Une instance de EscapeContentArg y est ajouté au contexte d'extraction avant d'appeler le code standard.


Parser fichiers Office

Une implémentation du parser org.apache.tika.parser.microsoft.OfficeParser est utilisée en standard pour les fichiers Office, doc, xls, ppt ... Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.microsoft;

import java.io.IOException;

import org.apache.poi.poifs.filesystem.DirectoryNode;
import org.apache.poi.poifs.filesystem.Entry;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.OfficeParser;
import org.apache.tika.parser.microsoft.VisibleSummaryExtractor;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * Office parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains
 * the flag isEscapeContent and set to true, the content is not parsed. This extension used an extension of SummaryExtractor, renamed as FutureSummaryExtension because was created
 * from a work in progress code, from the TIKA svn.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOfficeParser extends OfficeParser {

	private static final long serialVersionUID = -1575837099583857962L;

	/** {@inheritDoc} */
	@Override
	protected void parse(DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			// Parse summary entries first, to make metadata available early.
			/* Use the custom extension, because the standard one is visible only in package. */
			/* new SummaryExtractor(metadata).parseSummaries(root); */
			new VisibleSummaryExtractor(metadata).parseSummaries(root);

			for (Entry entry : root) {
				POIFSDocumentType type = POIFSDocumentType.detectType(entry);

				if (type != POIFSDocumentType.UNKNOWN) {
					setType(metadata, type.getType());
				}

				// Remove all code for content extraction.
			}
		} else {
			super.parse(root, context, metadata, xhtml);
		}
	}

	/**
	 * Store the content type meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param type MediaType instance.
	 */
	private void setType(Metadata metadata, MediaType type) {
		metadata.set(Metadata.CONTENT_TYPE, type.toString());
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou l'extension mise en place pour ne conserver que l'extraction des propriétés par l'instance de org.apache.tika.parser.microsoft.VisibleSummaryExtractor. Cette dernière est une extension de la classe org.apache.tika.parser.microsoft.SummaryExtractor dont la visibilité est de type "package", ne pouvant être utilisée depuis cette extension.

package org.apache.tika.parser.microsoft;

import org.apache.tika.metadata.Metadata;

/**
 * SummaryExtractor extension to make it visible.
 * 
 * @author Etienne Jouvin
 * 
 */
public class VisibleSummaryExtractor extends SummaryExtractor {

	/**
	 * @param metadata Metadata completed during parsing.
	 */
	public VisibleSummaryExtractor(Metadata metadata) {
		super(metadata);
	}

}

La méthode setType doit être recopiée car elle est privée dans la classe étendue.


Parser fichiers Office Open XML

Une implémentation du parser org.apache.tika.parser.microsoft.ooxml.OOXMLParser est utilisée en standard pour les fichiers Office Open XML, docx, xlsx, pptx ... Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * Office parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains
 * the flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOOXMLParser extends OOXMLParser {

	private static final long serialVersionUID = -710246317651988474L;

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			ContentFilterOOXMLExtractorFactory.parse(stream, handler, metadata, context);
		} else {
			/* Let super method works. */
			super.parse(stream, handler, metadata, context);
		}
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou l'utilisation d'un nouvel extracteur, classe ContentFilterOOXMLExtractorFactory, de propriétés.

La classe ContentFilterOOXMLExtractorFactory est une "recopie" de la classe OOXMLExtractorFactory, dans laquelle le traitement sur le contenu a été supprimé.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.io.IOException;
import java.io.InputStream;
import java.util.Locale;

import org.apache.poi.POIXMLDocument;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xslf.XSLFSlideShow;
import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.CloseShieldInputStream;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLExtractor;
import org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory;
import org.apache.xmlbeans.XmlException;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Skip content extraction.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOOXMLExtractorFactory extends OOXMLExtractorFactory {

	/**
	 * Parse the stream to extract properties. This is pretty the same as the parent function in parent class, but remove extraction from content.
	 * 
	 * @param stream Stream to parse.
	 * @param baseHandler Content handler.
	 * @param metadata Metadata instance where properties will be stored.
	 * @param context Parsing context.
	 * @throws IOException Exception during reading.
	 * @throws SAXException Exception during XML parsing.
	 * @throws TikaException Other exception fired.
	 */
	public static void parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		Locale locale = context.get(Locale.class, Locale.getDefault());
		ExtractorFactory.setThreadPrefersEventExtractors(true);

		try {
			OOXMLExtractor extractor;

			POIXMLTextExtractor poiExtractor;
			TikaInputStream tis = TikaInputStream.cast(stream);
			if (tis != null && tis.getOpenContainer() instanceof OPCPackage) {
				poiExtractor = ExtractorFactory.createExtractor((OPCPackage) tis.getOpenContainer());
			} else if (tis != null && tis.hasFile()) {
				poiExtractor = (POIXMLTextExtractor) ExtractorFactory.createExtractor(tis.getFile());
			} else {
				InputStream shield = new CloseShieldInputStream(stream);
				poiExtractor = (POIXMLTextExtractor) ExtractorFactory.createExtractor(shield);
			}

			POIXMLDocument document = poiExtractor.getDocument();
			if (poiExtractor instanceof XSSFEventBasedExcelExtractor) {
				// Made a copy of future excel extractor, and make an extension of it to customize the metadata parser.
				extractor = new MetaFixedXSSFExcelExtractorDecorator(context, (XSSFEventBasedExcelExtractor) poiExtractor, locale);
			} else if (document == null) {
				throw new TikaException("Expecting UserModel based POI OOXML extractor with a document, but none found. " + "The extractor returned was a " + poiExtractor);
			} else if (document instanceof XSLFSlideShow) {
				// Did not check changes. We just want to extract metadatas and change the extractor.
				extractor = new MetaFixedXSLFPowerPointExtractorDecorator(context, (XSLFPowerPointExtractor) poiExtractor);
			} else if (document instanceof XWPFDocument) {
				// Did not check changes. We just want to extract metadatas and change the extractor.
				extractor = new MetaFixedXWPFWordExtractorDecorator(context, (XWPFWordExtractor) poiExtractor);
			} else {
				// No change in version 1.0. Use a new one where the metadata extraction is changed.
				extractor = new MetaFixedPOIXMLTextExtractorDecorator(context, poiExtractor);
			}

			/* Remove All work on document content, just keep metadata extractor. */
			// We need to get the content first, but not end
			// the document just yet
			// EndDocumentShieldingContentHandler handler = new EndDocumentShieldingContentHandler(baseHandler);
			// extractor.getXHTML(handler, metadata, context);

			// Now we can get the metadata
			extractor.getMetadataExtractor().extract(metadata);

			// Then finish up
			// handler.reallyEndDocument();
		} catch (IllegalArgumentException e) {
			if (e.getMessage().startsWith("No supported documents found")) {
				throw new TikaException("TIKA-418: RuntimeException while getting content" + " for thmx and xps file types", e);
			} else {
				throw new TikaException("Error creating OOXML extractor", e);
			}
		} catch (InvalidFormatException e) {
			throw new TikaException("Error creating OOXML extractor", e);
		} catch (OpenXML4JException e) {
			throw new TikaException("Error creating OOXML extractor", e);
		} catch (XmlException e) {
			throw new TikaException("Error creating OOXML extractor", e);

		}
	}

}

Dans cette version de TIKA, les propriétés personnalisées ne sont pas extraites. C'est pourquoi il a été nécessaire d'ajouter de nouvelle instance de "decorator", surchargeant les standards et utilisant l'extractor fr.ejn.tutorial.tika.parser.microsoft.ooxml.AllMetadataExtractor créé spécifiquement.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.math.BigDecimal;
import java.util.Calendar;

import org.apache.commons.lang3.BooleanUtils;
import org.apache.poi.POIXMLProperties.CustomProperties;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.MSOffice;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.Property;
import org.apache.tika.parser.microsoft.ooxml.MetadataExtractor;
import org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperty;

/**
 * Extension of org.apache.tika.parser.microsoft.ooxml.MetadataExtractor, where only identified custom properties are extracted. Call extraction of all custom properties
 * 
 * All methods <code>addProperty</code> are inspired from class <code>SummaryExtractor</code> or parent <code>MetadataExtractor</code>.
 * 
 * @author Etienne Jouvin
 * 
 */
public class AllMetadataExtractor extends MetadataExtractor {

	private final POIXMLTextExtractor extractor;

	/**
	 * Extractor constructor. Get the extractor from arguments, will be used during custom properties extraction.
	 * 
	 * @param extractor Extractor based on POI.
	 * @param type Mime type argument.
	 */
	public AllMetadataExtractor(POIXMLTextExtractor extractor, String type) {
		super(extractor, type);

		this.extractor = extractor;
	}

	/**
	 * Add a Calendar property. If value is null, do not add to the Metadata instance.
	 * 
	 * @param metadata Instance where extraction is stored.
	 * @param property Property extracted.
	 * @param value Value extracted.
	 */
	private void addProperty(Metadata metadata, Property property, Calendar value) {
		if (null != value) {
			/* Date format will be done by the instance of Metadata, should be an instance of DateFormattedMetadata. */
			metadata.set(property, value.getTime());
		}
	}

	/**
	 * Add a BigDecimal property. If value is null, do not add to the Metadata instance.
	 * 
	 * @param metadata Instance where extraction is stored.
	 * @param name Property name.
	 * @param value Value extracted.
	 */
	private void addProperty(Metadata metadata, String name, BigDecimal value) {
		if (null != value) {
			metadata.set(name, value.toString());
		}
	}

	/**
	 * Add a Double property. If value is null, do not add to the Metadata instance.
	 * 
	 * @param metadata Instance where extraction is stored.
	 * @param name Property name.
	 * @param value Value extracted.
	 */
	private void addProperty(Metadata metadata, String name, Double value) {
		if (null != value) {
			metadata.set(name, Double.toString(value));
		}
	}

	/**
	 * Add a Integer property. If value is null, do not add to the Metadata instance.
	 * 
	 * @param metadata Instance where extraction is stored.
	 * @param name Property name.
	 * @param value Value extracted.
	 */
	private void addProperty(Metadata metadata, String name, Integer value) {
		if (null != value) {
			metadata.set(name, Integer.toString(value));
		}
	}

	/**
	 * Add a String property. If value is null, do not add to the Metadata instance.
	 * 
	 * @param metadata Instance where extraction is stored.
	 * @param name Property name.
	 * @param value Value extracted.
	 */
	private void addProperty(Metadata metadata, String name, String value) {
		if (value != null) {
			metadata.set(name, value);
		}
	}

	/** {@inheritDoc} */
	@Override
	public void extract(Metadata metadata) throws TikaException {
		super.extract(metadata);

		// Added, read custom properties.
		if (extractor.getDocument() != null || (extractor instanceof XSSFEventBasedExcelExtractor && extractor.getPackage() != null)) {
			extractMetadata(extractor.getCustomProperties(), metadata);
		}
	}

	/**
	 * Add this method to read custom properties on document.
	 * 
	 * @param properties All custom properties.
	 * @param metadata Metadata to complete with read properties.
	 */
	private void extractMetadata(CustomProperties properties, Metadata metadata) {
		org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperties propsHolder = properties.getUnderlyingProperties();

		String propertyName;
		Property tmpProperty;

		for (CTProperty property : propsHolder.getPropertyList()) {
			propertyName = MSOffice.USER_DEFINED_METADATA_NAME_PREFIX + property.getName();

			/* Parse each property */
			if (property.isSetLpwstr()) {
				addProperty(metadata, propertyName, property.getLpwstr());
			} else if (property.isSetFiletime()) {
				tmpProperty = Property.externalDate(propertyName);
				addProperty(metadata, tmpProperty, property.getFiletime());
			} else if (property.isSetDate()) {
				tmpProperty = Property.externalDate(propertyName);
				addProperty(metadata, tmpProperty, property.getDate());
			} else if (property.isSetDecimal()) {
				addProperty(metadata, propertyName, property.getDecimal());
			} else if (property.isSetBool()) {
				addProperty(metadata, propertyName, BooleanUtils.toStringTrueFalse(property.getBool()));
			} else if (property.isSetInt()) {
				addProperty(metadata, propertyName, property.getInt());
			} else if (property.isSetLpstr()) {
				addProperty(metadata, propertyName, property.getLpstr());
			} else if (property.isSetI4()) {
				/* Number in Excel for example.... Why i4 ? Ask microsoft. */
				addProperty(metadata, propertyName, property.getI4());
			} else if (property.isSetR8()) {
				addProperty(metadata, propertyName, property.getR8());
			}
		}
	}

}

Fichiers Excel

Ce cas présente une difficulté car l'"extractor" utilisé parcoure les feuilles pour détecter la propriété protected. L'extension mise en place consiste donc à recevoir l'instance standard et de l'exécuter avant le traitement spécifique. Ces fichiers sont traités par le "decorator" fr.ejn.tutorial.tika.parser.microsoft.ooxml.MetaFixedXSSFExcelExtractorDecorator.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import org.apache.poi.POIXMLTextExtractor;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.microsoft.ooxml.MetadataExtractor;

/**
 * In standard metadata extractor, a custom property is added for protected sheet. But this is impossible to override it. So create a new extractor that will call the standard one
 * and the new one which extract also custom properties.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ExcelMetadataExtractor extends MetadataExtractor {

	private MetadataExtractor excelMetadataExtractor;
	private AllMetadataExtractor metadataExtractor;

	/**
	 * Extractor for Excel. Create a specific extractor used to extract custom properties.
	 * 
	 * @param extractor Extractor based on POI.
	 * @param type Mime type argument.
	 * @param excelMetadataExtractor Metadata extractor for Excel.
	 */
	public ExcelMetadataExtractor(POIXMLTextExtractor extractor, String type, MetadataExtractor excelMetadataExtractor) {
		super(extractor, type);

		this.metadataExtractor = new AllMetadataExtractor(extractor, type);
		this.excelMetadataExtractor = excelMetadataExtractor;
	}

	/** {@inheritDoc} */
	@Override
	public void extract(Metadata metadata) throws TikaException {
		// Call standard metadata extractor. Can not extend it, because it used some internal fields.
		this.excelMetadataExtractor.extract(metadata);
		// Call the new one that will override already read metadatas.
		this.metadataExtractor.extract(metadata);
	}

}

L'instance de cette classe est créée au sein de la classe fr.ejn.tutorial.tika.parser.microsoft.ooxml.MetaFixedXSSFExcelExtractorDecorator.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.util.Locale;

import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.MetadataExtractor;
import org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator;

/**
 * Modify the getMetadata function to use a custom one that will parse all custom properties.
 * 
 * @author Etienne Jouvin
 * 
 */
public class MetaFixedXSSFExcelExtractorDecorator extends XSSFExcelExtractorDecorator {

	/* Copy from parent class, used in getMetadataExtractor. */
	private static final String TYPE = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
	private final XSSFEventBasedExcelExtractor extractor;

	/**
	 * Extractor for Powerpoint. Create a specific extractor used to extract custom properties.
	 * 
	 * @param context Parsing context.
	 * @param extractor Metadata extractor for Excel.
	 * @param locale Locale used for extraction.
	 */
	public MetaFixedXSSFExcelExtractorDecorator(ParseContext context, XSSFEventBasedExcelExtractor extractor, Locale locale) {
		super(context, extractor, locale);

		// This field is private in parent class, can not get it during getMetadataExtractor.
		this.extractor = extractor;
	}

	/** {@inheritDoc} */
	@Override
	public MetadataExtractor getMetadataExtractor() {
		// Create a new metadata extractor. It will call the standard one, and then the custom one.
		return new ExcelMetadataExtractor(extractor, TYPE, super.getMetadataExtractor());
	}

}

L'instance créé par la fonction getMetadataExtractor de la classe parente est utilisée comme argment pour l'instance ExcelMetadataExtractor. Les autres cas sont beaucoup plus simples.

Fichiers Powerpoint

Seul le "decorator" est modifié afin de créer une instance de AllMetadataExtractor en lieu et place de MetadataExtractor dans la classe parente. Ces fichiers sont traités par le "decorator" fr.ejn.tutorial.tika.parser.microsoft.ooxml.MetaFixedXSLFPowerPointExtractorDecorator.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.MetadataExtractor;
import org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator;

/**
 * Modify the getMetadata function to use a custom one that will parse all custom properties.
 * 
 * @author Etienne Jouvin
 * 
 */
public class MetaFixedXSLFPowerPointExtractorDecorator extends XSLFPowerPointExtractorDecorator {

	/**
	 * Extractor for Powerpoint. Create a specific extractor used to extract custom properties.
	 * 
	 * @param context Parsing context.
	 * @param extractor Metadata extractor for Excel.
	 */
	public MetaFixedXSLFPowerPointExtractorDecorator(ParseContext context, XSLFPowerPointExtractor extractor) {
		super(context, extractor);
	}

	/** {@inheritDoc} */
	@Override
	public MetadataExtractor getMetadataExtractor() {
		// In parent class, type is an internal variable, built during the constructor.
		// In parent constructor (the one extended here), this type is sent as null.
		// So we use null here;
		return new AllMetadataExtractor(extractor, null);
	}

}

Fichiers Word

Seul le "decorator" est modifié afin de créer une instance de AllMetadataExtractor en lieu et place de MetadataExtractor dans la classe parente. Ces fichiers sont traités par le "decorator" fr.ejn.tutorial.tika.parser.microsoft.ooxml.MetaFixedXWPFWordExtractorDecorator.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.MetadataExtractor;
import org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator;

/**
 * Modify the getMetadata function to use a custom one that will parse all custom properties.
 * 
 * @author Etienne Jouvin
 * 
 */
public class MetaFixedXWPFWordExtractorDecorator extends XWPFWordExtractorDecorator {

	private static final String TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";

	/**
	 * Extractor for Word. Create a specific extractor used to extract custom properties.
	 * 
	 * @param context Parsing context.
	 * @param extractor Metadata extractor for Word.
	 */
	public MetaFixedXWPFWordExtractorDecorator(ParseContext context, XWPFWordExtractor extractor) {
		super(context, extractor);
	}

	/** {@inheritDoc} */
	@Override
	public MetadataExtractor getMetadataExtractor() {
		// In parent class, type is an internal variable, built during the constructor.
		// In parent constructor (the one extended here), this type is sent as "application/vnd.openxmlformats-officedocument.wordprocessingml.document".
		return new AllMetadataExtractor(extractor, TYPE);
	}

}

Autres fichiers

Seul le "decorator" est modifié afin de créer une instance de AllMetadataExtractor en lieu et place de MetadataExtractor dans la classe parente.Ces fichiers sont traités par le "decorator" fr.ejn.tutorial.tika.parser.microsoft.ooxml.MetaFixedPOIXMLTextExtractorDecorator.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import org.apache.poi.POIXMLTextExtractor;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.MetadataExtractor;
import org.apache.tika.parser.microsoft.ooxml.POIXMLTextExtractorDecorator;

/**
 * Modify the getMetadata function to use a custom one that will parse all custom properties.
 * 
 * @author Etienne Jouvin
 * 
 */
public class MetaFixedPOIXMLTextExtractorDecorator extends POIXMLTextExtractorDecorator {

	/**
	 * Extractor for POI XML document.
	 * 
	 * @param context Parsing context.
	 * @param extractor Extractor instance.
	 */
	public MetaFixedPOIXMLTextExtractorDecorator(ParseContext context, POIXMLTextExtractor extractor) {
		super(context, extractor);
	}

	/** {@inheritDoc} */
	@Override
	public MetadataExtractor getMetadataExtractor() {
		// In parent class, type is an internal variable, built during the constructor.
		// In parent constructor (the one extended here), this type is sent as null.
		// So we use null here;
		return new AllMetadataExtractor(extractor, null);
	}

}


Parser fichiers PDF

Une implémentation du parser org.apache.tika.parser.pdf.PDFParser est utilisée en standard pour les fichiers PDF. Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.pdf;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.Calendar;
import java.util.List;

import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.io.RandomAccess;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.CloseShieldInputStream;
import org.apache.tika.io.TemporaryResources;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.PagedText;
import org.apache.tika.metadata.Property;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.PasswordProvider;
import org.apache.tika.parser.pdf.PDFParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * PDF parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains the
 * flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterPDFParser extends PDFParser {

	private static final long serialVersionUID = -7912065246613445882L;

	/**
	 * Store a Calendar meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param property Property to store.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, Property property, Calendar value) {
		if (value != null) {
			metadata.set(property, value.getTime());
		}
	}

	/**
	 * Store a value for meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param property Property to store.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, Property property, String value) {
		if (value != null) {
			metadata.add(property, value);
		}
	}

	/**
	 * Store a Calendar meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param name Property name.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, String name, Calendar value) {
		if (value != null) {
			metadata.set(name, value.getTime().toString());
		}
	}

	/**
	 * Used when processing custom metadata entries, as PDFBox won't do the conversion for us in the way it does for the standard ones.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param name Property name.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, String name, COSBase value) {
		if (value instanceof COSArray) {
			for (COSBase v : ((COSArray) value).toList()) {
				addMetadata(metadata, name, v);
			}
		} else if (value instanceof COSString) {
			addMetadata(metadata, name, ((COSString) value).getString());
		} else {
			addMetadata(metadata, name, value.toString());
		}
	}

	/**
	 * Store a value meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param name Property name.
	 * @param value Property value to store.
	 */
	private void addMetadata(Metadata metadata, String name, String value) {
		if (value != null) {
			metadata.add(name, value);
		}
	}

	/**
	 * Extract and store metada from the PDF document to Metadata instance.
	 * 
	 * @param document Document to parse.
	 * @param metadata Metadata instance to complete.
	 * @throws TikaException TIKA exception on properties extraction.
	 */
	private void extractMetadata(PDDocument document, Metadata metadata) throws TikaException {
		PDDocumentInformation info = document.getDocumentInformation();
		metadata.set(PagedText.N_PAGES, document.getNumberOfPages());
		addMetadata(metadata, TikaCoreProperties.TITLE, info.getTitle());
		addMetadata(metadata, TikaCoreProperties.CREATOR, info.getAuthor());
		addMetadata(metadata, TikaCoreProperties.CREATOR_TOOL, info.getCreator());
		addMetadata(metadata, TikaCoreProperties.KEYWORDS, info.getKeywords());
		addMetadata(metadata, "producer", info.getProducer());
		// TODO Move to description in Tika 2.0
		addMetadata(metadata, TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT, info.getSubject());
		addMetadata(metadata, "trapped", info.getTrapped());
		try {
			// TODO Remove these in Tika 2.0
			addMetadata(metadata, "created", info.getCreationDate());
			addMetadata(metadata, TikaCoreProperties.CREATED, info.getCreationDate());
		} catch (IOException e) {
			// Invalid date format, just ignore
		}
		try {
			Calendar modified = info.getModificationDate();
			addMetadata(metadata, Metadata.LAST_MODIFIED, modified);
			addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
		} catch (IOException e) {
			// Invalid date format, just ignore
		}

		// All remaining metadata is custom
		// Copy this over as-is
		List<String> handledMetadata = Arrays.asList(new String[] { "Author", "Creator", "CreationDate", "ModDate", "Keywords", "Producer", "Subject", "Title", "Trapped" });
		for (COSName key : info.getDictionary().keySet()) {
			String name = key.getName();
			if (!handledMetadata.contains(name)) {
				addMetadata(metadata, name, info.getDictionary().getDictionaryObject(key));
			}
		}
	}

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			/* Should not extract informations from content, work only on properties. */
			/* Must copy all code, because the there is no entry point. */
			/* Because of that, we need to duplicate many functions to handle properties. */
			PDDocument pdfDocument = null;
			TemporaryResources tmp = new TemporaryResources();

			try {
				// PDFBox can process entirely in memory, or can use a temp file
				// for unpacked / processed resources
				// Decide which to do based on if we're reading from a file or not already
				TikaInputStream tstream = TikaInputStream.cast(stream);
				if (tstream != null && tstream.hasFile()) {
					// File based, take that as a cue to use a temporary file
					RandomAccess scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
					pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
				} else {
					// Go for the normal, stream based in-memory parsing
					pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
				}

				if (pdfDocument.isEncrypted()) {
					String password = null;

					// Did they supply a new style Password Provider?
					PasswordProvider passwordProvider = context.get(PasswordProvider.class);
					if (passwordProvider != null) {
						password = passwordProvider.getPassword(metadata);
					}

					// Fall back on the old style metadata if set
					if (password == null && metadata.get(PASSWORD) != null) {
						password = metadata.get(PASSWORD);
					}

					// If no password is given, use an empty string as the default
					if (password == null) {
						password = "";
					}

					try {
						pdfDocument.decrypt(password);
					} catch (Exception e) {
						// Ignore
					}
				}
				metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
				extractMetadata(pdfDocument, metadata);
				// Update, do not parse pdf content.
				// PDF2XHTML.process(pdfDocument, handler, metadata, extractAnnotationText, enableAutoSpace, suppressDuplicateOverlappingText, sortByPosition);

				// extractEmbeddedDocuments(context, pdfDocument, handler);
			} finally {
				if (pdfDocument != null) {
					pdfDocument.close();
				}
				tmp.dispose();
			}
		} else {
			/* Let super method works. */
			super.parse(stream, handler, metadata, context);
		}
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou la duplication de celui-ci.


link+ Exécution

Afin de faciliter l'utilisation de TIKA et de ces nouveaux parsers, la classe TikaExtractorImpl a été écrite.

package fr.ejn.tutorial.metadatas.impl.tika;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.text.MessageFormat;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.log4j.Logger;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import fr.ejn.tutorial.metadatas.Extractor;
import fr.ejn.tutorial.tika.parser.AutoDetectContentFilterParser;

/**
 * File properties reader with TIKA implementation. By default TIKA read the file content, and cause poor performance. Use a custom parser that skip the content parsing. But it was
 * necessary to duplicate some code from TIKA framework.
 * 
 * @author Etienne Jouvin
 * 
 */
public class TikaExtractorImpl implements Extractor {

	private static final Logger LOGGER = Logger.getLogger(TikaExtractorImpl.class);
	private String configResource;

	/**
	 * TIKA configuration used for the extraction.
	 */
	private TikaConfig tikaConfig;

	/**
	 * Defaut constructor.
	 */
	public TikaExtractorImpl() {
		configResource = null;
		tikaConfig = null;
	}

	/**
	 * Convert metadata instance to a Map.
	 * 
	 * @param metadata Instance to convert.
	 * @return Converted map.
	 */
	private Map<String, String[]> convertToMap(Metadata metadata) {
		/* names can never returns null. No need to check the null. */
		String[] names = metadata.names();
		Map<String, String[]> properties = new HashMap<String, String[]>(names.length);

		for (String name : names) {
			properties.put(name, metadata.getValues(name));
		}

		return properties;
	}

	/** {@inheritDoc} */
	@Override
	public Map<String, String[]> extract(String filePath) {
		if (null == tikaConfig) {
			return null;
		}

		/* Create the metadata handler, where all metadata will be stored during content parsing. */
		Metadata metadata = readMetadata(filePath);

		/* Convert and return the metadata instance into a Map of values. */
		return convertToMap(metadata);
	}

	/** {@inheritDoc} */
	@Override
	public void init() {
		InputStream resource;
		String resourceToLoad;

		if (StringUtils.isBlank(configResource)) {
			resource = null;
			resourceToLoad = null;
		} else {
			/* Try to load the configuration resource set. */
			resourceToLoad = configResource;
			resource = TikaExtractorImpl.class.getResourceAsStream(resourceToLoad);
		}

		/* Initialize the Tika configuration. */
		try {
			if (null == resource) {
				/* Use the default TIKAConfig constructor. */
				/* Configuration from META-INF/services will be loaded. */
				/* Then detectors, parsers are sorted according the class name. */
				/* Classes into package org.apache.tika always come first. */
				/* Custom class in different package will be loaded after Tika core. */
				/* To override a default parser, just keep the package as org.apache.tika and make sure the String compare on class name */
				/* return first the new instance. */
				tikaConfig = new TikaConfig();
			} else {
				/* A resource was loaded, use it to build the TikaConfig instance. */
				tikaConfig = new TikaConfig(resource);
			}
		} catch (IOException ioException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_ACCESS_EXCEPTION, resourceToLoad), ioException);
		} catch (SAXException saxException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_SAX_EXCEPTION, resourceToLoad), saxException);
		} catch (TikaException tikaException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_TIKA_EXCEPTION, resourceToLoad), tikaException);
		} finally {
			IOUtils.closeQuietly(resource);
		}
	}

	/**
	 * Read file properties.
	 * 
	 * @param filePath File to parse.
	 * @return All metadatas read from the file.
	 */
	private Metadata readMetadata(String filePath) {
		/* Create the metadata handler, where all metadata will be stored during content parsing. */
		Metadata metadata = new Metadata();

		if (null != filePath) {
			InputStream input = null;
			try {
				input = TikaInputStream.get(new File(filePath), metadata);
				/* File loaded, can call the parser. */

				/* Create a parser, with auto format detection. */
				AutoDetectContentFilterParser parser = new AutoDetectContentFilterParser(tikaConfig);
				/* Set flag to escape content reading. */
				parser.setEscapeContent(true);

				/* Parse the content. Use a default handler, the map will be used directly. */
				parser.parse(input, new DefaultHandler(), metadata);
			} catch (IOException ioException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_FILE_IO, filePath), ioException);
			} catch (SAXException saxException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_FILE_IO, filePath), saxException);
			} catch (TikaException tikaException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_TIKA_EXCEPTION, filePath), tikaException);
			} finally {
				IOUtils.closeQuietly(input);
			}
		}

		return metadata;
	}

	/**
	 * Set the resource name.
	 * 
	 * @param configResource Configuration resource name to set.
	 */
	public void setConfigResource(String configResource) {
		this.configResource = configResource;
	}

}

Son fonctionnement est relativement simple. Il faut exécuter la méthode init dans un premier temps, permettant d'initialiser une instance de TikaConfig. Il est ensuite possible d'extraire les propriétés d'un fichier, identifié par son emplacement, à partir de la fonction extract.

link+ Configurations

Les deux modes de configuration, expliqués sur la page [Paramétrage TIKA], sont mis en place dans le cadre de cette analyse. Au niveau des sources, les fichiers se trouvent dans la répertoire des ressources de test.

XML format icon.png Configuration XML

Ce mode de configuration permet de spécifier uniquement les parsers à utiliser. L'inconvénient est qu'il est nécessaire de reprendre la déclaration de ceux-ci dans les différents jar fournis.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
	<!-- There is no specific rule on the root node name. But see link https://issues.apache.org/jira/browse/TIKA-527, where there is an example with node properties -->
	<!-- 
		Since TIKA 1.0.
		See in org.apache.tika.config.TikaConfig, for constructor TikaConfig(Element element).
		Parser check the nodes detector and if it contains the attribute class, 
		try to instantiate a class with the value of attribute.
		then MimeTypesFactory is loaded from this attribute value, that should be a path.
		All values from the file META-INF/services/org.apache.tika.detect.Detector are reproduce in the following configuration.
		And add the DefaultDetector also. This detector is used by default
		as it is the case the function TikaConfig.getDefaultConfig().
	-->
	<detectors>
<!-- Put the default detector. This one will load all default ones. -->
		<detector class="org.apache.tika.detect.DefaultDetector" />
	</detectors>
	<!-- 
		See in org.apache.tika.config.TikaConfig, for constructor TikaConfig(Element element).
		Parser check the node mimeTypeRepository and if it contains the attribute resource, 
		then MimeTypesFactory is loaded from this attribute value, that should be a path.
	
		If attribute is not found, mimetypes in TikaConfig is set by the function MimeTypes.getDefaultMimeTypes(),
		which is the case the function TikaConfig.getDefaultConfig() is used.
	-->
	<mimeTypeRepository />
	<!--
		By default, parsers configuration is read from the file META-INF/services/org.apache.tika.parser.Parser
		loaded during DefaultParser constructor.
		
		Just copy all parser from orginal file in this XML and set custom parser.
	-->
	<parsers>
		<parser class="org.apache.tika.parser.asm.ClassParser" />
		<parser class="org.apache.tika.parser.audio.AudioParser" />
		<parser class="org.apache.tika.parser.audio.MidiParser" />
		<parser class="org.apache.tika.parser.dwg.DWGParser" />
		<parser class="org.apache.tika.parser.epub.EpubParser" />
		<parser class="org.apache.tika.parser.feed.FeedParser" />
		<parser class="org.apache.tika.parser.font.TrueTypeParser" />
		<parser class="org.apache.tika.parser.html.HtmlParser" />
		<parser class="org.apache.tika.parser.image.ImageParser" />
		<parser class="org.apache.tika.parser.image.PSDParser" />
		<parser class="org.apache.tika.parser.image.TiffParser" />
		<parser class="org.apache.tika.parser.iwork.IWorkPackageParser" />
		<parser class="org.apache.tika.parser.jpeg.JpegParser" />
		<parser class="org.apache.tika.parser.mail.RFC822Parser" />
		<parser class="org.apache.tika.parser.mbox.MboxParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.microsoft.OfficeParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.microsoft.ContentFilterOfficeParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.microsoft.TNEFParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.microsoft.ooxml.ContentFilterOOXMLParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.mp3.Mp3Parser" />
		<parser class="org.apache.tika.parser.hdf.HDFParser" />
		<parser class="org.apache.tika.parser.netcdf.NetCDFParser" />
		<parser class="org.apache.tika.parser.odf.OpenDocumentParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.pdf.PDFParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.pdf.ContentFilterPDFParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.pkg.PackageParser" />
		<parser class="org.apache.tika.parser.rtf.RTFParser" />
		<parser class="org.apache.tika.parser.txt.TXTParser" />
		<parser class="org.apache.tika.parser.video.FLVParser" />
		<parser class="org.apache.tika.parser.xml.DcXMLParser" />
		<parser class="org.apache.tika.parser.xml.FictionBookParser" />
		<parser class="org.apache.tika.parser.chm.ChmParser" />
	</parsers>
</properties>

Dans cette configuration, les parsers standards sont remplacés par les nouveaux. Il ne faut pas oublier d'ajouter les detectors également et surtout la classe org.apache.tika.detect.DefaultDetector, sinon la majorité des fichiers sont pris en compte par une instance de org.apache.tika.parser.pkg.ZipContainerDetector.


Icon Personnalisation.png Configuration ini

L'utilisation d'un fichier dans le répertoire META-INF/services n'est pas fonctionnel dans le cadre de cette version. En effet, il est souhaité d'apporter de nouveaux parsers en surcharge de ceux existants. Or le mode de chargement de ce fichier rend aléatoire l'utilisation des parsers. Cela n'est donc pas utilisé.


link+ Variables extraites

Ce paragraphe liste les différentes variables extraites avec les parsers mis en place dans cet article.

Fichiers Office

format doc
Application-Name Author Character Count Comments Company
Content-Length Content-Type Creation-Date Edit-Time Keywords
Last-Author Last-Save-Date Page-Count Revision-Number Template
Word-Count custom:MyCustomDate custom:MyCustomString resourceName subject
title xmpTPg:NPages
format ppt
Author Content-Length Content-Type Creation-Date Edit-Time
Last-Author Last-Save-Date Revision-Number Slide-Count Word-Count
custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate
resourceName title xmpTPg:NPages
format xls
Application-Name Author Content-Length Content-Type Creation-Date
Last-Author Last-Save-Date custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean
custom:myCustomNumber custom:myCustomSecondDate resourceName


Fichiers Office Open XML

format docx
Application-Name Application-Version Author Character Count Character-Count-With-Spaces
Content-Length Content-Type Creation-Date Keywords Last-Author
Last-Modified Line-Count Page-Count Paragraph-Count Revision-Number
Template Total-Time Word-Count creator custom:MyCustomDate
custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate date
description publisher resourceName subject title
xmpTPg:NPages
format pptx
Application-Version Author Content-Length Content-Type Creation-Date
Last-Author Last-Modified Paragraph-Count Presentation-Format Revision-Number
Slide-Count Total-Time Word-Count creator custom:MyCustomDate
custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate date
resourceName title xmpTPg:NPages
format xlsx
Application-Name Application-Version Content-Length Content-Type Creation-Date
Last-Modified custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber
custom:myCustomSecondDate date protected resourceName


Fichiers PDF

Author Content-Length Content-Type Creation-Date created
creator producer resourceName title xmpTPg:NPages