TIKA 1.6 Extension extractions propriétés

De EjnTricks

Cet article présente une modification mise en place sur TIKA afin d'extraire uniquement les propriétés, sans parcourir le contenu des documents, pour les fichiers PDF et Office. Peu de différences ont été constatées par rapport à l'étude faite sur la version 1.5.

Le code source de cette étude est disponible à l'adresse http://www.jouvinio.net/svn/study/branches/Tika/1.6.


Hand-icon.png Votre avis

Nobody voted on this yet

 You need to enable JavaScript to vote


Java format icon.png Implémentation

Nouvel argument de contexte

Une nouvelle classe est mise en place, afin de spécifier dans le contexte d'extraction si le contenu doit être échappé ou non. En effet, la modification laisse la possibilité de conserver le comportement standard.

La classe EscapeContentArg est écrite ainsi.

package fr.ejn.tutorial.tika.parser;

import org.apache.tika.parser.ParseContext;

/**
 * During document parsing, it may be done only on metadata. Instance of this class should be used in ParseContext.
 * 
 * @author Etienne Jouvin
 * 
 */
public final class EscapeContentArg {

	/**
	 * Check the EscapeContentArg value from a ParseContext instance.
	 * 
	 * @param context ParseContext instance to parse.
	 * @return True if the argument is not null in the PagContext instance, and if the escapeContent value is set to true.
	 */
	public static boolean checkEscapeContent(ParseContext context) {
		/* Get argument from the context. */
		EscapeContentArg escapeContentArg = context.get(EscapeContentArg.class);

		/* If argument is not null and the value is set to true in the argument, return true. */
		return null != escapeContentArg && escapeContentArg.isEscapeContent();
	}

	private boolean escapeContent;

	/**
	 * @return the escapeContent.
	 */
	public boolean isEscapeContent() {
		return escapeContent;
	}

	/**
	 * @param escapeContent the escapeContent to set.
	 */
	public void setEscapeContent(boolean escapeContent) {
		this.escapeContent = escapeContent;
	}

}

Ceci permet d'encapsuler la valeur booléenne escapeContent. A noter la fonction checkEscapeContent qui permet de savoir depuis le contexte d'extraction, argument d'instance ParseContext, si le contenu doit être échappé durant l'extraction. En effet, ce contrôle étant réalisé dans différents parsers, il était intéressant d'externaliser ce contrôle.


Nouveau parser

L'extraction des informations s'effectue très simplement à l'aide de la classe org.apache.tika.parser.AutoDetectParser. Cette classe a été étendue afin de placer une instance de EscapeContentArg dans le contexte d'extraction, juste avant de déclencher celle-ci.

package fr.ejn.tutorial.tika.parser;

import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * TIKA AutoDetectorParser extension used to set a flag for content parsing. By default, the content is parsed by TIKA and introduce poor performance if only file properties is
 * wanted.
 * 
 * @author Etienne Jouvin
 * 
 */
public class AutoDetectContentFilterParser extends AutoDetectParser {

	private static final long serialVersionUID = 7440689727706539222L;
	private boolean escapeContent;

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 */
	public AutoDetectContentFilterParser() {
		super();
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param detector type detector.
	 */
	public AutoDetectContentFilterParser(Detector detector) {
		super(detector);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param detector type detector.
	 * @param parsers Parsers instance used to read file properties.
	 */
	public AutoDetectContentFilterParser(Detector detector, Parser... parsers) {
		super(detector, parsers);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param parsers Parsers instance used to read file properties.
	 */
	public AutoDetectContentFilterParser(Parser... parsers) {
		super(parsers);
	}

	/**
	 * Creates an auto-detecting parser instance using the default Tika configuration.
	 * 
	 * @param config TIKA configuration instance.
	 */
	public AutoDetectContentFilterParser(TikaConfig config) {
		super(config);
	}

	/**
	 * @return the escapeContent.
	 */
	public boolean isEscapeContent() {
		return escapeContent;
	}

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		/* Build the escape content argument. */
		EscapeContentArg escapeContentArg = new EscapeContentArg();
		escapeContentArg.setEscapeContent(escapeContent);

		/* Complete the context with the argument. */
		context.set(EscapeContentArg.class, escapeContentArg);

		/* Let super method works. */
		super.parse(stream, handler, metadata, context);
	}

	/**
	 * @param escapeContent the escapeContent to set.
	 */
	public void setEscapeContent(boolean escapeContent) {
		this.escapeContent = escapeContent;
	}

}

La variable escapeContent sera affectée dans l'instance de EscapeContentArg. Ainsi, ce parser peut être utilisé pour un comportement standard ou modifié.

La modification la plus importante concerne la méthode parse. Une instance de EscapeContentArg y est ajouté au contexte d'extraction avant d'appeler le code standard.


Parser fichiers Office

Une implémentation du parser org.apache.tika.parser.microsoft.OfficeParser est utilisée en standard pour les fichiers Office, doc, xls, ppt ... Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.microsoft;

import java.io.IOException;

import org.apache.poi.poifs.filesystem.DirectoryNode;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.OfficeParser;
import org.apache.tika.parser.microsoft.SummaryExtractor;
import org.apache.tika.sax.XHTMLContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * Office parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains
 * the flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOfficeParser extends OfficeParser {

	private static final long serialVersionUID = -1575837099583857962L;

	/** {@inheritDoc} */
	@Override
	protected void parse(DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			// Parse summary entries first, to make metadata available early
			new SummaryExtractor(metadata).parseSummaries(root);

			// Parse remaining document entries
			POIFSDocumentType type = POIFSDocumentType.detectType(root);

			if (type != POIFSDocumentType.UNKNOWN) {
				setType(metadata, type.getType());
			}

			// Remove all code for content extraction.
		} else {
			super.parse(root, context, metadata, xhtml);
		}
	}

	/**
	 * Store the content type meta data.
	 * 
	 * @param metadata Metadata instance to complete.
	 * @param type MediaType instance.
	 */
	private void setType(Metadata metadata, MediaType type) {
		metadata.set(Metadata.CONTENT_TYPE, type.toString());
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou l'extension mise en place pour ne conserver que l'extraction des propriétés par l'instance de SummaryExtractor. A noter que cette dernière a été rendue publique depuis la version 1.5 de TIKA.

La méthode setType doit être recopiée car elle est privée dans la classe étendue.


Parser fichiers Office Open XML

Une implémentation du parser org.apache.tika.parser.microsoft.ooxml.OOXMLParser est utilisée en standard pour les fichiers Office Open XML, docx, xlsx, pptx ... Une extension de celle-ci est réalisée afin de supprimer tout le traitement, dans la méthode parse sur le contenu du fichier.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.io.IOException;
import java.io.InputStream;
import java.util.Set;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * Office parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains
 * the flag isEscapeContent and set to true, the content is not parsed.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOOXMLParser extends OOXMLParser {

	private static final long serialVersionUID = 5116030503623701915L;
	protected static final Set<MediaType> VISIBLE_UNSUPPORTED_OOXML_TYPES = OOXMLParser.UNSUPPORTED_OOXML_TYPES;

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		if (EscapeContentArg.checkEscapeContent(context)) {
			ContentFilterOOXMLExtractorFactory.parse(stream, handler, metadata, context);
		} else {
			/* Let super method works. */
			super.parse(stream, handler, metadata, context);
		}
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de déclencher le code standard ou l'utilisation d'un nouvel extracteur, classe ContentFilterOOXMLExtractorFactory, de propriétés.

La classe ContentFilterOOXMLExtractorFactory est une "recopie" de la classe OOXMLExtractorFactory, dans laquelle le traitement sur le contenu a été supprimé.

package fr.ejn.tutorial.tika.parser.microsoft.ooxml;

import java.io.IOException;
import java.io.InputStream;
import java.util.Locale;

import org.apache.poi.POIXMLDocument;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.poi.xslf.usermodel.XMLSlideShow;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.CloseShieldInputStream;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaMetadataKeys;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.EmptyParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLExtractor;
import org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory;
import org.apache.tika.parser.microsoft.ooxml.POIXMLTextExtractorDecorator;
import org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator;
import org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator;
import org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator;
import org.apache.tika.parser.pkg.ZipContainerDetector;
import org.apache.xmlbeans.XmlException;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

/**
 * Skip content extraction.
 * 
 * @author Etienne Jouvin
 * 
 */
public class ContentFilterOOXMLExtractorFactory extends OOXMLExtractorFactory {

	/**
	 * Parse the stream to extract properties. This is pretty the same as the parent function in parent class, but remove extraction from content.
	 * 
	 * @param stream Stream to parse.
	 * @param baseHandler Content handler.
	 * @param metadata Metadata instance where properties will be stored.
	 * @param context Parsing context.
	 * @throws IOException Exception during reading.
	 * @throws SAXException Exception during XML parsing.
	 * @throws TikaException Other exception fired.
	 */
	public static void parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		Locale locale = context.get(Locale.class, Locale.getDefault());
		ExtractorFactory.setThreadPrefersEventExtractors(true);

		try {
			OOXMLExtractor extractor;
			OPCPackage pkg;

			// Locate or Open the OPCPackage for the file
			TikaInputStream tis = TikaInputStream.cast(stream);
			if (tis != null && tis.getOpenContainer() instanceof OPCPackage) {
				pkg = (OPCPackage) tis.getOpenContainer();
			} else if (tis != null && tis.hasFile()) {
				pkg = OPCPackage.open(tis.getFile().getPath(), PackageAccess.READ);
				tis.setOpenContainer(pkg);
			} else {
				InputStream shield = new CloseShieldInputStream(stream);
				pkg = OPCPackage.open(shield);
			}

			// Get the type, and ensure it's one we handle
			MediaType type = ZipContainerDetector.detectOfficeOpenXML(pkg);
			/* Use the constant from the ContentFilterOOXMLParser class, OOXMLParser.UNSUPPORTED_OOXML_TYPES is protected. */
			if (type == null || /* OOXMLParser.UNSUPPORTED_OOXML_TYPES */ContentFilterOOXMLParser.VISIBLE_UNSUPPORTED_OOXML_TYPES.contains(type)) {
				// Not a supported type, delegate to Empty Parser
				EmptyParser.INSTANCE.parse(stream, baseHandler, metadata, context);
				return;
			}
			metadata.set(Metadata.CONTENT_TYPE, type.toString());

			// Have the appropriate OOXML text extractor picked
			POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg);

			POIXMLDocument document = poiExtractor.getDocument();
			if (poiExtractor instanceof XSSFEventBasedExcelExtractor) {
				extractor = new XSSFExcelExtractorDecorator(context, (XSSFEventBasedExcelExtractor) poiExtractor, locale);
			} else if (document == null) {
				throw new TikaException("Expecting UserModel based POI OOXML extractor with a document, but none found. " + "The extractor returned was a " + poiExtractor);
			} else if (document instanceof XMLSlideShow) {
				extractor = new XSLFPowerPointExtractorDecorator(context, (XSLFPowerPointExtractor) poiExtractor);
			} else if (document instanceof XWPFDocument) {
				extractor = new XWPFWordExtractorDecorator(context, (XWPFWordExtractor) poiExtractor);
			} else {
				extractor = new POIXMLTextExtractorDecorator(context, poiExtractor);
			}

			// Get the bulk of the metadata first, so that it's accessible during
			// parsing if desired by the client (see TIKA-1109)
			extractor.getMetadataExtractor().extract(metadata);

			/* Remove All work on document content, just keep metadata extractor. */
			// Extract the text, along with any in-document metadata
			// extractor.getXHTML(baseHandler, metadata, context);
		} catch (IllegalArgumentException e) {
			if (e.getMessage().startsWith("No supported documents found")) {
				throw new TikaException("TIKA-418: RuntimeException while getting content" + " for thmx and xps file types", e);
			} else {
				throw new TikaException("Error creating OOXML extractor", e);
			}
		} catch (InvalidFormatException e) {
			throw new TikaException("Error creating OOXML extractor", e);
		} catch (OpenXML4JException e) {
			throw new TikaException("Error creating OOXML extractor", e);
		} catch (XmlException e) {
			throw new TikaException("Error creating OOXML extractor", e);

		}
	}

}

A noter dans le cadre de l'étude depuis la version 1.5, la propriété protected, déduite depuis les feuilles, n'est plus extraite. Jusqu'à cette version, il était aisé de n'extraire que celle-ci sans parcourir le contenu des feuilles. Sur la version 1.5, la propriété est extraite en même temps que le contenu des feuilles, rendant la personnalisation beaucoup plus difficile. Cela est faisable mais nécessiterait trop de recopie de code. Il a donc été décidé de ne pas en tenir compte.

Parser fichiers PDF

Sur la version 1.6, une modification importante a été apportée sur le parser PDF pour le cadre de cette étude. En effet, le contenu n'est pas parcouru lorsque qu'un argument est null. Dans la classe org.apache.tika.parser.pdf.PDFParser, la fonction parse a été modifiée ainsi.

    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context)
            throws IOException, SAXException, TikaException {
       
        PDDocument pdfDocument = null;
        TemporaryResources tmp = new TemporaryResources();
        //config from context, or default if not set via context
        PDFParserConfig localConfig = context.get(PDFParserConfig.class, defaultConfig);
        try {
            // PDFBox can process entirely in memory, or can use a temp file
            //  for unpacked / processed resources
            // Decide which to do based on if we're reading from a file or not already
            TikaInputStream tstream = TikaInputStream.cast(stream);
            if (tstream != null && tstream.hasFile()) {
                // File based, take that as a cue to use a temporary file
                RandomAccess scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
                if (localConfig.getUseNonSequentialParser() == true) {
                    pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), scratchFile);
                } else {
                    pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
                }
            } else {
                // Go for the normal, stream based in-memory parsing
                if (localConfig.getUseNonSequentialParser() == true) {
                    pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), new RandomAccessBuffer()); 
                } else {
                    pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
                }
            }
            
           
            if (pdfDocument.isEncrypted()) {
                String password = null;
                
                // Did they supply a new style Password Provider?
                PasswordProvider passwordProvider = context.get(PasswordProvider.class);
                if (passwordProvider != null) {
                   password = passwordProvider.getPassword(metadata);
                }
                
                // Fall back on the old style metadata if set
                if (password == null && metadata.get(PASSWORD) != null) {
                   password = metadata.get(PASSWORD);
                }
                
                // If no password is given, use an empty string as the default
                if (password == null) {
                   password = "";
                }
               
                try {
                    pdfDocument.decrypt(password);
                } catch (Exception e) {
                    // Ignore
                }
            }
            metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
            extractMetadata(pdfDocument, metadata);
            if (handler != null) {
                PDF2XHTML.process(pdfDocument, handler, context, metadata, localConfig);
            }
            
        } finally {
            if (pdfDocument != null) {
               pdfDocument.close();
            }
            tmp.dispose();
        }
    }

Donc pour cet article, il suffit que le paramètre handler soit null. La modification va permettre de supprimer une grande recopie du code par rapport aux précédente version et le code du parser org.apache.tika.parser.pdf.PDFParser devient le suivant

package fr.ejn.tutorial.tika.parser.pdf;

import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import fr.ejn.tutorial.tika.parser.EscapeContentArg;

/**
 * PDF parser extension. By default, the content is parsed by TIKA and introduce poor performance if only file properties is wanted. In this extension, if the context contains the
 * flag isEscapeContent and set to true, the content is not parsed.
 *
 * @author Etienne Jouvin
 *
 */
public class ContentFilterPDFParser extends PDFParser {

	private static final long serialVersionUID = -7912065246613445882L;

	/** {@inheritDoc} */
	@Override
	public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
		ContentHandler realHandler = null;
		if (!EscapeContentArg.checkEscapeContent(context)) {
			/* Do not force null as content handler. */
			realHandler = handler;
		}
		super.parse(stream, realHandler, metadata, context);
	}

}

Dans la méthode parse, l'appel à la fonction checkEscapeContent de la classe EscapeContentArg permet de savoir si il faut utiliser l'argument initial ou le remplacer par un null.


link+ Exécution

Afin de faciliter l'utilisation de TIKA et de ces nouveaux parsers, la classe TikaExtractorImpl a été écrite.

package fr.ejn.tutorial.metadatas.impl.tika;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.text.MessageFormat;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringUtils;
import org.apache.log4j.Logger;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import fr.ejn.tutorial.metadatas.Extractor;
import fr.ejn.tutorial.tika.parser.AutoDetectContentFilterParser;

/**
 * File properties reader with TIKA implementation. By default TIKA read the file content, and cause poor performance. Use a custom parser that skip the content parsing. But it was
 * necessary to duplicate some code from TIKA framework.
 * 
 * @author Etienne Jouvin
 * 
 */
public class TikaExtractorImpl implements Extractor {

	private static final Logger LOGGER = Logger.getLogger(TikaExtractorImpl.class);
	private String configResource;

	/**
	 * TIKA configuration used for the extraction.
	 */
	private TikaConfig tikaConfig;

	/**
	 * Defaut constructor.
	 */
	public TikaExtractorImpl() {
		configResource = null;
		tikaConfig = null;
	}

	/**
	 * Convert metadata instance to a Map.
	 * 
	 * @param metadata Instance to convert.
	 * @return Converted map.
	 */
	private Map<String, String[]> convertToMap(Metadata metadata) {
		/* names can never returns null. No need to check the null. */
		String[] names = metadata.names();
		Map<String, String[]> properties = new HashMap<String, String[]>(names.length);

		for (String name : names) {
			properties.put(name, metadata.getValues(name));
		}

		return properties;
	}

	/** {@inheritDoc} */
	@Override
	public Map<String, String[]> extract(String filePath) {
		if (null == tikaConfig) {
			return null;
		}

		/* Create the metadata handler, where all metadata will be stored during content parsing. */
		Metadata metadata = readMetadata(filePath);

		/* Convert and return the metadata instance into a Map of values. */
		return convertToMap(metadata);
	}

	/** {@inheritDoc} */
	@Override
	public void init() {
		InputStream resource;
		String resourceToLoad;

		if (StringUtils.isBlank(configResource)) {
			resource = null;
			resourceToLoad = null;
		} else {
			/* Try to load the configuration resource set. */
			resourceToLoad = configResource;
			resource = TikaExtractorImpl.class.getResourceAsStream(resourceToLoad);
		}

		/* Initialize the Tika configuration. */
		try {
			if (null == resource) {
				/* Use the default TIKAConfig constructor. */
				/* Configuration from META-INF/services will be loaded. */
				/* Then detectors, parsers are sorted according the class name. */
				/* Classes into package org.apache.tika always come first. */
				/* Custom class in different package will be loaded after Tika core. */
				/* To override a default parser, just keep the package as org.apache.tika and make sure the String compare on class name */
				/* return first the new instance. */
				tikaConfig = new TikaConfig();
			} else {
				/* A resource was loaded, use it to build the TikaConfig instance. */
				tikaConfig = new TikaConfig(resource);
			}
		} catch (IOException ioException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_ACCESS_EXCEPTION, resourceToLoad), ioException);
		} catch (SAXException saxException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_SAX_EXCEPTION, resourceToLoad), saxException);
		} catch (TikaException tikaException) {
			LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_CONFIG_TIKA_EXCEPTION, resourceToLoad), tikaException);
		} finally {
			IOUtils.closeQuietly(resource);
		}
	}

	/**
	 * Read file properties.
	 * 
	 * @param filePath File to parse.
	 * @return All metadatas read from the file.
	 */
	private Metadata readMetadata(String filePath) {
		/* Create the metadata handler, where all metadata will be stored during content parsing. */
		Metadata metadata = new Metadata();

		if (null != filePath) {
			InputStream input = null;
			try {
				input = TikaInputStream.get(new File(filePath), metadata);
				/* File loaded, can call the parser. */

				/* Create a parser, with auto format detection. */
				AutoDetectContentFilterParser parser = new AutoDetectContentFilterParser(tikaConfig);
				/* Set flag to escape content reading. */
				parser.setEscapeContent(true);

				/* Parse the content. Use a default handler, the map will be used directly. */
				parser.parse(input, new DefaultHandler(), metadata);
			} catch (IOException ioException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_FILE_IO, filePath), ioException);
			} catch (SAXException saxException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_FILE_IO, filePath), saxException);
			} catch (TikaException tikaException) {
				LOGGER.error(MessageFormat.format(TikaExtractorImplMsgLog.ERR_LOG_PARSE_TIKA_EXCEPTION, filePath), tikaException);
			} finally {
				IOUtils.closeQuietly(input);
			}
		}

		return metadata;
	}

	/**
	 * Set the resource name.
	 * 
	 * @param configResource Configuration resource name to set.
	 */
	public void setConfigResource(String configResource) {
		this.configResource = configResource;
	}

}

Son fonctionnement est relativement simple. Il faut exécuter la méthode init dans un premier temps, permettant d'initialiser une instance de TikaConfig. Il est ensuite possible d'extraire les propriétés d'un fichier, identifié par son emplacement, à partir de la fonction extract.

link+ Configurations

Les deux modes de configuration, expliqués sur la page [Paramétrage TIKA], sont mis en place dans le cadre de cette analyse. Au niveau des sources, les fichiers se trouvent dans la répertoire des ressources de test.

XML format icon.png Configuration XML

Ce mode de configuration permet de spécifier uniquement les parsers à utiliser. L'inconvénient est qu'il est nécessaire de reprendre la déclaration de ceux-ci dans les différents jar fournis.

<?xml version="1.0" encoding="UTF-8"?>
<properties>
	<!-- There is no specific rule on the root node name. But see link https://issues.apache.org/jira/browse/TIKA-527, where there is an example with node properties -->
	<!-- 
		Since TIKA 1.0.
		See in org.apache.tika.config.TikaConfig, for constructor TikaConfig(Element element).
		Parser check the nodes detector and if it contains the attribute class, 
		try to instantiate a class with the value of attribute.
		then MimeTypesFactory is loaded from this attribute value, that should be a path.
		All values from the file META-INF/services/org.apache.tika.detect.Detector are reproduce in the following configuration.
		And add the DefaultDetector also. This detector is used by default
		as it is the case the function TikaConfig.getDefaultConfig().
	-->
	<detectors>
		<detector class="org.apache.tika.parser.microsoft.POIFSContainerDetector" />
		<detector class="org.apache.tika.parser.pkg.ZipContainerDetector" />
<!-- Do not forget to add the default detector, otherwise almost everything is considered as application/octet. -->
		<detector class="org.apache.tika.detect.DefaultDetector" />
<!--  -->
	</detectors>
	<!-- 
		See in org.apache.tika.config.TikaConfig, for constructor TikaConfig(Element element).
		Parser check the node mimeTypeRepository and if it contains the attribute resource, 
		then MimeTypesFactory is loaded from this attribute value, that should be a path.
	
		If attribute is not found, mimetypes in TikaConfig is set by the function MimeTypes.getDefaultMimeTypes(),
		which is the case the function TikaConfig.getDefaultConfig() is used.
	-->
	<mimeTypeRepository />
	<!--
		By default, parsers configuration is read from the file META-INF/services/org.apache.tika.parser.Parser
		loaded during DefaultParser constructor.
		
		Just copy all parser from orginal file in this XML and set custom parser.
	-->
	<parsers>
		<parser class="org.apache.tika.parser.asm.ClassParser" />
		<parser class="org.apache.tika.parser.audio.AudioParser" />
		<parser class="org.apache.tika.parser.audio.MidiParser" />
		<parser class="org.apache.tika.parser.crypto.Pkcs7Parser" />
		<parser class="org.apache.tika.parser.dwg.DWGParser" />
		<parser class="org.apache.tika.parser.epub.EpubParser" />
		<parser class="org.apache.tika.parser.executable.ExecutableParser" />
		<parser class="org.apache.tika.parser.feed.FeedParser" />
		<parser class="org.apache.tika.parser.font.AdobeFontMetricParser" />
		<parser class="org.apache.tika.parser.font.TrueTypeParser" />
		<parser class="org.apache.tika.parser.html.HtmlParser" />
		<parser class="org.apache.tika.parser.image.ImageParser" />
		<parser class="org.apache.tika.parser.image.PSDParser" />
		<parser class="org.apache.tika.parser.image.TiffParser" />
		<parser class="org.apache.tika.parser.iptc.IptcAnpaParser" />
		<parser class="org.apache.tika.parser.iwork.IWorkPackageParser" />
		<parser class="org.apache.tika.parser.jpeg.JpegParser" />
		<parser class="org.apache.tika.parser.mail.RFC822Parser" />
		<parser class="org.apache.tika.parser.mbox.MboxParser" />
		<parser class="org.apache.tika.parser.mbox.OutlookPSTParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.microsoft.OfficeParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.microsoft.ContentFilterOfficeParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.microsoft.TNEFParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.microsoft.ooxml.ContentFilterOOXMLParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.mp3.Mp3Parser" />
		<parser class="org.apache.tika.parser.mp4.MP4Parser" />
		<parser class="org.apache.tika.parser.hdf.HDFParser" />
		<parser class="org.apache.tika.parser.netcdf.NetCDFParser" />
		<parser class="org.apache.tika.parser.odf.OpenDocumentParser" />
<!-- Start parser update -->
<!--
		<parser class="org.apache.tika.parser.pdf.PDFParser" />
-->
		<parser class="fr.ejn.tutorial.tika.parser.pdf.ContentFilterPDFParser" />
<!-- End parser update -->
		<parser class="org.apache.tika.parser.pkg.CompressorParser" />
		<parser class="org.apache.tika.parser.pkg.PackageParser" />
		<parser class="org.apache.tika.parser.rtf.RTFParser" />
		<parser class="org.apache.tika.parser.txt.TXTParser" />
		<parser class="org.apache.tika.parser.video.FLVParser" />
		<parser class="org.apache.tika.parser.xml.DcXMLParser" />
		<parser class="org.apache.tika.parser.xml.FictionBookParser" />
		<parser class="org.apache.tika.parser.chm.ChmParser" />
		<parser class="org.apache.tika.parser.code.SourceCodeParser" />
		<parser class="org.apache.tika.parser.mat.MatParser" />
		<!-- Copy parsers from jar vorbis-java-tika-0.1.jar -->
		<parser class="org.gagravarr.tika.FlacParser" />
		<parser class="org.gagravarr.tika.OggParser" />
		<parser class="org.gagravarr.tika.OpusParser" />
		<parser class="org.gagravarr.tika.SpeexParser" />
		<parser class="org.gagravarr.tika.VorbisParser" />		
	</parsers>
</properties>

Dans cette configuration, les parsers standards sont remplacés par les nouveaux.

Icon Personnalisation.png Configuration ini

Un fichier org.apache.tika.parser.Parser est mis à disposition dans le répertoire META-INF/services afin d'enregistrer les nouveaux parsers.

#  Licensed to the Apache Software Foundation (ASF) under one or more
#  contributor license agreements.  See the NOTICE file distributed with
#  this work for additional information regarding copyright ownership.
#  The ASF licenses this file to You under the Apache License, Version 2.0
#  (the "License"); you may not use this file except in compliance with
#  the License.  You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.

# Override the Office parser, default one is : 
# org.apache.tika.parser.microsoft.OfficeParser
fr.ejn.tutorial.tika.parser.microsoft.ContentFilterOfficeParser

# Override the OOXML document parser, default one is : 
# org.apache.tika.parser.microsoft.ooxml.OOXMLParser
fr.ejn.tutorial.tika.parser.microsoft.ooxml.ContentFilterOOXMLParser

# Override the PDF parser, default one is : 
# org.apache.tika.parser.pdf.PDFParser
fr.ejn.tutorial.tika.parser.pdf.ContentFilterPDFParser


link+ Variables extraites

Ce paragraphe liste les différentes variables extraites avec les parsers mis en place dans cet article. A noter qu'à partir de cette version, la classe des parsers utilisés sont mis à disposition dans les propriétés extraites. Mais attention, ceux-ci peuvent dépendre du mode de chargement de la configuration. Dans le cadre de l'utilisation d'un fichier XML le parser org.apache.tika.parser.CompositeParser est utilisé, alors que le chargement depuis l'extension dans META-INF entraîne l'utilisation de org.apache.tika.parser.DefaultParser.

Fichiers Office

format doc
Application-Name Author Character Count Comments Company
Content-Length Content-Type Creation-Date Edit-Time Keywords
Last-Author Last-Modified Last-Save-Date Page-Count Revision-Number
Template Word-Count comment cp:revision cp:subject
creator custom:MyCustomDate custom:MyCustomString date dc:creator
dc:subject dc:title dcterms:created dcterms:modified extended-properties:Application
extended-properties:Company extended-properties:Template meta:author meta:character-count meta:creation-date
meta:keyword meta:last-author meta:page-count meta:save-date meta:word-count
modified resourceName subject title w:comments
xmpTPg:NPages X-Parsed-By
format ppt
Author Content-Length Content-Type Creation-Date Edit-Time
Last-Author Last-Modified Last-Save-Date Revision-Number Slide-Count
Word-Count cp:revision creator custom:MyCustomDate custom:MyCustomString
custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate date dc:creator
dc:title dcterms:created dcterms:modified meta:author meta:creation-date
meta:last-author meta:save-date meta:slide-count meta:word-count modified
resourceName title xmpTPg:NPages X-Parsed-By
format xls
Application-Name Author Content-Length Content-Type Creation-Date
Last-Author Last-Modified Last-Save-Date creator custom:MyCustomDate
custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber custom:myCustomSecondDate date
dc:creator dcterms:created dcterms:modified extended-properties:Application meta:author
meta:creation-date meta:last-author meta:save-date modified resourceName
X-Parsed-By


Fichiers Office Open XML

format docx
Application-Name Application-Version Author Character Count Character-Count-With-Spaces
Content-Length Content-Type Creation-Date Keywords Last-Author
Last-Modified Last-Save-Date Line-Count Page-Count Paragraph-Count
Revision-Number Template Total-Time Word-Count cp:revision
cp:subject creator custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean
custom:myCustomNumber custom:myCustomSecondDate date dc:creator dc:description
dc:publisher dc:subject dc:title dcterms:created dcterms:modified
description extended-properties:AppVersion extended-properties:Application extended-properties:Company extended-properties:Template
extended-properties:TotalTime meta:author meta:character-count meta:character-count-with-spaces meta:creation-date
meta:keyword meta:last-author meta:line-count meta:page-count meta:paragraph-count
meta:save-date meta:word-count modified publisher resourceName
subject title xmpTPg:NPages X-Parsed-By
format pptx
Application-Version Author Content-Length Content-Type Creation-Date
Last-Author Last-Modified Last-Save-Date Paragraph-Count Presentation-Format
Revision-Number Slide-Count Total-Time Word-Count cp:revision
creator custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean custom:myCustomNumber
custom:myCustomSecondDate date dc:creator dc:title dcterms:created
dcterms:modified extended-properties:AppVersion extended-properties:PresentationFormat extended-properties:TotalTime meta:author
meta:creation-date meta:last-author meta:paragraph-count meta:save-date meta:slide-count
meta:word-count modified resourceName title xmpTPg:NPages
X-Parsed-By
format xlsx
Application-Name Application-Version Content-Length Content-Type Creation-Date
Last-Modified Last-Save-Date custom:MyCustomDate custom:MyCustomString custom:myCustomBoolean
custom:myCustomNumber custom:myCustomSecondDate date dcterms:created dcterms:modified
extended-properties:AppVersion extended-properties:Application meta:creation-date meta:save-date modified
resourceName X-Parsed-By


Fichiers PDF

Author Content-Length Content-Type Creation-Date created
creator dc:creator dc:format dc:title dcterms:created
meta:author meta:creation-date pdf:encrypted pdf:PDFVersion producer
resourceName title xmp:CreatorTool xmpTPg:NPages X-Parsed-By