Read Adobe XMP / XML in PHP

I’ve found a few snippets of PHP code to read XMP / XML meta data from an image file, but none that I would call very robust or efficient. I ended up writing my own for Underwater Focus, and I’m quite pleased with the result. In fact, after adding support for a shortcode, I packaged it as an Adobe XMP plugin for WordPress.

The first part of using XMP meta data is reading the XMP information from the image. I’ve seen a few solutions that read the whole file into memory, and others that read-in just a small part. If the XMP / XML contains a lot of information, that small part may be incomplete. And each time the XMP meta data is required, the original (and sometimes quite large) image file must be re-read. Since the XMP doesn’t change unless the original image is updated, there’s no reason to keep re-reading the same large file time and time again.

The method I wrote reads a 64k chunk at a time, and keeps reading additional chunks until it finds the `</x:xmpmeta>` tag, or reaches the hard-coded limit of 500k. The extracted XMP information is then saved on disk, and if the function is called again for the same file, the cached information is used instead (provided the cache file’s modification time is newer than the original image).

function __construct() {
        $this->use_cache = true;
        $this->cache_dir = dirname ( __FILE__ ) . '/cache/';
        if ( ! is_dir( $this->cache_dir ) ) mkdir( $this->cache_dir );
}

function get_xmp_raw( $filepath ) {

        $max_size = 512000;     // maximum size read
        $chunk_size = 65536;    // read 64k at a time
        $start_tag = '<x:xmpmeta';
        $end_tag = '</x:xmpmeta>';
        $cache_file = $this->cache_dir . md5( $filepath ) . '.xml';
        $xmp_raw = null; 

        if ( $this->use_cache == true && file_exists( $cache_file ) && 
                filemtime( $cache_file ) > filemtime( $filepath ) && 
                $cache_fh = fopen( $cache_file, 'rb' ) ) {

                $xmp_raw = fread( $cache_fh, filesize( $cache_file ) );
                fclose( $cache_fh );

        } elseif ( $file_fh = fopen( $filepath, 'rb' ) ) {

                $file_size = filesize( $filepath );
                while ( ( $file_pos = ftell( $file_fh ) ) < $file_size  && $file_pos < $max_size ) {
                        $chunk .= fread( $file_fh, $chunk_size );
                        if ( ( $end_pos = strpos( $chunk, $end_tag ) ) !== false ) {
                                if ( ( $start_pos = strpos( $chunk, $start_tag ) ) !== false ) {

                                        $xmp_raw = substr( $chunk, $start_pos, 
                                                $end_pos - $start_pos + strlen( $end_tag ) );

                                        if ( $this->use_cache == true && $cache_fh = fopen( $cache_file, 'wb' ) ) {

                                                fwrite( $cache_fh, $xmp_raw );
                                                fclose( $cache_fh );
                                        }
                                }
                                break;  // stop reading after finding the xmp data
                        }
                }
                fclose( $file_fh );
        }
        return $xmp_raw;
}

The second part of using XMP meta data is turning it into an `array()`. You could parse the extracted XML, but adobe uses a variety of namespaces, which makes it difficult and tedious to code — especially since we’re probably just interested in a few values anyway.

I chose an easier method by using regular expressions to parse the XML instead. I use an associative array to describe the regular expressions, and then check the matched values for `<rdf:li>` XML tags to create second dimension arrays. The `<lr:hierarchicalSubject>` XML tag needs additional parsing to create a third dimension array from it’s pipe-delimited values.

function get_xmp_array( &$xmp_raw ) {
        $xmp_arr = array();
        foreach ( array(
                'Creator Email' => '<Iptc4xmpCore:CreatorContactInfo[^>]+?CiEmailWork="([^"]*)"',
                'Owner Name'    => '<rdf:Description[^>]+?aux:OwnerName="([^"]*)"',
                'Creation Date' => '<rdf:Description[^>]+?xmp:CreateDate="([^"]*)"',
                'Modification Date'     => '<rdf:Description[^>]+?xmp:ModifyDate="([^"]*)"',
                'Label'         => '<rdf:Description[^>]+?xmp:Label="([^"]*)"',
                'Credit'        => '<rdf:Description[^>]+?photoshop:Credit="([^"]*)"',
                'Source'        => '<rdf:Description[^>]+?photoshop:Source="([^"]*)"',
                'Headline'      => '<rdf:Description[^>]+?photoshop:Headline="([^"]*)"',
                'City'          => '<rdf:Description[^>]+?photoshop:City="([^"]*)"',
                'State'         => '<rdf:Description[^>]+?photoshop:State="([^"]*)"',
                'Country'       => '<rdf:Description[^>]+?photoshop:Country="([^"]*)"',
                'Country Code'  => '<rdf:Description[^>]+?Iptc4xmpCore:CountryCode="([^"]*)"',
                'Location'      => '<rdf:Description[^>]+?Iptc4xmpCore:Location="([^"]*)"',
                'Title'         => '<dc:title>\s*<rdf:Alt>\s*(.*?)\s*<\/rdf:Alt>\s*<\/dc:title>',
                'Description'   => '<dc:description>\s*<rdf:Alt>\s*(.*?)\s*<\/rdf:Alt>\s*<\/dc:description>',
                'Creator'       => '<dc:creator>\s*<rdf:Seq>\s*(.*?)\s*<\/rdf:Seq>\s*<\/dc:creator>',
                'Keywords'      => '<dc:subject>\s*<rdf:Bag>\s*(.*?)\s*<\/rdf:Bag>\s*<\/dc:subject>',
                'Hierarchical Keywords' => '<lr:hierarchicalSubject>\s*<rdf:Bag>\s*(.*?)\s*<\/rdf:Bag>\s*<\/lr:hierarchicalSubject>'
        ) as $key => $regex ) {

                // get a single text string
                $xmp_arr[$key] = preg_match( "/$regex/is", $xmp_raw, $match ) ? $match[1] : '';

                // if string contains a list, then re-assign the variable as an array with the list elements
                $xmp_arr[$key] = preg_match_all( "/<rdf:li[^>]*>([^>]*)<\/rdf:li>/is", $xmp_arr[$key], $match ) ? $match[1] : $xmp_arr[$key];

                // hierarchical keywords need to be split into a third dimension
                if ( ! empty( $xmp_arr[$key] ) && $key == 'Hierarchical Keywords' ) {
                        foreach ( $xmp_arr[$key] as $li => $val ) $xmp_arr[$key][$li] = explode( '|', $val );
                        unset ( $li, $val );
                }
        }
        return $xmp_arr;
}

In the case of WordPress, a usage example might look like this.

global $adobeXMP;
$xml = $adobeXMP->get_xmp_array( $adobeXMP->get_xmp_raw( get_attached_file( $pid ) ) );

Find this content useful? Share it with your friends!