Parsing keywords. This piece of code may come in handy

flyingdutchie · November 18, 2007

Hi everyone,

I wrote, for my SmugFig API, this piece of code that transforms a string of keywords into an ArrayList of these keywords. It is a snippet of Java code.

public SomeClass {
...
...
    private static final String quotePattern    = "(\"([^\"]+)\"[\\s,;]*)";
    private static final String commaPattern    = "(([^,;]+)\\s*[,;]?)";
    private static final String spacePattern    = "(([^,;\\s]+)\\s*)";
 
    public static ArrayList<String> toKeywords(String keywords) {
        return quoteParse(keywords);
    }
 
    private static ArrayList<String> quoteParse(String keywordsString) {
        if (keywordsString == null)
            return null;
        
        if (keywordsString.length() == 0)
            return new ArrayList<String>();
        
        final ArrayList<String> retValue = new ArrayList<String>();

        Pattern pat     = Pattern.compile(quotePattern);
        Matcher matcher = pat.matcher(keywordsString);
        
        Set<String> keywords = new TreeSet<String>();
        while (matcher.find()) {
            String matchResult = matcher.group(2);
            if (matchResult.length() > 0)
                keywords.add(matchResult);
        }

        final String remainingNonQuotedWords = matcher.replaceAll("");
        
        if (remainingNonQuotedWords.length() > 0) {
            boolean isSpaceDelimited = 
                    remainingNonQuotedWords.indexOf(',')<0 && 
                    remainingNonQuotedWords.indexOf(';')<0;

            if (isSpaceDelimited) {
                pat     = Pattern.compile(spacePattern);
                matcher = pat.matcher(remainingNonQuotedWords);
                while (matcher.find()) {
                    isSpaceDelimited = true;
                    String matchResult = matcher.group(2).trim();
                    if (matchResult.length() > 0)
                        keywords.add(matchResult);
                }
            }
            else {
                pat     = Pattern.compile(commaPattern);
                matcher = pat.matcher(remainingNonQuotedWords);
                while (matcher.find()) {
                    String matchResult = matcher.group(2).trim();
                    if (matchResult.length() > 0)
                        keywords.add(matchResult);
                }
            }
        }
        
        retValue.addAll(keywords);
        return retValue;
    }
...
...
}

It can parse

keywords seperated by spaces
Single word keywords. E.g.
wedding anderson ceremony
keywords seperated by commas or semicolons
Single or multi-word keywords. E.g.
boston, red sox; world championship
keywords seperated by spaces, commas or semicolons and which are quoted
"boston" "red sox"; "world championship"
Or any combination thereof:
"boston", red sox, world championship "parade"; hello

It took me a while to figure out how Smugmug parses keywords. The code above comes close to it, i think. It is not perfect, and i have tested it only a little bit. Let me know if it works for you.:D

(Updated it with new code to also handle keywords seperated by semicolons)

darryl · November 19, 2007

Hey flyingdutchie... funny, I wrote a parser too. Gotta watch out for semi-colons too:

sub parsekeys {

    my $oldkeywords = '' ;
    my @allkeywords = @quotedkeywords = @splitkeywords = () ;

    $oldkeywords = shift ;

# Decode the HTML entities
    $oldkeywords =~ s|\\&quot;|&quot;|g ;
    $oldkeywords = decode_entities($oldkeywords) ;

# Let's pluck out the quoted strings ;

    while ($oldkeywords =~ m/"(.+?)"/) {
        push (@quotedkeywords,$1) ;
        $oldkeywords =~ s/".+?"// ;
    }

    $oldkeywords =~ s/^\s*// ;
    $oldkeywords =~ s/\s*$// ;
    $oldkeywords =~ s/\s\s+/ /g ;
    $oldkeywords =~ s/;\s+/;/g ;

    if ($oldkeywords =~ /[;,]/) {
        @splitkeywords = split (/[;,]\s*/, $oldkeywords) ;
    } else {
        @splitkeywords = split (/\s/, $oldkeywords) ;
    }

    @allkeywords = (@quotedkeywords, @splitkeywords) ;

    sort @allkeywords ;
    return @allkeywords ;
}

flyingdutchie · November 19, 2007

darryl wrote:

Hey flyingdutchie... funny, I wrote a parser too. Gotta watch out for semi-colons too:

Aha, those as well! I'll change my code to handle these as well.
Thanks for your code snippet!

Quick question: What does this part of your code do?

# Decode the HTML entities
    $oldkeywords =~ s|\\&quot;|&quot;|g ;
    $oldkeywords = decode_entities($oldkeywords) ;

flyingdutchie · November 19, 2007

How does Smugmug actually handle keyword-strings?
I hope Smugmug sees this message:

Hello Smugmug
Instead of doing guess-work, how are keywords parsed by Smugmug's system. Do you have any code or description of this code?

Thanks!

-- Anton.

flyingdutchie · November 20, 2007

Updated code original message
I updated the code in the original message to handle keywords that are seperated by semicolons as well.

darryl · November 27, 2007

flyingdutchie wrote:
Aha, those as well! I'll change my code to handle these as well.
Thanks for your code snippet!

Quick question: What does this part of your code do?
# Decode the HTML entities
    $oldkeywords =~ s|\\&quot;|&quot;|g ;
    $oldkeywords = decode_entities($oldkeywords) ;

Hi, I thought I wrote back on this.

I discovered that for some reason, SmugMug converts " marks into \" Additionally, it converts things like & into &, etc.

I didn't want to deal with entities at all, and it turns out that if you submit them back through the API, SmugMug converts them for you. So I end up using the Perl decode_entities function (from the HTML::Entities module) to convert all of those &s and "s into & and ".

FormerLurker · November 30, 2007

Could this be tweaked?
(from a previous thread without resolution)

I suspect this will probably require JS (if it's even possible) and, since I haven't done anything with JS other than cut and paste, I'm hoping one of the JS experts might point me in the right direction.

I have two distinct headers for different sections of my SM site: the default header for everything but one specific category, and a different header for the one category.

I am playing around with customizing my search/keyword pages and am trying to figure out a way to keep the same structure. That is, if a certain keyword is searched for, the category-specific header is used instead of the default header.

I am constructing my search box in php so it will be no problem to add a "hidden" keyword that the user won't have to type, so really all I think I need is to see if there is a way I can use JS to "catch" the keywords and switch out the header if the category-specific keyword is part of the search string (and therefore part of the URL).

Thoughts? Can this be done? Thanks in advance!

darryl · December 10, 2007

darryl wrote:

Hi, I thought I wrote back on this.

I discovered that for some reason, SmugMug converts " marks into \" Additionally, it converts things like & into &, etc.

I didn't want to deal with entities at all, and it turns out that if you submit them back through the API, SmugMug converts them for you. So I end up using the Perl decode_entities function (from the HTML::Entities module) to convert all of those &s and "s into & and ".

Oh, the reason for the first regexp is that SmugMug's API seems broken (or at least inconsistent) when it comes to double quotes. If you use the web UI to enter a keyword with quotes, like "this word", then if you use the API to read the keywords back, you'll see "this word"

That makes sense.

However, if you use the API to set a keyword like, "that word", when you read the keyword back, it'll be \"thatword\"

What's odd is that other entities like & simply gets turned into & without any leading backslash.

I actually started a [post=658237]thread on this[/post] a while back, but forgot about it. It's still annoying though. (/me waves at Dev)

mouellette · December 10, 2007

Here is my code for fixing keywords in Ruby

def self.fixKeywords(keywords)
	if keywords.nil?
		return ""
	end
	newKeywords = keywords.collect do |key|
		# remove any quotes or punctuation
		newkey = key.gsub(/["'.:?~`!\@#\$%\^&\*\(\)<>\?,\/\-\+\=]/, '')
		# we need to make sure that keywords are restricted to 40 characters.
		if newkey.size > 40
			newkey = newkey[0..39]
		end
		if key =~ /[\s\d]/
			newkey = '"' + newkey + '"'
		end
		newkey
	end
	returnValue = newKeywords.join(", ")
	return returnValue
end

Then when I read the keywords back in I turn them back into an array with:

@keywords = info.attributes['Keywords'].gsub(/[\\"]+|&quot;/,'').split(/\s*,\s*/)

NOTE: I found you also need to escape quotes in captions with their entitiy equivalents and then convert them back when reading it in.

LIMITS: I also found the following limits to keywords. Only the first 15 keywords will be displayed and used in searching. A keyword can only have up to 40 characters.

largelylivin · January 3, 2008

This thread seems to be in the neighborhood of something that I need to do. I need the ability to have 'named' keywords like Brand: Sea Ray that I can search for to build my on dynamic galleries. Actually, a better example is BoatName: argus because the name is completely unknown and cannot be separated from the dozens of other keywords on the photo.

Could this be used to accomplish something like this or should I dig into IPTC keywords?

Parsing keywords. This piece of code may come in handy

Comments