Parsing keywords. This piece of code may come in handy

flyingdutchieflyingdutchie Registered Users Posts: 1,286 Major grins
Hi everyone,

I wrote, for my SmugFig API, this piece of code that transforms a string of keywords into an ArrayList of these keywords. It is a snippet of Java code.
public SomeClass {
...
...
    private static final String quotePattern    = "(\"([^\"]+)\"[\\s,;]*)";
    private static final String commaPattern    = "(([^,;]+)\\s*[,;]?)";
    private static final String spacePattern    = "(([^,;\\s]+)\\s*)";
 
    public static ArrayList<String> toKeywords(String keywords) {
        return quoteParse(keywords);
    }
 
    private static ArrayList<String> quoteParse(String keywordsString) {
        if (keywordsString == null)
            return null;
        
        if (keywordsString.length() == 0)
            return new ArrayList<String>();
        
        final ArrayList<String> retValue = new ArrayList<String>();

        Pattern pat     = Pattern.compile(quotePattern);
        Matcher matcher = pat.matcher(keywordsString);
        
        Set<String> keywords = new TreeSet<String>();
        while (matcher.find()) {
            String matchResult = matcher.group(2);
            if (matchResult.length() > 0)
                keywords.add(matchResult);
        }

        final String remainingNonQuotedWords = matcher.replaceAll("");
        
        if (remainingNonQuotedWords.length() > 0) {
            boolean isSpaceDelimited = 
                    remainingNonQuotedWords.indexOf(',')<0 && 
                    remainingNonQuotedWords.indexOf(';')<0;

            if (isSpaceDelimited) {
                pat     = Pattern.compile(spacePattern);
                matcher = pat.matcher(remainingNonQuotedWords);
                while (matcher.find()) {
                    isSpaceDelimited = true;
                    String matchResult = matcher.group(2).trim();
                    if (matchResult.length() > 0)
                        keywords.add(matchResult);
                }
            }
            else {
                pat     = Pattern.compile(commaPattern);
                matcher = pat.matcher(remainingNonQuotedWords);
                while (matcher.find()) {
                    String matchResult = matcher.group(2).trim();
                    if (matchResult.length() > 0)
                        keywords.add(matchResult);
                }
            }
        }
        
        retValue.addAll(keywords);
        return retValue;
    }
...
...
}

It can parse
  • keywords seperated by spaces
    Single word keywords. E.g.
    wedding anderson ceremony
  • keywords seperated by commas or semicolons
    Single or multi-word keywords. E.g.
    boston, red sox; world championship
  • keywords seperated by spaces, commas or semicolons and which are quoted
    "boston" "red sox"; "world championship"
  • Or any combination thereof:
    "boston", red sox, world championship "parade"; hello
It took me a while to figure out how Smugmug parses keywords. The code above comes close to it, i think. It is not perfect, and i have tested it only a little bit. Let me know if it works for you.:D

(Updated it with new code to also handle keywords seperated by semicolons)
I can't grasp the notion of time.

When I hear the earth will melt into the sun,
in two billion years,
all I can think is:
    "Will that be on a Monday?"
==========================
http://www.streetsofboston.com
http://blog.antonspaans.com

Comments

  • darryldarryl Registered Users Posts: 997 Major grins
    edited November 19, 2007
    Hey flyingdutchie... funny, I wrote a parser too. Gotta watch out for semi-colons too:
    sub parsekeys {
    
        my $oldkeywords = '' ;
        my @allkeywords = @quotedkeywords = @splitkeywords = () ;
    
        $oldkeywords = shift ;
    
    # Decode the HTML entities
        $oldkeywords =~ s|\\&quot;|&quot;|g ;
        $oldkeywords = decode_entities($oldkeywords) ;
    
    # Let's pluck out the quoted strings ;
    
        while ($oldkeywords =~ m/"(.+?)"/) {
            push (@quotedkeywords,$1) ;
            $oldkeywords =~ s/".+?"// ;
        }
    
        $oldkeywords =~ s/^\s*// ;
        $oldkeywords =~ s/\s*$// ;
        $oldkeywords =~ s/\s\s+/ /g ;
        $oldkeywords =~ s/;\s+/;/g ;
    
        if ($oldkeywords =~ /[;,]/) {
            @splitkeywords = split (/[;,]\s*/, $oldkeywords) ;
        } else {
            @splitkeywords = split (/\s/, $oldkeywords) ;
        }
    
        @allkeywords = (@quotedkeywords, @splitkeywords) ;
    
        sort @allkeywords ;
        return @allkeywords ;
    }
    
  • flyingdutchieflyingdutchie Registered Users Posts: 1,286 Major grins
    edited November 19, 2007
    darryl wrote:
    Hey flyingdutchie... funny, I wrote a parser too. Gotta watch out for semi-colons too:

    Aha, those as well! I'll change my code to handle these as well.
    Thanks for your code snippet!

    Quick question: What does this part of your code do?
    # Decode the HTML entities
        $oldkeywords =~ s|\\&quot;|&quot;|g ;
        $oldkeywords = decode_entities($oldkeywords) ;
    
    I can't grasp the notion of time.

    When I hear the earth will melt into the sun,
    in two billion years,
    all I can think is:
        "Will that be on a Monday?"
    ==========================
    http://www.streetsofboston.com
    http://blog.antonspaans.com
  • flyingdutchieflyingdutchie Registered Users Posts: 1,286 Major grins
    edited November 19, 2007
    How does Smugmug actually handle keyword-strings?
    I hope Smugmug sees this message:

    Hello Smugmug
    Instead of doing guess-work, how are keywords parsed by Smugmug's system. Do you have any code or description of this code?

    Thanks!

    -- Anton.
    I can't grasp the notion of time.

    When I hear the earth will melt into the sun,
    in two billion years,
    all I can think is:
        "Will that be on a Monday?"
    ==========================
    http://www.streetsofboston.com
    http://blog.antonspaans.com
  • flyingdutchieflyingdutchie Registered Users Posts: 1,286 Major grins
    edited November 20, 2007
    Updated code original message
    I updated the code in the original message to handle keywords that are seperated by semicolons as well.
    I can't grasp the notion of time.

    When I hear the earth will melt into the sun,
    in two billion years,
    all I can think is:
        "Will that be on a Monday?"
    ==========================
    http://www.streetsofboston.com
    http://blog.antonspaans.com
  • darryldarryl Registered Users Posts: 997 Major grins
    edited November 27, 2007
    Aha, those as well! I'll change my code to handle these as well.
    Thanks for your code snippet!

    Quick question: What does this part of your code do?
    # Decode the HTML entities
        $oldkeywords =~ s|\\&quot;|&quot;|g ;
        $oldkeywords = decode_entities($oldkeywords) ;
    

    Hi, I thought I wrote back on this.

    I discovered that for some reason, SmugMug converts " marks into \" Additionally, it converts things like & into &, etc.

    I didn't want to deal with entities at all, and it turns out that if you submit them back through the API, SmugMug converts them for you. So I end up using the Perl decode_entities function (from the HTML::Entities module) to convert all of those &s and "s into & and ".
  • FormerLurkerFormerLurker Registered Users Posts: 82 Big grins
    edited November 30, 2007
    Could this be tweaked?
    (from a previous thread without resolution)

    I suspect this will probably require JS (if it's even possible) and, since I haven't done anything with JS other than cut and paste, I'm hoping one of the JS experts might point me in the right direction.

    I have two distinct headers for different sections of my SM site: the default header for everything but one specific category, and a different header for the one category.

    I am playing around with customizing my search/keyword pages and am trying to figure out a way to keep the same structure. That is, if a certain keyword is searched for, the category-specific header is used instead of the default header.

    I am constructing my search box in php so it will be no problem to add a "hidden" keyword that the user won't have to type, so really all I think I need is to see if there is a way I can use JS to "catch" the keywords and switch out the header if the category-specific keyword is part of the search string (and therefore part of the URL).

    Thoughts? Can this be done? Thanks in advance!
  • darryldarryl Registered Users Posts: 997 Major grins
    edited December 10, 2007
    darryl wrote:
    Hi, I thought I wrote back on this.

    I discovered that for some reason, SmugMug converts " marks into \" Additionally, it converts things like & into &, etc.

    I didn't want to deal with entities at all, and it turns out that if you submit them back through the API, SmugMug converts them for you. So I end up using the Perl decode_entities function (from the HTML::Entities module) to convert all of those &s and "s into & and ".

    Oh, the reason for the first regexp is that SmugMug's API seems broken (or at least inconsistent) when it comes to double quotes. If you use the web UI to enter a keyword with quotes, like "this word", then if you use the API to read the keywords back, you'll see "this word"

    That makes sense.

    However, if you use the API to set a keyword like, "that word", when you read the keyword back, it'll be \"thatword\"

    What's odd is that other entities like & simply gets turned into & without any leading backslash.

    I actually started a [post=658237]thread on this[/post] a while back, but forgot about it. It's still annoying though. (/me waves at Dev)
  • mouellettemouellette Registered Users Posts: 11 Big grins
    edited December 10, 2007
    Here is my code for fixing keywords in Ruby
    def self.fixKeywords(keywords)
    	if keywords.nil?
    		return ""
    	end
    	newKeywords = keywords.collect do |key|
    		# remove any quotes or punctuation
    		newkey = key.gsub(/["'.:?~`!\@#\$%\^&\*\(\)<>\?,\/\-\+\=]/, '')
    		# we need to make sure that keywords are restricted to 40 characters.
    		if newkey.size > 40
    			newkey = newkey[0..39]
    		end
    		if key =~ /[\s\d]/
    			newkey = '"' + newkey + '"'
    		end
    		newkey
    	end
    	returnValue = newKeywords.join(", ")
    	return returnValue
    end
    


    Then when I read the keywords back in I turn them back into an array with:
    @keywords = info.attributes['Keywords'].gsub(/[\\"]+|&quot;/,'').split(/\s*,\s*/)
    

    NOTE: I found you also need to escape quotes in captions with their entitiy equivalents and then convert them back when reading it in.

    LIMITS: I also found the following limits to keywords. Only the first 15 keywords will be displayed and used in searching. A keyword can only have up to 40 characters.
  • largelylivinlargelylivin Registered Users Posts: 561 Major grins
    edited January 3, 2008
    This thread seems to be in the neighborhood of something that I need to do. I need the ability to have 'named' keywords like Brand: Sea Ray that I can search for to build my on dynamic galleries. Actually, a better example is BoatName: argus because the name is completely unknown and cannot be separated from the dozens of other keywords on the photo.

    Could this be used to accomplish something like this or should I dig into IPTC keywords?
    Brad Newby

    http://blue-dog.smugmug.com
    http://smile-123.smugmug.com
    http://vintage-photos.blogspot.com/

    Canon 7D, 100-400L, Mongoose 3.5, hoping for a 500L real soon.
Sign In or Register to comment.