Parsing keywords. This piece of code may come in handy
flyingdutchie
Registered Users Posts: 1,286 Major grins
Hi everyone,
I wrote, for my SmugFig API, this piece of code that transforms a string of keywords into an ArrayList of these keywords. It is a snippet of Java code.
It can parse
(Updated it with new code to also handle keywords seperated by semicolons)
I wrote, for my SmugFig API, this piece of code that transforms a string of keywords into an ArrayList of these keywords. It is a snippet of Java code.
public SomeClass { ... ... private static final String quotePattern = "(\"([^\"]+)\"[\\s,;]*)"; private static final String commaPattern = "(([^,;]+)\\s*[,;]?)"; private static final String spacePattern = "(([^,;\\s]+)\\s*)"; public static ArrayList<String> toKeywords(String keywords) { return quoteParse(keywords); } private static ArrayList<String> quoteParse(String keywordsString) { if (keywordsString == null) return null; if (keywordsString.length() == 0) return new ArrayList<String>(); final ArrayList<String> retValue = new ArrayList<String>(); Pattern pat = Pattern.compile(quotePattern); Matcher matcher = pat.matcher(keywordsString); Set<String> keywords = new TreeSet<String>(); while (matcher.find()) { String matchResult = matcher.group(2); if (matchResult.length() > 0) keywords.add(matchResult); } final String remainingNonQuotedWords = matcher.replaceAll(""); if (remainingNonQuotedWords.length() > 0) { boolean isSpaceDelimited = remainingNonQuotedWords.indexOf(',')<0 && remainingNonQuotedWords.indexOf(';')<0; if (isSpaceDelimited) { pat = Pattern.compile(spacePattern); matcher = pat.matcher(remainingNonQuotedWords); while (matcher.find()) { isSpaceDelimited = true; String matchResult = matcher.group(2).trim(); if (matchResult.length() > 0) keywords.add(matchResult); } } else { pat = Pattern.compile(commaPattern); matcher = pat.matcher(remainingNonQuotedWords); while (matcher.find()) { String matchResult = matcher.group(2).trim(); if (matchResult.length() > 0) keywords.add(matchResult); } } } retValue.addAll(keywords); return retValue; } ... ... }
It can parse
- keywords seperated by spaces
Single word keywords. E.g.
wedding anderson ceremony - keywords seperated by commas or semicolons
Single or multi-word keywords. E.g.
boston, red sox; world championship - keywords seperated by spaces, commas or semicolons and which are quoted
"boston" "red sox"; "world championship" - Or any combination thereof:
"boston", red sox, world championship "parade"; hello
(Updated it with new code to also handle keywords seperated by semicolons)
I can't grasp the notion of time.
When I hear the earth will melt into the sun,
in two billion years,
all I can think is:
"Will that be on a Monday?"
==========================
http://www.streetsofboston.com
http://blog.antonspaans.com
When I hear the earth will melt into the sun,
in two billion years,
all I can think is:
"Will that be on a Monday?"
==========================
http://www.streetsofboston.com
http://blog.antonspaans.com
0
Comments
Aha, those as well! I'll change my code to handle these as well.
Thanks for your code snippet!
Quick question: What does this part of your code do?
When I hear the earth will melt into the sun,
in two billion years,
all I can think is:
"Will that be on a Monday?"
==========================
http://www.streetsofboston.com
http://blog.antonspaans.com
I hope Smugmug sees this message:
Hello Smugmug
Instead of doing guess-work, how are keywords parsed by Smugmug's system. Do you have any code or description of this code?
Thanks!
-- Anton.
When I hear the earth will melt into the sun,
in two billion years,
all I can think is:
"Will that be on a Monday?"
==========================
http://www.streetsofboston.com
http://blog.antonspaans.com
I updated the code in the original message to handle keywords that are seperated by semicolons as well.
When I hear the earth will melt into the sun,
in two billion years,
all I can think is:
"Will that be on a Monday?"
==========================
http://www.streetsofboston.com
http://blog.antonspaans.com
Hi, I thought I wrote back on this.
I discovered that for some reason, SmugMug converts " marks into \" Additionally, it converts things like & into &, etc.
I didn't want to deal with entities at all, and it turns out that if you submit them back through the API, SmugMug converts them for you. So I end up using the Perl decode_entities function (from the HTML::Entities module) to convert all of those &s and "s into & and ".
(from a previous thread without resolution)
I suspect this will probably require JS (if it's even possible) and, since I haven't done anything with JS other than cut and paste, I'm hoping one of the JS experts might point me in the right direction.
I have two distinct headers for different sections of my SM site: the default header for everything but one specific category, and a different header for the one category.
I am playing around with customizing my search/keyword pages and am trying to figure out a way to keep the same structure. That is, if a certain keyword is searched for, the category-specific header is used instead of the default header.
I am constructing my search box in php so it will be no problem to add a "hidden" keyword that the user won't have to type, so really all I think I need is to see if there is a way I can use JS to "catch" the keywords and switch out the header if the category-specific keyword is part of the search string (and therefore part of the URL).
Thoughts? Can this be done? Thanks in advance!
Oh, the reason for the first regexp is that SmugMug's API seems broken (or at least inconsistent) when it comes to double quotes. If you use the web UI to enter a keyword with quotes, like "this word", then if you use the API to read the keywords back, you'll see "this word"
That makes sense.
However, if you use the API to set a keyword like, "that word", when you read the keyword back, it'll be \"thatword\"
What's odd is that other entities like & simply gets turned into & without any leading backslash.
I actually started a [post=658237]thread on this[/post] a while back, but forgot about it. It's still annoying though. (/me waves at Dev)
Then when I read the keywords back in I turn them back into an array with:
NOTE: I found you also need to escape quotes in captions with their entitiy equivalents and then convert them back when reading it in.
LIMITS: I also found the following limits to keywords. Only the first 15 keywords will be displayed and used in searching. A keyword can only have up to 40 characters.
Could this be used to accomplish something like this or should I dig into IPTC keywords?
http://blue-dog.smugmug.com
http://smile-123.smugmug.com
http://vintage-photos.blogspot.com/
Canon 7D, 100-400L, Mongoose 3.5, hoping for a 500L real soon.