Split a string ignoring quoted sections
Given a string like this:
a,"string, with",various,"values, and some",quoted
What is a good algorithm to split this based on commas while ignoring the commas inside the quoted sections?
The output should be an array:
[ "a", "string, with", "various", "values, and some", "quoted" ]
If my language of choice didn't offer a way to do this without thinking then I would initially consider two options as the easy way out:
Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.
Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatenating the entries until I reached a terminating quote.
These are hacks however, and if this is a pure 'mental' exercise then I suspect they will prove unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.
Looks like you've got some good answers here.
For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.
Read more… Read less…
import csv reader = csv.reader(open("some.csv")) for row in reader: print row
Of course using a CSV parser is better but just for the fun of it you could:
Loop on the string letter by letter. If current_letter == quote : toggle inside_quote variable. Else if (current_letter ==comma and not inside_quote) : push current_word into array and clear current_word. Else append the current_letter to current_word When the loop is done push the current_word into array
What if an odd number of quotes appear in the original string?
This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:
field1, "field2, field3", field4, "field5, field6" field7
Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.
Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.
Here's a simple python implementation based on Pat's pseudocode:
def splitIgnoringSingleQuote(string, split_char, remove_quotes=False): string_split =  current_word = "" inside_quote = False for letter in string: if letter == "'": if not remove_quotes: current_word += letter if inside_quote: inside_quote = False else: inside_quote = True elif letter == split_char and not inside_quote: string_split.append(current_word) current_word = "" else: current_word += letter string_split.append(current_word) return string_split