I have the following string:
string text = "1. This is first sentence. 2. This is the second sentence. 3. This is the third sentence. 4. This is the fourth sentence."
I want to split it according to 1. 2. 3. and so on:
result[0] == "This is first sentence."
result[1] == "This is the second sentence."
result[2] == "This is the third sentence."
result[3] == "This is the fourth sentence."
Is there any way I can do it C#?
5 Answers 5
Assuming that you can't encounter such a pattern in your sentences : X. (a integer, followed by a point, followed by a space), this should work:
String[] result = Regex.Split(text, @"[0-9]+\. ");
5 Comments
\d matches any unicode digit (fileformat.info/info/unicode/category/Nd/list.htm). For instance, this character ௮ will be splitted if you use \d but won't be if you use [0-9].\d is equivalent to [0-9]. :)\d a rather leaky abstraction if you started to run into cases where your regex was matching special characters you hadn't intended it to match...is it possible that there will be numbers in the sentence too?
As I do not know you formatting, you already said you cannot do on EOL/New Line I would try something like...
List<string> lines = new List<string>();
string buffer = "";
int count = 1;
foreach(char c in input)
{
if(c.ToString() == count.ToString())
{
if(!string.IsNullOrEmpty(buffer))
{
lines.Add(buffer);
buffer = "";
}
count++;
}
buffer += c;
}
//lines will now contain your splitted data
You can then access each sentence like this...
string s1 = lines[0];
string s2 = lines[1];
string s3 = lines[2];
Important: Make sure you check the count of lines before getting sentence like...
string s1 = lines.Count > 0 ? lines[0] : "";
This makes a big assumption that you will not have the next lines number ID in a given sentance (i.e. sentence 2 will not contain the number 3)
If this does not help the provide you input in original format (do not add lines breaks if there are none)
EDIT: Fixed my code (wrong variable sorry)
int index = 1;
String[] result = Regex.Split(text, @"[0-9]+\. ").Where(i => !string.IsNullOrEmpty(i)).Select(i => (index++).ToString() + ". " + i).ToArray();
result will contain your sentences, including the "line number".
Comments
You could split on the '.' char and drop anything smaller than 2 char from the resulting array.
Of course, this relies on the fact that you would have no datapoints of 1 character other than the numeric indicator, if that was the case you could also check for it as a numeric value.
This answer would also drop a period from your sentences, so you'd have to add that back in. There is a lot of manipulation but this saves you from having to read each char and decision it independently.
Comments
This is the easiest way:
var str = "1. This is first sentence." +
"2. This is the second sentence." +
"3. This is the third sentence." +
"n. This is the nenth sentence";
//set your max number e.g 10000
var num = Enumerable.Range(1, 10000).Select(x=>x.ToString()+".").ToArray();
var res=str.Split(num ,StringSplitOptions.RemoveEmptyEntries);
Hope this help ;)
1. First line 2. Second Numbered2. 2. Third Line