Extracting a number from provided URLs inside a text file

Question 1

I need to extract a ID (int number) from a Url.

Example: http://www.example.com/foo/bar/12345

For this I wrote 4 methods where the first one ReadIDsFromFile() is called by my constructor and the return value is set to a properties of this class. The methods are called in the order I posted them below.

private List<string> ReadIDsFromFile(string path)
 {
 // path is the full qualified path to a txt file. C:\text.txt
 List<string> TweetIDsList = new List<string>();
 string temp = string.Empty;
 using (StreamReader sr = new StreamReader(path))
 {
 while ((temp = sr.ReadLine()) != null)
 {
 if (ValidateUrl(temp))
 {
 TweetIDsList.Add(ExtractID(temp));
 }
 else
 {
 Logger.Log("Invalid URL: {0}", temp);
 }
 }
 }
 return TweetIDsList;
 }
private bool ValidateUrl(string url)
{
 Uri uriResult;
 bool result;
 return result = Uri.TryCreate(url, UriKind.Absolute, out uriResult) && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps);
}
private string ExtractID(string url)
{
 string id = string.Empty;
 char[] urlArray = url.ToCharArray();
 int result;
 for (int i = urlArray.Length - 1; i >= 0; i--)
 {
 if (Int32.TryParse(urlArray[i].ToString(), out result))
 {
 id += result;
 }
 else
 {
 break; // break loop. If tryparse fails it means we have reached a character which is not a number, probably a forward slash.
 }
 }
 return ReverseNumber(id);
}
private string ReverseNumber(string id)
{
 char[] tempArray = id.ToCharArray();
 string result = string.Empty;
 for (int i = tempArray.Length -1; i >= 0; i--)
 {
 result += tempArray[i];
 }
 return result;
}

My code is working without problems so far but I feel like that it is overly unnecessarily complicated. I am especially concerned about the reversed order of the ID and my attempt to reverse it. Is there a better way?

Question 2

You can use a regular expression to find the ID. The pattern [0-9]+$ will match one or more occurrences of 0-9 at the end of the string. You can use it like this:

private static readonly Regex UrlId = new Regex("[0-9]+$");
private static string ExtractID(string url)
{
 var match = UrlId.Match(url);
 return match.Success
 ? match.Captures[0].Value
 : string.Empty;
}

Instead of using a StreamReader, consider using File.ReadLines

foreach (var line in File.ReadLines(path))
{
 if (ValidateUrl(line))
 {
 TweetIDsList.Add(ExtractID(line));
 }
 else
 {
 Logger.Log("Invalid URL: {0}", line);
 }
}

You can remove the variable result from ValidateUrl:

private static bool ValidateUrl(string url)
{
 Uri uriResult;
 return Uri.TryCreate(url, UriKind.Absolute, out uriResult) &&
 (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps);
}

Question 3

Thank you very much. Can you elaborate why I should use File.ReadLines()?

Question 4

Since you're already processing the file line by line, using File.ReadLines() allows you to use foreach() so that your code more clearly expresses what you're doing. Leave the null test to the library.

Question 5

@Takeru As Snowbody said, File.ReadLines makes the code clearer. Note there is a difference between ReadLines and ReadAllLines. I forgot to mention, you should specify the encoding when reading the file.

Question 6

In this case all those helper functions are probably redundant since you can leverage the Segments array of the Uri class to get the ID from the last segment.:

private List<string> ReadIDsFromFile(string path)
{
 // path is the full qualified path to a txt file. C:\text.txt
 List<string> TweetIDsList = new List<string>();
 string temp = string.Empty;
 using (StreamReader sr = new StreamReader(path))
 {
 while ((temp = sr.ReadLine()) != null)
 {
 Uri uriResult;
 if (Uri.TryCreate(temp, UriKind.Absolute, out uriResult) && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps))
 {
 //TweetIDsList.Add(new string(uriResult.Segments.Last().Reverse().ToArray()));
 //Since reverse isn't necessary the segment itself can be passed
 TweetIDsList.Add(uriResult.Segments.Last());
 }
 else
 {
 Logger.Log("Invalid URL: {0}", temp);
 }
 }
 }
 return TweetIDsList;
}

Question 7

wow, thank you. I didn't know about that property of uri. The Reverse() is unneeded in this case because I only needed to reverse the string as I was reading from end to start.

Question 8

@Takeru - In that case you can use the segment itself. BTW File.ReadAllLines will read the whole file into memory. This works faster, but for large files is impractical, and StreamReader is necessary.

Question 9

You can also add a check that your last segment is a valid int.

mjolka mjolka 16.3k2 gold badges30 silver badges73 bronze badges · Answer 1 · 2015-04-07 02:25:02Z

You can use a regular expression to find the ID. The pattern [0-9]+$ will match one or more occurrences of 0-9 at the end of the string. You can use it like this:

private static readonly Regex UrlId = new Regex("[0-9]+$");
private static string ExtractID(string url)
{
 var match = UrlId.Match(url);
 return match.Success
 ? match.Captures[0].Value
 : string.Empty;
}

Instead of using a StreamReader, consider using File.ReadLines

foreach (var line in File.ReadLines(path))
{
 if (ValidateUrl(line))
 {
 TweetIDsList.Add(ExtractID(line));
 }
 else
 {
 Logger.Log("Invalid URL: {0}", line);
 }
}

You can remove the variable result from ValidateUrl:

private static bool ValidateUrl(string url)
{
 Uri uriResult;
 return Uri.TryCreate(url, UriKind.Absolute, out uriResult) &&
 (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps);
}

Thank you very much. Can you elaborate why I should use File.ReadLines()?
Since you're already processing the file line by line, using File.ReadLines() allows you to use foreach() so that your code more clearly expresses what you're doing. Leave the null test to the library.
@Takeru As Snowbody said, File.ReadLines makes the code clearer. Note there is a difference between ReadLines and ReadAllLines. I forgot to mention, you should specify the encoding when reading the file.

user33306user33306 · Answer 2 · 2015-04-07 15:36:18Z

In this case all those helper functions are probably redundant since you can leverage the Segments array of the Uri class to get the ID from the last segment.:

private List<string> ReadIDsFromFile(string path)
{
 // path is the full qualified path to a txt file. C:\text.txt
 List<string> TweetIDsList = new List<string>();
 string temp = string.Empty;
 using (StreamReader sr = new StreamReader(path))
 {
 while ((temp = sr.ReadLine()) != null)
 {
 Uri uriResult;
 if (Uri.TryCreate(temp, UriKind.Absolute, out uriResult) && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps))
 {
 //TweetIDsList.Add(new string(uriResult.Segments.Last().Reverse().ToArray()));
 //Since reverse isn't necessary the segment itself can be passed
 TweetIDsList.Add(uriResult.Segments.Last());
 }
 else
 {
 Logger.Log("Invalid URL: {0}", temp);
 }
 }
 }
 return TweetIDsList;
}

wow, thank you. I didn't know about that property of uri. The Reverse() is unneeded in this case because I only needed to reverse the string as I was reading from end to start.
@Takeru - In that case you can use the segment itself. BTW File.ReadAllLines will read the whole file into memory. This works faster, but for large files is impractical, and StreamReader is necessary.
You can also add a check that your last segment is a valid int.

Stack Exchange Network

Extracting a number from provided URLs inside a text file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting a number from provided URLs inside a text file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions