I have this Java program that asks for a name of the file containing data rows in the format $$ x_1\ y_1 \\ x_2\ y_2 \\ \vdots \\ x_n\ y_n $$, and outputs on the command line the Pearson correlation coefficient of the input data:
com.github.coderodde.stat.pearson.PearsonCorrelation.java:
package com.github.coderodde.stat.pearson;
import java.awt.geom.Point2D;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.ArrayList;
import java.util.List;
public class PearsonCorrelation {
public static void main(String[] args) throws IOException {
if (args.length != 1) {
return;
}
String dataFileName = args[0];
List<Point2D.Double> dataPoints = readDataPointsFromFile(dataFileName);
double correlationCoefficient =
computeCorrelationCoefficient(dataPoints);
System.out.println("r = " + correlationCoefficient);
}
private static List<Point2D.Double>
readDataPointsFromFile(String dataFileName) throws IOException {
List<String> dataFileLines =
Files.readAllLines(new File(dataFileName).toPath());
List<Point2D.Double> dataPoints = new ArrayList<>();
for (String dataFileLine : dataFileLines) {
String[] parts = dataFileLine.split("\\s+");
double xCoordinate = Double.parseDouble(parts[0]);
double yCoordinate = Double.parseDouble(parts[1]);
dataPoints.add(new Point2D.Double(xCoordinate, yCoordinate));
}
return dataPoints;
}
private static double computeCorrelationCoefficient(
List<Point2D.Double> dataPoints) {
double averageX = computeMeanX(dataPoints);
double averageY = computeMeanY(dataPoints);
double numerator = computeNumerator(dataPoints, averageX, averageY);
double denominator = computeDenominator(dataPoints, averageX, averageY);
return numerator / denominator;
}
private static double
computeNumerator(List<Point2D.Double> dataPoints,
double averageX,
double averageY) {
double sum = 0.0;
for (Point2D.Double dataPoint : dataPoints) {
sum += (dataPoint.x - averageX) * (dataPoint.y - averageY);
}
return sum;
}
private static double
computeDenominator(List<Point2D.Double> dataPoints,
double averageX,
double averageY) {
double factor1Sum = 0.0;
double factor2Sum = 0.0;
for (Point2D.Double dataPoint : dataPoints) {
double dx = dataPoint.x - averageX;
double dy = dataPoint.y - averageY;
factor1Sum += dx * dx;
factor2Sum += dy * dy;
}
return Math.sqrt(factor1Sum * factor2Sum);
}
private static double computeMeanX(List<Point2D.Double> dataPoints) {
double sum = 0.0;
for (Point2D.Double dataPoint : dataPoints) {
sum += dataPoint.x;
}
return sum / dataPoints.size();
}
private static double computeMeanY(List<Point2D.Double> dataPoints) {
double sum = 0.0;
for (Point2D.Double dataPoint : dataPoints) {
sum += dataPoint.y;
}
return sum / dataPoints.size();
}
}
Critique request
Seems to work, but I would like to hear comments on how to make it more mature.
1 Answer 1
Don't repeat yourself
The methods that compute the means of x and y have the same structure, just operate on a different field of the input list.
It would be good to extract the common logic to a helper method that takes a function to extract a double value from the input list.
Actually, the other methods also have the same shape, looping over the same list, completing a value for each item and summing, so all of them can be implemented in terms of the suggested helper method, though at the expense of one additional pass in the computation of the denominator. It's not a big problem though, as written is easy to understand.
The CLI could be more friendly
When the number of arguments is unexpected, it could print a usage message rather than quietly ignore.
When the path parameter is not a file, it could print an error message rather than crash with a stack trace.
Avoid strange dependencies
I get that Point2D.Double
has what you need, but it's defined under awt
, a library for graphical UI, so it looks like a strange dependency for a command line tool.
It's not a big issue, but I think it would be cleaner to just define a custom static inner class for this.