A python tool using XGboost and sentence-transformers to perform schema matching task on tables. Support multi-language column names and instances matching and can be used without column names. Both csv and json file type are supported.
Based on the "CSV" category.
Alternatively, view Python-Schema-Matching alternatives based on common mentions on social networks and blogs.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of Python Schema Matching by XGboost and Sentence-Transformers or a related project?
A python tool using XGboost and sentence-transformers to perform schema matching task on tables. Support multi-language column names and instances matching and can be used without column names. Both csv and json file type are supported.
Schema matching is the problem of finding potential associations between elements (most often attributes or relations) of two schemas. source
pip install schema-matching
from schema_matching import schema_matching
df_pred,df_pred_labels,predicted_pairs = schema_matching("Test Data/QA/Table1.json","Test Data/QA/Table2.json")
print(df_pred)
print(df_pred_labels)
for pair_tuple in predicted_pairs:
print(pair_tuple)
See Data format in Training Data and Test Data folders. You need to put mapping.txt, Table1.csv and Table2.csv in new folders under Training Data. For Test Data, mapping.txt is not needed.
python relation_features.py
python train.py
Example:
python cal_column_similarity.py -p Test\ Data/self -m /model/2022-04-12-12-06-32 -s one-to-one
python cal_column_similarity.py -p Test\ Data/authors -m /model/2022-04-12-12-06-32-11 -t 0.9
Parameters:
Output:
Features: "is_url","is_numeric","is_date","is_string","numeric:mean", "numeric:min", "numeric:max", "numeric:variance","numeric:cv", "numeric:unique/len(data_list)", "length:mean", "length:min", "length:max", "length:variance","length:cv", "length:unique/len(data_list)", "whitespace_ratios:mean","punctuation_ratios:mean","special_character_ratios:mean","numeric_ratios:mean", "whitespace_ratios:cv","punctuation_ratios:cv","special_character_ratios:cv","numeric_ratios:cv", "colname:bleu_score", "colname:edit_distance","colname:lcs","colname:tsm_cosine", "colname:one_in_one", "instance_similarity:cosine"
Average Confusion Matrix: | | Negative(Truth) | Positive(Truth) | |----------------|-----------------|-----------------| | Negative(pred) | 0.94343111 | 0.05656889 | | Positive(pred) | 0.17135417 | 0.82864583 |
Data: https://github.com/fireindark707/Schema_Matching_XGboost/tree/main/Test%20Data/self
| title | text | summary | keywords | url | country | language | domain | name | timestamp | |
|---|---|---|---|---|---|---|---|---|---|---|
| col1 | 1(FN) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| col2 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| col3 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| words | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 | 0 |
| link | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 | 0 |
| col6 | 0 | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 | 0 |
| lang | 0 | 0 | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 | 0 |
| col8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1(TP) | 0 | 0 |
| website | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0(FN) | 0 |
| col10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1(TP) |
F1 score: 0.889
@software{fireinfark707_Schema_Matching_by_2022,
author = {fireinfark707},
license = {MIT},
month = {4},
title = {{Schema Matching by XGboost}},
url = {https://github.com/fireindark707/Schema_Matching_XGboost},
year = {2022}
}
*Note that all licence references and agreements mentioned in the Python Schema Matching by XGboost and Sentence-Transformers README section above
are relevant to that project's source code only.
Do not miss the trending, packages, news and articles with our weekly report.