-
Notifications
You must be signed in to change notification settings - Fork 4
90% accuracy claim. #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @varadhbhatnagar we did a pre-training on a corpus of over 6M lines of logical python code. (we injected some special tokens such as (indent), (dedent) etc. to keep the logical structure of the code.) And then we fine tuned the model on a Binary Classification problem. Where the model is shown a pair of tokens, where we have the first set of tokens from code and the second set from the comments and the task is to predict whether they match or not. We fine tuned for this task using about 35K pairs. In that task, at the training time, the training F1 score reaches 90%. Hope this answers your question. |
Thanks. Is there a paper associated with this project? And it this related to the Microsoft Codebert in anyway? |
It is not related to MS Codebert (Apart from sharing the same name). The methodology we followed is inspired by the CuBERT paper with our own methods and ideas blend into it. We have not published any paper on yet. But the model is open sourced for everyone to use. |
Thanks for asking the questions 👍 |
I wanted to get an idea about the method complexity that this model can handle. For training and testing, did you use simple methods similar to files in |
Hi, We fine tuned this model on the task using py150k Dataset But just to clarify this dataset has 150K Python files. We used our open source library tree-hugger to mine those files to create a (method, docstring) tuple dataset. We then swapped about 50% of those docstrings and marked them as a negative class while the rest is positive. And then used the pretrained model for fine tuning on this task. |
It is mentioned on the website (https://codist-ai.com/) that this model gives 90% accuracy.
Can you elaborate what exactly is this accuracy and how is it measured?
The text was updated successfully, but these errors were encountered: