Often, we become so captivated by complexity and sophistication that we overlook the profound effectiveness of simple, fundamental methods. Kenneth Wa

Unix for Poets: Basic NLP Tasks Using Unix Tools

submited by
Style Pass
2024-10-15 14:00:05

Often, we become so captivated by complexity and sophistication that we overlook the profound effectiveness of simple, fundamental methods.

Kenneth Ward Church in wonderful Unix™ for Poets shows how basic natural language processing (NLP) tasks like splitting text into words, counting words, sorting words in rhyming order and even computing n-grams can be done by anybody using simple unix tools without any fancy tools or algorithms, even on a modest PC. In this article, we look at some basic NLP tasks using Unix tools based on Unix™ for Poets with Moby-Dick corpus. Which is 214,492 words, 19,390 lines and 1.2M in size.

1. Count Word Frequencis In a Text 2. Sort Words By ‘‘Rhyming’’ Order 3. N-grams 4. Find Palyndrome Words 5. Conclusion References

tr can be used to translate or delete characters. Option -c means complement. So In this example we use tr to convert all non-alphabetical characters to \n. By -s, we specify to replace each sequence of a repeated character with a single occurrence of that character. Hence every time tr encounters a non-alphabetical character (eg. space, punctuation, numbers) It replaces it with a \n character. Finaly we replace all uppercase letters by lowercase ones. We store the result in words.txt file.

Leave a Comment