I am analyzing my own tweets and I have inserted the data into the Hive table using Hive JSON SerDE . I want to find out the frequency of all two word phrases in my tweets as a table. The output should look something like:
phrase frequency
["the","room"] 1248.0
["a","boy"] 1039.0
["rt","to"] 1032.0
["to","ct"] 986.0
Right now, I am able to do it for all single word phrases and I am getting the output as:
phrase frequency
["the"] 1248.0
["a"] 1039.0
["rt"] 1032.0
["to"] 986.0
["you"] 828.0
For the one word phrase output, my code is:
create table ng(new_ar array<struct<ngram:array<string>,estfrequency:double>>);
insert overwrite table ng select context_ngrams(sentences(lower(text)),array(null),100) as word from tweets;
create table wordFreq (ngram array<string>, estfrequency double);
INSERT OVERWRITE TABLE wordFreq SELECT X.ngram, X.estfrequency from ng LATERAL VIEW explode(new_ar) Z as X;
select * from wordFreq;
How do I modify the above code for my desired output?
Aucun commentaire:
Enregistrer un commentaire