samedi 27 juin 2015

How do I get ngrams array string and estfrequency as seperate elements in a hive table using HiveQL?

I am analyzing my own tweets and I have inserted the data into the Hive table using Hive JSON SerDE . I want to find out the frequency of all two word phrases in my tweets as a table. The output should look something like:

phrase             frequency
["the","room"]      1248.0
["a","boy"]        1039.0
["rt","to"]        1032.0
["to","ct"]         986.0

Right now, I am able to do it for all single word phrases and I am getting the output as:

phrase     frequency
["the"]     1248.0
["a"]       1039.0
["rt"]      1032.0
["to"]      986.0
["you"]     828.0

For the one word phrase output, my code is:

create table ng(new_ar array<struct<ngram:array<string>,estfrequency:double>>);

insert overwrite table ng select context_ngrams(sentences(lower(text)),array(null),100) as word from tweets;

create table wordFreq (ngram array<string>,  estfrequency double);

INSERT OVERWRITE TABLE wordFreq SELECT X.ngram, X.estfrequency from ng LATERAL VIEW explode(new_ar) Z as X;    

select * from wordFreq;

How do I modify the above code for my desired output?

Aucun commentaire:

Enregistrer un commentaire